Dev2
It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:
- HF model download
-> pick quant/download type - Benchmarking
-> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring - Result auditing (Will continue to work on this)
-> human review responses flagged by LLM judge - Results showcase/leaderboard
-> WIP
Progress related to benchmarking:
- 3x question count compared to Waifmark 1
- ~2x faster parallel automated scoring
- memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.
Comments 0
No comments yet. Be the first!
Sign in to join the conversation.