Devlog by @qb - Stardance

@qb on waifmark-2 · 12 days ago

5h 5m 9s logged

Dev2

It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:

HF model download
-> pick quant/download type
Benchmarking
-> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring
Result auditing (Will continue to work on this)
-> human review responses flagged by LLM judge
Results showcase/leaderboard
-> WIP

Progress related to benchmarking:

3x question count compared to Waifmark 1
~2x faster parallel automated scoring
memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.