You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

5h 5m 9s logged

Dev2

It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:

  1. HF model download
    -> pick quant/download type
  2. Benchmarking
    -> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring
  3. Result auditing (Will continue to work on this)
    -> human review responses flagged by LLM judge
  4. Results showcase/leaderboard
    -> WIP

Progress related to benchmarking:

  • 3x question count compared to Waifmark 1
  • ~2x faster parallel automated scoring
  • memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.
0
1

Comments 0

No comments yet. Be the first!