waifmark-2 - Stardance

Open comments for this post

@qb on waifmark-2 · 12 days ago

5h 5m 9s logged

Dev2

It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:

HF model download
-> pick quant/download type
Benchmarking
-> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring
Result auditing (Will continue to work on this)
-> human review responses flagged by LLM judge
Results showcase/leaderboard
-> WIP

Progress related to benchmarking:

3x question count compared to Waifmark 1
~2x faster parallel automated scoring
memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.

Open comments for this post

@qb on waifmark-2 · 15 days ago

2h 35m 35s logged

Devlog 0.1

(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!

Prior to Stardance -
Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.
Current stage -
Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).

Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!