Devlog by @qb - Stardance

2h 35m 35s logged

Devlog 0.1

(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!

Prior to Stardance -
Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.
Current stage -
Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).

Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!