waifmark-2
- 2 Devlogs
- 8 Total hours
updating (and automating) my benchmark testing local agentic capabilities and speech persona of small (V)LLMs.
updating (and automating) my benchmark testing local agentic capabilities and speech persona of small (V)LLMs.
Dev2
It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:
Progress related to benchmarking:
Devlog 0.1
(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!
Prior to Stardance -
Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.
Current stage -
Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).
Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!