You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

2h 35m 35s logged

Devlog 0.1

(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!

  • Prior to Stardance -
    Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
    However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.

  • Current stage -
    Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).

Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!

0
3

Comments 0

No comments yet. Be the first!