You are browsing as a guest. Sign up (or log in) to start making projects!

waifmark-2

  • 2 Devlogs
  • 8 Total hours

updating (and automating) my benchmark testing local agentic capabilities and speech persona of small (V)LLMs.

Open comments for this post

5h 5m 9s logged

Dev2

It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:

  1. HF model download
    -> pick quant/download type
  2. Benchmarking
    -> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring
  3. Result auditing (Will continue to work on this)
    -> human review responses flagged by LLM judge
  4. Results showcase/leaderboard
    -> WIP

Progress related to benchmarking:

  • 3x question count compared to Waifmark 1
  • ~2x faster parallel automated scoring
  • memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.

Dev2

It’s been so long but the UI looks a little better now (?) The process has been basically fully streamlined:

  1. HF model download
    -> pick quant/download type
  2. Benchmarking
    -> start server -> run benchmark, score/100 evaluation, ETA and live console monitoring
  3. Result auditing (Will continue to work on this)
    -> human review responses flagged by LLM judge
  4. Results showcase/leaderboard
    -> WIP

Progress related to benchmarking:

  • 3x question count compared to Waifmark 1
  • ~2x faster parallel automated scoring
  • memory system (short/long ctx), agentic toolcalling (basic file reading/shell/research) functions fully implemented as part of local agentic benchmarking.

Replying to @qb

0
1
Open comments for this post

2h 35m 35s logged

Devlog 0.1

(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!

  • Prior to Stardance -
    Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
    However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.

  • Current stage -
    Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).

Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!

Devlog 0.1

(For a future project.)
I’m trying out LLM benchmarking on my M1 Macbook Pro that probably needs a break!

  • Prior to Stardance -
    Waifmark 1 was a benchmark testing local agentic capabilities and speech persona of small locally hosted (V)LLMs.
    However, my benchmarking process for Waifmark 1 was unstandardised and troublesome, and I kept all the data in excel out of all places.

  • Current stage -
    Waifmark 2 is in the works. The benchmark is now evaluated by an LLM-as-a-Judge that can flag and pass low-confidence outputs for human review (as is industry standard).

Using the wonderful streamlit library I built a basic app that can download you a model from hf, serve the model and benchmark it in 3 steps. Unfortunately I cannot show any more behind the process as of now, but I’m very excited to join Stardance and to see what changes can be observed moving from Waifmark 1 -> 2!

Replying to @qb

0
3

Followers

Loading…