aero-deuce - Stardance

Ship #1

@zemu on aero-deuce · about 2 months ago

I made Aero-Deuce, a 12-billion parameter AI language model that runs locally on consumer hardware. It’s an instruction-following chatbot fine-tuned and post-trained from Google’s Gemma 4 12B using a technique called QLoRA with the Muon optimizer. It’s a relatively new approach that applies matrix orthogonalization to keep the training efficient. The whole model was trained on 30,000 instruction-following examples across 2,000 steps, going from a loss of 3.82 down to 0.57. The final model is exported in three formats: a GGUF file (~7GB) that works with anything supporting llama.cpp (LM Studio, Ollama, GPT4All), an MLX version optimized for Apple Silicon, and the raw LoRA adapter for anyone who wants to merge it or fine-tune further. I also built a live API endpoint on Modal with streaming support, so anyone can chat with it through a web interface.

   It's named "Aero-Deuce" because deuce means two, and this is my second attempt at LLMs, my first being a 300m custom model with some weird architecture, and I didn't have the funds to get that one off the ground.

  The most challenging part was the training pipeline itself. I started on Modal with spot GPUs, which meant my training kept getting preempted. I had to build robust checkpoint resumption just to survive it. Then I switched to Lightning AI for the second half. The base model uses a "unified" architecture (gemma4_unified) that most tooling didn't support — PEFT didn't recognize it, MLX couldn't run it, so I had to manually strip multimodal weights, rename parameter keys, and patch configs to make it work everywhere. Exporting to GGUF required building llama.cpp from source. The HuggingFace uploads stalled repeatedly. There were a lot of nights where nothing worked and I had to dig through error logs at 2am.

  I'm proud that it actually works end to end. A 12B model that you can download as a single 7GB file and run on a MacBook Air — that's real. The training converged well, the model follows instructions properly, it identifies itself as Aero-Deuce instead of claiming to be Gemma. I built the entire pipeline from scratch: data loading with loss masking, the Muon optimizer implementation, dual-optimizer parameter partitioning, checkpoint resumption, the export pipeline, and the inference API. It's not a notebook tutorial — it's a real system.

  To test it: go to the live demo at https://aero-deuce-lander.vercel.app, or download the GGUF from huggingface.co/ZeZZm/aero-deuce-GGUF and open it in LM Studio or GPT4All. If you're on a Mac, you can run pip install mlx-lm and use python -m mlx_lm.generate --model ZeZZm/aero-deuce-MLX --prompt "anything". The code is at github.com/Ryz3nPlayZ/aero-deuce. Everything is Apache 2.0.

Thanks!

4 devlogs
10h
17.74x multiplier
174 Stardust

Try project → See source code →

Open comments for this post

@zemu on aero-deuce · about 2 months ago

18m 6s logged

finished landing page

Open comments for this post

@zemu on aero-deuce · about 2 months ago

2h 3m 38s logged

adapter, gguf, and mlx (q4) all on huggingface
I made a simple landing page
hosted custom inference endpoint (might not work very well because im broke)
should be able to run through services like Ollama and llama.cpp locally

Open comments for this post

@zemu on aero-deuce · about 2 months ago

3h 32m 38s logged

I made a custom AI model called Aero-Deuce. it’s based on Google’s Gemma 4 12B, but we fine-tuned/post-trained it using a technique called QLoRA with a custom optimizer called Muon. the idea was to make a small, efficient instruction-following model that runs locally on consumer hardware.

we trained it on 30,000 instruction-following examples for 2,000 steps. the training loss dropped from 3.82 to 0.57. it runs at about 14-16 tokens per second on a MacBook Air M5.

the model is exported in three formats: a raw LoRA adapter (for merging with the base model), a GGUF file (runs in llama.cpp, LM Studio, Ollama, etc.), and an MLX version (optimized for Apple Silicon). all three will be on HuggingFace soon. the code is on GitHub.

still need to benchmark it against the base model and submit it to the HuggingFace leaderboard.

code: github.com/Ryz3nPlayZ/aero-deuce
adapter: huggingface.co/ZeZZm/aero-deuce
gguf: huggingface.co/ZeZZm/aero-deuce-GGUF
mlx: huggingface.co/ZeZZm/aero-deuce-MLX

Open comments for this post

@zemu on aero-deuce · about 2 months ago

3h 55m 45s logged

45% of the way done with test post training run but im already broke