You are browsing as a guest. Sign up (or log in) to start making projects!

Makemore

  • 2 Devlogs
  • 15 Total hours

LLM generator practice. The gradio demo link may not work if I am between versions

Open comments for this post

11h 7m 7s logged

Tiny Shakespeare (decoder-only transformer)

Dataset:

  • Tiny Shakespeare, ~1.1M characters
  • vocab 65

Architecture: GPT-style decoder

  • token embedding + learned position embedding (embd=384)
  • 6 transformer blocks, each:
    • multi-head causal self-attention (8 heads, head_size=48)
    • feed-forward (Linear → ReLU → Linear)
    • pre-LayerNorm + residual connections on both sublayers
  • final LayerNorm → linear to vocab
  • block_size=64, dropout=0.2

Training:

  • AdamW, lr=1e-4
  • ExponentialLR, gamma=0.99995
  • batch size 1024
  • 5000 iters
  • CUDA + torch.compile

Results:

  • train loss = 1.26
  • val loss = 1.52

Why it worked:

  • Attention replaces flattening. Instead of mashing all 64 embeddings into one vector, each position attends to the relevant earlier positions directly. Sequence structure is preserved instead of thrown away — the exact failure mode of mlp3.
  • The receptive field is the full block (64 chars) and every position can use all of it, not just a fixed local window.
  • Residual connections + pre-LayerNorm let 6 blocks stack and actually train.
  • Position embeddings give the model order information that a bag-of-embeddings MLP never had.

Against the mlp3 goals:

  • Goal was dev loss < 1.7 → hit 1.52.
  • Goal was “generate actual words and sentences” → it now produces speaker tags, line breaks, and mostly real English words in Shakespearean cadence. It learned the SPEAKER:\n dialogue form like mlp3 did, but this time it fills the dialogue with coherent structure instead of gibberish.

The train/dev gap:

  • 1.26 vs 1.52→ much tighter than mlp3’s 1.12/2.02. Dropout 0.2 + a context long enough to actually be useful means it’s learning real patterns, not memorizing short-range ones.

Notes / loose ends:

  • LR never decays much → gamma=0.99995 over 5000 steps only drops lr from 1e-4 to ~7.8e-5, so the schedule is nearly flat. Loss was still falling at step 5000; more iters or a higher/decaying lr would likely push val below 1.5.

Next:

  • Decay the LR properly (or warmup + cosine) and train longer to close in on ~1.4.
  • Scale block_size for longer context now that attention makes it affordable.
  • Develop gpt-2 (the old sucky model from 2019)
0

Loading discussion…

0
3
Open comments for this post

3h 45m 24s logged

makemore — devlog

mlp2-gpu: names MLP

Dataset: Karpathy’s names.txt (32k names, vocab 27)

Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)

Training:

  • AdamW, lr=1e-3
  • 200 epochs
  • MPS

Result:
dev loss = 2.20

Samples:
bhuza
fremah
mykeslanna
talee

The model worked because names are short and mostly depend on local character patterns.

mlp3: Tiny Shakespeare (without a guide and with single hidden layer)

Dataset:

  • 1.1M characters
  • vocab 65

Architecture:
embed=32, hidden=400, block_size=16

Training:

  • batch size 8192
  • 600 epochs
  • CUDA

Results:
train loss = 1.12
dev loss = 2.02

Samples:

LERCUTIF:
MARCAPUL:
HENRY:

Why it failed:

  • 16 characters of context is too short for prose.
  • Flattening embeddings throws away sequence structure.
  • The model learned the easiest pattern: SPEAKER:\n dialogue.
  • Large train/dev gap suggests overfitting to short-range patterns.

The model was not too small. It was the wrong architecture.

Next:

WaveNet-style hierarchical MLP from the next lecture.

Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.

Goal:

  • dev loss < 1.7 on Tiny Shakespeare
  • generate actual words and sentences

After that: transformers.

0

Loading discussion…

0
4

Followers

Loading…