Open comments for this post

@xerneas on Makemore · 5 days ago

11h 7m 7s logged

Tiny Shakespeare (decoder-only transformer)

Dataset:

Tiny Shakespeare, ~1.1M characters
vocab 65

Architecture: GPT-style decoder

token embedding + learned position embedding (embd=384)
6 transformer blocks, each:
- multi-head causal self-attention (8 heads, head_size=48)
- feed-forward (Linear → ReLU → Linear)
- pre-LayerNorm + residual connections on both sublayers
final LayerNorm → linear to vocab
block_size=64, dropout=0.2

Training:

AdamW, lr=1e-4
ExponentialLR, gamma=0.99995
batch size 1024
5000 iters
CUDA + torch.compile

Results:

train loss = 1.26
val loss = 1.52

Why it worked:

Attention replaces flattening. Instead of mashing all 64 embeddings into one vector, each position attends to the relevant earlier positions directly. Sequence structure is preserved instead of thrown away — the exact failure mode of mlp3.
The receptive field is the full block (64 chars) and every position can use all of it, not just a fixed local window.
Residual connections + pre-LayerNorm let 6 blocks stack and actually train.
Position embeddings give the model order information that a bag-of-embeddings MLP never had.

Against the mlp3 goals:

Goal was dev loss < 1.7 → hit 1.52.
Goal was “generate actual words and sentences” → it now produces speaker tags, line breaks, and mostly real English words in Shakespearean cadence. It learned the SPEAKER:\n dialogue form like mlp3 did, but this time it fills the dialogue with coherent structure instead of gibberish.

The train/dev gap:

1.26 vs 1.52→ much tighter than mlp3’s 1.12/2.02. Dropout 0.2 + a context long enough to actually be useful means it’s learning real patterns, not memorizing short-range ones.

Notes / loose ends:

LR never decays much → gamma=0.99995 over 5000 steps only drops lr from 1e-4 to ~7.8e-5, so the schedule is nearly flat. Loss was still falling at step 5000; more iters or a higher/decaying lr would likely push val below 1.5.

Decay the LR properly (or warmup + cosine) and train longer to close in on ~1.4.
Scale block_size for longer context now that attention makes it affordable.
Develop gpt-2 (the old sucky model from 2019)

Open comments for this post

@xerneas on Makemore · 19 days ago

3h 45m 24s logged

makemore — devlog

mlp2-gpu: names MLP

Dataset: Karpathy’s names.txt (32k names, vocab 27)

Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)

Training:

AdamW, lr=1e-3
200 epochs
MPS

Result:
dev loss = 2.20

Samples:
bhuza
fremah
mykeslanna
talee

The model worked because names are short and mostly depend on local character patterns.

mlp3: Tiny Shakespeare (without a guide and with single hidden layer)

Dataset:

1.1M characters
vocab 65

Architecture:
embed=32, hidden=400, block_size=16

Training:

batch size 8192
600 epochs
CUDA

Results:
train loss = 1.12
dev loss = 2.02

Samples:

LERCUTIF:
MARCAPUL:
HENRY:

Why it failed:

16 characters of context is too short for prose.
Flattening embeddings throws away sequence structure.
The model learned the easiest pattern: SPEAKER:\n dialogue.
Large train/dev gap suggests overfitting to short-range patterns.

The model was not too small. It was the wrong architecture.

WaveNet-style hierarchical MLP from the next lecture.

Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.

Goal:

dev loss < 1.7 on Tiny Shakespeare
generate actual words and sentences

After that: transformers.

Tiny Shakespeare (decoder-only transformer)

Dataset:

Architecture: GPT-style decoder

Training:

Results:

Why it worked:

Against the mlp3 goals:

The train/dev gap:

Notes / loose ends:

Next: