makemore — devlog
mlp2-gpu: names MLP
Dataset: Karpathy’s names.txt (32k names, vocab 27)
Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)
Training:
- AdamW, lr=1e-3
- 200 epochs
- MPS
Result:
dev loss = 2.20
Samples:
bhuza
fremah
mykeslanna
talee
The model worked because names are short and mostly depend on local character patterns.
mlp3: Tiny Shakespeare (without a guide and with single hidden layer)
Dataset:
- 1.1M characters
- vocab 65
Architecture:
embed=32, hidden=400, block_size=16
Training:
- batch size 8192
- 600 epochs
- CUDA
Results:
train loss = 1.12
dev loss = 2.02
Samples:
LERCUTIF:
MARCAPUL:
HENRY:
Why it failed:
- 16 characters of context is too short for prose.
- Flattening embeddings throws away sequence structure.
- The model learned the easiest pattern: SPEAKER:\n dialogue.
- Large train/dev gap suggests overfitting to short-range patterns.
The model was not too small. It was the wrong architecture.
Next:
WaveNet-style hierarchical MLP from the next lecture.
Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.
Goal:
- dev loss < 1.7 on Tiny Shakespeare
- generate actual words and sentences
After that: transformers.
Comments 0
No comments yet. Be the first!
Sign in to join the conversation.