Makemore
- 2 Devlogs
- 15 Total hours
LLM generator practice. The gradio demo link may not work if I am between versions
LLM generator practice. The gradio demo link may not work if I am between versions
makemore — devlog
mlp2-gpu: names MLP
Dataset: Karpathy’s names.txt (32k names, vocab 27)
Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)
Training:
Result:
dev loss = 2.20
Samples:
bhuza
fremah
mykeslanna
talee
The model worked because names are short and mostly depend on local character patterns.
mlp3: Tiny Shakespeare (without a guide and with single hidden layer)
Dataset:
Architecture:
embed=32, hidden=400, block_size=16
Training:
Results:
train loss = 1.12
dev loss = 2.02
Samples:
LERCUTIF:
MARCAPUL:
HENRY:
Why it failed:
The model was not too small. It was the wrong architecture.
Next:
WaveNet-style hierarchical MLP from the next lecture.
Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.
Goal:
After that: transformers.