Devlog by @xerneas

@xerneas on Makemore · 19 days ago

3h 45m 24s logged

makemore — devlog

mlp2-gpu: names MLP

Dataset: Karpathy’s names.txt (32k names, vocab 27)

Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)

Training:

Result:
dev loss = 2.20

Samples:
bhuza
fremah
mykeslanna
talee

The model worked because names are short and mostly depend on local character patterns.

mlp3: Tiny Shakespeare (without a guide and with single hidden layer)

Dataset:

Architecture:
embed=32, hidden=400, block_size=16

Training:

Results:
train loss = 1.12
dev loss = 2.02

Samples:

LERCUTIF:
MARCAPUL:
HENRY:

Why it failed:

The model was not too small. It was the wrong architecture.

WaveNet-style hierarchical MLP from the next lecture.

Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.

Goal:

After that: transformers.