You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

3h 45m 24s logged

makemore — devlog

mlp2-gpu: names MLP

Dataset: Karpathy’s names.txt (32k names, vocab 27)

Architecture:
embed=10 -> flatten -> linear(200) -> tanh -> linear(27)

Training:

  • AdamW, lr=1e-3
  • 200 epochs
  • MPS

Result:
dev loss = 2.20

Samples:
bhuza
fremah
mykeslanna
talee

The model worked because names are short and mostly depend on local character patterns.

mlp3: Tiny Shakespeare (without a guide and with single hidden layer)

Dataset:

  • 1.1M characters
  • vocab 65

Architecture:
embed=32, hidden=400, block_size=16

Training:

  • batch size 8192
  • 600 epochs
  • CUDA

Results:
train loss = 1.12
dev loss = 2.02

Samples:

LERCUTIF:
MARCAPUL:
HENRY:

Why it failed:

  • 16 characters of context is too short for prose.
  • Flattening embeddings throws away sequence structure.
  • The model learned the easiest pattern: SPEAKER:\n dialogue.
  • Large train/dev gap suggests overfitting to short-range patterns.

The model was not too small. It was the wrong architecture.

Next:

WaveNet-style hierarchical MLP from the next lecture.

Instead of flattening all embeddings at once, combine them in stages to preserve locality and expand the receptive field.

Goal:

  • dev loss < 1.7 on Tiny Shakespeare
  • generate actual words and sentences

After that: transformers.

0
4

Comments 0

No comments yet. Be the first!