pre-LayerNorm + residual connections on both sublayers
final LayerNorm → linear to vocab
block_size=64, dropout=0.2
Training:
AdamW, lr=1e-4
ExponentialLR, gamma=0.99995
batch size 1024
5000 iters
CUDA + torch.compile
Results:
train loss = 1.26
val loss = 1.52
Why it worked:
Attention replaces flattening. Instead of mashing all 64 embeddings into one vector, each position attends to the relevant earlier positions directly. Sequence structure is preserved instead of thrown away — the exact failure mode of mlp3.
The receptive field is the full block (64 chars) and every position can use all of it, not just a fixed local window.
Residual connections + pre-LayerNorm let 6 blocks stack and actually train.
Position embeddings give the model order information that a bag-of-embeddings MLP never had.
Against the mlp3 goals:
Goal was dev loss < 1.7 → hit 1.52.
Goal was “generate actual words and sentences” → it now produces speaker tags, line breaks, and mostly real English words in Shakespearean cadence. It learned the SPEAKER:\n dialogue form like mlp3 did, but this time it fills the dialogue with coherent structure instead of gibberish.
The train/dev gap:
1.26 vs 1.52→ much tighter than mlp3’s 1.12/2.02. Dropout 0.2 + a context long enough to actually be useful means it’s learning real patterns, not memorizing short-range ones.
Notes / loose ends:
LR never decays much → gamma=0.99995 over 5000 steps only drops lr from 1e-4 to ~7.8e-5, so the schedule is nearly flat. Loss was still falling at step 5000; more iters or a higher/decaying lr would likely push val below 1.5.
Next:
Decay the LR properly (or warmup + cosine) and train longer to close in on ~1.4.
Scale block_size for longer context now that attention makes it affordable.
Comments 0
No comments yet. Be the first!
Sign in to join the conversation.