Tiny Shakespeare (decoder-only transformer)

token embedding + learned position embedding (embd=384)
6 transformer blocks, each:
- multi-head causal self-attention (8 heads, head_size=48)
- feed-forward (Linear → ReLU → Linear)
- pre-LayerNorm + residual connections on both sublayers
final LayerNorm → linear to vocab
block_size=64, dropout=0.2

Why it worked:

Attention replaces flattening. Instead of mashing all 64 embeddings into one vector, each position attends to the relevant earlier positions directly. Sequence structure is preserved instead of thrown away — the exact failure mode of mlp3.
The receptive field is the full block (64 chars) and every position can use all of it, not just a fixed local window.
Residual connections + pre-LayerNorm let 6 blocks stack and actually train.
Position embeddings give the model order information that a bag-of-embeddings MLP never had.

Goal was dev loss < 1.7 → hit 1.52.
Goal was “generate actual words and sentences” → it now produces speaker tags, line breaks, and mostly real English words in Shakespearean cadence. It learned the SPEAKER:\n dialogue form like mlp3 did, but this time it fills the dialogue with coherent structure instead of gibberish.

1.26 vs 1.52→ much tighter than mlp3’s 1.12/2.02. Dropout 0.2 + a context long enough to actually be useful means it’s learning real patterns, not memorizing short-range ones.

LR never decays much → gamma=0.99995 over 5000 steps only drops lr from 1e-4 to ~7.8e-5, so the schedule is nearly flat. Loss was still falling at step 5000; more iters or a higher/decaying lr would likely push val below 1.5.

Decay the LR properly (or warmup + cosine) and train longer to close in on ~1.4.
Scale block_size for longer context now that attention makes it affordable.
Develop gpt-2 (the old sucky model from 2019)