You are browsing as a guest. Sign up (or log in) to start making projects!

Open comments for this post

11h 7m 7s logged

Tiny Shakespeare (decoder-only transformer)

Dataset:

  • Tiny Shakespeare, ~1.1M characters
  • vocab 65

Architecture: GPT-style decoder

  • token embedding + learned position embedding (embd=384)
  • 6 transformer blocks, each:
    • multi-head causal self-attention (8 heads, head_size=48)
    • feed-forward (Linear → ReLU → Linear)
    • pre-LayerNorm + residual connections on both sublayers
  • final LayerNorm → linear to vocab
  • block_size=64, dropout=0.2

Training:

  • AdamW, lr=1e-4
  • ExponentialLR, gamma=0.99995
  • batch size 1024
  • 5000 iters
  • CUDA + torch.compile

Results:

  • train loss = 1.26
  • val loss = 1.52

Why it worked:

  • Attention replaces flattening. Instead of mashing all 64 embeddings into one vector, each position attends to the relevant earlier positions directly. Sequence structure is preserved instead of thrown away — the exact failure mode of mlp3.
  • The receptive field is the full block (64 chars) and every position can use all of it, not just a fixed local window.
  • Residual connections + pre-LayerNorm let 6 blocks stack and actually train.
  • Position embeddings give the model order information that a bag-of-embeddings MLP never had.

Against the mlp3 goals:

  • Goal was dev loss < 1.7 → hit 1.52.
  • Goal was “generate actual words and sentences” → it now produces speaker tags, line breaks, and mostly real English words in Shakespearean cadence. It learned the SPEAKER:\n dialogue form like mlp3 did, but this time it fills the dialogue with coherent structure instead of gibberish.

The train/dev gap:

  • 1.26 vs 1.52→ much tighter than mlp3’s 1.12/2.02. Dropout 0.2 + a context long enough to actually be useful means it’s learning real patterns, not memorizing short-range ones.

Notes / loose ends:

  • LR never decays much → gamma=0.99995 over 5000 steps only drops lr from 1e-4 to ~7.8e-5, so the schedule is nearly flat. Loss was still falling at step 5000; more iters or a higher/decaying lr would likely push val below 1.5.

Next:

  • Decay the LR properly (or warmup + cosine) and train longer to close in on ~1.4.
  • Scale block_size for longer context now that attention makes it affordable.
  • Develop gpt-2 (the old sucky model from 2019)
0
3

Comments 0

No comments yet. Be the first!