Llama 3 Mini, from scratch
A tiny Llama-3-style transformer (~40M params) traced shape-by-shape, then trained on TinyShakespeare. Every matrix, every dimension.
Scroll to zoom · drag to pan · use Fullscreen for the dense parts.
This is the full forward pass of a Llama 3 Mini — a scaled-down Llama-3-style transformer small enough to pretrain on a free Colab T4. Zoom into the diagram above to follow any part; the notes below trace the shapes end to end.
The config (vs. full Llama 3 8B)
| Mini | Llama 3 8B | |
|---|---|---|
Hidden dim d_model | 512 | 4096 |
| Layers | 8 | 32 |
| Heads | 8 | 32 |
| Total params | ~40M | 8B |
| VRAM | ~4–6 GB (Colab T4) | — |
We train on TinyShakespeare with a character-level vocab, batch size 128
and sequence length 128.
1. Tokens → embeddings
Input tokens arrive as a (batch, seq) = (128, 128) integer matrix. The
embedding table maps each token id to a d_model-dim vector:
With a vocab of V = 65 characters and d_model = 512, that is
65 × 512 = 33,280 parameters. The output is the token embedding:
2. Inside one transformer block (×8)
Each block is RMSNorm → Attention → RMSNorm → SwiGLU FFN, with residual connections around the attention and feed-forward sub-layers.
Attention
From the normalized input we project three matrices, each of shape
(128, 128, 512):
A few Llama-specific details worth noticing in the diagram:
- RoPE is applied to
QandKonly — not toV. Rotary embeddings encode position by rotating the query/key vectors, so values stay untouched. - Attention runs over 8 heads, so the
512channels split into8 × 64.
# scaled dot-product attention, per head
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(head_dim) # (B, H, T, T)
scores = scores.masked_fill(causal_mask, float("-inf"))
attn = scores.softmax(dim=-1) @ V # (B, H, T, head_dim)
Feed-forward (SwiGLU)
Llama uses a gated SwiGLU MLP instead of a plain ReLU MLP:
3. Output head
After the final RMSNorm, a linear layer projects back to vocab size, and a softmax turns it into next-token probabilities:
Trace it yourself: the only places the last dimension changes are the embedding (
→ 512), the attention projections (stay512), and the output head (512 → vocab). Everything else preserves shape.
Run the full training loop in the Colab notebook linked above.