Llama 3 Mini, from scratch

A tiny Llama-3-style transformer (~40M params) traced shape-by-shape, then trained on TinyShakespeare. Every matrix, every dimension.

Loading diagram…

Scroll to zoom · drag to pan · use Fullscreen for the dense parts.

▶ Run the code in Google Colab

This is the full forward pass of a Llama 3 Mini — a scaled-down Llama-3-style transformer small enough to pretrain on a free Colab T4. Zoom into the diagram above to follow any part; the notes below trace the shapes end to end.

The config (vs. full Llama 3 8B)

	Mini	Llama 3 8B
Hidden dim `d_model`	512	4096
Layers	8	32
Heads	8	32
Total params	~40M	8B
VRAM	~4–6 GB (Colab T4)	—

We train on TinyShakespeare with a character-level vocab, batch size 128 and sequence length 128.

1. Tokens → embeddings

Input tokens arrive as a (batch, seq) = (128, 128) integer matrix. The embedding table maps each token id to a d_model-dim vector:

\text{params}_{\text{embed}} = V \times d_{\text{model}}

With a vocab of V = 65 characters and d_model = 512, that is 65 × 512 = 33,280 parameters. The output is the token embedding:

[\text{Batch}, \text{Seq}, d_{\text{model}}] = (128, 128, 512)

2. Inside one transformer block (×8)

Each block is RMSNorm → Attention → RMSNorm → SwiGLU FFN, with residual connections around the attention and feed-forward sub-layers.

Attention

From the normalized input we project three matrices, each of shape (128, 128, 512):

Q = xW_Q,\quad K = xW_K,\quad V = xW_V

A few Llama-specific details worth noticing in the diagram:

RoPE is applied to Q and K only — not to V. Rotary embeddings encode position by rotating the query/key vectors, so values stay untouched.
Attention runs over 8 heads, so the 512 channels split into 8 × 64.

# scaled dot-product attention, per head
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(head_dim)  # (B, H, T, T)
scores = scores.masked_fill(causal_mask, float("-inf"))
attn = scores.softmax(dim=-1) @ V                          # (B, H, T, head_dim)

Feed-forward (SwiGLU)

Llama uses a gated SwiGLU MLP instead of a plain ReLU MLP:

\text{SwiGLU}(x) = \big(\text{SiLU}(xW_1) \odot xW_3\big)W_2

3. Output head

After the final RMSNorm, a linear layer projects back to vocab size, and a softmax turns it into next-token probabilities:

(128, 128, 512) \xrightarrow{\text{Linear}} (128, 128, 80) \xrightarrow{\text{softmax}} \text{probabilities}

Trace it yourself: the only places the last dimension changes are the embedding (→ 512), the attention projections (stay 512), and the output head (512 → vocab). Everything else preserves shape.

Run the full training loop in the Colab notebook linked above.