Multi-Head Attention

One attention head can encode one direction in key-space — a single relation like "verbs reach for subjects". Real transformers stack many heads in parallel, each with its own W_Q/W_K/W_V, so a single block can track multiple patterns simultaneously. The per-head outputs concatenate and project to the same shape as the input.

Four heads, four jobs — same sentence, four hand-crafted patterns

Same attention math as Lab 21, but with four different trained W_Q/W_K pairs. Each head's target is hand-coded so the patterns are interpretable; in a real model the patterns are learned and messier, but the principle is identical.

Per-token output mix — concat(h₁, h₂, h₃, h₄) → linear

For one selected token, here's what each head contributed (its weighted sum of V). The model concatenates all four into a single vector of length n_heads · d_h and projects it back to D with an output matrix W_O. Picking the right token shows clear per-head specialization.

selected token —

What each head delivered to this position

concat → W_O projection (D dims)

Why one head isn't enough. A single head's W_Q can only project the embedding onto one specific "what am I looking for" direction. If your model needs to track both verbs-to-subjects and nouns-to-modifiers and pronouns-to-antecedents at the same time, a single head can't do it — those are different directions in key-space that often conflict. Splitting D into many small heads lets each one pick its own direction without stepping on the others.

How dimensions add up. Standard convention: pick n_heads, then set d_h = D / n_heads. So GPT-3 (175B) uses D=12288 and n_heads=96 → each head has d_h=128. The heads run in genuine parallel — same matmul, batched. The output is concatenated back to D and pushed through W_O before continuing.

What real heads learn. Mechanistic-interpretability work (Anthropic, others) has named real heads: induction heads (find the second occurrence of a pattern), copy heads (move information from one position to another), previous-token heads (positional), name-mover heads (resolve which name a pronoun refers to). None of these were designed in — they emerged from training. The hand-crafted heads in this lab are a teaching aid; the real ones are stranger and more interesting.

Next (Lab 26): once you have the next-token distribution, how do you actually pick a token? Greedy, temperature, top-k, top-p — different ways to sample.

Multi-Head Attention

Four heads, four jobs — same sentence, four hand-crafted patterns

Per-token output mix — concat(h1, h2, h3, h4) → linear

What each head delivered to this position

Per-token output mix — concat(h₁, h₂, h₃, h₄) → linear