Multi-Head Attention

One attention head can encode one direction in key-space — a single relation like "verbs reach for subjects". Real transformers stack many heads in parallel, each with its own WQ/WK/WV, so a single block can track multiple patterns simultaneously. The per-head outputs concatenate and project to the same shape as the input.

Four heads, four jobs — same sentence, four hand-crafted patterns

Same attention math as Lab 21, but with four different trained WQ/WK pairs. Each head's target is hand-coded so the patterns are interpretable; in a real model the patterns are learned and messier, but the principle is identical.

Per-token output mix — concat(h1, h2, h3, h4) → linear

For one selected token, here's what each head contributed (its weighted sum of V). The model concatenates all four into a single vector of length nheads · dh and projects it back to D with an output matrix WO. Picking the right token shows clear per-head specialization.

selected token —

What each head delivered to this position

concat → WO projection (D dims)
Why one head isn't enough. A single head's WQ can only project the embedding onto one specific "what am I looking for" direction. If your model needs to track both verbs-to-subjects and nouns-to-modifiers and pronouns-to-antecedents at the same time, a single head can't do it — those are different directions in key-space that often conflict. Splitting D into many small heads lets each one pick its own direction without stepping on the others.

How dimensions add up. Standard convention: pick nheads, then set dh = D / nheads. So GPT-3 (175B) uses D=12288 and nheads=96 → each head has dh=128. The heads run in genuine parallel — same matmul, batched. The output is concatenated back to D and pushed through WO before continuing.

What real heads learn. Mechanistic-interpretability work (Anthropic, others) has named real heads: induction heads (find the second occurrence of a pattern), copy heads (move information from one position to another), previous-token heads (positional), name-mover heads (resolve which name a pronoun refers to). None of these were designed in — they emerged from training. The hand-crafted heads in this lab are a teaching aid; the real ones are stranger and more interesting.

Next (Lab 26): once you have the next-token distribution, how do you actually pick a token? Greedy, temperature, top-k, top-p — different ways to sample.