One attention head can encode one direction in key-space — a single relation like "verbs reach for subjects". Real transformers stack many heads in parallel, each with its own WQ/WK/WV, so a single block can track multiple patterns simultaneously. The per-head outputs concatenate and project to the same shape as the input.
Same attention math as Lab 21, but with four different trained WQ/WK pairs. Each head's target is hand-coded so the patterns are interpretable; in a real model the patterns are learned and messier, but the principle is identical.
For one selected token, here's what each head contributed (its weighted sum of V). The model concatenates all four into a single vector of length nheads · dh and projects it back to D with an output matrix WO. Picking the right token shows clear per-head specialization.
WQ can only project the embedding onto one specific "what am I looking for" direction. If your model needs to track both verbs-to-subjects and nouns-to-modifiers and pronouns-to-antecedents at the same time, a single head can't do it — those are different directions in key-space that often conflict. Splitting D into many small heads lets each one pick its own direction without stepping on the others.
nheads, then set dh = D / nheads. So GPT-3 (175B) uses D=12288 and nheads=96 → each head has dh=128. The heads run in genuine parallel — same matmul, batched. The output is concatenated back to D and pushed through WO before continuing.