Attention Head

An attention head is one operation, repeated: every token's Query dot-products every other token's Key, softmax turns the scores into a probability, then each token's output is a weighted sum of Values. This lab unpacks one head end-to-end so you can watch which words decide to look at which.

Q, K, V — linear projections of every embedding

Each token embedding x gets multiplied by three matrices to produce its query (Q = x·WQ), key (K = x·WK), and value (V = x·WV). In random mode those matrices are noise — Q/K/V carry no useful signal. In trained mode they're hand-built so verbs reach for animals and animals reach for verbs (using the same cluster colours as Lab 20).

Q T × dh
watmotaniabs

K T × dh
watmotaniabs

V T × dh
watmotaniabs

Attention — softmax(Q·Kᵀ / √dh) · V

Multiply Q by Kᵀ to get a T × T score matrix: row i is how much token i wants to look at every other token. Softmax along each row turns scores into a probability that sums to 1. Click any row to inspect what that query is attending to; the output below is the resulting weighted V.

Each token's output is its row of the attention matrix dotted with V — a 4-dim cluster-coloured vector showing what got mixed in.

How it works. WQ, WK, WV are D × dh matrices learned by gradient descent. Their job is to read whatever lives in the embedding and emit three different "views" of it: what am I looking for, what do I offer, what should I broadcast if attended to. The dot product Q·Kᵀ is just every-query-against-every-key; dividing by √dh keeps the variance from blowing up as dh grows. Softmax along each row picks winners while still admitting noise.

The trained matrices here are interpretable on purpose. Real heads aren't this clean — they encode a stew of grammatical, semantic, and positional signals — but the mechanism is identical. Toggle trained off and the heatmap collapses to noise: same architecture, no learned content, no useful behaviour. That's why scaling laws are about data, not arithmetic.

Next (Lab 22): once one forward pass produces a distribution, the model can write into its own context and condition future passes on what it just wrote — that's chain-of-thought. (Lab 25 covers what happens when you stack many of these heads in parallel; in a real transformer a single block runs dozens of heads alongside an MLP, then does it again for many layers.)