An attention head is one operation, repeated: every token's Query dot-products every other token's Key, softmax turns the scores into a probability, then each token's output is a weighted sum of Values. This lab unpacks one head end-to-end so you can watch which words decide to look at which.
Each token embedding x gets multiplied by three matrices to produce its query (Q = x·WQ), key (K = x·WK), and value (V = x·WV). In random mode those matrices are noise — Q/K/V carry no useful signal. In trained mode they're hand-built so verbs reach for animals and animals reach for verbs (using the same cluster colours as Lab 20).
Multiply Q by Kᵀ to get a T × T score matrix: row i is how much token i wants to look at every other token. Softmax along each row turns scores into a probability that sums to 1. Click any row to inspect what that query is attending to; the output below is the resulting weighted V.
Each token's output is its row of the attention matrix dotted with V — a 4-dim cluster-coloured vector showing what got mixed in.
WQ, WK, WV are D × dh matrices learned by gradient descent. Their job is to read whatever lives in the embedding and emit three different "views" of it: what am I looking for, what do I offer, what should I broadcast if attended to. The dot product Q·Kᵀ is just every-query-against-every-key; dividing by √dh keeps the variance from blowing up as dh grows. Softmax along each row picks winners while still admitting noise.