Attention Head

An attention head is one operation, repeated: every token's Query dot-products every other token's Key, softmax turns the scores into a probability, then each token's output is a weighted sum of Values. This lab unpacks one head end-to-end so you can watch which words decide to look at which.

Q, K, V — linear projections of every embedding

Each token embedding x gets multiplied by three matrices to produce its query (Q = x·W_Q), key (K = x·W_K), and value (V = x·W_V). In random mode those matrices are noise — Q/K/V carry no useful signal. In trained mode they're hand-built so verbs reach for animals and animals reach for verbs (using the same cluster colours as Lab 20).

trained weights (vs random noise)

Q T × d_h
watmotaniabs

K T × d_h
watmotaniabs

V T × d_h
watmotaniabs

Attention — softmax(Q·Kᵀ / √d_h) · V

Multiply Q by Kᵀ to get a T × T score matrix: row i is how much token i wants to look at every other token. Softmax along each row turns scores into a probability that sums to 1. Click any row to inspect what that query is attending to; the output below is the resulting weighted V.

apply softmax (row-wise probabilities)

Each token's output is its row of the attention matrix dotted with V — a 4-dim cluster-coloured vector showing what got mixed in.

How it works. W_Q, W_K, W_V are D × d_h matrices learned by gradient descent. Their job is to read whatever lives in the embedding and emit three different "views" of it: what am I looking for, what do I offer, what should I broadcast if attended to. The dot product Q·Kᵀ is just every-query-against-every-key; dividing by √d_h keeps the variance from blowing up as d_h grows. Softmax along each row picks winners while still admitting noise.

The trained matrices here are interpretable on purpose. Real heads aren't this clean — they encode a stew of grammatical, semantic, and positional signals — but the mechanism is identical. Toggle trained off and the heatmap collapses to noise: same architecture, no learned content, no useful behaviour. That's why scaling laws are about data, not arithmetic.

Next (Lab 22): once one forward pass produces a distribution, the model can write into its own context and condition future passes on what it just wrote — that's chain-of-thought. (Lab 25 covers what happens when you stack many of these heads in parallel; in a real transformer a single block runs dozens of heads alongside an MLP, then does it again for many layers.)

Attention Head

Q, K, V — linear projections of every embedding

Q T × dhwatmotaniabs

K T × dhwatmotaniabs

V T × dhwatmotaniabs

Attention — softmax(Q·Kᵀ / √dh) · V

Q T × d_h
watmotaniabs

K T × d_h
watmotaniabs

V T × d_h
watmotaniabs

Attention — softmax(Q·Kᵀ / √d_h) · V