Positional Encoding

Attention is a set operation — every Q dot-products every K regardless of order. "Dog chases cat" and "cat chases dog" produce the same attention pattern up to relabeling. Positional encoding fixes this by stamping each embedding with a position-dependent vector before attention runs.

Bare attention is order-blind — same tokens, different order, same scores

Below: the same words rearranged. With PE off, the attention matrices contain identical scores between the same token pairs — attention can't tell what came first. Toggle PE on and slide α to add position information; the matrices start to diverge.

sentence A sentence B add positional encoding α (PE strength) 0.50

Sentence A — attention matrix

Sentence B — attention matrix

Sinusoidal PE — PE[pos][2i] = sin(pos/10000^2i/D), PE[pos][2i+1] = cos(...)

Each position gets a unique vector of sines and cosines at geometrically-spaced frequencies. Adding this to the token embedding gives every position a distinguishing fingerprint that attention can pick up on. The visualization below is the classic stripey signature — low dims oscillate slowly across positions, high dims oscillate fast.

Rows = position (0 → T-1). Columns = dimension (0 → D-1). Teal = positive, orange = negative.

Same token, different positions

The token runs appears at different positions in A and B. With PE off, its K vector is identical in both spots (similarity = 1.00). With PE on, the K vectors diverge — that's the bit attention needs to distinguish them.

K[runs] in A @ pos 1

K[runs] in B @ pos 1

cosine similarity

—

1.00 = indistinguishable. Anything below ~0.95 gives attention enough signal to separate them.

Why this is a real problem. Without positional encoding, a transformer treats its input as a bag of tokens. "John saw Mary" = "Mary saw John" = "saw Mary John". The model has no way to learn subject-verb-object roles, no way to handle counting or ordering, no way to do anything sequential. Every working transformer adds some position signal — sinusoidal (original 2017 paper), learned absolute (BERT, GPT-2), rotary (RoPE) (LLaMA, modern LLMs).

How attention picks it up. When you add PE to embeddings before computing Q and K, the position info gets baked into every key. Now Q·Kᵀ doesn't just measure "what cluster is this" — it can also measure "is this nearby" or "is this token at position k". The attention head learns to use whichever signal is helpful for the task. In RoPE the trick is different: instead of adding position, you rotate Q and K by a position-dependent angle so that the dot product naturally encodes relative position.

The "α" slider is artificial. Real models don't blend in PE — they just add it. The slider here lets you watch attention slide between order-blind and order-aware as you turn the position signal up.

Next (Lab 25): stacking multiple attention heads in parallel, so the model can track several patterns at once (verbs→subjects, adjectives→nouns, position-relative, etc.).