Attention is a set operation — every Q dot-products every K regardless of order. "Dog chases cat" and "cat chases dog" produce the same attention pattern up to relabeling. Positional encoding fixes this by stamping each embedding with a position-dependent vector before attention runs.
Below: the same words rearranged. With PE off, the attention matrices contain identical scores between the same token pairs — attention can't tell what came first. Toggle PE on and slide α to add position information; the matrices start to diverge.
Each position gets a unique vector of sines and cosines at geometrically-spaced frequencies. Adding this to the token embedding gives every position a distinguishing fingerprint that attention can pick up on. The visualization below is the classic stripey signature — low dims oscillate slowly across positions, high dims oscillate fast.
The token runs appears at different positions in A and B. With PE off, its K vector is identical in both spots (similarity = 1.00). With PE on, the K vectors diverge — that's the bit attention needs to distinguish them.
Q·Kᵀ doesn't just measure "what cluster is this" — it can also measure "is this nearby" or "is this token at position k". The attention head learns to use whichever signal is helpful for the task. In RoPE the trick is different: instead of adding position, you rotate Q and K by a position-dependent angle so that the dot product naturally encodes relative position.