Drift & LoRA

Two reality checks. Drift: as the model autoregressively appends its own output, attention gravitates toward those fresh tokens — the prompt becomes a fading minority. LoRA: instead of retraining the giant base matrix, learn a tiny low-rank update that bends behaviour without touching the core.

Context drift — attention pulls toward fresh tokens

Generation reuses the same attention head from Lab 21. Each step samples the next token from the cluster the model is currently attending to most, then appends it. The teal bars above each chip show how much attention the latest query is paying to that position — watch the mass migrate to the right.

tokens — prompt · generated

Attention on prompt

100% healthy — query still grounded in prompt

Attention entropy

1.00 normalised; low = collapsed onto a few tokens

Cluster lock-in

fraction of generated tokens from the dominant cluster

LoRA — W + α·B·A  (low-rank adaptation)

The base weight matrix W is huge and frozen. Fine-tuning learns two small matrices B and A whose product has the same shape as W but vastly fewer parameters (rank r). At inference you just add α·B·A — the model gets new behaviour without losing the old.

W
D × N
🔒 frozen
+ α·
B
D × r
·
A
r × N
=
Weff
D × N
adapted

Base output — softmax(x · W)

Adapted — softmax(x · (W + α·B·A))

How drift happens. Attention assigns mass via softmax(Q·Kᵀ) — and a fresh query at the last position naturally finds the highest dot-product with recent, similar keys. As the model emits tokens from its own dominant cluster, those become the easiest things to attend to next, which makes the model emit more of the same. Positive feedback. Sampling temperature is the only real lever to break out; without it, the chain collapses onto whichever cluster won the first few steps.

Why hallucination feels confident. The model isn't fabricating from nothing — it's faithfully completing a context that's mostly its own previous output. Each step is locally valid; it's the trajectory that drifts.

How LoRA gets away with so few parameters. Empirically, the useful change needed to teach a base model a new style or task lives on a low-dimensional manifold of the full D × N weight space. So learning B (D × r) and A (r × N) for r ≪ min(D, N) captures the useful direction with r·(D + N) parameters instead of D·N. The base W stays frozen and shareable; an adapter is just B and A on a flash drive. Stack them, swap them, retire them — the base never changes.

The core arc is done. Tokenize (Lab 20) → embed → attend (Lab 21) → chain (Lab 22) → drift and adapt (here). Same matrix multiplications all the way down. The "magic" is what the parameters end up encoding after training, not the operations that use them.

Next (Lab 24): positional encoding — how attention learns about word order in the first place. Then Lab 25 stacks heads in parallel, Lab 26 covers how a distribution becomes a token, and Lab 27 closes on retrieval.