Two reality checks. Drift: as the model autoregressively appends its own output, attention gravitates toward those fresh tokens — the prompt becomes a fading minority. LoRA: instead of retraining the giant base matrix, learn a tiny low-rank update that bends behaviour without touching the core.
Generation reuses the same attention head from Lab 21. Each step samples the next token from the cluster the model is currently attending to most, then appends it. The teal bars above each chip show how much attention the latest query is paying to that position — watch the mass migrate to the right.
The base weight matrix W is huge and frozen. Fine-tuning learns two small matrices B and A whose product has the same shape as W but vastly fewer parameters (rank r). At inference you just add α·B·A — the model gets new behaviour without losing the old.
softmax(Q·Kᵀ) — and a fresh query at the last position naturally finds the highest dot-product with recent, similar keys. As the model emits tokens from its own dominant cluster, those become the easiest things to attend to next, which makes the model emit more of the same. Positive feedback. Sampling temperature is the only real lever to break out; without it, the chain collapses onto whichever cluster won the first few steps.
D × N weight space. So learning B (D × r) and A (r × N) for r ≪ min(D, N) captures the useful direction with r·(D + N) parameters instead of D·N. The base W stays frozen and shareable; an adapter is just B and A on a flash drive. Stack them, swap them, retire them — the base never changes.