A transformer is stateless between tokens — every prediction looks at its full context and nothing else. Chain-of-thought exploits this by letting the model write into its own context: each reasoning token it generates is appended and conditions the next prediction. More tokens = more attention passes = more compute spent per answer. It isn't thinking; it's buying compute.
Direct vs Chain-of-thought — same model, different prompt
Same puzzle, same weights, two prompting strategies. Direct emits a single answer token. Chain of thought emits a few reasoning tokens first — each one gets fed back into the context, watch how the answer distribution sharpens.
compute spent — direct 1 · CoT 0 tokens
Direct — one token
Next-token distribution over candidate answers
P(correct) = —
Chain of thought — generating
Next-token distribution over candidate answers, conditioned on context-so-far
P(correct) = —
P(correct) over generation — compute buys conviction
The orange line is the direct model's single answer attempt. The teal line is the CoT trajectory — flat until a reasoning token unlocks the right pattern (typically the model just says the answer mid-chain and then copies it forward via attention).
direct (one shot) chain of thought (step by step)vertical lines = reasoning tokens emitted
How it works. Each forward pass takes the entire context, runs attention over it, and outputs a probability distribution for the next token. The model has no scratchpad, no working memory — only the visible context. When CoT generates a reasoning token, that token gets appended to the context, so the very next forward pass attends to it. If reasoning produces "12" at position k, then at position k+5 an induction-style attention head can copy that 12 straight into the answer slot. That's the whole mechanism.
Why "more tokens = better reasoning." Each emitted token costs one full forward pass — one round of attention over the whole context. Direct prediction gets one pass; CoT gets one per token. So CoT is literally trading tokens for compute, which is why reasoning models like o1 and Claude with thinking spend lots of hidden tokens before answering. The puzzles here are tiny but the dynamics scale.
The distributions are hand-authored. Real models compute these from softmax(WU · hfinal), where hfinal is the last-layer hidden state for the last position — see Lab 21 for the attention head that produces it. The point of this lab is the shape: how distributions shift as the model conditions on its own emissions.
Next (Lab 23): drift, hallucination, and LoRA — what happens when the context fills with the model's own tokens and starts crowding out the original prompt.