Sampling Strategies

After the model produces a next-token distribution (see Labs 21 & 22), you still have to pick a token. Greedy is deterministic but boring. Temperature reshapes the curve. Top-k and top-p (nucleus) crop the tail. Each strategy trades off fidelity against diversity in a different way — and the choice is what makes one model feel creative and another feel robotic.

The fixed distribution — same model output, different policies

Prompt: "The cat sat on the" → next token distribution shown below.

Four ways to sample — same distribution, very different outputs

Each panel reshapes the base distribution above. The teal bars are the active candidates after the strategy is applied; eliminated tokens grey out. Roll once to draw a single sample, or roll 200× to see the empirical distribution.

How they trade off. Greedy gives you the model's single most likely answer — repeatable but bland and prone to loops. Temperature scales every logit before the softmax: T < 1 sharpens (more deterministic), T > 1 flattens (more diverse), T → 0 equals greedy, T → ∞ equals uniform random. Top-k keeps only the k highest-probability tokens and renormalizes — simple but you have to guess k regardless of how peaked or flat the distribution is. Top-p (nucleus sampling, Holtzman 2019) keeps the smallest set whose cumulative probability hits p — adaptive: a sharp distribution keeps very few tokens, a flat one keeps many.

Modern defaults. Most production LLMs combine these — typically temperature ≈ 0.7–1.0 with top-p ≈ 0.9. Reasoning models often run lower temperatures (0.2–0.5) to keep the chain of thought coherent. Creative writing wants higher.

Two things this lab leaves out. Beam search tracks the top-n partial sequences over multiple steps rather than committing one token at a time — useful for translation and structured outputs, less useful for open-ended generation (it ends up bland because high-likelihood text is bland). Repetition penalties downweight tokens that already appeared — a hack to fight the drift seen in Lab 23.

Next (Lab 27): retrieval-augmented generation — instead of trying to squeeze world knowledge into the weights, paste the relevant facts into the prompt.