Tokenizer & Embeddings

An LLM doesn't read text — it reads integers, then looks those integers up in a giant table of vectors. This lab shows both halves: byte-pair encoding growing a vocabulary from raw characters, and that vocabulary becoming a geometry where similar words sit near each other.

Byte-pair encoding — merge the most common pair, repeat

Start with characters. Count every adjacent pair across the corpus, merge the most frequent into a new token, and repeat. Common chunks like th, er, then whole words emerge as single tokens. The slider runs N merges; the vocabulary below grows live.

vocab size chars in corpus tokens after merges compression
last merge —

Now tokenize a new sentence with the learned vocab. Unknown characters fall back to single-char tokens (orange).

Tokens become vectors — WE[id]

Each token ID indexes a row of the embedding matrix WE — a vector of D floats. Untrained, those rows are random noise in a D-dimensional space. Trained, similar meanings end up near each other. We project to 2D so you can see it.

water motion animals abstract unclustered / subword
How it works. BPE keeps the highest-frequency adjacent pair each step, then rewrites the corpus with the merged token — so the vocabulary grows from characters toward common chunks toward whole words, in order of usefulness. The encoder just replays those merges in training order on new text. Each token ID then indexes WE[id], a row of the embedding matrix — random at init, organized by training. Top-2 PCA finds the plane along which embeddings vary the most, so we can show a D-dimensional space as a flat scatter.

The semantic toggle is a stand-in. Real embeddings get their structure from co-occurrence over billions of tokens. Here we simply blend cluster centers into matching tokens (water/motion/animals/abstract) so you can see what trained geometry looks like — uncheck it to compare against pure noise. Either way, the mechanism downstream is identical: QKT attention works on whatever vectors are in the table.

Next (Lab 21): these vectors flow into a single attention head — Q, K, V projections and a softmax over QKT. That's where the model starts using the geometry.