An LLM doesn't read text — it reads integers, then looks those integers up in a giant table of vectors. This lab shows both halves: byte-pair encoding growing a vocabulary from raw characters, and that vocabulary becoming a geometry where similar words sit near each other.
Start with characters. Count every adjacent pair across the corpus, merge the most frequent into a new token, and repeat. Common chunks like th, er, then whole words emerge as single tokens. The slider runs N merges; the vocabulary below grows live.
Now tokenize a new sentence with the learned vocab. Unknown characters fall back to single-char tokens (orange).
Each token ID indexes a row of the embedding matrix WE — a vector of D floats. Untrained, those rows are random noise in a D-dimensional space. Trained, similar meanings end up near each other. We project to 2D so you can see it.
WE[id], a row of the embedding matrix — random at init, organized by training. Top-2 PCA finds the plane along which embeddings vary the most, so we can show a D-dimensional space as a flat scatter.
QKT attention works on whatever vectors are in the table.
Q, K, V projections and a softmax over QKT. That's where the model starts using the geometry.