RAG · Retrieval-Augmented Generation

Cramming world knowledge into model weights is expensive, slow to update, and gets stale. RAG sidesteps it: embed your documents, embed the query, fetch the most similar docs at runtime, and paste them into the prompt. The model now answers with the docs in its context window — same weights, fresh facts.

Knowledge base + cosine retrieval — top-k by cos(q · d)

Each document gets a vector via the same embedding scheme as Lab 20 (cluster-tinted average of word vectors). The query is embedded the same way. Cosine similarity ranks docs; the top-k feed the prompt.

query top-k

query embedding

Augmented prompt — what actually goes to the LLM

The retrieved docs get spliced into the prompt template before the user query. The model then answers conditioned on real, current information — no need for it to "know" the facts in its weights.

Base model alone (no retrieval)

With RAG context (retrieved docs in prompt)

How the embedding lookup actually works. At index time you embed every document once with a model (OpenAI's text-embedding-3-small, Anthropic's Voyage, sentence-transformers, etc.) and store the vectors in a database (FAISS, Pinecone, pgvector, LanceDB). At query time you embed the query, compute cosine similarity against every doc vector — or, for big corpora, use an approximate-nearest-neighbour index (HNSW, IVF) to get sub-millisecond lookups across millions of docs.

What this lab simplifies. Real production RAG layers: chunking (break long docs into ~500-token windows so retrieved pieces fit in context), hybrid search (cosine + keyword BM25 = better than either alone), reranking (a second model rescores top-20 down to top-3), and citations (the answer includes which doc supported each claim). The cluster-based embeddings used here are a teaching toy — real embedding models are trained on billions of pairs and capture far richer semantics.

When RAG works and doesn't. RAG shines on factual lookup over a known corpus (your docs, your codebase, your wiki). It struggles on multi-hop reasoning that requires synthesizing across many chunks, on questions where the right doc doesn't exist, and on queries the embedding model can't match to docs (synonyms, abstract questions). Failure mode: the model hallucinates confidently using whatever it retrieved, even if the retrieval was wrong.

That closes the arc. Lab 20 built the embeddings, Lab 21 the attention, Lab 22 the reasoning, Lab 23 the failure modes and adaptation, Labs 24–26 the missing pieces (position, multi-head, sampling). RAG ties it all together: same model, same context window, smarter input.