Cognitive Orchestration

Free preview · Lesson 3

Retrieval-Augmented Generation (RAG)

A model only knows what it was trained on — frozen at a cutoff date, with no access to your private documents. Retrieval-Augmented Generation, introduced by Lewis et al. (2020), fixes this by fetching relevant text at query time and feeding it into the prompt.

What you'll take away
  • Trace the RAG pipeline: embed the question, retrieve from a vector store, augment the prompt, generate.
  • Recognise what RAG buys — freshness, grounding, privacy — and where its limits begin.

A model only knows what it was trained on — frozen at a cutoff date, with no access to your private documents. Retrieval-Augmented Generation, introduced by Lewis et al. (2020), fixes this by fetching relevant text at query time and feeding it into the prompt.

How it works

The pattern is a short pipeline: embed the user's question into a vector, search a vector store of your documents for the closest passages, paste those passages into the prompt, and let the model answer from the retrieved evidence.

The classic RAG pipeline. The model's parametric memory is augmented with non-parametric memory you control.
The classic RAG pipeline. The model's parametric memory is augmented with non-parametric memory you control.

This buys three things at once: freshness (answer from today's data), grounding (cite real sources, reduce hallucination), and privacy/scope (the model reasons over your corpus). Surveys such as Gao et al. (2023) trace how the basic pattern matured into reranking, query rewriting, and self-correcting variants like Self-RAG. When relationships between facts matter — not just isolated passages — GraphRAG retrieves over a knowledge graph instead of a flat index.

Example. A support bot that answers "What's our refund window?" should not guess from training data. With RAG it retrieves your current policy doc and answers from it, with a citation — wrong answers become traceable rather than invented.

The limit

RAG answers what to put in the window. It does not answer how much, in what order, or what to drop when the window fills — the questions context engineering takes up next.

References & further reading