Module 11: RAG Engineering
Retrieval-Augmented Generation (RAG) is the most widely deployed pattern in production AI systems today. It solves a fundamental problem: language models know everything they learned during training, but nothing about your documents, your customers, or anything that happened after their knowledge cutoff. RAG gives the model real-time access to your data - reliably, cheaply, and without retraining.
This module covers the full RAG engineering stack: from the theory of why RAG exists, through every layer of the pipeline, to production evaluation and agentic extensions.
What You'll Learn
Lessons
| # | Lesson | Key Concepts |
|---|---|---|
| 01 | Why RAG Exists | LLM knowledge limits, hallucination, RAG vs. fine-tuning vs. prompt stuffing, founding paper |
| 02 | Document Ingestion and Chunking | Fixed-size, semantic, structure-aware chunking; PDF/DOCX parsing; metadata enrichment |
| 03 | Embedding Models in Production | Dense embeddings, model selection, fine-tuning embeddings, embedding pipelines, dimensionality |
| 04 | Vector Search in Practice | FAISS, pgvector, Pinecone, Weaviate; ANN algorithms; index tuning; exact vs. approximate search |
| 05 | Hybrid Search and Reranking | BM25 sparse retrieval, reciprocal rank fusion, cross-encoder reranking, Cohere Rerank |
| 06 | Query Transformation and HyDE | HyDE, multi-query expansion, step-back prompting, query decomposition for complex questions |
| 07 | Agentic RAG | Self-correcting retrieval loops, CRAG, multi-hop reasoning, tool-augmented retrieval |
| 08 | RAG Evaluation with RAGAS | Faithfulness, answer relevancy, context precision/recall; building golden datasets; CI eval |
Prerequisites
- Familiarity with LLM APIs and the Anthropic SDK
- Python async programming basics
- Basic understanding of embeddings and vector similarity (helpful but not required)
The Core Challenge of Production RAG
A RAG system that scores 85% in offline evaluation routinely fails users in ways your test set never predicted. The gap between benchmark performance and production quality is wider in RAG than in almost any other AI pattern - because RAG has more failure modes, stacked on top of each other.
The retrieval can fail (wrong chunks returned). The chunks can be malformed (tables split mid-row, code split mid-function). The context can be too long or too short. The generator can ignore the retrieved context and hallucinate anyway. And unlike a classification model where you can measure accuracy on a holdout set, RAG failure is often invisible - the model confidently synthesizes an answer from partially-correct chunks that contains subtle factual errors.
This module teaches you to engineer every layer of the stack so failures are visible, measurable, and systematically reduced.
The RAG Stack at a Glance
Common RAG Failure Modes
| Layer | Failure | Symptom |
|---|---|---|
| Parsing | Table extracted as garbled text | Answers about tabular data are wrong |
| Chunking | Sentence split mid-entity | Retrieval misses the relevant chunk |
| Embedding | Wrong model for domain | Similar documents ranked far apart |
| Retrieval | Low precision top-k | Irrelevant context passed to generator |
| Reranking | Skipped to save latency | Precision drops 20-30% |
| Generation | Model ignores context | Hallucinations despite correct retrieval |
| Evaluation | Only measures offline accuracy | Production failures are invisible |
Each layer of this module addresses one row of this table.
