Module 04: RAG Systems
Retrieval-Augmented Generation is the architectural pattern that transformed LLMs from impressive demos into production-grade knowledge systems. Instead of relying on a model's frozen parametric memory, RAG dynamically retrieves relevant context at inference time - giving your system access to up-to-date, verifiable, domain-specific knowledge without retraining.
This module covers the full RAG engineering stack: from understanding why RAG exists and when to skip it, through chunking, embedding, vector storage, retrieval algorithms, reranking, hybrid search, evaluation, and advanced patterns including Graph RAG and Agentic RAG.
The Full RAG Pipeline
Lessons in This Module
| # | Lesson | What You'll Learn |
|---|---|---|
| 01 | Why RAG and When Not To | Hallucination problem, RAG vs fine-tuning decision framework, cost analysis |
| 02 | Document Chunking Strategies | Fixed, semantic, recursive, parent-child chunking; overlap strategies |
| 03 | Embedding Models Deep Dive | MTEB benchmark, E5/BGE/GTE families, Matryoshka embeddings, fine-tuning |
| 04 | Vector Databases | Pinecone, Qdrant, Weaviate, Milvus, pgvector - comparisons and trade-offs |
| 05 | Retrieval Algorithms & ANN | HNSW, IVF, IVF-PQ, DiskANN - how approximate nearest neighbor works |
| 06 | Reranking | Cross-encoders, ColBERT, LLM-as-reranker, RRF, latency budgets |
| 07 | Hybrid Search (Dense + Sparse) | BM25, SPLADE, fusion methods - when hybrid beats pure dense |
| 08 | RAG Evaluation | RAGAS framework, faithfulness, context precision/recall, golden datasets |
| 09 | Advanced RAG Patterns | HyDE, multi-query, Self-RAG, CRAG, iterative retrieval |
| 10 | Graph RAG | Microsoft Graph RAG, entity extraction, community detection, global queries |
| 11 | Agentic RAG | Retrieval as a tool, multi-step reasoning, LangGraph RAG agents |
Prerequisites
Before starting this module, you should be comfortable with:
- Transformer architecture - attention mechanisms, encoder models (covered in Module 01)
- Embeddings basics - what a vector embedding represents, cosine similarity (Module 02)
- LLM prompting - context windows, system prompts, how models use in-context information (Module 03)
- Python async - many RAG libraries use async patterns for concurrent retrieval
Key Concepts Glossary
| Term | Definition |
|---|---|
| RAG | Retrieval-Augmented Generation - augmenting LLM generation with retrieved context |
| Chunking | Splitting documents into smaller pieces suitable for embedding and retrieval |
| Embedding | Dense vector representation of text, capturing semantic meaning |
| ANN | Approximate Nearest Neighbor - fast but approximate vector similarity search |
| HNSW | Hierarchical Navigable Small World - the dominant ANN graph algorithm |
| Reranking | Second-pass scoring of retrieved candidates using a more expensive cross-encoder |
| BM25 | Best Match 25 - classic sparse retrieval algorithm, the baseline for keyword search |
| Hybrid Search | Combining dense (semantic) and sparse (keyword) retrieval signals |
| RAGAS | RAG Assessment - automated evaluation framework measuring faithfulness and relevance |
| HyDE | Hypothetical Document Embeddings - embed a hypothetical answer to improve retrieval |
| Graph RAG | RAG over knowledge graphs - handles entity relationships and multi-hop queries |
| Agentic RAG | RAG where an agent controls retrieval - iterative, multi-source, adaptive |
:::tip Start Here If you're new to RAG, read lessons 01 through 05 in order - they build on each other. Lessons 06 onward can be read independently once you have the foundation. :::
:::note Production Focus Every lesson includes production engineering notes, common failure modes, and real cost/latency analysis. This is not a tutorial module - it's an engineering reference for building RAG systems that survive contact with production traffic. :::
