Module 04: RAG Systems

Retrieval-Augmented Generation is the architectural pattern that transformed LLMs from impressive demos into production-grade knowledge systems. Instead of relying on a model's frozen parametric memory, RAG dynamically retrieves relevant context at inference time - giving your system access to up-to-date, verifiable, domain-specific knowledge without retraining.

This module covers the full RAG engineering stack: from understanding why RAG exists and when to skip it, through chunking, embedding, vector storage, retrieval algorithms, reranking, hybrid search, evaluation, and advanced patterns including Graph RAG and Agentic RAG.

The Full RAG Pipeline

Lessons in This Module

#	Lesson	What You'll Learn
01	Why RAG and When Not To	Hallucination problem, RAG vs fine-tuning decision framework, cost analysis
02	Document Chunking Strategies	Fixed, semantic, recursive, parent-child chunking; overlap strategies
03	Embedding Models Deep Dive	MTEB benchmark, E5/BGE/GTE families, Matryoshka embeddings, fine-tuning
04	Vector Databases	Pinecone, Qdrant, Weaviate, Milvus, pgvector - comparisons and trade-offs
05	Retrieval Algorithms & ANN	HNSW, IVF, IVF-PQ, DiskANN - how approximate nearest neighbor works
06	Reranking	Cross-encoders, ColBERT, LLM-as-reranker, RRF, latency budgets
07	Hybrid Search (Dense + Sparse)	BM25, SPLADE, fusion methods - when hybrid beats pure dense
08	RAG Evaluation	RAGAS framework, faithfulness, context precision/recall, golden datasets
09	Advanced RAG Patterns	HyDE, multi-query, Self-RAG, CRAG, iterative retrieval
10	Graph RAG	Microsoft Graph RAG, entity extraction, community detection, global queries
11	Agentic RAG	Retrieval as a tool, multi-step reasoning, LangGraph RAG agents

Prerequisites

Before starting this module, you should be comfortable with:

Transformer architecture - attention mechanisms, encoder models (covered in Module 01)
Embeddings basics - what a vector embedding represents, cosine similarity (Module 02)
LLM prompting - context windows, system prompts, how models use in-context information (Module 03)
Python async - many RAG libraries use async patterns for concurrent retrieval

Key Concepts Glossary

Term	Definition
RAG	Retrieval-Augmented Generation - augmenting LLM generation with retrieved context
Chunking	Splitting documents into smaller pieces suitable for embedding and retrieval
Embedding	Dense vector representation of text, capturing semantic meaning
ANN	Approximate Nearest Neighbor - fast but approximate vector similarity search
HNSW	Hierarchical Navigable Small World - the dominant ANN graph algorithm
Reranking	Second-pass scoring of retrieved candidates using a more expensive cross-encoder
BM25	Best Match 25 - classic sparse retrieval algorithm, the baseline for keyword search
Hybrid Search	Combining dense (semantic) and sparse (keyword) retrieval signals
RAGAS	RAG Assessment - automated evaluation framework measuring faithfulness and relevance
HyDE	Hypothetical Document Embeddings - embed a hypothetical answer to improve retrieval
Graph RAG	RAG over knowledge graphs - handles entity relationships and multi-hop queries
Agentic RAG	RAG where an agent controls retrieval - iterative, multi-source, adaptive

:::tip Start Here If you're new to RAG, read lessons 01 through 05 in order - they build on each other. Lessons 06 onward can be read independently once you have the foundation. :::

:::note Production Focus Every lesson includes production engineering notes, common failure modes, and real cost/latency analysis. This is not a tutorial module - it's an engineering reference for building RAG systems that survive contact with production traffic. :::

The Full RAG Pipeline​

Lessons in This Module​

Prerequisites​

Key Concepts Glossary​

The Full RAG Pipeline

Lessons in This Module

Prerequisites

Key Concepts Glossary