Skip to main content

Module 04: RAG Systems

Retrieval-Augmented Generation is the architectural pattern that transformed LLMs from impressive demos into production-grade knowledge systems. Instead of relying on a model's frozen parametric memory, RAG dynamically retrieves relevant context at inference time - giving your system access to up-to-date, verifiable, domain-specific knowledge without retraining.

This module covers the full RAG engineering stack: from understanding why RAG exists and when to skip it, through chunking, embedding, vector storage, retrieval algorithms, reranking, hybrid search, evaluation, and advanced patterns including Graph RAG and Agentic RAG.

The Full RAG Pipeline

Lessons in This Module

#LessonWhat You'll Learn
01Why RAG and When Not ToHallucination problem, RAG vs fine-tuning decision framework, cost analysis
02Document Chunking StrategiesFixed, semantic, recursive, parent-child chunking; overlap strategies
03Embedding Models Deep DiveMTEB benchmark, E5/BGE/GTE families, Matryoshka embeddings, fine-tuning
04Vector DatabasesPinecone, Qdrant, Weaviate, Milvus, pgvector - comparisons and trade-offs
05Retrieval Algorithms & ANNHNSW, IVF, IVF-PQ, DiskANN - how approximate nearest neighbor works
06RerankingCross-encoders, ColBERT, LLM-as-reranker, RRF, latency budgets
07Hybrid Search (Dense + Sparse)BM25, SPLADE, fusion methods - when hybrid beats pure dense
08RAG EvaluationRAGAS framework, faithfulness, context precision/recall, golden datasets
09Advanced RAG PatternsHyDE, multi-query, Self-RAG, CRAG, iterative retrieval
10Graph RAGMicrosoft Graph RAG, entity extraction, community detection, global queries
11Agentic RAGRetrieval as a tool, multi-step reasoning, LangGraph RAG agents

Prerequisites

Before starting this module, you should be comfortable with:

  • Transformer architecture - attention mechanisms, encoder models (covered in Module 01)
  • Embeddings basics - what a vector embedding represents, cosine similarity (Module 02)
  • LLM prompting - context windows, system prompts, how models use in-context information (Module 03)
  • Python async - many RAG libraries use async patterns for concurrent retrieval

Key Concepts Glossary

TermDefinition
RAGRetrieval-Augmented Generation - augmenting LLM generation with retrieved context
ChunkingSplitting documents into smaller pieces suitable for embedding and retrieval
EmbeddingDense vector representation of text, capturing semantic meaning
ANNApproximate Nearest Neighbor - fast but approximate vector similarity search
HNSWHierarchical Navigable Small World - the dominant ANN graph algorithm
RerankingSecond-pass scoring of retrieved candidates using a more expensive cross-encoder
BM25Best Match 25 - classic sparse retrieval algorithm, the baseline for keyword search
Hybrid SearchCombining dense (semantic) and sparse (keyword) retrieval signals
RAGASRAG Assessment - automated evaluation framework measuring faithfulness and relevance
HyDEHypothetical Document Embeddings - embed a hypothetical answer to improve retrieval
Graph RAGRAG over knowledge graphs - handles entity relationships and multi-hop queries
Agentic RAGRAG where an agent controls retrieval - iterative, multi-source, adaptive

:::tip Start Here If you're new to RAG, read lessons 01 through 05 in order - they build on each other. Lessons 06 onward can be read independently once you have the foundation. :::

:::note Production Focus Every lesson includes production engineering notes, common failure modes, and real cost/latency analysis. This is not a tutorial module - it's an engineering reference for building RAG systems that survive contact with production traffic. :::

© 2026 EngineersOfAI. All rights reserved.