Module 02: Pretraining and Fine-tuning

What This Module Covers

Every LLM you use today - GPT-4, Claude, Gemini, LLaMA - went through the same fundamental pipeline: pretrain on massive text, then fine-tune to behave as instructed, then align with human preferences. This module takes you deep into each stage of that pipeline.

You will learn why different training objectives produce different model capabilities, how the industry moved from full fine-tuning to parameter-efficient methods like LoRA, why RLHF was revolutionary, and why DPO is replacing it. By the end of this module, you will be able to make informed decisions about which training approach to use for your specific use case.

The Full Pipeline

Lessons in This Module

#	Lesson	What You Will Learn
01	Language Modeling Objectives	MLM vs CLM, cross-entropy loss, perplexity, why objective choice matters
02	Masked Language Modeling (BERT)	The 15% masking trick, BERT architecture, NSP debate, RoBERTa improvements
03	Causal Language Modeling (GPT)	Autoregressive training, GPT evolution, sampling strategies, in-context learning
04	Pretraining at Scale	Multi-node training, ZeRO optimizer, Flash Attention, training data, costs
05	Supervised Fine-Tuning	Fine-tuning on labeled data, catastrophic forgetting, hyperparameters, evaluation
06	Instruction Tuning	Teaching models to follow instructions, FLAN, chain-of-thought, open datasets
07	LoRA	Low-rank weight updates, rank selection, alpha scaling, merging, PEFT
08	QLoRA	4-bit quantization + LoRA, NF4, double quantization, 65B on one GPU
09	Full Fine-Tuning vs PEFT	Decision framework, memory comparison, quality tradeoffs, practical guide
10	RLHF	Reward model training, PPO, KL penalty, reward hacking, InstructGPT results
11	DPO	Direct preference optimization, the math behind DPO, vs RLHF, TRL training
12	Modern Alignment Techniques	RLAIF, Constitutional AI, iterative DPO, process reward models, open questions

Prerequisites

Before starting this module, you should have completed:

Module 01: Transformer Architecture - attention mechanism, positional encoding, feed-forward layers
Basic PyTorch - tensors, autograd, training loops
Familiarity with tokenization - BPE, WordPiece, SentencePiece

Key Concepts You Will Master

Training Objectives

Causal Language Modeling (CLM) - predict the next token
Masked Language Modeling (MLM) - predict masked tokens using bidirectional context
Cross-entropy loss and perplexity

Pretraining Infrastructure

Tensor, pipeline, and data parallelism
ZeRO optimizer states (DeepSpeed)
Mixed precision training (BF16/FP16)
Flash Attention

Fine-tuning Methods

Full fine-tuning - update all parameters
LoRA - low-rank weight updates (Hu et al., 2021)
QLoRA - 4-bit base model + LoRA (Dettmers et al., 2023)
Prompt tuning and prefix tuning

Alignment

RLHF - Reinforcement Learning from Human Feedback
DPO - Direct Preference Optimization (Rafailov et al., 2023)
Constitutional AI and RLAIF

How to Use This Module

The lessons build on each other. Start with lesson 01 and work through sequentially. Each lesson includes working code examples you can run, production notes from real deployments, and interview Q&A calibrated to ML engineering interviews at top companies.

The module assumes you are an engineer who wants to understand not just what these techniques are, but why they work, when to use them, and how to debug them in production.

What This Module Covers​

The Full Pipeline​

Lessons in This Module​

Prerequisites​

Key Concepts You Will Master​

How to Use This Module​