Skip to main content

Module 02: Pretraining and Fine-tuning

What This Module Covers

Every LLM you use today - GPT-4, Claude, Gemini, LLaMA - went through the same fundamental pipeline: pretrain on massive text, then fine-tune to behave as instructed, then align with human preferences. This module takes you deep into each stage of that pipeline.

You will learn why different training objectives produce different model capabilities, how the industry moved from full fine-tuning to parameter-efficient methods like LoRA, why RLHF was revolutionary, and why DPO is replacing it. By the end of this module, you will be able to make informed decisions about which training approach to use for your specific use case.

The Full Pipeline

Lessons in This Module

#LessonWhat You Will Learn
01Language Modeling ObjectivesMLM vs CLM, cross-entropy loss, perplexity, why objective choice matters
02Masked Language Modeling (BERT)The 15% masking trick, BERT architecture, NSP debate, RoBERTa improvements
03Causal Language Modeling (GPT)Autoregressive training, GPT evolution, sampling strategies, in-context learning
04Pretraining at ScaleMulti-node training, ZeRO optimizer, Flash Attention, training data, costs
05Supervised Fine-TuningFine-tuning on labeled data, catastrophic forgetting, hyperparameters, evaluation
06Instruction TuningTeaching models to follow instructions, FLAN, chain-of-thought, open datasets
07LoRALow-rank weight updates, rank selection, alpha scaling, merging, PEFT
08QLoRA4-bit quantization + LoRA, NF4, double quantization, 65B on one GPU
09Full Fine-Tuning vs PEFTDecision framework, memory comparison, quality tradeoffs, practical guide
10RLHFReward model training, PPO, KL penalty, reward hacking, InstructGPT results
11DPODirect preference optimization, the math behind DPO, vs RLHF, TRL training
12Modern Alignment TechniquesRLAIF, Constitutional AI, iterative DPO, process reward models, open questions

Prerequisites

Before starting this module, you should have completed:

  • Module 01: Transformer Architecture - attention mechanism, positional encoding, feed-forward layers
  • Basic PyTorch - tensors, autograd, training loops
  • Familiarity with tokenization - BPE, WordPiece, SentencePiece

Key Concepts You Will Master

Training Objectives

  • Causal Language Modeling (CLM) - predict the next token
  • Masked Language Modeling (MLM) - predict masked tokens using bidirectional context
  • Cross-entropy loss and perplexity

Pretraining Infrastructure

  • Tensor, pipeline, and data parallelism
  • ZeRO optimizer states (DeepSpeed)
  • Mixed precision training (BF16/FP16)
  • Flash Attention

Fine-tuning Methods

  • Full fine-tuning - update all parameters
  • LoRA - low-rank weight updates (Hu et al., 2021)
  • QLoRA - 4-bit base model + LoRA (Dettmers et al., 2023)
  • Prompt tuning and prefix tuning

Alignment

  • RLHF - Reinforcement Learning from Human Feedback
  • DPO - Direct Preference Optimization (Rafailov et al., 2023)
  • Constitutional AI and RLAIF

How to Use This Module

The lessons build on each other. Start with lesson 01 and work through sequentially. Each lesson includes working code examples you can run, production notes from real deployments, and interview Q&A calibrated to ML engineering interviews at top companies.

The module assumes you are an engineer who wants to understand not just what these techniques are, but why they work, when to use them, and how to debug them in production.

© 2026 EngineersOfAI. All rights reserved.