LLM Pretraining - From Raw Data to Foundation Models
Reading time: ~50 min | Interview relevance: Critical (Tier 1-2) / High (Tier 3-4) | Roles: MLE, Research Eng, ML Infra Eng, LLM Eng
The Real Interview Moment
You are interviewing at a compute-focused AI startup. The lead researcher presents this scenario: "We have a cluster of 512 H100 GPUs and a budget of $2M in compute. We want to train the best possible model. How large should the model be? How much data do we need? How long will it take? Walk me through every decision from data collection to the first checkpoint."
You start with the compute budget. She pushes: "You said Chinchilla-optimal. But we are optimizing for inference cost, not training cost. How does that change your sizing? And what about data quality \text{---} does the scaling law still hold if 30% of our data is noisy?"
This is the definitive Tier 1/Tier 2 interview question. It tests whether you understand the end-to-end pretraining pipeline \text{---} not just the training loop, but the data engineering, tokenization decisions, scaling tradeoffs, and infrastructure challenges that consume 80% of the actual work.
Candidates who can only describe the training objective get a "lean hire" at best. Candidates who can reason about data quality, compute allocation, parallelism strategy, and failure recovery \text{---} with concrete numbers \text{---} get a "strong hire."
What You Will Master
- Design a data collection and filtering pipeline with quality heuristics
- Explain deduplication methods (MinHash, exact match, n-gram) and why they matter
- Compare BPE, SentencePiece, and tiktoken tokenization with vocabulary tradeoffs
- Derive the causal language modeling objective and its connection to compression
- Apply Chinchilla and inference-aware scaling laws with compute budget math
- Plan 3D parallelism (data, tensor, pipeline) for a given GPU cluster
- Design checkpointing and fault tolerance for multi-week training runs
- Explain curriculum learning and data mixing strategies
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Design a data collection pipeline | ___ | |||||
| Explain deduplication methods | ___ | |||||
| Compare BPE vs SentencePiece vs tiktoken | ___ | |||||
| Derive causal LM training objective | ___ | |||||
| Apply Chinchilla scaling law | ___ | |||||
| Calculate compute budgets (GPU-hours) | ___ | |||||
| Plan 3D parallelism | ___ | |||||
| Design fault tolerance for training | ___ | |||||
| Explain curriculum learning | ___ |
Target: All 4s and 5s before your interview.
Part 1 \text{---} Data Collection and Filtering
The Data Pipeline
Pretraining data quality is the single most impactful factor in model quality. The pipeline looks like this:
Data Sources
| Source | Volume | Quality | Use Case |
|---|---|---|---|
| Common Crawl | ~250B pages, ~100T tokens raw | Low (needs heavy filtering) | Web knowledge, general text |
| Wikipedia | ~4B tokens (English) | High | Factual knowledge |
| Books (Books3, Gutenberg) | ~10-50B tokens | Medium-High | Long-form reasoning, narrative |
| Academic papers (arXiv, S2ORC) | ~50B tokens | High | Scientific reasoning |
| Code (GitHub, StackOverflow) | ~500B+ tokens | Medium | Code generation, logical reasoning |
| Math (textbooks, competition data) | ~10B tokens | High | Mathematical reasoning |
| Curated instruction data | ~1-10B tokens | Very High | Instruction following |
A common interview question is "If you could only improve one thing about your pretraining data, what would it be?" The answer is almost always data quality, not quantity. LLaMA 3 was trained on 15T tokens \text{---} but the data team spent months building quality classifiers. Red Pajama and FineWeb showed that better filtering on Common Crawl data can match proprietary datasets.
Quality Filtering Heuristics
The following heuristics are used by most open-source data pipelines (C4, RedPajama, FineWeb, Dolma):
| Filter | What It Catches | Implementation |
|---|---|---|
| Min word count | Stub pages, navigation menus | Remove documents with fewer than 50 words |
| Mean word length | Garbled text, encoded content | Remove if mean word length is outside 3-10 characters |
| Symbol-to-word ratio | Markup-heavy pages, log files | Remove if more than 10% of tokens are special characters |
| Repetition ratio | Boilerplate, template pages | Remove if any n-gram (2-4 words) repeats more than a threshold |
| "Dirty word" ratio | Adult content | Remove if offensive word fraction exceeds threshold |
| Perplexity filter | Gibberish, auto-generated text | Train a small LM on Wikipedia; remove high-perplexity docs |
| Classifier filter | General low quality | Train a classifier (e.g., fastText) on Wikipedia (positive) vs random web (negative) |
The "Wikipedia quality" classifier is a powerful idea: train a binary classifier to distinguish "looks like Wikipedia" from "looks like random web." Documents scoring high are more likely to be well-written, factual, and informative. This was used by GPT-3, LLaMA, and most subsequent models.
Deduplication
Deduplication is critical because web crawls contain enormous amounts of repeated content (boilerplate, scraped sites, template pages). Training on duplicates:
- Wastes compute
- Memorizes specific text (privacy risk)
- Biases the model toward overrepresented content
- Can cause training instability
Deduplication methods:
MinHash deduplication is the most commonly used method. The algorithm:
- For each document, create a set of n-gram shingles (e.g., all 5-word sequences)
- Apply hash functions to each shingle set \text{---} keep the minimum hash for each function
- The resulting values form the "MinHash signature"
- Use Locality-Sensitive Hashing (LSH) to group documents with similar signatures
- Within each group, compute exact Jaccard similarity and remove near-duplicates (typically above 0.8 similarity)
Candidates often forget to mention cross-document deduplication across data sources. Wikipedia text appears in Common Crawl, books appear in multiple scraped sites, etc. You must deduplicate across all sources, not just within each source. This is a significant engineering challenge at the scale of trillions of tokens.
Part 2 \text{---} Tokenization
Why Tokenization Matters for LLMs
The tokenizer defines the model's "vocabulary" \text{---} its atomic units of processing. Every LLM design decision is affected:
- Context window: 128K tokens is not 128K words. Typical ratio is 1 word 1.3 tokens for English, but much worse for non-Latin scripts
- Training cost: More tokens per document = more FLOPs to process the same text
- Multilingual quality: A tokenizer trained primarily on English fragments non-Latin text into many small tokens, degrading quality
- Code quality: Whitespace handling matters - Python indentation should not consume excessive tokens
BPE (Byte Pair Encoding)
The standard tokenization algorithm for LLMs. Starting from a character vocabulary, iteratively merge the most frequent adjacent pair:
Algorithm:
- Start with a vocabulary of all individual characters (or bytes)
- Count all adjacent pairs in the training corpus
- Merge the most frequent pair into a new token
- Repeat steps 2-3 until vocabulary reaches target size
Example:
Corpus: "low low low low low lowest lowest newer newer wider"
Initial vocab: {l, o, w, e, s, t, n, r, i, d, <space>}
Step 1: Most frequent pair: (l, o) → merge into "lo"
Step 2: Most frequent pair: (lo, w) → merge into "low"
Step 3: Most frequent pair: (e, s) → merge into "es"
Step 4: Most frequent pair: (es, t) → merge into "est"
Step 5: Most frequent pair: (low, est) → merge into "lowest"
...
BPE Variants in Modern LLMs
| Tokenizer | Base Unit | Vocab Size | Used By | Key Feature |
|---|---|---|---|---|
| GPT-2 BPE | Bytes | 50,257 | GPT-2, GPT-3 | Byte-level fallback |
| SentencePiece | Unicode | Varies | LLaMA, T5 | Language-agnostic, whitespace as token |
| tiktoken (cl100k) | Bytes | 100,256 | GPT-4, GPT-3.5 | Fast (Rust), regex pre-tokenization |
| LLaMA 3 tokenizer | Bytes | 128,256 | LLaMA 3 | Larger vocab for multilingual |
"Modern LLMs use Byte Pair Encoding or its variants. BPE starts with individual bytes or characters and iteratively merges the most frequent pairs until reaching a target vocabulary size. The key engineering decisions are: (1) vocabulary size - larger vocabs like LLaMA 3's 128K reduce sequence length but increase embedding table size, (2) byte-level vs character-level base - byte-level ensures nothing is out-of-vocabulary, (3) pre-tokenization regex - tiktoken uses regex to split on whitespace and punctuation before BPE, preventing merges across word boundaries. The tokenizer directly impacts context length, multilingual quality, and inference cost."
Vocabulary Size Tradeoffs
For :
- (LLaMA 2): 131M params (~0.5 GB in fp32)
- (LLaMA 3): 525M params (~2 GB in fp32)
Larger vocab means:
- Shorter sequences (fewer tokens per document) = less KV cache, faster inference
- Better multilingual (more tokens dedicated to non-English scripts)
- Larger embedding table (more parameters, more memory)
- Sparser updates (rare tokens get updated less during training)
The "Fertility" Problem
Fertility = average number of tokens per word for a given language. English is efficient (~1.3 tokens/word with most tokenizers). Other languages suffer:
| Language | Fertility (tiktoken cl100k) | Effective Context (128K tokens) |
|---|---|---|
| English | ~1.3 | ~98K words |
| Spanish | ~1.5 | ~85K words |
| Chinese | ~1.8 | ~71K words |
| Hindi | ~3.5 | ~37K words |
| Thai | ~4.0 | ~32K words |
This means a 128K context model effectively has 3x less context for Thai than for English. LLaMA 3 addressed this partially by tripling the vocabulary size from 32K to 128K.
Part 3 - Training Objectives
Causal Language Modeling (CLM)
The primary pretraining objective for decoder-only LLMs:
Minimize the negative log-likelihood of each token given all preceding tokens. This is equivalent to compression - a model with lower loss is a better compressor of text.
Connection to perplexity:
A perplexity of 10 means the model is, on average, as uncertain as if it had to choose uniformly among 10 equally likely options for each next token.
The connection between language modeling and compression is a favorite interview topic at research-oriented companies. Key insight: a model that perfectly predicts the next token achieves optimal compression. Shannon's source coding theorem tells us the minimum description length equals the entropy of the source. So "better language model" and "better compressor" are mathematically the same thing.
Prefix Language Modeling
Used by PaLM, UL2, and some hybrid architectures. The input sequence is split into a prefix (bidirectional attention) and a target (causal attention):
The first tokens are the prefix - they can attend to each other bidirectionally. Loss is computed only on the target tokens .
Why it helps: For tasks where the "input" is well-defined (e.g., a document to summarize), bidirectional attention on the input creates better representations. But it reduces training efficiency (fewer tokens contribute to the loss).
Fill-in-the-Middle (FIM)
Used by code models (Code LLaMA, StarCoder, DeepSeek Coder). During training, randomly split a document into three parts:
Original: [A][B][C]
FIM input: <PRE>[A]<SUF>[C]<MID>[B]
The model learns to predict the middle given the prefix and suffix. This is critical for code completion where you need to fill in a function body given the signature and the code below.
FIM rate: Typically 50% of training examples use FIM, 50% use standard CLM. Applied with probability during training.
Part 4 - Scaling Laws
The Kaplan Scaling Laws (OpenAI, 2020)
The first quantitative scaling laws showed that loss scales as a power law with model size, dataset size, and compute:
where = parameters, = tokens, = compute (FLOPs), and , .
Key insight from Kaplan: For a fixed compute budget, larger models trained on less data outperform smaller models trained on more data. This led to GPT-3 (175B params trained on 300B tokens).
Chinchilla Scaling Law (Hoffmann et al., 2022)
Chinchilla revisited the Kaplan scaling laws with more careful experiments and found a different optimal allocation:
The ratio of tokens to parameters should be approximately . This means:
| Model Size | Chinchilla-Optimal Tokens | Example |
|---|---|---|
| 1B | 20B | |
| 7B | 140B | |
| 70B | 1.4T | Chinchilla (70B, 1.4T tokens) |
| 175B | 3.5T | GPT-3 was undertrained (only 300B) |
Many candidates recite "Chinchilla says 20 tokens per parameter" without understanding the implications. The real insight is that GPT-3 was massively undertrained - it should have seen 12x more data. But this also means training GPT-3 to Chinchilla-optimal would cost 12x more compute. The Chinchilla law tells you the compute-optimal frontier, not the practical one.
Inference-Aware Scaling Laws (2023-2024)
Chinchilla optimizes for training cost. But in practice, inference cost dominates total cost of ownership. A model served for millions of queries per day costs far more in inference than it cost to train.
The inference-aware insight: It is cheaper to train a smaller model on MORE data than Chinchilla-optimal, because the smaller model is cheaper to serve.
If is large (high query volume), the optimal strategy shifts toward smaller models:
| Strategy | Model Size | Training Tokens | Train Cost | Inference Cost/Query | Best When |
|---|---|---|---|---|---|
| Chinchilla-optimal | 70B | 1.4T | $10M | $0.001 | Low query volume |
| Inference-aware | 7B | 2-5T | $2M | $0.0001 | High query volume |
| Over-train | 1B | 3T | $200K | $0.00001 | Edge/mobile |
This explains LLaMA 3 8B trained on 15T tokens (190x Chinchilla-optimal for its size) - Meta optimized for inference cost, not training cost.
"Chinchilla scaling says you should train with about 20 tokens per parameter to minimize training loss for a given compute budget. But this optimizes training cost, not total cost. In practice, LLaMA 3 8B was trained on 15 trillion tokens - roughly 190 times the Chinchilla-optimal amount - because a smaller model that is over-trained is much cheaper to serve. The right scaling law depends on your deployment scenario: if you are serving millions of queries, over-training a smaller model is more cost-effective than Chinchilla-optimal training of a larger model."
Compute Budget Estimation
Given a compute budget, find the optimal model:
For a budget of dollars at cost per GPU-hour:
Example: Budget = 3/GPU-hour, 1000 TFLOPS bf16, 40% MFU:
Chinchilla-optimal:
So approximately a 400B parameter model on 400B tokens - but this is Chinchilla-optimal. For inference-aware, you would train a 70B model on ~2.3T tokens, or a 7B model on ~23T tokens.
Part 5 - Training Infrastructure
The Three Dimensions of Parallelism
Training large LLMs requires distributing computation across hundreds or thousands of GPUs. The three main parallelism strategies address different bottlenecks:
Data Parallelism (DP)
Each GPU holds a complete copy of the model. The batch is split across GPUs, and gradients are synchronized (all-reduce) after each step.
Memory per GPU: Full model + optimizer states + activations for the batch chunk
Communication: All-reduce of gradients after each step. For a model with parameters: bytes communicated per step (reduce-scatter + all-gather).
When to use: Always used as the outer parallelism dimension. Scales almost linearly with number of GPUs if communication is overlapped with computation.
FSDP (Fully Sharded Data Parallelism): Instead of replicating the full model on every GPU, shard the parameters, gradients, and optimizer states across GPUs. Each GPU holds only of the model state. All-gather parameters when needed for computation, then discard.
| DP Variant | Memory per GPU | Communication | Complexity |
|---|---|---|---|
| Standard DP | Full model + optimizer | Gradient all-reduce | Low |
| ZeRO Stage 1 | Full model, sharded optimizer | Gradient + optimizer all-reduce | Medium |
| ZeRO Stage 2 | Full model, sharded optimizer + gradients | All-reduce at each step | Medium |
| ZeRO Stage 3 / FSDP | Sharded everything | All-gather + reduce-scatter per layer | High |
Tensor Parallelism (TP)
Split individual matrix multiplications across GPUs. For attention: split heads across GPUs. For FFN: split the intermediate dimension.
Memory per GPU: of the model parameters per layer (for TP degree )
Communication: Two all-reduce operations per layer (one for attention, one for FFN). Communication happens within a layer, so it must be fast - typically limited to GPUs connected by NVLink within a single node.
When to use: When a single layer does not fit in one GPU's memory, or when you need to reduce per-GPU memory. Typically TP = 2, 4, or 8 (within one node).
Tensor parallelism requires high-bandwidth communication within each layer's forward and backward pass. Never use TP across nodes connected by InfiniBand (too slow). TP is for intra-node (NVLink: 900 GB/s on H100), while DP and PP can go across nodes (InfiniBand: 400-800 Gb/s).
Pipeline Parallelism (PP)
Split layers across GPUs. GPU 0 handles layers 0-9, GPU 1 handles layers 10-19, etc.
Memory per GPU: Only the layers assigned to that GPU
Communication: Send activations between adjacent stages. Much less data than TP or DP gradient sync.
The "bubble" problem: Naive PP has pipeline bubbles where GPUs wait for activations from the previous stage. Mitigated by micro-batching: split the mini-batch into micro-batches and pipeline them.
3D Parallelism in Practice
For a 512-GPU cluster training a 70B model:
| Dimension | Degree | Scope | Why |
|---|---|---|---|
| TP | 8 | Within node (8 GPUs/node) | Split large matrices across NVLink-connected GPUs |
| PP | 4 | Across 4 nodes | Split 80 layers into 4 stages of 20 layers |
| DP | 16 | Across remaining nodes | 512 / (8 * 4) = 16-way data parallel |
Total: GPUs
Effective batch size: DP degree micro-batch size per GPU gradient accumulation steps. Typical: sequences tokens/seq tokens per step.
Checkpointing and Fault Tolerance
A 70B model training run takes weeks to months. Hardware failures are not edge cases - they are guaranteed:
| Cluster Size | Mean Time Between Failure (estimated) |
|---|---|
| 64 GPUs | ~1 week |
| 512 GPUs | ~1 day |
| 4096 GPUs | ~2-4 hours |
| 16384 GPUs (LLaMA 3 405B) | Under 1 hour |
Checkpointing strategy:
- Frequency: Save every 100-1000 steps (balance between safety and I/O overhead)
- Asynchronous: Do not block training for checkpoint writes. Save to fast local NVMe, then async copy to distributed storage
- Sharded: Each DP rank saves its own shard. Full checkpoint = union of all shards
- Optimizer state: Must save optimizer states (Adam moments) - these are 2x the model size
Recovery protocol:
- Detect failure (heartbeat monitoring, NCCL timeout)
- Terminate all ranks
- Replace failed node (spare pool)
- Load latest checkpoint (verify with checksums)
- Resume training
At infrastructure-focused companies, the fault tolerance question is a strong signal. Candidates who can discuss elastic training (adjusting DP degree when nodes fail/join), checkpoint formats (safetensors vs pickle), and training resumption strategies (learning rate schedule, batch size warmup after restart) demonstrate production experience.
Activation Checkpointing (Gradient Checkpointing)
Separate from model checkpointing. During backpropagation, you need activations from the forward pass. Normally all activations are stored in memory. Activation checkpointing discards intermediate activations and recomputes them during the backward pass:
Memory-compute tradeoff:
- Without checkpointing: memory =
- With full checkpointing: memory = , compute = ~33% more (one extra forward pass)
- Selective checkpointing: only recompute attention (most memory-hungry), keep FFN activations
Part 6 - Curriculum Learning
Data Mixing
The ratio of data sources during training significantly impacts model capabilities:
| Data Mix | Effect |
|---|---|
| More code | Better at reasoning, logic, structured output |
| More math | Better at quantitative reasoning |
| More books | Better at long-form coherence, narrative |
| More web | Better at general knowledge, diverse topics |
| More multilingual | Better at non-English tasks |
The LLaMA 3 data mix (approximate, based on Meta's report):
- Web: ~50%
- Code: ~20%
- Math: ~5%
- Books: ~10%
- Academic: ~5%
- Multilingual: ~10%
Curriculum Strategies
Rather than training on a fixed data distribution, curriculum learning adjusts the data mix over training:
- Quality annealing: Start with diverse web data, gradually increase the proportion of high-quality data (academic, curated) toward the end
- Domain upsampling: After initial training, do extra passes on domains where the model is weakest (identified by domain-specific eval)
- Difficulty ordering: Train on shorter, simpler documents first; gradually increase length and complexity
- Code scheduling: Introduce code data after basic language understanding is established
Anthropic and Google use sophisticated data mixing schedules that are proprietary. In interviews, demonstrate awareness that data mixing is an active research area. Saying "I would use a fixed 50/30/20 split" is fine for an initial answer, but follow up with "In practice, I would measure domain-specific loss during training and adjust the mix based on which domains are improving slowest."
Long Context Training
Models like Gemini 1.5 (1M context) and LLaMA 3 (128K context) require special training strategies for long contexts:
- Context length schedule: Start training with 4K context, gradually extend to 32K, then 128K. Training on 128K from the start is wasteful (most documents are short)
- RoPE base frequency adjustment: When extending context, adjust the RoPE base frequency to maintain positional resolution
- Ring Attention / sequence parallelism: For very long sequences that exceed a single GPU's memory, distribute the sequence across GPUs
- Loss weighting: Optionally weight later tokens in long sequences more heavily to ensure the model learns to use long contexts
Practice Problems
Problem 1: Compute Budget Planning
You have a budget of 3/GPU-hour. The H100 achieves 1000 TFLOPS in bf16 with 35% model FLOP utilization. What is the largest Chinchilla-optimal model you can train? How many tokens does it need? How long will training take on 128 GPUs?
Hint 1 - Direction
Calculate total available FLOPs from the budget, then use with the Chinchilla constraint .
Hint 2 - Insight
Total FLOPs = . With Chinchilla: , so .
Hint 3 - Full Solution + Rubric
Step 1: Total compute
Step 2: Chinchilla-optimal model size
Step 3: Training tokens
Step 4: Training time on 128 GPUs
Answer: ~42B parameter model, ~840B tokens, ~54 days on 128 H100s.
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Correct setup | All formulas correct | Minor errors in computation | Could not set up the problem |
| MFU awareness | Used 35% MFU | Assumed 100% utilization | Did not know what MFU means |
| Final answer | Within 20% of 42B | Within 50% | Off by more than 2x |
| Practical follow-up | "54 days is risky - need checkpoint every 30 min, spare nodes, and monitoring" | "That is a long time" | No consideration of practicality |
Problem 2: Data Pipeline Design
You are building a pretraining dataset for a 7B model focused on code and technical documentation. You have access to Common Crawl, GitHub, arXiv, and Stack Overflow. Design the data pipeline from raw data to training-ready tokens. Include filtering, deduplication, and mixing strategy.
Hint 1 - Direction
For each data source, think about: what is the quality distribution? What needs to be filtered out? How do you handle duplicates within and across sources?
Hint 2 - Insight
GitHub has a lot of auto-generated code, vendored dependencies, and low-quality repos. Common Crawl technical pages overlap with Stack Overflow. arXiv needs PDF-to-text conversion and has formatting issues. Design specific filters for each source, then global dedup, then mix.
Hint 3 - Full Solution + Rubric
Source-specific pipelines:
GitHub:
- Filter: stars greater than 5, exclude forks, auto-generated files (node_modules, vendor), binaries
- Language detection: keep top 20 programming languages
- Quality: remove files under 100 bytes or over 1MB, filter high-entropy files (minified JS)
- License filter: keep permissive licenses only (MIT, Apache, BSD)
- Near-dedup: MinHash on file content (80% threshold)
Common Crawl:
- URL filter: prioritize technical domains (docs., developer., *.readthedocs.io)
- Text extraction: trafilatura for main content extraction
- Quality: perplexity filter (Wikipedia-trained KenLM), length filter, repetition filter
- Language: English only (fastText)
- Near-dedup: MinHash on paragraphs
arXiv:
- PDF to text: GROBID or Nougat for structure-aware extraction
- Filter: keep cs., stat.ML, math. categories
- Quality: remove papers with poor extraction (high formula-to-text ratio without proper rendering)
- Deduplicate across versions (v1, v2, etc.)
Stack Overflow:
- Filter: accepted answers only, or score greater than 5
- Format: question + accepted answer as a single document
- Remove code-only answers (no explanation)
Global pipeline:
- Cross-source deduplication: MinHash across all sources (web pages that copy SO answers, etc.)
- PII removal: regex for emails, phone numbers, API keys, IP addresses
- Tokenization: BPE with 32K vocab, trained on the combined corpus
Mixing (for code-focused 7B model):
- Code (GitHub): 40%
- Technical web (CC filtered): 25%
- Academic (arXiv): 15%
- Stack Overflow: 10%
- General web (CC general): 10%
Total target: 2T tokens (inference-aware over-training for 7B model)
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Source-specific filters | Tailored to each source | Generic filters for all | No source-specific thinking |
| Deduplication | Within-source and cross-source | Mentioned dedup vaguely | No dedup strategy |
| Quality focus | Multiple quality signals | One filter | "Use all the data" |
| Data mixing | Justified ratios for code focus | Mentioned mixing | Equal split or random |
Problem 3: Scaling Law Application
A team trained a 1B model on 20B tokens and achieved a validation loss of 2.8. They trained a 3B model on 60B tokens and got 2.4. Assuming power-law scaling, predict the loss for a 10B model trained on 200B tokens. Is 200B tokens Chinchilla-optimal for 10B parameters?
Hint 1 - Direction
Use the power law . With two data points, you can estimate the scaling behavior (with simplifying assumptions).
Hint 2 - Insight
Simpler approach: if both models are at Chinchilla-optimal (both have ), then loss scales primarily with compute . , . The loss went from 2.8 to 2.4 with ~9x compute. Use this to extrapolate.
Hint 3 - Full Solution + Rubric
Approach: Both data points are Chinchilla-optimal (), so we can use the compute scaling law:
Compute for each:
Assuming (typical for well-filtered web data):
From data points:
Dividing:
For (10x ):
Predicted loss: ~2.1
Is 200B Chinchilla-optimal for 10B? Yes: , which matches the Chinchilla ratio.
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Set up scaling law | Correct formula with compute | Vague "loss decreases with scale" | Cannot set up |
| Numerical answer | Within 0.2 of ~2.1 | Within 0.5 | Cannot compute |
| Chinchilla check | Correctly identifies | Mentions Chinchilla vaguely | Does not check |
| Caveats | "This assumes perfect scaling - real loss depends on data quality, training stability" | Mentions uncertainty | Presents answer as certain |
Problem 4: Parallelism Strategy
You need to train a 30B parameter model on a cluster of 256 H100 GPUs (32 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Each GPU has 80 GB memory. Design the 3D parallelism strategy. Estimate memory usage per GPU.
Hint 1 - Direction
Start with memory requirements: 30B params in bf16 = 60 GB for weights alone. Optimizer states (Adam) = 12 bytes per param = 360 GB. This does not fit on one GPU or even one node.
Hint 2 - Insight
With FSDP (ZeRO-3), optimizer states are sharded. TP within nodes (8-way) to split layer computation. Consider PP = 2 or 4 for additional memory relief. DP handles the rest.
Hint 3 - Full Solution + Rubric
Memory analysis (no parallelism):
- Model weights (bf16): GB
- Adam optimizer: GB (fp32 weights + first moment + second moment)
- Gradients (bf16): 60 GB
- Activations: ~40-80 GB (depends on batch size and sequence length)
- Total: ~480-540 GB
Proposed strategy:
| Dimension | Degree | Scope | Rationale |
|---|---|---|---|
| TP | 4 | Within node | Splits each layer across 4 GPUs; 2 TP groups per node |
| PP | 2 | Across 2 nodes | Splits model into 2 pipeline stages |
| DP (FSDP) | 32 | Remaining dimension | 256 / (4 * 2) = 32-way data parallel with full sharding |
Memory per GPU with this config:
- Model weights: GB per GPU (only this GPU's TP shard)
- Optimizer states: GB (sharded across all DP ranks, TP, and PP)
- Gradients (sharded): ~0.5 GB
- Activations (with checkpointing): ~10-20 GB
- Total: ~27-37 GB per GPU - fits comfortably in 80 GB
Why TP=4 not TP=8:
- TP=8 means all 8 GPUs in a node work on the same layer - maximum layer splitting
- TP=4 leaves room for 2 independent TP groups per node, which can be in different pipeline stages or DP groups
- TP=4 also has less communication overhead than TP=8
Effective batch size: With DP=32, micro-batch=2 per GPU, grad accumulation=4: sequences. At 4096 tokens/seq: ~1M tokens per step.
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Memory calculation | Correct breakdown | Approximate but reasonable | Could not estimate |
| TP within node | Correctly placed within NVLink | Used TP across nodes | Did not consider hardware topology |
| FSDP for optimizer | Sharded optimizer across DP ranks | Mentioned FSDP | Full optimizer on every GPU |
| Practical details | Batch size, activation checkpointing, effective throughput | Mentioned some considerations | Pure theory |
Interview Cheat Sheet
| Concept | Key Fact / Formula | Common Follow-Up |
|---|---|---|
| Data filtering | Wikipedia-quality classifier + heuristics (length, repetition, perplexity) | "How do you handle bias in the filter?" |
| Deduplication | MinHash + LSH for fuzzy dedup; cross-source is critical | "What similarity threshold? How to scale?" |
| BPE tokenization | Iteratively merge most frequent pairs; byte-level for open vocab | "Vocab size tradeoffs? Multilingual impact?" |
| Causal LM objective | ; equivalent to compression | "Connection to perplexity? To entropy?" |
| FIM | PRE + SUF + MID format; 50% rate; critical for code infilling | "Why not 100% FIM?" |
| Chinchilla | for compute-optimal; 70B needs 1.4T tokens | "Why is LLaMA 3 8B trained on 15T?" |
| Inference-aware | Over-train smaller models for cheaper serving | "What is the cost crossover point?" |
| Training FLOPs | ; GPU-hours = | "What MFU is realistic?" |
| 3D parallelism | TP within node (NVLink), PP across few nodes, DP for the rest | "Why not TP=8 everywhere?" |
| FSDP | Shard optimizer + gradients + params across DP ranks | "Memory per GPU with ZeRO-3?" |
| Checkpointing | Every 100-1000 steps; async to distributed storage | "MTBF for 1000 GPUs?" |
| Data mixing | Code ~20-40%, web ~30-50%, math ~5-10%; curriculum over training | "How do you decide the mix?" |
Spaced Repetition Checkpoints
Day 0 (After reading this chapter)
- Draw the data pipeline from raw crawl to training-ready tokens (10 stages)
- Write the causal LM loss formula and explain its connection to compression
- Calculate: how many tokens is Chinchilla-optimal for a 13B model?
- List the three parallelism dimensions and when each is used
Day 3
- Explain MinHash deduplication in 60 seconds
- Compare BPE, SentencePiece, and tiktoken - key differences
- Why did the field move from Chinchilla-optimal to inference-aware scaling?
- What is the difference between activation checkpointing and model checkpointing?
Day 7
- Design a data pipeline for a code-focused model (from memory)
- Calculate training compute for a 7B model on 2T tokens; estimate GPU-hours on 64 H100s
- Explain 3D parallelism for a 70B model on 512 GPUs - assign TP, PP, DP degrees
- What is FIM and why is it important for code models?
Day 14
- Do Practice Problem 1 (compute budget) from scratch, timed (10 min)
- Explain curriculum learning - what changes during training and why
- Discuss fault tolerance: checkpointing frequency, recovery protocol, spare nodes
- Quiz yourself on all cheat sheet entries - 60 seconds each
Day 21
- Full mock: "You have $1M and need the best code model possible. Walk me through everything."
- Re-take the self-assessment - all scores should be 4+
- Solve Problem 3 (scaling law extrapolation) without hints
- Explain the end-to-end pretraining pipeline in 5 minutes to a non-expert
