LLM Pretraining - From Raw Data to Foundation Models

Reading time: ~50 min | Interview relevance: Critical (Tier 1-2) / High (Tier 3-4) | Roles: MLE, Research Eng, ML Infra Eng, LLM Eng

The Real Interview Moment

You are interviewing at a compute-focused AI startup. The lead researcher presents this scenario: "We have a cluster of 512 H100 GPUs and a budget of $2M in compute. We want to train the best possible model. How large should the model be? How much data do we need? How long will it take? Walk me through every decision from data collection to the first checkpoint."

You start with the compute budget. She pushes: "You said Chinchilla-optimal. But we are optimizing for inference cost, not training cost. How does that change your sizing? And what about data quality \text{---} does the scaling law still hold if 30% of our data is noisy?"

This is the definitive Tier 1/Tier 2 interview question. It tests whether you understand the end-to-end pretraining pipeline \text{---} not just the training loop, but the data engineering, tokenization decisions, scaling tradeoffs, and infrastructure challenges that consume 80% of the actual work.

Candidates who can only describe the training objective get a "lean hire" at best. Candidates who can reason about data quality, compute allocation, parallelism strategy, and failure recovery \text{---} with concrete numbers \text{---} get a "strong hire."

What You Will Master

Design a data collection and filtering pipeline with quality heuristics
Explain deduplication methods (MinHash, exact match, n-gram) and why they matter
Compare BPE, SentencePiece, and tiktoken tokenization with vocabulary tradeoffs
Derive the causal language modeling objective and its connection to compression
Apply Chinchilla and inference-aware scaling laws with compute budget math
Plan 3D parallelism (data, tensor, pipeline) for a given GPU cluster
Design checkpointing and fault tolerance for multi-week training runs
Explain curriculum learning and data mixing strategies

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Explain	4 -- Can Derive	5 -- Can Teach	Your Score
Design a data collection pipeline						___
Explain deduplication methods						___
Compare BPE vs SentencePiece vs tiktoken						___
Derive causal LM training objective						___
Apply Chinchilla scaling law						___
Calculate compute budgets (GPU-hours)						___
Plan 3D parallelism						___
Design fault tolerance for training						___
Explain curriculum learning						___

Target: All 4s and 5s before your interview.

Part 1 \text{---} Data Collection and Filtering

The Data Pipeline

Pretraining data quality is the single most impactful factor in model quality. The pipeline looks like this:

Pretraining Data Pipeline

Data Sources

Source	Volume	Quality	Use Case
Common Crawl	~250B pages, ~100T tokens raw	Low (needs heavy filtering)	Web knowledge, general text
Wikipedia	~4B tokens (English)	High	Factual knowledge
Books (Books3, Gutenberg)	~10-50B tokens	Medium-High	Long-form reasoning, narrative
Academic papers (arXiv, S2ORC)	~50B tokens	High	Scientific reasoning
Code (GitHub, StackOverflow)	~500B+ tokens	Medium	Code generation, logical reasoning
Math (textbooks, competition data)	~10B tokens	High	Mathematical reasoning
Curated instruction data	~1-10B tokens	Very High	Instruction following

Interviewer's Perspective

A common interview question is "If you could only improve one thing about your pretraining data, what would it be?" The answer is almost always data quality, not quantity. LLaMA 3 was trained on 15T tokens \text{---} but the data team spent months building quality classifiers. Red Pajama and FineWeb showed that better filtering on Common Crawl data can match proprietary datasets.

Quality Filtering Heuristics

The following heuristics are used by most open-source data pipelines (C4, RedPajama, FineWeb, Dolma):

Filter	What It Catches	Implementation
Min word count	Stub pages, navigation menus	Remove documents with fewer than 50 words
Mean word length	Garbled text, encoded content	Remove if mean word length is outside 3-10 characters
Symbol-to-word ratio	Markup-heavy pages, log files	Remove if more than 10% of tokens are special characters
Repetition ratio	Boilerplate, template pages	Remove if any n-gram (2-4 words) repeats more than a threshold
"Dirty word" ratio	Adult content	Remove if offensive word fraction exceeds threshold
Perplexity filter	Gibberish, auto-generated text	Train a small LM on Wikipedia; remove high-perplexity docs
Classifier filter	General low quality	Train a classifier (e.g., fastText) on Wikipedia (positive) vs random web (negative)

The "Wikipedia quality" classifier is a powerful idea: train a binary classifier to distinguish "looks like Wikipedia" from "looks like random web." Documents scoring high are more likely to be well-written, factual, and informative. This was used by GPT-3, LLaMA, and most subsequent models.

Deduplication

Deduplication is critical because web crawls contain enormous amounts of repeated content (boilerplate, scraped sites, template pages). Training on duplicates:

Wastes compute
Memorizes specific text (privacy risk)
Biases the model toward overrepresented content
Can cause training instability

Deduplication methods:

Deduplication Methods

MinHash deduplication is the most commonly used method. The algorithm:

For each document, create a set of n-gram shingles (e.g., all 5-word sequences)
Apply $k$ hash functions to each shingle set \text{---} keep the minimum hash for each function
The resulting $k$ values form the "MinHash signature"
Use Locality-Sensitive Hashing (LSH) to group documents with similar signatures
Within each group, compute exact Jaccard similarity and remove near-duplicates (typically above 0.8 similarity)

Common Trap

Candidates often forget to mention cross-document deduplication across data sources. Wikipedia text appears in Common Crawl, books appear in multiple scraped sites, etc. You must deduplicate across all sources, not just within each source. This is a significant engineering challenge at the scale of trillions of tokens.

Part 2 \text{---} Tokenization

Why Tokenization Matters for LLMs

The tokenizer defines the model's "vocabulary" \text{---} its atomic units of processing. Every LLM design decision is affected:

Context window: 128K tokens is not 128K words. Typical ratio is 1 word $\approx$ 1.3 tokens for English, but much worse for non-Latin scripts
Training cost: More tokens per document = more FLOPs to process the same text
Multilingual quality: A tokenizer trained primarily on English fragments non-Latin text into many small tokens, degrading quality
Code quality: Whitespace handling matters - Python indentation should not consume excessive tokens

BPE (Byte Pair Encoding)

The standard tokenization algorithm for LLMs. Starting from a character vocabulary, iteratively merge the most frequent adjacent pair:

Algorithm:

Start with a vocabulary of all individual characters (or bytes)
Count all adjacent pairs in the training corpus
Merge the most frequent pair into a new token
Repeat steps 2-3 until vocabulary reaches target size

Example:

Corpus: "low low low low low lowest lowest newer newer wider"

Initial vocab: {l, o, w, e, s, t, n, r, i, d, <space>}

Step 1: Most frequent pair: (l, o) → merge into "lo"
Step 2: Most frequent pair: (lo, w) → merge into "low"
Step 3: Most frequent pair: (e, s) → merge into "es"
Step 4: Most frequent pair: (es, t) → merge into "est"
Step 5: Most frequent pair: (low, est) → merge into "lowest"
...

BPE Variants in Modern LLMs

Tokenizer	Base Unit	Vocab Size	Used By	Key Feature
GPT-2 BPE	Bytes	50,257	GPT-2, GPT-3	Byte-level fallback
SentencePiece	Unicode	Varies	LLaMA, T5	Language-agnostic, whitespace as token
tiktoken (cl100k)	Bytes	100,256	GPT-4, GPT-3.5	Fast (Rust), regex pre-tokenization
LLaMA 3 tokenizer	Bytes	128,256	LLaMA 3	Larger vocab for multilingual

60-Second Answer

"Modern LLMs use Byte Pair Encoding or its variants. BPE starts with individual bytes or characters and iteratively merges the most frequent pairs until reaching a target vocabulary size. The key engineering decisions are: (1) vocabulary size - larger vocabs like LLaMA 3's 128K reduce sequence length but increase embedding table size, (2) byte-level vs character-level base - byte-level ensures nothing is out-of-vocabulary, (3) pre-tokenization regex - tiktoken uses regex to split on whitespace and punctuation before BPE, preventing merges across word boundaries. The tokenizer directly impacts context length, multilingual quality, and inference cost."

Vocabulary Size Tradeoffs

\text{Embedding parameters} = V \times d

For $d = 4096$ :

$V = 32{,}000$ (LLaMA 2): 131M params (~0.5 GB in fp32)
$V = 128{,}256$ (LLaMA 3): 525M params (~2 GB in fp32)

Larger vocab means:

Shorter sequences (fewer tokens per document) = less KV cache, faster inference
Better multilingual (more tokens dedicated to non-English scripts)
Larger embedding table (more parameters, more memory)
Sparser updates (rare tokens get updated less during training)

Vocabulary Size Tradeoffs

The "Fertility" Problem

Fertility = average number of tokens per word for a given language. English is efficient (~1.3 tokens/word with most tokenizers). Other languages suffer:

Language	Fertility (tiktoken cl100k)	Effective Context (128K tokens)
English	~1.3	~98K words
Spanish	~1.5	~85K words
Chinese	~1.8	~71K words
Hindi	~3.5	~37K words
Thai	~4.0	~32K words

This means a 128K context model effectively has 3x less context for Thai than for English. LLaMA 3 addressed this partially by tripling the vocabulary size from 32K to 128K.

Part 3 - Training Objectives

Causal Language Modeling (CLM)

The primary pretraining objective for decoder-only LLMs:

\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}; \theta)

Minimize the negative log-likelihood of each token given all preceding tokens. This is equivalent to compression - a model with lower loss is a better compressor of text.

Connection to perplexity:

\text{PPL} = \exp(\mathcal{L}_{\text{CLM}} / T)

A perplexity of 10 means the model is, on average, as uncertain as if it had to choose uniformly among 10 equally likely options for each next token.

Interviewer's Perspective

The connection between language modeling and compression is a favorite interview topic at research-oriented companies. Key insight: a model that perfectly predicts the next token achieves optimal compression. Shannon's source coding theorem tells us the minimum description length equals the entropy of the source. So "better language model" and "better compressor" are mathematically the same thing.

Prefix Language Modeling

Used by PaLM, UL2, and some hybrid architectures. The input sequence is split into a prefix (bidirectional attention) and a target (causal attention):

\mathcal{L}_{\text{PrefixLM}} = -\sum_{t=P+1}^{T} \log P(x_t \mid x_1, \ldots, x_P, x_{P+1}, \ldots, x_{t-1}; \theta)

The first $P$ tokens are the prefix - they can attend to each other bidirectionally. Loss is computed only on the target tokens $P+1, \ldots, T$ .

Why it helps: For tasks where the "input" is well-defined (e.g., a document to summarize), bidirectional attention on the input creates better representations. But it reduces training efficiency (fewer tokens contribute to the loss).

Fill-in-the-Middle (FIM)

Used by code models (Code LLaMA, StarCoder, DeepSeek Coder). During training, randomly split a document into three parts:

Original: [A][B][C]
FIM input: <PRE>[A]<SUF>[C]<MID>[B]

The model learns to predict the middle given the prefix and suffix. This is critical for code completion where you need to fill in a function body given the signature and the code below.

FIM rate: Typically 50% of training examples use FIM, 50% use standard CLM. Applied with probability $p$ during training.

Fill-in-the-Middle Training

Part 4 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

The first quantitative scaling laws showed that loss scales as a power law with model size, dataset size, and compute:

L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

where $N$ = parameters, $D$ = tokens, $C$ = compute (FLOPs), and $\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ .

Key insight from Kaplan: For a fixed compute budget, larger models trained on less data outperform smaller models trained on more data. This led to GPT-3 (175B params trained on 300B tokens).

Chinchilla Scaling Law (Hoffmann et al., 2022)

Chinchilla revisited the Kaplan scaling laws with more careful experiments and found a different optimal allocation:

N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

The ratio of tokens to parameters should be approximately $D/N \approx 20$ . This means:

Model Size	Chinchilla-Optimal Tokens	Example
1B	20B
7B	140B
70B	1.4T	Chinchilla (70B, 1.4T tokens)
175B	3.5T	GPT-3 was undertrained (only 300B)

Common Trap

Many candidates recite "Chinchilla says 20 tokens per parameter" without understanding the implications. The real insight is that GPT-3 was massively undertrained - it should have seen 12x more data. But this also means training GPT-3 to Chinchilla-optimal would cost 12x more compute. The Chinchilla law tells you the compute-optimal frontier, not the practical one.

Inference-Aware Scaling Laws (2023-2024)

Chinchilla optimizes for training cost. But in practice, inference cost dominates total cost of ownership. A model served for millions of queries per day costs far more in inference than it cost to train.

The inference-aware insight: It is cheaper to train a smaller model on MORE data than Chinchilla-optimal, because the smaller model is cheaper to serve.

\text{Total Cost} = C_{\text{train}} + C_{\text{inference}} \times T_{\text{deployment}}

If $T_{\text{deployment}}$ is large (high query volume), the optimal strategy shifts toward smaller models:

Strategy	Model Size	Training Tokens	Train Cost	Inference Cost/Query	Best When
Chinchilla-optimal	70B	1.4T	$10M	$0.001	Low query volume
Inference-aware	7B	2-5T	$2M	$0.0001	High query volume
Over-train	1B	3T	$200K	$0.00001	Edge/mobile

This explains LLaMA 3 8B trained on 15T tokens (190x Chinchilla-optimal for its size) - Meta optimized for inference cost, not training cost.

Scaling Strategies

60-Second Answer

"Chinchilla scaling says you should train with about 20 tokens per parameter to minimize training loss for a given compute budget. But this optimizes training cost, not total cost. In practice, LLaMA 3 8B was trained on 15 trillion tokens - roughly 190 times the Chinchilla-optimal amount - because a smaller model that is over-trained is much cheaper to serve. The right scaling law depends on your deployment scenario: if you are serving millions of queries, over-training a smaller model is more cost-effective than Chinchilla-optimal training of a larger model."

Compute Budget Estimation

Given a compute budget, find the optimal model:

C = 6ND \quad \text{(training FLOPs)}

For a budget of $B$ dollars at cost $p$ per GPU-hour:

C = \frac{B}{p} \times \text{GPU\_FLOPS} \times \text{MFU} \times 3600

Example: Budget = $2M, H100 at$ 3/GPU-hour, 1000 TFLOPS bf16, 40% MFU:

C = \frac{2{,}000{,}000}{3} \times 10^{15} \times 0.4 \times 3600 = 9.6 \times 10^{23} \text{ FLOPs}

Chinchilla-optimal: $N = D = \sqrt{C/6} = \sqrt{1.6 \times 10^{23}} \approx 4 \times 10^{11}$

So approximately a 400B parameter model on 400B tokens - but this is Chinchilla-optimal. For inference-aware, you would train a 70B model on ~2.3T tokens, or a 7B model on ~23T tokens.

Part 5 - Training Infrastructure

The Three Dimensions of Parallelism

Training large LLMs requires distributing computation across hundreds or thousands of GPUs. The three main parallelism strategies address different bottlenecks:

3D Parallelism

Data Parallelism (DP)

Each GPU holds a complete copy of the model. The batch is split across GPUs, and gradients are synchronized (all-reduce) after each step.

Memory per GPU: Full model + optimizer states + activations for the batch chunk

Communication: All-reduce of gradients after each step. For a model with $N$ parameters: $2N$ bytes communicated per step (reduce-scatter + all-gather).

When to use: Always used as the outer parallelism dimension. Scales almost linearly with number of GPUs if communication is overlapped with computation.

FSDP (Fully Sharded Data Parallelism): Instead of replicating the full model on every GPU, shard the parameters, gradients, and optimizer states across GPUs. Each GPU holds only $1/k$ of the model state. All-gather parameters when needed for computation, then discard.

DP Variant	Memory per GPU	Communication	Complexity
Standard DP	Full model + optimizer	Gradient all-reduce	Low
ZeRO Stage 1	Full model, sharded optimizer	Gradient + optimizer all-reduce	Medium
ZeRO Stage 2	Full model, sharded optimizer + gradients	All-reduce at each step	Medium
ZeRO Stage 3 / FSDP	Sharded everything	All-gather + reduce-scatter per layer	High

Tensor Parallelism (TP)

Split individual matrix multiplications across GPUs. For attention: split heads across GPUs. For FFN: split the intermediate dimension.

Memory per GPU: $1/k$ of the model parameters per layer (for TP degree $k$ )

Communication: Two all-reduce operations per layer (one for attention, one for FFN). Communication happens within a layer, so it must be fast - typically limited to GPUs connected by NVLink within a single node.

When to use: When a single layer does not fit in one GPU's memory, or when you need to reduce per-GPU memory. Typically TP = 2, 4, or 8 (within one node).

Common Trap

Tensor parallelism requires high-bandwidth communication within each layer's forward and backward pass. Never use TP across nodes connected by InfiniBand (too slow). TP is for intra-node (NVLink: 900 GB/s on H100), while DP and PP can go across nodes (InfiniBand: 400-800 Gb/s).

Pipeline Parallelism (PP)

Split layers across GPUs. GPU 0 handles layers 0-9, GPU 1 handles layers 10-19, etc.

Memory per GPU: Only the layers assigned to that GPU

Communication: Send activations between adjacent stages. Much less data than TP or DP gradient sync.

The "bubble" problem: Naive PP has pipeline bubbles where GPUs wait for activations from the previous stage. Mitigated by micro-batching: split the mini-batch into micro-batches and pipeline them.

Pipeline Parallelism

3D Parallelism in Practice

For a 512-GPU cluster training a 70B model:

Dimension	Degree	Scope	Why
TP	8	Within node (8 GPUs/node)	Split large matrices across NVLink-connected GPUs
PP	4	Across 4 nodes	Split 80 layers into 4 stages of 20 layers
DP	16	Across remaining nodes	512 / (8 * 4) = 16-way data parallel

Total: $8 \times 4 \times 16 = 512$ GPUs

Effective batch size: DP degree $\times$ micro-batch size per GPU $\times$ gradient accumulation steps. Typical: $16 \times 4 \times 8 = 512$ sequences $\times 4096$ tokens/seq $\approx 2\text{M}$ tokens per step.

Checkpointing and Fault Tolerance

A 70B model training run takes weeks to months. Hardware failures are not edge cases - they are guaranteed:

Cluster Size	Mean Time Between Failure (estimated)
64 GPUs	~1 week
512 GPUs	~1 day
4096 GPUs	~2-4 hours
16384 GPUs (LLaMA 3 405B)	Under 1 hour

Checkpointing strategy:

Frequency: Save every 100-1000 steps (balance between safety and I/O overhead)
Asynchronous: Do not block training for checkpoint writes. Save to fast local NVMe, then async copy to distributed storage
Sharded: Each DP rank saves its own shard. Full checkpoint = union of all shards
Optimizer state: Must save optimizer states (Adam moments) - these are 2x the model size

Recovery protocol:

Detect failure (heartbeat monitoring, NCCL timeout)
Terminate all ranks
Replace failed node (spare pool)
Load latest checkpoint (verify with checksums)
Resume training

Interviewer's Perspective

At infrastructure-focused companies, the fault tolerance question is a strong signal. Candidates who can discuss elastic training (adjusting DP degree when nodes fail/join), checkpoint formats (safetensors vs pickle), and training resumption strategies (learning rate schedule, batch size warmup after restart) demonstrate production experience.

Activation Checkpointing (Gradient Checkpointing)

Separate from model checkpointing. During backpropagation, you need activations from the forward pass. Normally all activations are stored in memory. Activation checkpointing discards intermediate activations and recomputes them during the backward pass:

Memory-compute tradeoff:

Without checkpointing: memory = $O(L \times \text{activations per layer})$
With full checkpointing: memory = $O(\sqrt{L} \times \text{activations per layer})$ , compute = ~33% more (one extra forward pass)
Selective checkpointing: only recompute attention (most memory-hungry), keep FFN activations

Part 6 - Curriculum Learning

Data Mixing

The ratio of data sources during training significantly impacts model capabilities:

Data Mix	Effect
More code	Better at reasoning, logic, structured output
More math	Better at quantitative reasoning
More books	Better at long-form coherence, narrative
More web	Better at general knowledge, diverse topics
More multilingual	Better at non-English tasks

The LLaMA 3 data mix (approximate, based on Meta's report):

Web: ~50%
Code: ~20%
Math: ~5%
Books: ~10%
Academic: ~5%
Multilingual: ~10%

Curriculum Strategies

Rather than training on a fixed data distribution, curriculum learning adjusts the data mix over training:

Quality annealing: Start with diverse web data, gradually increase the proportion of high-quality data (academic, curated) toward the end
Domain upsampling: After initial training, do extra passes on domains where the model is weakest (identified by domain-specific eval)
Difficulty ordering: Train on shorter, simpler documents first; gradually increase length and complexity
Code scheduling: Introduce code data after basic language understanding is established

Curriculum Learning

Company Variation

Anthropic and Google use sophisticated data mixing schedules that are proprietary. In interviews, demonstrate awareness that data mixing is an active research area. Saying "I would use a fixed 50/30/20 split" is fine for an initial answer, but follow up with "In practice, I would measure domain-specific loss during training and adjust the mix based on which domains are improving slowest."

Long Context Training

Models like Gemini 1.5 (1M context) and LLaMA 3 (128K context) require special training strategies for long contexts:

Context length schedule: Start training with 4K context, gradually extend to 32K, then 128K. Training on 128K from the start is wasteful (most documents are short)
RoPE base frequency adjustment: When extending context, adjust the RoPE base frequency to maintain positional resolution
Ring Attention / sequence parallelism: For very long sequences that exceed a single GPU's memory, distribute the sequence across GPUs
Loss weighting: Optionally weight later tokens in long sequences more heavily to ensure the model learns to use long contexts

Practice Problems

Problem 1: Compute Budget Planning

You have a budget of $500K and access to H100 GPUs at$ 3/GPU-hour. The H100 achieves 1000 TFLOPS in bf16 with 35% model FLOP utilization. What is the largest Chinchilla-optimal model you can train? How many tokens does it need? How long will training take on 128 GPUs?

Hint 1 - Direction

Calculate total available FLOPs from the budget, then use $C = 6ND$ with the Chinchilla constraint $D \approx 20N$ .

Hint 2 - Insight

Total FLOPs = $\frac{500{,}000}{3} \times 10^{15} \times 0.35 \times 3600 \approx 2.1 \times 10^{23}$ . With Chinchilla: $C = 6N \times 20N = 120N^2$ , so $N = \sqrt{C/120}$ .

Hint 3 - Full Solution + Rubric

Step 1: Total compute

\text{GPU-hours} = \frac{500{,}000}{3} = 166{,}667 \text{ GPU-hours}

C = 166{,}667 \times 1000 \times 10^{12} \times 0.35 \times 3600 = 2.1 \times 10^{23} \text{ FLOPs}

Step 2: Chinchilla-optimal model size

C = 6ND, \quad D = 20N

2.1 \times 10^{23} = 6 \times N \times 20N = 120N^2

N = \sqrt{\frac{2.1 \times 10^{23}}{120}} = \sqrt{1.75 \times 10^{21}} \approx 4.2 \times 10^{10} \approx 42\text{B params}

Step 3: Training tokens

D = 20 \times 42\text{B} = 840\text{B tokens}

Step 4: Training time on 128 GPUs

\text{Wall clock} = \frac{166{,}667 \text{ GPU-hours}}{128 \text{ GPUs}} \approx 1{,}302 \text{ hours} \approx 54 \text{ days}

Answer: ~42B parameter model, ~840B tokens, ~54 days on 128 H100s.

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Correct setup	All formulas correct	Minor errors in computation	Could not set up the problem
MFU awareness	Used 35% MFU	Assumed 100% utilization	Did not know what MFU means
Final answer	Within 20% of 42B	Within 50%	Off by more than 2x
Practical follow-up	"54 days is risky - need checkpoint every 30 min, spare nodes, and monitoring"	"That is a long time"	No consideration of practicality

Problem 2: Data Pipeline Design

You are building a pretraining dataset for a 7B model focused on code and technical documentation. You have access to Common Crawl, GitHub, arXiv, and Stack Overflow. Design the data pipeline from raw data to training-ready tokens. Include filtering, deduplication, and mixing strategy.

Hint 1 - Direction

For each data source, think about: what is the quality distribution? What needs to be filtered out? How do you handle duplicates within and across sources?

Hint 2 - Insight

GitHub has a lot of auto-generated code, vendored dependencies, and low-quality repos. Common Crawl technical pages overlap with Stack Overflow. arXiv needs PDF-to-text conversion and has formatting issues. Design specific filters for each source, then global dedup, then mix.

Hint 3 - Full Solution + Rubric

Source-specific pipelines:

GitHub:

Filter: stars greater than 5, exclude forks, auto-generated files (node_modules, vendor), binaries
Language detection: keep top 20 programming languages
Quality: remove files under 100 bytes or over 1MB, filter high-entropy files (minified JS)
License filter: keep permissive licenses only (MIT, Apache, BSD)
Near-dedup: MinHash on file content (80% threshold)

Common Crawl:

URL filter: prioritize technical domains (docs., developer., *.readthedocs.io)
Text extraction: trafilatura for main content extraction
Quality: perplexity filter (Wikipedia-trained KenLM), length filter, repetition filter
Language: English only (fastText)
Near-dedup: MinHash on paragraphs

arXiv:

PDF to text: GROBID or Nougat for structure-aware extraction
Filter: keep cs., stat.ML, math. categories
Quality: remove papers with poor extraction (high formula-to-text ratio without proper rendering)
Deduplicate across versions (v1, v2, etc.)

Stack Overflow:

Filter: accepted answers only, or score greater than 5
Format: question + accepted answer as a single document
Remove code-only answers (no explanation)

Global pipeline:

Cross-source deduplication: MinHash across all sources (web pages that copy SO answers, etc.)
PII removal: regex for emails, phone numbers, API keys, IP addresses
Tokenization: BPE with 32K vocab, trained on the combined corpus

Mixing (for code-focused 7B model):

Code (GitHub): 40%
Technical web (CC filtered): 25%
Academic (arXiv): 15%
Stack Overflow: 10%
General web (CC general): 10%

Total target: 2T tokens (inference-aware over-training for 7B model)

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Source-specific filters	Tailored to each source	Generic filters for all	No source-specific thinking
Deduplication	Within-source and cross-source	Mentioned dedup vaguely	No dedup strategy
Quality focus	Multiple quality signals	One filter	"Use all the data"
Data mixing	Justified ratios for code focus	Mentioned mixing	Equal split or random

Problem 3: Scaling Law Application

A team trained a 1B model on 20B tokens and achieved a validation loss of 2.8. They trained a 3B model on 60B tokens and got 2.4. Assuming power-law scaling, predict the loss for a 10B model trained on 200B tokens. Is 200B tokens Chinchilla-optimal for 10B parameters?

Hint 1 - Direction

Use the power law $L(N, D) \approx A/N^{\alpha} + B/D^{\beta} + L_{\infty}$ . With two data points, you can estimate the scaling behavior (with simplifying assumptions).

Hint 2 - Insight

Simpler approach: if both models are at Chinchilla-optimal (both have $D = 20N$ ), then loss scales primarily with compute $C = 6ND$ . $C_1 = 6 \times 1B \times 20B = 1.2 \times 10^{20}$ , $C_2 = 6 \times 3B \times 60B = 1.08 \times 10^{21}$ . The loss went from 2.8 to 2.4 with ~9x compute. Use this to extrapolate.

Hint 3 - Full Solution + Rubric

Approach: Both data points are Chinchilla-optimal ( $D/N = 20$ ), so we can use the compute scaling law:

L(C) = \left(\frac{C_c}{C}\right)^{\alpha} + L_{\infty}

Compute for each:

$C_1 = 6 \times 10^9 \times 20 \times 10^9 = 1.2 \times 10^{20}$
$C_2 = 6 \times 3 \times 10^9 \times 60 \times 10^9 = 1.08 \times 10^{21}$
$C_3 = 6 \times 10 \times 10^9 \times 200 \times 10^9 = 1.2 \times 10^{22}$

Assuming $L_{\infty} \approx 1.7$ (typical for well-filtered web data):

From data points:

$2.8 - 1.7 = 1.1 = (C_c/1.2 \times 10^{20})^{\alpha}$
$2.4 - 1.7 = 0.7 = (C_c/1.08 \times 10^{21})^{\alpha}$

Dividing: $\frac{1.1}{0.7} = \left(\frac{1.08 \times 10^{21}}{1.2 \times 10^{20}}\right)^{\alpha} = 9^{\alpha}$

\alpha = \frac{\log(1.1/0.7)}{\log(9)} = \frac{\log(1.571)}{2.197} = \frac{0.452}{2.197} \approx 0.206

For $C_3 = 1.2 \times 10^{22}$ (10x $C_2$ ):

L_3 = 1.7 + 0.7 \times \left(\frac{1.08 \times 10^{21}}{1.2 \times 10^{22}}\right)^{0.206} = 1.7 + 0.7 \times (0.09)^{0.206}

(0.09)^{0.206} = e^{0.206 \times \ln(0.09)} = e^{0.206 \times (-2.408)} = e^{-0.496} \approx 0.609

L_3 \approx 1.7 + 0.7 \times 0.609 \approx 1.7 + 0.43 \approx 2.13

Predicted loss: ~2.1

Is 200B Chinchilla-optimal for 10B? Yes: $D/N = 200B/10B = 20$ , which matches the Chinchilla ratio.

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Set up scaling law	Correct formula with compute	Vague "loss decreases with scale"	Cannot set up
Numerical answer	Within 0.2 of ~2.1	Within 0.5	Cannot compute
Chinchilla check	Correctly identifies $D/N = 20$	Mentions Chinchilla vaguely	Does not check
Caveats	"This assumes perfect scaling - real loss depends on data quality, training stability"	Mentions uncertainty	Presents answer as certain

Problem 4: Parallelism Strategy

You need to train a 30B parameter model on a cluster of 256 H100 GPUs (32 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Each GPU has 80 GB memory. Design the 3D parallelism strategy. Estimate memory usage per GPU.

Hint 1 - Direction

Start with memory requirements: 30B params in bf16 = 60 GB for weights alone. Optimizer states (Adam) = 12 bytes per param = 360 GB. This does not fit on one GPU or even one node.

Hint 2 - Insight

With FSDP (ZeRO-3), optimizer states are sharded. TP within nodes (8-way) to split layer computation. Consider PP = 2 or 4 for additional memory relief. DP handles the rest.

Hint 3 - Full Solution + Rubric

Memory analysis (no parallelism):

Model weights (bf16): $30 \times 10^9 \times 2 = 60$ GB
Adam optimizer: $30 \times 10^9 \times 12 = 360$ GB (fp32 weights + first moment + second moment)
Gradients (bf16): 60 GB
Activations: ~40-80 GB (depends on batch size and sequence length)
Total: ~480-540 GB

Proposed strategy:

Dimension	Degree	Scope	Rationale
TP	4	Within node	Splits each layer across 4 GPUs; 2 TP groups per node
PP	2	Across 2 nodes	Splits model into 2 pipeline stages
DP (FSDP)	32	Remaining dimension	256 / (4 * 2) = 32-way data parallel with full sharding

Memory per GPU with this config:

Model weights: $60 \text{ GB} / 4 \text{ (TP)} = 15$ GB per GPU (only this GPU's TP shard)
Optimizer states: $360 \text{ GB} / 32 \text{ (FSDP)} / 4 \text{ (TP)} / 2 \text{ (PP)} = \sim 1.4$ GB (sharded across all DP ranks, TP, and PP)
Gradients (sharded): ~0.5 GB
Activations (with checkpointing): ~10-20 GB
Total: ~27-37 GB per GPU - fits comfortably in 80 GB

Why TP=4 not TP=8:

TP=8 means all 8 GPUs in a node work on the same layer - maximum layer splitting
TP=4 leaves room for 2 independent TP groups per node, which can be in different pipeline stages or DP groups
TP=4 also has less communication overhead than TP=8

Effective batch size: With DP=32, micro-batch=2 per GPU, grad accumulation=4: $32 \times 2 \times 4 = 256$ sequences. At 4096 tokens/seq: ~1M tokens per step.

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Memory calculation	Correct breakdown	Approximate but reasonable	Could not estimate
TP within node	Correctly placed within NVLink	Used TP across nodes	Did not consider hardware topology
FSDP for optimizer	Sharded optimizer across DP ranks	Mentioned FSDP	Full optimizer on every GPU
Practical details	Batch size, activation checkpointing, effective throughput	Mentioned some considerations	Pure theory

Interview Cheat Sheet

Concept	Key Fact / Formula	Common Follow-Up
Data filtering	Wikipedia-quality classifier + heuristics (length, repetition, perplexity)	"How do you handle bias in the filter?"
Deduplication	MinHash + LSH for fuzzy dedup; cross-source is critical	"What similarity threshold? How to scale?"
BPE tokenization	Iteratively merge most frequent pairs; byte-level for open vocab	"Vocab size tradeoffs? Multilingual impact?"
Causal LM objective	$-\sum_t \log P(x_t \mid x_{1..t-1})$ ; equivalent to compression	"Connection to perplexity? To entropy?"
FIM	`PRE + SUF + MID` format; 50% rate; critical for code infilling	"Why not 100% FIM?"
Chinchilla	$D/N \approx 20$ for compute-optimal; 70B needs 1.4T tokens	"Why is LLaMA 3 8B trained on 15T?"
Inference-aware	Over-train smaller models for cheaper serving	"What is the cost crossover point?"
Training FLOPs	$C = 6ND$ ; GPU-hours = $C / (\text{FLOPS} \times \text{MFU} \times 3600)$	"What MFU is realistic?"
3D parallelism	TP within node (NVLink), PP across few nodes, DP for the rest	"Why not TP=8 everywhere?"
FSDP	Shard optimizer + gradients + params across DP ranks	"Memory per GPU with ZeRO-3?"
Checkpointing	Every 100-1000 steps; async to distributed storage	"MTBF for 1000 GPUs?"
Data mixing	Code ~20-40%, web ~30-50%, math ~5-10%; curriculum over training	"How do you decide the mix?"

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

Draw the data pipeline from raw crawl to training-ready tokens (10 stages)
Write the causal LM loss formula and explain its connection to compression
Calculate: how many tokens is Chinchilla-optimal for a 13B model?
List the three parallelism dimensions and when each is used

Day 3

Explain MinHash deduplication in 60 seconds
Compare BPE, SentencePiece, and tiktoken - key differences
Why did the field move from Chinchilla-optimal to inference-aware scaling?
What is the difference between activation checkpointing and model checkpointing?

Day 7

Design a data pipeline for a code-focused model (from memory)
Calculate training compute for a 7B model on 2T tokens; estimate GPU-hours on 64 H100s
Explain 3D parallelism for a 70B model on 512 GPUs - assign TP, PP, DP degrees
What is FIM and why is it important for code models?

Day 14

Do Practice Problem 1 (compute budget) from scratch, timed (10 min)
Explain curriculum learning - what changes during training and why
Discuss fault tolerance: checkpointing frequency, recovery protocol, spare nodes
Quiz yourself on all cheat sheet entries - 60 seconds each

Day 21

Full mock: "You have $1M and need the best code model possible. Walk me through everything."
Re-take the self-assessment - all scores should be 4+
Solve Problem 3 (scaling law extrapolation) without hints
Explain the end-to-end pretraining pipeline in 5 minutes to a non-expert

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 \text{---} Data Collection and Filtering​

The Data Pipeline​

Data Sources​

Quality Filtering Heuristics​

Deduplication​

Part 2 \text{---} Tokenization​

Why Tokenization Matters for LLMs​

BPE (Byte Pair Encoding)​

BPE Variants in Modern LLMs​

Vocabulary Size Tradeoffs​

The "Fertility" Problem​

Part 3 - Training Objectives​

Causal Language Modeling (CLM)​

Prefix Language Modeling​

Fill-in-the-Middle (FIM)​

Part 4 - Scaling Laws​

The Kaplan Scaling Laws (OpenAI, 2020)​

Chinchilla Scaling Law (Hoffmann et al., 2022)​

Inference-Aware Scaling Laws (2023-2024)​

Compute Budget Estimation​

Part 5 - Training Infrastructure​

The Three Dimensions of Parallelism​

Data Parallelism (DP)​

Tensor Parallelism (TP)​

Pipeline Parallelism (PP)​

3D Parallelism in Practice​

Checkpointing and Fault Tolerance​

Activation Checkpointing (Gradient Checkpointing)​

Part 6 - Curriculum Learning​

Data Mixing​

Curriculum Strategies​

Long Context Training​

Practice Problems​

Problem 1: Compute Budget Planning​

Problem 2: Data Pipeline Design​

Problem 3: Scaling Law Application​

Problem 4: Parallelism Strategy​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (After reading this chapter)​

Day 3​

Day 7​

Day 14​

Day 21​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 \text{---} Data Collection and Filtering

The Data Pipeline

Data Sources

Quality Filtering Heuristics

Deduplication

Part 2 \text{---} Tokenization

Why Tokenization Matters for LLMs

BPE (Byte Pair Encoding)

BPE Variants in Modern LLMs

Vocabulary Size Tradeoffs

The "Fertility" Problem

Part 3 - Training Objectives

Causal Language Modeling (CLM)

Prefix Language Modeling

Fill-in-the-Middle (FIM)

Part 4 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

Chinchilla Scaling Law (Hoffmann et al., 2022)

Inference-Aware Scaling Laws (2023-2024)

Compute Budget Estimation

Part 5 - Training Infrastructure

The Three Dimensions of Parallelism

Data Parallelism (DP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

3D Parallelism in Practice

Checkpointing and Fault Tolerance

Activation Checkpointing (Gradient Checkpointing)

Part 6 - Curriculum Learning

Data Mixing

Curriculum Strategies

Long Context Training

Practice Problems

Problem 1: Compute Budget Planning

Problem 2: Data Pipeline Design

Problem 3: Scaling Law Application

Problem 4: Parallelism Strategy

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

Day 3

Day 7

Day 14

Day 21