Skip to main content

LLM Pretraining - From Raw Data to Foundation Models

Reading time: ~50 min | Interview relevance: Critical (Tier 1-2) / High (Tier 3-4) | Roles: MLE, Research Eng, ML Infra Eng, LLM Eng

The Real Interview Moment

You are interviewing at a compute-focused AI startup. The lead researcher presents this scenario: "We have a cluster of 512 H100 GPUs and a budget of $2M in compute. We want to train the best possible model. How large should the model be? How much data do we need? How long will it take? Walk me through every decision from data collection to the first checkpoint."

You start with the compute budget. She pushes: "You said Chinchilla-optimal. But we are optimizing for inference cost, not training cost. How does that change your sizing? And what about data quality \text{---} does the scaling law still hold if 30% of our data is noisy?"

This is the definitive Tier 1/Tier 2 interview question. It tests whether you understand the end-to-end pretraining pipeline \text{---} not just the training loop, but the data engineering, tokenization decisions, scaling tradeoffs, and infrastructure challenges that consume 80% of the actual work.

Candidates who can only describe the training objective get a "lean hire" at best. Candidates who can reason about data quality, compute allocation, parallelism strategy, and failure recovery \text{---} with concrete numbers \text{---} get a "strong hire."

What You Will Master

  • Design a data collection and filtering pipeline with quality heuristics
  • Explain deduplication methods (MinHash, exact match, n-gram) and why they matter
  • Compare BPE, SentencePiece, and tiktoken tokenization with vocabulary tradeoffs
  • Derive the causal language modeling objective and its connection to compression
  • Apply Chinchilla and inference-aware scaling laws with compute budget math
  • Plan 3D parallelism (data, tensor, pipeline) for a given GPU cluster
  • Design checkpointing and fault tolerance for multi-week training runs
  • Explain curriculum learning and data mixing strategies

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Design a data collection pipeline___
Explain deduplication methods___
Compare BPE vs SentencePiece vs tiktoken___
Derive causal LM training objective___
Apply Chinchilla scaling law___
Calculate compute budgets (GPU-hours)___
Plan 3D parallelism___
Design fault tolerance for training___
Explain curriculum learning___

Target: All 4s and 5s before your interview.

Part 1 \text{---} Data Collection and Filtering

The Data Pipeline

Pretraining data quality is the single most impactful factor in model quality. The pipeline looks like this:

Pretraining Data Pipeline

Data Sources

SourceVolumeQualityUse Case
Common Crawl~250B pages, ~100T tokens rawLow (needs heavy filtering)Web knowledge, general text
Wikipedia~4B tokens (English)HighFactual knowledge
Books (Books3, Gutenberg)~10-50B tokensMedium-HighLong-form reasoning, narrative
Academic papers (arXiv, S2ORC)~50B tokensHighScientific reasoning
Code (GitHub, StackOverflow)~500B+ tokensMediumCode generation, logical reasoning
Math (textbooks, competition data)~10B tokensHighMathematical reasoning
Curated instruction data~1-10B tokensVery HighInstruction following
Interviewer's Perspective

A common interview question is "If you could only improve one thing about your pretraining data, what would it be?" The answer is almost always data quality, not quantity. LLaMA 3 was trained on 15T tokens \text{---} but the data team spent months building quality classifiers. Red Pajama and FineWeb showed that better filtering on Common Crawl data can match proprietary datasets.

Quality Filtering Heuristics

The following heuristics are used by most open-source data pipelines (C4, RedPajama, FineWeb, Dolma):

FilterWhat It CatchesImplementation
Min word countStub pages, navigation menusRemove documents with fewer than 50 words
Mean word lengthGarbled text, encoded contentRemove if mean word length is outside 3-10 characters
Symbol-to-word ratioMarkup-heavy pages, log filesRemove if more than 10% of tokens are special characters
Repetition ratioBoilerplate, template pagesRemove if any n-gram (2-4 words) repeats more than a threshold
"Dirty word" ratioAdult contentRemove if offensive word fraction exceeds threshold
Perplexity filterGibberish, auto-generated textTrain a small LM on Wikipedia; remove high-perplexity docs
Classifier filterGeneral low qualityTrain a classifier (e.g., fastText) on Wikipedia (positive) vs random web (negative)

The "Wikipedia quality" classifier is a powerful idea: train a binary classifier to distinguish "looks like Wikipedia" from "looks like random web." Documents scoring high are more likely to be well-written, factual, and informative. This was used by GPT-3, LLaMA, and most subsequent models.

Deduplication

Deduplication is critical because web crawls contain enormous amounts of repeated content (boilerplate, scraped sites, template pages). Training on duplicates:

  • Wastes compute
  • Memorizes specific text (privacy risk)
  • Biases the model toward overrepresented content
  • Can cause training instability

Deduplication methods:

Deduplication Methods

MinHash deduplication is the most commonly used method. The algorithm:

  1. For each document, create a set of n-gram shingles (e.g., all 5-word sequences)
  2. Apply kk hash functions to each shingle set \text{---} keep the minimum hash for each function
  3. The resulting kk values form the "MinHash signature"
  4. Use Locality-Sensitive Hashing (LSH) to group documents with similar signatures
  5. Within each group, compute exact Jaccard similarity and remove near-duplicates (typically above 0.8 similarity)
Common Trap

Candidates often forget to mention cross-document deduplication across data sources. Wikipedia text appears in Common Crawl, books appear in multiple scraped sites, etc. You must deduplicate across all sources, not just within each source. This is a significant engineering challenge at the scale of trillions of tokens.

Part 2 \text{---} Tokenization

Why Tokenization Matters for LLMs

The tokenizer defines the model's "vocabulary" \text{---} its atomic units of processing. Every LLM design decision is affected:

  • Context window: 128K tokens is not 128K words. Typical ratio is 1 word \approx 1.3 tokens for English, but much worse for non-Latin scripts
  • Training cost: More tokens per document = more FLOPs to process the same text
  • Multilingual quality: A tokenizer trained primarily on English fragments non-Latin text into many small tokens, degrading quality
  • Code quality: Whitespace handling matters - Python indentation should not consume excessive tokens

BPE (Byte Pair Encoding)

The standard tokenization algorithm for LLMs. Starting from a character vocabulary, iteratively merge the most frequent adjacent pair:

Algorithm:

  1. Start with a vocabulary of all individual characters (or bytes)
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat steps 2-3 until vocabulary reaches target size

Example:

Corpus: "low low low low low lowest lowest newer newer wider"

Initial vocab: {l, o, w, e, s, t, n, r, i, d, <space>}

Step 1: Most frequent pair: (l, o) → merge into "lo"
Step 2: Most frequent pair: (lo, w) → merge into "low"
Step 3: Most frequent pair: (e, s) → merge into "es"
Step 4: Most frequent pair: (es, t) → merge into "est"
Step 5: Most frequent pair: (low, est) → merge into "lowest"
...

BPE Variants in Modern LLMs

TokenizerBase UnitVocab SizeUsed ByKey Feature
GPT-2 BPEBytes50,257GPT-2, GPT-3Byte-level fallback
SentencePieceUnicodeVariesLLaMA, T5Language-agnostic, whitespace as token
tiktoken (cl100k)Bytes100,256GPT-4, GPT-3.5Fast (Rust), regex pre-tokenization
LLaMA 3 tokenizerBytes128,256LLaMA 3Larger vocab for multilingual
60-Second Answer

"Modern LLMs use Byte Pair Encoding or its variants. BPE starts with individual bytes or characters and iteratively merges the most frequent pairs until reaching a target vocabulary size. The key engineering decisions are: (1) vocabulary size - larger vocabs like LLaMA 3's 128K reduce sequence length but increase embedding table size, (2) byte-level vs character-level base - byte-level ensures nothing is out-of-vocabulary, (3) pre-tokenization regex - tiktoken uses regex to split on whitespace and punctuation before BPE, preventing merges across word boundaries. The tokenizer directly impacts context length, multilingual quality, and inference cost."

Vocabulary Size Tradeoffs

Embedding parameters=V×d\text{Embedding parameters} = V \times d

For d=4096d = 4096:

  • V=32,000V = 32{,}000 (LLaMA 2): 131M params (~0.5 GB in fp32)
  • V=128,256V = 128{,}256 (LLaMA 3): 525M params (~2 GB in fp32)

Larger vocab means:

  • Shorter sequences (fewer tokens per document) = less KV cache, faster inference
  • Better multilingual (more tokens dedicated to non-English scripts)
  • Larger embedding table (more parameters, more memory)
  • Sparser updates (rare tokens get updated less during training)

Vocabulary Size Tradeoffs

The "Fertility" Problem

Fertility = average number of tokens per word for a given language. English is efficient (~1.3 tokens/word with most tokenizers). Other languages suffer:

LanguageFertility (tiktoken cl100k)Effective Context (128K tokens)
English~1.3~98K words
Spanish~1.5~85K words
Chinese~1.8~71K words
Hindi~3.5~37K words
Thai~4.0~32K words

This means a 128K context model effectively has 3x less context for Thai than for English. LLaMA 3 addressed this partially by tripling the vocabulary size from 32K to 128K.

Part 3 - Training Objectives

Causal Language Modeling (CLM)

The primary pretraining objective for decoder-only LLMs:

LCLM=t=1TlogP(xtx1,x2,,xt1;θ)\mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}; \theta)

Minimize the negative log-likelihood of each token given all preceding tokens. This is equivalent to compression - a model with lower loss is a better compressor of text.

Connection to perplexity:

PPL=exp(LCLM/T)\text{PPL} = \exp(\mathcal{L}_{\text{CLM}} / T)

A perplexity of 10 means the model is, on average, as uncertain as if it had to choose uniformly among 10 equally likely options for each next token.

Interviewer's Perspective

The connection between language modeling and compression is a favorite interview topic at research-oriented companies. Key insight: a model that perfectly predicts the next token achieves optimal compression. Shannon's source coding theorem tells us the minimum description length equals the entropy of the source. So "better language model" and "better compressor" are mathematically the same thing.

Prefix Language Modeling

Used by PaLM, UL2, and some hybrid architectures. The input sequence is split into a prefix (bidirectional attention) and a target (causal attention):

LPrefixLM=t=P+1TlogP(xtx1,,xP,xP+1,,xt1;θ)\mathcal{L}_{\text{PrefixLM}} = -\sum_{t=P+1}^{T} \log P(x_t \mid x_1, \ldots, x_P, x_{P+1}, \ldots, x_{t-1}; \theta)

The first PP tokens are the prefix - they can attend to each other bidirectionally. Loss is computed only on the target tokens P+1,,TP+1, \ldots, T.

Why it helps: For tasks where the "input" is well-defined (e.g., a document to summarize), bidirectional attention on the input creates better representations. But it reduces training efficiency (fewer tokens contribute to the loss).

Fill-in-the-Middle (FIM)

Used by code models (Code LLaMA, StarCoder, DeepSeek Coder). During training, randomly split a document into three parts:

Original: [A][B][C]
FIM input: <PRE>[A]<SUF>[C]<MID>[B]

The model learns to predict the middle given the prefix and suffix. This is critical for code completion where you need to fill in a function body given the signature and the code below.

FIM rate: Typically 50% of training examples use FIM, 50% use standard CLM. Applied with probability pp during training.

Fill-in-the-Middle Training

Part 4 - Scaling Laws

The Kaplan Scaling Laws (OpenAI, 2020)

The first quantitative scaling laws showed that loss scales as a power law with model size, dataset size, and compute:

L(N)(NcN)αN,L(D)(DcD)αD,L(C)(CcC)αCL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

where NN = parameters, DD = tokens, CC = compute (FLOPs), and αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095.

Key insight from Kaplan: For a fixed compute budget, larger models trained on less data outperform smaller models trained on more data. This led to GPT-3 (175B params trained on 300B tokens).

Chinchilla Scaling Law (Hoffmann et al., 2022)

Chinchilla revisited the Kaplan scaling laws with more careful experiments and found a different optimal allocation:

NoptC0.5,DoptC0.5N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

The ratio of tokens to parameters should be approximately D/N20D/N \approx 20. This means:

Model SizeChinchilla-Optimal TokensExample
1B20B
7B140B
70B1.4TChinchilla (70B, 1.4T tokens)
175B3.5TGPT-3 was undertrained (only 300B)
Common Trap

Many candidates recite "Chinchilla says 20 tokens per parameter" without understanding the implications. The real insight is that GPT-3 was massively undertrained - it should have seen 12x more data. But this also means training GPT-3 to Chinchilla-optimal would cost 12x more compute. The Chinchilla law tells you the compute-optimal frontier, not the practical one.

Inference-Aware Scaling Laws (2023-2024)

Chinchilla optimizes for training cost. But in practice, inference cost dominates total cost of ownership. A model served for millions of queries per day costs far more in inference than it cost to train.

The inference-aware insight: It is cheaper to train a smaller model on MORE data than Chinchilla-optimal, because the smaller model is cheaper to serve.

Total Cost=Ctrain+Cinference×Tdeployment\text{Total Cost} = C_{\text{train}} + C_{\text{inference}} \times T_{\text{deployment}}

If TdeploymentT_{\text{deployment}} is large (high query volume), the optimal strategy shifts toward smaller models:

StrategyModel SizeTraining TokensTrain CostInference Cost/QueryBest When
Chinchilla-optimal70B1.4T$10M$0.001Low query volume
Inference-aware7B2-5T$2M$0.0001High query volume
Over-train1B3T$200K$0.00001Edge/mobile

This explains LLaMA 3 8B trained on 15T tokens (190x Chinchilla-optimal for its size) - Meta optimized for inference cost, not training cost.

Scaling Strategies

60-Second Answer

"Chinchilla scaling says you should train with about 20 tokens per parameter to minimize training loss for a given compute budget. But this optimizes training cost, not total cost. In practice, LLaMA 3 8B was trained on 15 trillion tokens - roughly 190 times the Chinchilla-optimal amount - because a smaller model that is over-trained is much cheaper to serve. The right scaling law depends on your deployment scenario: if you are serving millions of queries, over-training a smaller model is more cost-effective than Chinchilla-optimal training of a larger model."

Compute Budget Estimation

Given a compute budget, find the optimal model:

C=6ND(training FLOPs)C = 6ND \quad \text{(training FLOPs)}

For a budget of BB dollars at cost pp per GPU-hour:

C=Bp×GPU_FLOPS×MFU×3600C = \frac{B}{p} \times \text{GPU\_FLOPS} \times \text{MFU} \times 3600

Example: Budget = 2M,H100at2M, H100 at 3/GPU-hour, 1000 TFLOPS bf16, 40% MFU:

C=2,000,0003×1015×0.4×3600=9.6×1023 FLOPsC = \frac{2{,}000{,}000}{3} \times 10^{15} \times 0.4 \times 3600 = 9.6 \times 10^{23} \text{ FLOPs}

Chinchilla-optimal: N=D=C/6=1.6×10234×1011N = D = \sqrt{C/6} = \sqrt{1.6 \times 10^{23}} \approx 4 \times 10^{11}

So approximately a 400B parameter model on 400B tokens - but this is Chinchilla-optimal. For inference-aware, you would train a 70B model on ~2.3T tokens, or a 7B model on ~23T tokens.

Part 5 - Training Infrastructure

The Three Dimensions of Parallelism

Training large LLMs requires distributing computation across hundreds or thousands of GPUs. The three main parallelism strategies address different bottlenecks:

3D Parallelism

Data Parallelism (DP)

Each GPU holds a complete copy of the model. The batch is split across GPUs, and gradients are synchronized (all-reduce) after each step.

Memory per GPU: Full model + optimizer states + activations for the batch chunk

Communication: All-reduce of gradients after each step. For a model with NN parameters: 2N2N bytes communicated per step (reduce-scatter + all-gather).

When to use: Always used as the outer parallelism dimension. Scales almost linearly with number of GPUs if communication is overlapped with computation.

FSDP (Fully Sharded Data Parallelism): Instead of replicating the full model on every GPU, shard the parameters, gradients, and optimizer states across GPUs. Each GPU holds only 1/k1/k of the model state. All-gather parameters when needed for computation, then discard.

DP VariantMemory per GPUCommunicationComplexity
Standard DPFull model + optimizerGradient all-reduceLow
ZeRO Stage 1Full model, sharded optimizerGradient + optimizer all-reduceMedium
ZeRO Stage 2Full model, sharded optimizer + gradientsAll-reduce at each stepMedium
ZeRO Stage 3 / FSDPSharded everythingAll-gather + reduce-scatter per layerHigh

Tensor Parallelism (TP)

Split individual matrix multiplications across GPUs. For attention: split heads across GPUs. For FFN: split the intermediate dimension.

Memory per GPU: 1/k1/k of the model parameters per layer (for TP degree kk)

Communication: Two all-reduce operations per layer (one for attention, one for FFN). Communication happens within a layer, so it must be fast - typically limited to GPUs connected by NVLink within a single node.

When to use: When a single layer does not fit in one GPU's memory, or when you need to reduce per-GPU memory. Typically TP = 2, 4, or 8 (within one node).

Common Trap

Tensor parallelism requires high-bandwidth communication within each layer's forward and backward pass. Never use TP across nodes connected by InfiniBand (too slow). TP is for intra-node (NVLink: 900 GB/s on H100), while DP and PP can go across nodes (InfiniBand: 400-800 Gb/s).

Pipeline Parallelism (PP)

Split layers across GPUs. GPU 0 handles layers 0-9, GPU 1 handles layers 10-19, etc.

Memory per GPU: Only the layers assigned to that GPU

Communication: Send activations between adjacent stages. Much less data than TP or DP gradient sync.

The "bubble" problem: Naive PP has pipeline bubbles where GPUs wait for activations from the previous stage. Mitigated by micro-batching: split the mini-batch into micro-batches and pipeline them.

Pipeline Parallelism

3D Parallelism in Practice

For a 512-GPU cluster training a 70B model:

DimensionDegreeScopeWhy
TP8Within node (8 GPUs/node)Split large matrices across NVLink-connected GPUs
PP4Across 4 nodesSplit 80 layers into 4 stages of 20 layers
DP16Across remaining nodes512 / (8 * 4) = 16-way data parallel

Total: 8×4×16=5128 \times 4 \times 16 = 512 GPUs

Effective batch size: DP degree ×\times micro-batch size per GPU ×\times gradient accumulation steps. Typical: 16×4×8=51216 \times 4 \times 8 = 512 sequences ×4096\times 4096 tokens/seq 2M\approx 2\text{M} tokens per step.

Checkpointing and Fault Tolerance

A 70B model training run takes weeks to months. Hardware failures are not edge cases - they are guaranteed:

Cluster SizeMean Time Between Failure (estimated)
64 GPUs~1 week
512 GPUs~1 day
4096 GPUs~2-4 hours
16384 GPUs (LLaMA 3 405B)Under 1 hour

Checkpointing strategy:

  1. Frequency: Save every 100-1000 steps (balance between safety and I/O overhead)
  2. Asynchronous: Do not block training for checkpoint writes. Save to fast local NVMe, then async copy to distributed storage
  3. Sharded: Each DP rank saves its own shard. Full checkpoint = union of all shards
  4. Optimizer state: Must save optimizer states (Adam moments) - these are 2x the model size

Recovery protocol:

  1. Detect failure (heartbeat monitoring, NCCL timeout)
  2. Terminate all ranks
  3. Replace failed node (spare pool)
  4. Load latest checkpoint (verify with checksums)
  5. Resume training
Interviewer's Perspective

At infrastructure-focused companies, the fault tolerance question is a strong signal. Candidates who can discuss elastic training (adjusting DP degree when nodes fail/join), checkpoint formats (safetensors vs pickle), and training resumption strategies (learning rate schedule, batch size warmup after restart) demonstrate production experience.

Activation Checkpointing (Gradient Checkpointing)

Separate from model checkpointing. During backpropagation, you need activations from the forward pass. Normally all activations are stored in memory. Activation checkpointing discards intermediate activations and recomputes them during the backward pass:

Memory-compute tradeoff:

  • Without checkpointing: memory = O(L×activations per layer)O(L \times \text{activations per layer})
  • With full checkpointing: memory = O(L×activations per layer)O(\sqrt{L} \times \text{activations per layer}), compute = ~33% more (one extra forward pass)
  • Selective checkpointing: only recompute attention (most memory-hungry), keep FFN activations

Part 6 - Curriculum Learning

Data Mixing

The ratio of data sources during training significantly impacts model capabilities:

Data MixEffect
More codeBetter at reasoning, logic, structured output
More mathBetter at quantitative reasoning
More booksBetter at long-form coherence, narrative
More webBetter at general knowledge, diverse topics
More multilingualBetter at non-English tasks

The LLaMA 3 data mix (approximate, based on Meta's report):

  • Web: ~50%
  • Code: ~20%
  • Math: ~5%
  • Books: ~10%
  • Academic: ~5%
  • Multilingual: ~10%

Curriculum Strategies

Rather than training on a fixed data distribution, curriculum learning adjusts the data mix over training:

  1. Quality annealing: Start with diverse web data, gradually increase the proportion of high-quality data (academic, curated) toward the end
  2. Domain upsampling: After initial training, do extra passes on domains where the model is weakest (identified by domain-specific eval)
  3. Difficulty ordering: Train on shorter, simpler documents first; gradually increase length and complexity
  4. Code scheduling: Introduce code data after basic language understanding is established

Curriculum Learning

Company Variation

Anthropic and Google use sophisticated data mixing schedules that are proprietary. In interviews, demonstrate awareness that data mixing is an active research area. Saying "I would use a fixed 50/30/20 split" is fine for an initial answer, but follow up with "In practice, I would measure domain-specific loss during training and adjust the mix based on which domains are improving slowest."

Long Context Training

Models like Gemini 1.5 (1M context) and LLaMA 3 (128K context) require special training strategies for long contexts:

  1. Context length schedule: Start training with 4K context, gradually extend to 32K, then 128K. Training on 128K from the start is wasteful (most documents are short)
  2. RoPE base frequency adjustment: When extending context, adjust the RoPE base frequency to maintain positional resolution
  3. Ring Attention / sequence parallelism: For very long sequences that exceed a single GPU's memory, distribute the sequence across GPUs
  4. Loss weighting: Optionally weight later tokens in long sequences more heavily to ensure the model learns to use long contexts

Practice Problems

Problem 1: Compute Budget Planning

You have a budget of 500KandaccesstoH100GPUsat500K and access to H100 GPUs at 3/GPU-hour. The H100 achieves 1000 TFLOPS in bf16 with 35% model FLOP utilization. What is the largest Chinchilla-optimal model you can train? How many tokens does it need? How long will training take on 128 GPUs?

Hint 1 - Direction

Calculate total available FLOPs from the budget, then use C=6NDC = 6ND with the Chinchilla constraint D20ND \approx 20N.

Hint 2 - Insight

Total FLOPs = 500,0003×1015×0.35×36002.1×1023\frac{500{,}000}{3} \times 10^{15} \times 0.35 \times 3600 \approx 2.1 \times 10^{23}. With Chinchilla: C=6N×20N=120N2C = 6N \times 20N = 120N^2, so N=C/120N = \sqrt{C/120}.

Hint 3 - Full Solution + Rubric

Step 1: Total compute

GPU-hours=500,0003=166,667 GPU-hours\text{GPU-hours} = \frac{500{,}000}{3} = 166{,}667 \text{ GPU-hours} C=166,667×1000×1012×0.35×3600=2.1×1023 FLOPsC = 166{,}667 \times 1000 \times 10^{12} \times 0.35 \times 3600 = 2.1 \times 10^{23} \text{ FLOPs}

Step 2: Chinchilla-optimal model size

C=6ND,D=20NC = 6ND, \quad D = 20N 2.1×1023=6×N×20N=120N22.1 \times 10^{23} = 6 \times N \times 20N = 120N^2 N=2.1×1023120=1.75×10214.2×101042B paramsN = \sqrt{\frac{2.1 \times 10^{23}}{120}} = \sqrt{1.75 \times 10^{21}} \approx 4.2 \times 10^{10} \approx 42\text{B params}

Step 3: Training tokens

D=20×42B=840B tokensD = 20 \times 42\text{B} = 840\text{B tokens}

Step 4: Training time on 128 GPUs

Wall clock=166,667 GPU-hours128 GPUs1,302 hours54 days\text{Wall clock} = \frac{166{,}667 \text{ GPU-hours}}{128 \text{ GPUs}} \approx 1{,}302 \text{ hours} \approx 54 \text{ days}

Answer: ~42B parameter model, ~840B tokens, ~54 days on 128 H100s.

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Correct setupAll formulas correctMinor errors in computationCould not set up the problem
MFU awarenessUsed 35% MFUAssumed 100% utilizationDid not know what MFU means
Final answerWithin 20% of 42BWithin 50%Off by more than 2x
Practical follow-up"54 days is risky - need checkpoint every 30 min, spare nodes, and monitoring""That is a long time"No consideration of practicality

Problem 2: Data Pipeline Design

You are building a pretraining dataset for a 7B model focused on code and technical documentation. You have access to Common Crawl, GitHub, arXiv, and Stack Overflow. Design the data pipeline from raw data to training-ready tokens. Include filtering, deduplication, and mixing strategy.

Hint 1 - Direction

For each data source, think about: what is the quality distribution? What needs to be filtered out? How do you handle duplicates within and across sources?

Hint 2 - Insight

GitHub has a lot of auto-generated code, vendored dependencies, and low-quality repos. Common Crawl technical pages overlap with Stack Overflow. arXiv needs PDF-to-text conversion and has formatting issues. Design specific filters for each source, then global dedup, then mix.

Hint 3 - Full Solution + Rubric

Source-specific pipelines:

GitHub:

  1. Filter: stars greater than 5, exclude forks, auto-generated files (node_modules, vendor), binaries
  2. Language detection: keep top 20 programming languages
  3. Quality: remove files under 100 bytes or over 1MB, filter high-entropy files (minified JS)
  4. License filter: keep permissive licenses only (MIT, Apache, BSD)
  5. Near-dedup: MinHash on file content (80% threshold)

Common Crawl:

  1. URL filter: prioritize technical domains (docs., developer., *.readthedocs.io)
  2. Text extraction: trafilatura for main content extraction
  3. Quality: perplexity filter (Wikipedia-trained KenLM), length filter, repetition filter
  4. Language: English only (fastText)
  5. Near-dedup: MinHash on paragraphs

arXiv:

  1. PDF to text: GROBID or Nougat for structure-aware extraction
  2. Filter: keep cs., stat.ML, math. categories
  3. Quality: remove papers with poor extraction (high formula-to-text ratio without proper rendering)
  4. Deduplicate across versions (v1, v2, etc.)

Stack Overflow:

  1. Filter: accepted answers only, or score greater than 5
  2. Format: question + accepted answer as a single document
  3. Remove code-only answers (no explanation)

Global pipeline:

  1. Cross-source deduplication: MinHash across all sources (web pages that copy SO answers, etc.)
  2. PII removal: regex for emails, phone numbers, API keys, IP addresses
  3. Tokenization: BPE with 32K vocab, trained on the combined corpus

Mixing (for code-focused 7B model):

  • Code (GitHub): 40%
  • Technical web (CC filtered): 25%
  • Academic (arXiv): 15%
  • Stack Overflow: 10%
  • General web (CC general): 10%

Total target: 2T tokens (inference-aware over-training for 7B model)

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Source-specific filtersTailored to each sourceGeneric filters for allNo source-specific thinking
DeduplicationWithin-source and cross-sourceMentioned dedup vaguelyNo dedup strategy
Quality focusMultiple quality signalsOne filter"Use all the data"
Data mixingJustified ratios for code focusMentioned mixingEqual split or random

Problem 3: Scaling Law Application

A team trained a 1B model on 20B tokens and achieved a validation loss of 2.8. They trained a 3B model on 60B tokens and got 2.4. Assuming power-law scaling, predict the loss for a 10B model trained on 200B tokens. Is 200B tokens Chinchilla-optimal for 10B parameters?

Hint 1 - Direction

Use the power law L(N,D)A/Nα+B/Dβ+LL(N, D) \approx A/N^{\alpha} + B/D^{\beta} + L_{\infty}. With two data points, you can estimate the scaling behavior (with simplifying assumptions).

Hint 2 - Insight

Simpler approach: if both models are at Chinchilla-optimal (both have D=20ND = 20N), then loss scales primarily with compute C=6NDC = 6ND. C1=6×1B×20B=1.2×1020C_1 = 6 \times 1B \times 20B = 1.2 \times 10^{20}, C2=6×3B×60B=1.08×1021C_2 = 6 \times 3B \times 60B = 1.08 \times 10^{21}. The loss went from 2.8 to 2.4 with ~9x compute. Use this to extrapolate.

Hint 3 - Full Solution + Rubric

Approach: Both data points are Chinchilla-optimal (D/N=20D/N = 20), so we can use the compute scaling law:

L(C)=(CcC)α+LL(C) = \left(\frac{C_c}{C}\right)^{\alpha} + L_{\infty}

Compute for each:

  • C1=6×109×20×109=1.2×1020C_1 = 6 \times 10^9 \times 20 \times 10^9 = 1.2 \times 10^{20}
  • C2=6×3×109×60×109=1.08×1021C_2 = 6 \times 3 \times 10^9 \times 60 \times 10^9 = 1.08 \times 10^{21}
  • C3=6×10×109×200×109=1.2×1022C_3 = 6 \times 10 \times 10^9 \times 200 \times 10^9 = 1.2 \times 10^{22}

Assuming L1.7L_{\infty} \approx 1.7 (typical for well-filtered web data):

From data points:

  • 2.81.7=1.1=(Cc/1.2×1020)α2.8 - 1.7 = 1.1 = (C_c/1.2 \times 10^{20})^{\alpha}
  • 2.41.7=0.7=(Cc/1.08×1021)α2.4 - 1.7 = 0.7 = (C_c/1.08 \times 10^{21})^{\alpha}

Dividing: 1.10.7=(1.08×10211.2×1020)α=9α\frac{1.1}{0.7} = \left(\frac{1.08 \times 10^{21}}{1.2 \times 10^{20}}\right)^{\alpha} = 9^{\alpha}

α=log(1.1/0.7)log(9)=log(1.571)2.197=0.4522.1970.206\alpha = \frac{\log(1.1/0.7)}{\log(9)} = \frac{\log(1.571)}{2.197} = \frac{0.452}{2.197} \approx 0.206

For C3=1.2×1022C_3 = 1.2 \times 10^{22} (10x C2C_2):

L3=1.7+0.7×(1.08×10211.2×1022)0.206=1.7+0.7×(0.09)0.206L_3 = 1.7 + 0.7 \times \left(\frac{1.08 \times 10^{21}}{1.2 \times 10^{22}}\right)^{0.206} = 1.7 + 0.7 \times (0.09)^{0.206} (0.09)0.206=e0.206×ln(0.09)=e0.206×(2.408)=e0.4960.609(0.09)^{0.206} = e^{0.206 \times \ln(0.09)} = e^{0.206 \times (-2.408)} = e^{-0.496} \approx 0.609 L31.7+0.7×0.6091.7+0.432.13L_3 \approx 1.7 + 0.7 \times 0.609 \approx 1.7 + 0.43 \approx 2.13

Predicted loss: ~2.1

Is 200B Chinchilla-optimal for 10B? Yes: D/N=200B/10B=20D/N = 200B/10B = 20, which matches the Chinchilla ratio.

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Set up scaling lawCorrect formula with computeVague "loss decreases with scale"Cannot set up
Numerical answerWithin 0.2 of ~2.1Within 0.5Cannot compute
Chinchilla checkCorrectly identifies D/N=20D/N = 20Mentions Chinchilla vaguelyDoes not check
Caveats"This assumes perfect scaling - real loss depends on data quality, training stability"Mentions uncertaintyPresents answer as certain

Problem 4: Parallelism Strategy

You need to train a 30B parameter model on a cluster of 256 H100 GPUs (32 nodes, 8 GPUs per node, NVLink within node, InfiniBand across nodes). Each GPU has 80 GB memory. Design the 3D parallelism strategy. Estimate memory usage per GPU.

Hint 1 - Direction

Start with memory requirements: 30B params in bf16 = 60 GB for weights alone. Optimizer states (Adam) = 12 bytes per param = 360 GB. This does not fit on one GPU or even one node.

Hint 2 - Insight

With FSDP (ZeRO-3), optimizer states are sharded. TP within nodes (8-way) to split layer computation. Consider PP = 2 or 4 for additional memory relief. DP handles the rest.

Hint 3 - Full Solution + Rubric

Memory analysis (no parallelism):

  • Model weights (bf16): 30×109×2=6030 \times 10^9 \times 2 = 60 GB
  • Adam optimizer: 30×109×12=36030 \times 10^9 \times 12 = 360 GB (fp32 weights + first moment + second moment)
  • Gradients (bf16): 60 GB
  • Activations: ~40-80 GB (depends on batch size and sequence length)
  • Total: ~480-540 GB

Proposed strategy:

DimensionDegreeScopeRationale
TP4Within nodeSplits each layer across 4 GPUs; 2 TP groups per node
PP2Across 2 nodesSplits model into 2 pipeline stages
DP (FSDP)32Remaining dimension256 / (4 * 2) = 32-way data parallel with full sharding

Memory per GPU with this config:

  • Model weights: 60 GB/4 (TP)=1560 \text{ GB} / 4 \text{ (TP)} = 15 GB per GPU (only this GPU's TP shard)
  • Optimizer states: 360 GB/32 (FSDP)/4 (TP)/2 (PP)=1.4360 \text{ GB} / 32 \text{ (FSDP)} / 4 \text{ (TP)} / 2 \text{ (PP)} = \sim 1.4 GB (sharded across all DP ranks, TP, and PP)
  • Gradients (sharded): ~0.5 GB
  • Activations (with checkpointing): ~10-20 GB
  • Total: ~27-37 GB per GPU - fits comfortably in 80 GB

Why TP=4 not TP=8:

  • TP=8 means all 8 GPUs in a node work on the same layer - maximum layer splitting
  • TP=4 leaves room for 2 independent TP groups per node, which can be in different pipeline stages or DP groups
  • TP=4 also has less communication overhead than TP=8

Effective batch size: With DP=32, micro-batch=2 per GPU, grad accumulation=4: 32×2×4=25632 \times 2 \times 4 = 256 sequences. At 4096 tokens/seq: ~1M tokens per step.

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Memory calculationCorrect breakdownApproximate but reasonableCould not estimate
TP within nodeCorrectly placed within NVLinkUsed TP across nodesDid not consider hardware topology
FSDP for optimizerSharded optimizer across DP ranksMentioned FSDPFull optimizer on every GPU
Practical detailsBatch size, activation checkpointing, effective throughputMentioned some considerationsPure theory

Interview Cheat Sheet

ConceptKey Fact / FormulaCommon Follow-Up
Data filteringWikipedia-quality classifier + heuristics (length, repetition, perplexity)"How do you handle bias in the filter?"
DeduplicationMinHash + LSH for fuzzy dedup; cross-source is critical"What similarity threshold? How to scale?"
BPE tokenizationIteratively merge most frequent pairs; byte-level for open vocab"Vocab size tradeoffs? Multilingual impact?"
Causal LM objectivetlogP(xtx1..t1)-\sum_t \log P(x_t \mid x_{1..t-1}); equivalent to compression"Connection to perplexity? To entropy?"
FIMPRE + SUF + MID format; 50% rate; critical for code infilling"Why not 100% FIM?"
ChinchillaD/N20D/N \approx 20 for compute-optimal; 70B needs 1.4T tokens"Why is LLaMA 3 8B trained on 15T?"
Inference-awareOver-train smaller models for cheaper serving"What is the cost crossover point?"
Training FLOPsC=6NDC = 6ND; GPU-hours = C/(FLOPS×MFU×3600)C / (\text{FLOPS} \times \text{MFU} \times 3600)"What MFU is realistic?"
3D parallelismTP within node (NVLink), PP across few nodes, DP for the rest"Why not TP=8 everywhere?"
FSDPShard optimizer + gradients + params across DP ranks"Memory per GPU with ZeRO-3?"
CheckpointingEvery 100-1000 steps; async to distributed storage"MTBF for 1000 GPUs?"
Data mixingCode ~20-40%, web ~30-50%, math ~5-10%; curriculum over training"How do you decide the mix?"

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

  • Draw the data pipeline from raw crawl to training-ready tokens (10 stages)
  • Write the causal LM loss formula and explain its connection to compression
  • Calculate: how many tokens is Chinchilla-optimal for a 13B model?
  • List the three parallelism dimensions and when each is used

Day 3

  • Explain MinHash deduplication in 60 seconds
  • Compare BPE, SentencePiece, and tiktoken - key differences
  • Why did the field move from Chinchilla-optimal to inference-aware scaling?
  • What is the difference between activation checkpointing and model checkpointing?

Day 7

  • Design a data pipeline for a code-focused model (from memory)
  • Calculate training compute for a 7B model on 2T tokens; estimate GPU-hours on 64 H100s
  • Explain 3D parallelism for a 70B model on 512 GPUs - assign TP, PP, DP degrees
  • What is FIM and why is it important for code models?

Day 14

  • Do Practice Problem 1 (compute budget) from scratch, timed (10 min)
  • Explain curriculum learning - what changes during training and why
  • Discuss fault tolerance: checkpointing frequency, recovery protocol, spare nodes
  • Quiz yourself on all cheat sheet entries - 60 seconds each

Day 21

  • Full mock: "You have $1M and need the best code model possible. Walk me through everything."
  • Re-take the self-assessment - all scores should be 4+
  • Solve Problem 3 (scaling law extrapolation) without hints
  • Explain the end-to-end pretraining pipeline in 5 minutes to a non-expert
© 2026 EngineersOfAI. All rights reserved.