Scaling Laws - The Science of Making Models Bigger
Reading time: ~45 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, ML Infrastructure Engineer
The Real Interview Moment
You are in an Anthropic research interview. The interviewer asks: "You have a fixed compute budget of FLOPs. Should you train a 10B parameter model on 2T tokens, or a 70B parameter model on 300B tokens? Show me the math."
You start with the Chinchilla scaling law, and she follows up: "Kaplan et al. (2020) would have given a different answer. What did Kaplan get wrong, and why does the Chinchilla-optimal ratio matter for production deployments? If you are deploying to millions of users, would you actually train Chinchilla-optimally?"
This question tests whether you understand the quantitative science of scaling - not just "bigger is better," but the precise mathematical relationships between parameters, data, and compute. Candidates who can only say "more data is better" without citing the specific power law exponents or explaining the Kaplan-Chinchilla disagreement get a "lean no-hire." Candidates who can derive the compute-optimal allocation, explain why production models are deliberately overtrained, and reason about inference cost tradeoffs get a "strong hire."
What You Will Master
- State the three core scaling law variables and their power law relationships
- Derive the compute-optimal training allocation (Chinchilla law)
- Explain what Kaplan got wrong and why Chinchilla corrected it
- Calculate optimal model size and token count for a given compute budget
- Discuss why production models are deliberately overtrained
- Reason about inference cost and the training-inference tradeoff
- Apply scaling laws to practical model design decisions
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| State the three scaling law variables | ___ | |||||
| Write the power law equations | ___ | |||||
| Explain Kaplan scaling laws | ___ | |||||
| Explain Chinchilla scaling laws | ___ | |||||
| Derive compute-optimal allocation | ___ | |||||
| Calculate optimal N and D for a compute budget | ___ | |||||
| Explain overtraining and why it is done | ___ | |||||
| Discuss inference cost tradeoffs | ___ | |||||
| Explain emergent abilities and scaling | ___ | |||||
| Apply scaling laws to a design decision | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Three Variables of Scale
The Core Insight
The fundamental discovery of scaling laws research is that language model performance (measured by cross-entropy loss) follows smooth power law relationships with three variables:
- - Number of model parameters
- - Number of training tokens (dataset size)
- - Compute budget (in FLOPs)
These three variables are not independent. For a Transformer language model, the compute required for training is approximately:
Where the factor of 6 accounts for the forward pass (~2ND FLOPs) and backward pass (~4ND FLOPs).
Power Law Relationships
Each variable, when the others are not limiting, produces a power law improvement in loss:
Where , , are characteristic constants and , , are the scaling exponents.
The crucial implication: performance improves as a power law, not linearly. Doubling compute does not double performance - it produces a fixed fractional improvement. Getting the next increment of performance always costs more than the last.
"Scaling laws describe the empirical observation that language model loss follows smooth power laws with respect to model size, dataset size, and compute. The key equation is , relating compute to parameters and data. Given a fixed compute budget, there is a unique optimal allocation between model size and data size that minimizes loss. Kaplan et al. (2020) found you should scale models faster than data. Chinchilla (2022) corrected this, showing parameters and data should scale equally - roughly 20 tokens per parameter. This has profound implications for which models to train and deploy."
Part 2 - Kaplan Scaling Laws (2020)
The Paper
"Scaling Laws for Neural Language Models" - Kaplan et al., 2020 (OpenAI)
The Key Findings
Kaplan et al. trained hundreds of language models ranging from 768 parameters to 1.5 billion parameters and measured how loss scaled. Their findings:
Finding 1: Smooth power laws.
Finding 2: Performance depends strongly on scale, weakly on architecture.
Model shape (depth vs width, number of heads) matters much less than total parameter count. A wide-shallow model with parameters performs similarly to a narrow-deep model with parameters.
Finding 3: The optimal allocation favors larger models.
Given a fixed compute budget, Kaplan concluded you should use most of the compute for a large model trained on relatively little data. Specifically, they claimed:
This means compute should go mostly to model size, with data growing much more slowly.
import numpy as np
# Kaplan scaling law predictions
def kaplan_loss_N(N):
"""Loss as a function of model parameters (Kaplan)."""
return (8.8e13 / N) ** 0.076
def kaplan_loss_D(D):
"""Loss as a function of training tokens (Kaplan)."""
return (5.4e13 / D) ** 0.095
def kaplan_optimal_allocation(C):
"""
Kaplan's compute-optimal allocation.
Favors larger models with less data.
"""
# N_opt ∝ C^0.73, D_opt ∝ C^0.27
# With C = 6ND, and empirical fits:
N_opt = 1.3e9 * (C / 1e21) ** 0.73
D_opt = C / (6 * N_opt)
tokens_per_param = D_opt / N_opt
return N_opt, D_opt, tokens_per_param
# Example: what does Kaplan recommend for different compute budgets?
for log_C in [20, 21, 22, 23, 24]:
C = 10 ** log_C
N, D, ratio = kaplan_optimal_allocation(C)
print(f"C=10^{log_C}: N={N/1e9:.1f}B, D={D/1e9:.0f}B tokens, "
f"ratio={ratio:.1f} tokens/param")
# Kaplan's prediction: large models, relatively few tokens per parameter
# Tokens per parameter ratio is LOW - typically 1-5 tokens/param
What Kaplan Got Wrong
Kaplan's conclusion - favor model size over data - had a critical methodological flaw:
They did not train to convergence. Kaplan used a fixed number of training steps per model size, which meant larger models saw proportionally fewer tokens relative to their capacity. This biased the results toward recommending larger models.
In other words, Kaplan compared:
- Small model trained for many tokens ✓ (near convergence)
- Large model trained for few tokens ✗ (far from convergence)
Naturally, the large model showed more room for improvement, making it seem like scaling model size was more efficient than scaling data.
Do not say "Kaplan showed that model size matters more than data." This is Kaplan's conclusion but it was wrong. Chinchilla showed that when you properly control for training duration, data is equally important. If an interviewer asks about Kaplan, always mention the correction: "Kaplan's methodology did not train to convergence, which biased the results toward larger models. Chinchilla corrected this."
Part 3 - Chinchilla Scaling Laws (2022)
The Paper
"Training Compute-Optimal Large Language Models" - Hoffmann et al., 2022 (DeepMind)
The Key Correction
Hoffmann et al. trained over 400 models ranging from 70M to 16B parameters, each trained on 5B to 500B tokens. Crucially, they varied both model size and training tokens for each compute budget, training many models to different degrees of convergence.
The Chinchilla Law
Their finding was dramatically different from Kaplan:
Parameters and data should scale equally. For every doubling of compute, you should double both the model size and the training data.
The specific ratio they found:
You should train on approximately 20 tokens per parameter.
The Joint Scaling Law
Chinchilla proposed a parametric loss function that depends on both and :
Where:
- - the model size term (decreasing loss as model gets bigger)
- - the data term (decreasing loss as data gets larger)
- - the irreducible loss (entropy of natural language, cannot be improved by any model)
- ,
The fitted parameters:
| Parameter | Value |
|---|---|
Computing the Optimal Allocation
Given compute , we want to minimize subject to :
Using a Lagrange multiplier and solving:
Similarly for :
Dividing:
This says: at the optimum, the marginal improvement from scaling the model equals the marginal improvement from scaling the data. Neither is a bottleneck - both contribute equally.
import numpy as np
from scipy.optimize import minimize_scalar
# Chinchilla scaling law parameters
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28
E = 1.69
def chinchilla_loss(N, D):
"""Compute loss given model size N and data size D."""
return A / N**alpha + B / D**beta + E
def optimal_allocation(C):
"""
Find compute-optimal N and D for a given compute budget C.
Constraint: C = 6 * N * D
"""
def loss_given_N(log_N):
N = np.exp(log_N)
D = C / (6 * N)
if D <= 0:
return 1e10
return chinchilla_loss(N, D)
# Search over possible model sizes
result = minimize_scalar(
loss_given_N,
bounds=(np.log(1e6), np.log(1e12)),
method='bounded'
)
N_opt = np.exp(result.x)
D_opt = C / (6 * N_opt)
loss = chinchilla_loss(N_opt, D_opt)
ratio = D_opt / N_opt
return N_opt, D_opt, loss, ratio
# What does Chinchilla recommend for various compute budgets?
print(f"{'Compute':>12s} | {'N_opt':>10s} | {'D_opt':>12s} | "
f"{'Loss':>6s} | {'Tokens/Param':>12s}")
print("-" * 70)
for log_C in [19, 20, 21, 22, 23, 24, 25]:
C = 10 ** log_C
N, D, loss, ratio = optimal_allocation(C)
print(f"10^{log_C:>2d} FLOPs | {N/1e9:>8.2f}B | {D/1e9:>10.1f}B tok | "
f"{loss:>6.3f} | {ratio:>10.1f}")
# The ratio stays approximately 20 tokens per parameter
Chinchilla's Proof by Construction
To validate their scaling law, DeepMind trained Chinchilla - a 70B parameter model on 1.4T tokens ( FLOPs). This was the compute-optimal allocation for their budget.
For comparison, Gopher was a 280B parameter model trained on 300B tokens with approximately the same compute budget. According to Kaplan, Gopher should have been preferred (larger model). According to Chinchilla, the compute would be better spent on a smaller model with more data.
| Model | Parameters | Tokens | Compute | Tokens/Param | MMLU |
|---|---|---|---|---|---|
| Gopher | 280B | 300B | ~ | 1.1 | 60.0% |
| Chinchilla | 70B | 1.4T | ~ | 20.0 | 67.6% |
Same compute, dramatically different performance. Chinchilla proved that Gopher was undertrained - it had too many parameters for the amount of data it saw.
If asked "What is the Chinchilla scaling law?" and you answer "Bigger models are better" - that is the opposite of the point. Chinchilla showed that simply making models bigger (Kaplan's approach) is wasteful. The optimal strategy is to scale parameters and data equally, at approximately 20 tokens per parameter. Many models at the time (GPT-3, Gopher) were massively undertrained relative to their size.
Part 4 - Implications for Model Design
Was GPT-3 Undertrained?
GPT-3 (175B parameters) was trained on 300B tokens:
Chinchilla optimal would have been tokens. GPT-3 was trained on roughly 12x fewer tokens than Chinchilla recommends.
This means one of two things:
- GPT-3 could have achieved the same performance with a ~20B parameter model trained on 300B tokens
- GPT-3 could have performed much better if trained on 3.5T tokens
The Model Landscape After Chinchilla
Chinchilla reshaped the entire field:
| Model | Year | Parameters | Tokens | Tokens/Param | Chinchilla-Optimal? |
|---|---|---|---|---|---|
| GPT-3 | 2020 | 175B | 300B | 1.7 | Severely undertrained |
| Gopher | 2021 | 280B | 300B | 1.1 | Severely undertrained |
| Chinchilla | 2022 | 70B | 1.4T | 20 | Yes |
| LLaMA-7B | 2023 | 7B | 1T | 143 | Deliberately overtrained |
| LLaMA-65B | 2023 | 65B | 1.4T | 21.5 | Approximately optimal |
| Mistral-7B | 2023 | 7B | ~8T | ~1143 | Severely overtrained |
| LLaMA-3-8B | 2024 | 8B | 15T | 1875 | Extremely overtrained |
Why Are Modern Models Overtrained?
This brings us to a crucial insight that interview candidates often miss: Chinchilla-optimal training minimizes loss for a given training compute budget, but not for a given inference compute budget.
Part 5 - The Training-Inference Tradeoff
The Cost Model
The total cost of an LLM is:
Where is the total number of inference queries over the model's lifetime.
For a model serving millions of users:
- Training cost is fixed (one-time)
- Inference cost scales with both model size and query volume
- Inference cost often dwarfs training cost
Why Overtrain?
Consider two options for achieving the same target loss:
Option A (Chinchilla-optimal): 70B model, 1.4T tokens
- Training cost: FLOPs
- Inference cost per token: proportional to 70B parameters
Option B (Overtrained): 7B model, 14T tokens
- Training cost: FLOPs
- Inference cost per token: proportional to 7B parameters - 10x cheaper!
Both use the same training compute. But Option B produces a model that is 10x cheaper to serve. If you are processing billions of queries, the inference savings far outweigh any suboptimality in training.
import numpy as np
def total_cost(N_params, D_tokens, queries, cost_per_training_flop, cost_per_inference_flop):
"""
Total cost = training + inference.
N_params: model parameters
D_tokens: training tokens
queries: total lifetime inference queries (in tokens)
"""
# Training cost: 6ND FLOPs
training_flops = 6 * N_params * D_tokens
training_cost = training_flops * cost_per_training_flop
# Inference cost: ~2N FLOPs per token (forward pass only)
inference_flops_per_token = 2 * N_params
inference_cost = inference_flops_per_token * queries * cost_per_inference_flop
return training_cost, inference_cost, training_cost + inference_cost
# Compare Chinchilla-optimal vs overtrained
# Assume same total training compute (same budget)
cost_train = 1e-18 # $/FLOP for training
cost_infer = 3e-18 # $/FLOP for inference (less efficient due to small batches)
# Scenario: 1 billion inference queries of 500 tokens each
total_inference_tokens = 1e9 * 500
print(f"{'Model':>20s} | {'Train $':>12s} | {'Infer $':>12s} | {'Total $':>12s}")
print("-" * 65)
for name, N, D in [
("Chinchilla-70B", 70e9, 1.4e12),
("Overtrained-7B", 7e9, 14e12),
("Overtrained-1B", 1e9, 98e12),
]:
tc, ic, total = total_cost(N, D, total_inference_tokens, cost_train, cost_infer)
print(f"{name:>20s} | ${tc/1e6:>10.1f}M | ${ic/1e6:>10.1f}M | ${total/1e6:>10.1f}M")
# The overtrained 7B model costs the same to train but 10x less to serve
The LLaMA Philosophy
Meta's LLaMA (Touvron et al., 2023) explicitly embraced this insight: "The objective of the scaling laws is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale."
LLaMA-7B was trained on 1T tokens (143 tokens/param) - 7x more than Chinchilla-optimal. This "wasted" training compute produced a model that was dramatically cheaper to deploy and only slightly worse in quality.
"Chinchilla showed that compute-optimal training uses about 20 tokens per parameter. But modern models like LLaMA deliberately overtrain - using 100-2000 tokens per parameter - because inference cost, not training cost, dominates the total lifetime cost of a model. A 7B model trained on 14T tokens uses the same training compute as a 70B model trained on 1.4T tokens, but costs 10x less to serve. For production deployments serving millions of users, overtraining is the economically rational choice."
Part 6 - The Scaling Law Equations in Detail
The Kaplan Parametric Form
Kaplan proposed that loss decomposes as:
This form assumes a specific functional relationship between the model and data contributions to loss.
The Chinchilla Parametric Form
Chinchilla used a simpler additive decomposition:
This assumes the model and data bottlenecks contribute independently, with an irreducible loss floor .
Compute-Optimal Scaling Exponents
For Chinchilla, the optimal allocation follows:
Where and .
With and :
Both are close to 0.5, confirming that parameters and data should scale approximately equally.
Part 7 - Emergent Abilities and Scaling
What Are Emergent Abilities?
Wei et al. (2022) defined emergent abilities as capabilities that are absent in small models but present in large models - they appear to emerge suddenly at a certain scale rather than improving gradually.
Examples of claimed emergent abilities:
- Multi-step arithmetic (appears at ~100B parameters)
- Chain-of-thought reasoning (appears at ~60B parameters)
- Word unscrambling (appears at ~10B parameters)
The Debate: Are Emergent Abilities Real?
This is an active research debate that interviewers may probe:
Argument for emergence (Wei et al., 2022):
- Many benchmarks show near-random performance until a threshold model size, then rapid improvement
- This is consistent with phase transitions in physics
- Some capabilities genuinely require a minimum amount of knowledge/reasoning
Argument against emergence (Schaeffer et al., 2023):
- "Emergent" abilities may be an artifact of the metric chosen
- When you switch from accuracy (discrete) to log-likelihood (continuous), the improvement is smooth
- The "sudden jump" is because accuracy changes from 0% to non-zero at a threshold, but the underlying probability is smoothly improving
import numpy as np
def demonstrate_metric_illusion():
"""
Show how metric choice can create the illusion of emergence.
"""
# Model capability (smooth power law improvement)
scales = np.logspace(7, 11, 50) # 10M to 100B parameters
# Probability of getting a single step correct (smooth)
p_correct = 1 - (1e13 / scales) ** 0.05
p_correct = np.clip(p_correct, 0.01, 0.99)
# Task requires getting 5 steps ALL correct
# Accuracy (discrete metric)
accuracy = p_correct ** 5
print("Scale (params) | p(single step) | accuracy (5 steps)")
print("-" * 55)
for i in range(0, len(scales), 5):
print(f"{scales[i]:>13.0f} | {p_correct[i]:>14.3f} | {accuracy[i]:>18.3f}")
# Key insight: p_correct improves smoothly,
# but accuracy appears to "emerge" because
# p^5 is very small until p is close to 1
demonstrate_metric_illusion()
# p_correct: 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99
# accuracy: 0.002, 0.03, 0.17, 0.33, 0.59, 0.77, 0.95
# Accuracy looks like a sudden jump even though capability is smooth!
Do not state definitively that emergent abilities either "are real" or "are not real." This is an active debate. The sophisticated answer is: "Cross-entropy loss improves smoothly with scale (this is well-established). Whether downstream task performance shows genuine phase transitions or is an artifact of discrete metrics is debated. Schaeffer et al. showed that many claimed emergent abilities disappear when using continuous metrics like log-probability instead of accuracy. However, some complex multi-step reasoning capabilities do appear to require a minimum scale."
Part 8 - Scaling Beyond Language
Vision Scaling Laws
Zhai et al. (2022) found similar scaling laws for vision models (ViT):
With and for ImageNet classification.
Multimodal Scaling
Scaling laws also apply to multimodal models. The key insight is that different modalities may have different scaling exponents, meaning the optimal data mix changes with scale.
Downstream Task Scaling
An important finding: the relationship between pre-training loss and downstream task performance is itself predictable:
For many tasks, this relationship is approximately linear or power-law, meaning scaling laws for pre-training loss translate into scaling laws for downstream performance. This is what allows companies like OpenAI and Anthropic to predict the capabilities of future models before training them.
Part 9 - Practical Applications of Scaling Laws
Compute Budget Planning
Scaling laws allow organizations to plan multi-million dollar training runs:
- Train small pilot models at multiple scales (1M, 10M, 100M, 1B parameters)
- Fit scaling law coefficients (, , , , )
- Extrapolate to predict the loss of a much larger model
- Decide whether the predicted improvement justifies the investment
import numpy as np
from scipy.optimize import curve_fit
def scaling_law(X, A, alpha, B, beta_param, E):
"""
Chinchilla parametric scaling law.
X is a 2D array: X[0] = N (params), X[1] = D (tokens)
"""
N, D = X
return A / N**alpha + B / D**beta_param + E
# Simulated pilot runs (in practice, these are real experiments)
pilot_data = {
# (N_params, D_tokens): measured_loss
(10e6, 200e6): 3.85,
(50e6, 1e9): 3.45,
(100e6, 2e9): 3.25,
(500e6, 10e9): 2.95,
(1e9, 20e9): 2.80,
}
N_vals = np.array([k[0] for k in pilot_data.keys()])
D_vals = np.array([k[1] for k in pilot_data.keys()])
losses = np.array(list(pilot_data.values()))
# Fit scaling law
X = np.array([N_vals, D_vals])
popt, pcov = curve_fit(
scaling_law, X, losses,
p0=[400, 0.3, 400, 0.3, 1.7],
bounds=([0, 0, 0, 0, 0], [1e6, 1, 1e6, 1, 5])
)
A_fit, alpha_fit, B_fit, beta_fit, E_fit = popt
print(f"Fitted: A={A_fit:.1f}, α={alpha_fit:.3f}, "
f"B={B_fit:.1f}, β={beta_fit:.3f}, E={E_fit:.3f}")
# Predict loss for a 70B model on 1.4T tokens
N_target = 70e9
D_target = 1.4e12
predicted_loss = scaling_law(np.array([[N_target], [D_target]]),
*popt)[0]
print(f"\nPredicted loss for 70B model on 1.4T tokens: {predicted_loss:.3f}")
print(f"Training compute: {6 * N_target * D_target:.2e} FLOPs")
Model Selection for Deployment
Given an inference budget (max model size for your hardware), scaling laws tell you how much data to train on:
| Inference Constraint | Optimal Model | Training Data (Chinchilla) | Training Data (Overtrained) |
|---|---|---|---|
| Edge device (1B params max) | 1B | 20B tokens | 200B-1T tokens |
| Single GPU (7B params max) | 7B | 140B tokens | 1T-15T tokens |
| Multi-GPU (70B params max) | 70B | 1.4T tokens | 3T-15T tokens |
| Cluster (400B+ params) | 400B | 8T tokens | 15T+ tokens |
Part 10 - Open Questions and Future Directions
Does Scaling Continue?
The billion-dollar question: will power law improvements continue indefinitely, or will they plateau?
Arguments for continued scaling:
- No theoretical ceiling identified for Transformer language models
- Each generation has found new data sources and techniques to maintain scaling
- Loss curves show no signs of bending (within current data)
Arguments for a plateau:
- Available high-quality text data is finite (~10-20T tokens of quality English text)
- Synthetic data generation may not substitute for real data
- Diminishing returns: power laws mean each increment costs exponentially more
Data Walls
The biggest practical constraint on scaling is data. Estimated high-quality text data:
| Source | Estimated Tokens |
|---|---|
| Common Crawl (filtered) | ~5-10T |
| Books | ~500B |
| Wikipedia | ~5B |
| Code (GitHub) | ~500B |
| Academic papers | ~100B |
| Total (deduplicated, quality-filtered) | ~10-20T |
At the Chinchilla ratio of 20 tokens/param, this limits Chinchilla-optimal models to ~500B-1T parameters. Larger models require either:
- Synthetic data generation
- Multimodal data (images, video, audio)
- Multi-epoch training (revisiting data, which has diminishing returns)
- New data sources (private data, specialized domains)
Beyond Loss: Scaling Laws for Capabilities
A frontier research direction is developing scaling laws not just for loss but for specific capabilities:
This would allow predicting not just "how good" a model is, but "what it can do" at each scale.
Practice Problems
Problem 1: Compute-Optimal Allocation
You have a compute budget of FLOPs. Using the Chinchilla scaling law ( and ), calculate the optimal model size and training data.
Hint
From and : . Therefore parameters. tokens. So approximately: a 29B parameter model trained on 580B tokens.
Problem 2: Kaplan vs Chinchilla
The original GPT-3 (175B, 300B tokens) was designed based on Kaplan's scaling laws. What would Chinchilla recommend for the same compute budget?
Hint
GPT-3 compute: FLOPs. Chinchilla-optimal: parameters, tokens. Chinchilla recommends a model 3.4x smaller trained on 3.4x more data. This is very close to what Chinchilla (70B, 1.4T) actually was, confirming the scaling law.
Problem 3: Overtraining Economics
A company serves 100M queries per day, each requiring 500 tokens of generation. Compare the total 1-year cost of: (a) a 70B Chinchilla-optimal model, and (b) a 7B overtrained model with the same training compute. Assume inference costs per FLOP.
Hint
Annual queries: queries. Total inference tokens: tokens. (a) 70B model: Inference FLOPs = . Inference cost = 2,555,0002 \times 7B \times 18.25T = 2.555 \times 10^{23}2.555 \times 10^{23} \times 10^{-15} = . The 7B model saves ~$2.3M per year in inference costs. Even if it required extra training compute, the savings pay for themselves within months.
Problem 4: Data Wall
If the total available high-quality text data is 15T tokens and you follow Chinchilla-optimal scaling (20 tokens/param), what is the largest Chinchilla-optimal model you can train? What compute would it require?
Hint
parameters. Compute: FLOPs. This is roughly the compute used for GPT-4 (estimated at ). To go beyond 750B parameters Chinchilla-optimally, you need more data \text{---} through synthetic generation, multimodal sources, or multi-epoch training with careful deduplication.
Problem 5: Predicting Model Performance
You train pilot models at 100M, 1B, and 10B parameters (each Chinchilla-optimal) and measure losses of 3.2, 2.8, and 2.5 respectively. Estimate the loss for a 100B parameter Chinchilla-optimal model. Assume .
Hint
With three data points and three unknowns (, , ), we can fit the curve. From the data: , , . Taking differences: and . The ratio . Solving numerically gives , , . Prediction for 100B: . But this should be validated with more pilot models \text{---} three points is the minimum for fitting.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "What are scaling laws?" | Loss follows power laws with params (), data (), compute (). . |
| "What did Kaplan find?" | Scale models faster than data (). Later shown to be wrong due to not training to convergence. |
| "What did Chinchilla find?" | Scale and equally (). Optimal: ~20 tokens per parameter. |
| "What did Kaplan get wrong?" | Did not train to convergence. Larger models were compared at fewer tokens/param, biasing toward model size. |
| "Why was Chinchilla important?" | Proved Gopher (280B/300B tok) was worse than Chinchilla (70B/1.4T tok) at same compute. |
| "Why are modern models overtrained?" | Inference cost dominates. Smaller model + more data = same training cost but 10x cheaper inference. |
| "What is the data wall?" | ~10-20T quality tokens available. Limits Chinchilla-optimal models to ~500B-1T params. |
| "Are emergent abilities real?" | Debated. May be metric artifacts (discrete accuracy vs continuous log-prob). Multi-step reasoning may genuinely require scale. |
| "How are scaling laws used in practice?" | Train small pilots, fit power law, predict large model performance, decide if investment is justified. |
| "What is the irreducible loss?" | in . Entropy of natural language. Cannot be reduced by any model. |
Spaced Repetition Checkpoints
Day 0 (Today)
- State the three scaling variables (, , ) and
- Explain the Chinchilla-optimal ratio (20 tokens/param)
- Explain what Kaplan got wrong
Day 3
- Write the Chinchilla parametric loss
- Calculate optimal and for a given compute budget
- Explain overtraining and inference cost tradeoffs
Day 7
- Do a quantitative comparison of GPT-3 vs Chinchilla
- Explain the data wall and its implications
- Discuss emergent abilities and the metric artifact argument
Day 14
- Mock interview: answer all 10 cheat sheet questions
- Derive the compute-optimal allocation using Lagrange multipliers
- Discuss practical scaling law usage for compute budget planning
Day 21
- Full 20-minute discussion covering Kaplan, Chinchilla, and practical implications
- Handle follow-up questions on data walls, overtraining economics, and emergence
- Design a scaling experiment for a hypothetical new model
Next Steps
You now understand the quantitative science behind model scaling - the power laws, the optimal allocations, and the economic tradeoffs that drive every major training decision in the industry. This knowledge connects directly to every other paper discussion in this handbook: the GPT series (Chapter 5) scaled according to these laws, the Transformer architecture (Chapter 3) is the substrate that scaling laws describe, and RLHF (Chapter 10) adds a human preference signal on top of the scaled base model.
