Skip to main content

Scaling Laws - The Science of Making Models Bigger

Reading time: ~45 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, ML Infrastructure Engineer

The Real Interview Moment

You are in an Anthropic research interview. The interviewer asks: "You have a fixed compute budget of 102410^{24} FLOPs. Should you train a 10B parameter model on 2T tokens, or a 70B parameter model on 300B tokens? Show me the math."

You start with the Chinchilla scaling law, and she follows up: "Kaplan et al. (2020) would have given a different answer. What did Kaplan get wrong, and why does the Chinchilla-optimal ratio matter for production deployments? If you are deploying to millions of users, would you actually train Chinchilla-optimally?"

This question tests whether you understand the quantitative science of scaling - not just "bigger is better," but the precise mathematical relationships between parameters, data, and compute. Candidates who can only say "more data is better" without citing the specific power law exponents or explaining the Kaplan-Chinchilla disagreement get a "lean no-hire." Candidates who can derive the compute-optimal allocation, explain why production models are deliberately overtrained, and reason about inference cost tradeoffs get a "strong hire."

What You Will Master

  • State the three core scaling law variables and their power law relationships
  • Derive the compute-optimal training allocation (Chinchilla law)
  • Explain what Kaplan got wrong and why Chinchilla corrected it
  • Calculate optimal model size and token count for a given compute budget
  • Discuss why production models are deliberately overtrained
  • Reason about inference cost and the training-inference tradeoff
  • Apply scaling laws to practical model design decisions

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
State the three scaling law variables___
Write the power law equations___
Explain Kaplan scaling laws___
Explain Chinchilla scaling laws___
Derive compute-optimal allocation___
Calculate optimal N and D for a compute budget___
Explain overtraining and why it is done___
Discuss inference cost tradeoffs___
Explain emergent abilities and scaling___
Apply scaling laws to a design decision___

Target: All 4s and 5s before your interview.

Part 1 - The Three Variables of Scale

The Core Insight

The fundamental discovery of scaling laws research is that language model performance (measured by cross-entropy loss) follows smooth power law relationships with three variables:

  1. NN - Number of model parameters
  2. DD - Number of training tokens (dataset size)
  3. CC - Compute budget (in FLOPs)

These three variables are not independent. For a Transformer language model, the compute required for training is approximately:

C6NDC \approx 6ND

Where the factor of 6 accounts for the forward pass (~2ND FLOPs) and backward pass (~4ND FLOPs).

The Scaling Triangle: Parameters, Data, Compute

Power Law Relationships

Each variable, when the others are not limiting, produces a power law improvement in loss:

L(N)=(NcN)αN(loss vs parameters, data not limiting)L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} \quad \text{(loss vs parameters, data not limiting)}

L(D)=(DcD)αD(loss vs data, parameters not limiting)L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} \quad \text{(loss vs data, parameters not limiting)}

L(C)=(CcC)αC(loss vs compute, optimally allocated)L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} \quad \text{(loss vs compute, optimally allocated)}

Where NcN_c, DcD_c, CcC_c are characteristic constants and αN\alpha_N, αD\alpha_D, αC\alpha_C are the scaling exponents.

The crucial implication: performance improves as a power law, not linearly. Doubling compute does not double performance - it produces a fixed fractional improvement. Getting the next increment of performance always costs more than the last.

60-Second Answer

"Scaling laws describe the empirical observation that language model loss follows smooth power laws with respect to model size, dataset size, and compute. The key equation is C6NDC \approx 6ND, relating compute to parameters and data. Given a fixed compute budget, there is a unique optimal allocation between model size and data size that minimizes loss. Kaplan et al. (2020) found you should scale models faster than data. Chinchilla (2022) corrected this, showing parameters and data should scale equally - roughly 20 tokens per parameter. This has profound implications for which models to train and deploy."

Part 2 - Kaplan Scaling Laws (2020)

The Paper

"Scaling Laws for Neural Language Models" - Kaplan et al., 2020 (OpenAI)

The Key Findings

Kaplan et al. trained hundreds of language models ranging from 768 parameters to 1.5 billion parameters and measured how loss scaled. Their findings:

Finding 1: Smooth power laws.

L(N)=(8.8×1013N)0.076L(N) = \left(\frac{8.8 \times 10^{13}}{N}\right)^{0.076}

L(D)=(5.4×1013D)0.095L(D) = \left(\frac{5.4 \times 10^{13}}{D}\right)^{0.095}

L(Cmin)=(3.1×108Cmin)0.050L(C_{\min}) = \left(\frac{3.1 \times 10^8}{C_{\min}}\right)^{0.050}

Finding 2: Performance depends strongly on scale, weakly on architecture.

Model shape (depth vs width, number of heads) matters much less than total parameter count. A wide-shallow model with NN parameters performs similarly to a narrow-deep model with NN parameters.

Finding 3: The optimal allocation favors larger models.

Given a fixed compute budget, Kaplan concluded you should use most of the compute for a large model trained on relatively little data. Specifically, they claimed:

NoptC0.73N_{\text{opt}} \propto C^{0.73} DoptC0.27D_{\text{opt}} \propto C^{0.27}

This means compute should go mostly to model size, with data growing much more slowly.

import numpy as np

# Kaplan scaling law predictions
def kaplan_loss_N(N):
"""Loss as a function of model parameters (Kaplan)."""
return (8.8e13 / N) ** 0.076

def kaplan_loss_D(D):
"""Loss as a function of training tokens (Kaplan)."""
return (5.4e13 / D) ** 0.095

def kaplan_optimal_allocation(C):
"""
Kaplan's compute-optimal allocation.
Favors larger models with less data.
"""
# N_opt ∝ C^0.73, D_opt ∝ C^0.27
# With C = 6ND, and empirical fits:
N_opt = 1.3e9 * (C / 1e21) ** 0.73
D_opt = C / (6 * N_opt)
tokens_per_param = D_opt / N_opt
return N_opt, D_opt, tokens_per_param

# Example: what does Kaplan recommend for different compute budgets?
for log_C in [20, 21, 22, 23, 24]:
C = 10 ** log_C
N, D, ratio = kaplan_optimal_allocation(C)
print(f"C=10^{log_C}: N={N/1e9:.1f}B, D={D/1e9:.0f}B tokens, "
f"ratio={ratio:.1f} tokens/param")

# Kaplan's prediction: large models, relatively few tokens per parameter
# Tokens per parameter ratio is LOW - typically 1-5 tokens/param

What Kaplan Got Wrong

Kaplan's conclusion - favor model size over data - had a critical methodological flaw:

They did not train to convergence. Kaplan used a fixed number of training steps per model size, which meant larger models saw proportionally fewer tokens relative to their capacity. This biased the results toward recommending larger models.

In other words, Kaplan compared:

  • Small model trained for many tokens ✓ (near convergence)
  • Large model trained for few tokens ✗ (far from convergence)

Naturally, the large model showed more room for improvement, making it seem like scaling model size was more efficient than scaling data.

Common Trap

Do not say "Kaplan showed that model size matters more than data." This is Kaplan's conclusion but it was wrong. Chinchilla showed that when you properly control for training duration, data is equally important. If an interviewer asks about Kaplan, always mention the correction: "Kaplan's methodology did not train to convergence, which biased the results toward larger models. Chinchilla corrected this."

Part 3 - Chinchilla Scaling Laws (2022)

The Paper

"Training Compute-Optimal Large Language Models" - Hoffmann et al., 2022 (DeepMind)

The Key Correction

Hoffmann et al. trained over 400 models ranging from 70M to 16B parameters, each trained on 5B to 500B tokens. Crucially, they varied both model size and training tokens for each compute budget, training many models to different degrees of convergence.

The Chinchilla Law

Their finding was dramatically different from Kaplan:

NoptC0.50N_{\text{opt}} \propto C^{0.50} DoptC0.50D_{\text{opt}} \propto C^{0.50}

Parameters and data should scale equally. For every doubling of compute, you should double both the model size and the training data.

The specific ratio they found:

Dopt20×NoptD_{\text{opt}} \approx 20 \times N_{\text{opt}}

You should train on approximately 20 tokens per parameter.

Kaplan vs Chinchilla Scaling Allocation

The Joint Scaling Law

Chinchilla proposed a parametric loss function that depends on both NN and DD:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

Where:

  • A/NαA/N^\alpha - the model size term (decreasing loss as model gets bigger)
  • B/DβB/D^\beta - the data term (decreasing loss as data gets larger)
  • EE - the irreducible loss (entropy of natural language, cannot be improved by any model)
  • α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28

The fitted parameters:

ParameterValue
AA406.4406.4
BB410.7410.7
α\alpha0.340.34
β\beta0.280.28
EE1.691.69

Computing the Optimal Allocation

Given compute C=6NDC = 6ND, we want to minimize L(N,D)L(N, D) subject to 6ND=C6ND = C:

minN,D[ANα+BDβ+E]s.t.6ND=C\min_{N, D} \left[\frac{A}{N^\alpha} + \frac{B}{D^\beta} + E\right] \quad \text{s.t.} \quad 6ND = C

Using a Lagrange multiplier and solving:

N[ANα+BDβ+λ(6NDC)]=0\frac{\partial}{\partial N}\left[\frac{A}{N^\alpha} + \frac{B}{D^\beta} + \lambda(6ND - C)\right] = 0

αANα+1+6λD=0\frac{-\alpha A}{N^{\alpha+1}} + 6\lambda D = 0

Similarly for DD:

βBDβ+1+6λN=0\frac{-\beta B}{D^{\beta+1}} + 6\lambda N = 0

Dividing:

αANα+1D=βBNDβ+1\frac{\alpha A}{N^{\alpha+1} D} = \frac{\beta B}{N D^{\beta+1}}

αANα=βBDβ\frac{\alpha A}{N^\alpha} = \frac{\beta B}{D^\beta}

This says: at the optimum, the marginal improvement from scaling the model equals the marginal improvement from scaling the data. Neither is a bottleneck - both contribute equally.

import numpy as np
from scipy.optimize import minimize_scalar

# Chinchilla scaling law parameters
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28
E = 1.69

def chinchilla_loss(N, D):
"""Compute loss given model size N and data size D."""
return A / N**alpha + B / D**beta + E

def optimal_allocation(C):
"""
Find compute-optimal N and D for a given compute budget C.
Constraint: C = 6 * N * D
"""
def loss_given_N(log_N):
N = np.exp(log_N)
D = C / (6 * N)
if D <= 0:
return 1e10
return chinchilla_loss(N, D)

# Search over possible model sizes
result = minimize_scalar(
loss_given_N,
bounds=(np.log(1e6), np.log(1e12)),
method='bounded'
)

N_opt = np.exp(result.x)
D_opt = C / (6 * N_opt)
loss = chinchilla_loss(N_opt, D_opt)
ratio = D_opt / N_opt

return N_opt, D_opt, loss, ratio


# What does Chinchilla recommend for various compute budgets?
print(f"{'Compute':>12s} | {'N_opt':>10s} | {'D_opt':>12s} | "
f"{'Loss':>6s} | {'Tokens/Param':>12s}")
print("-" * 70)

for log_C in [19, 20, 21, 22, 23, 24, 25]:
C = 10 ** log_C
N, D, loss, ratio = optimal_allocation(C)
print(f"10^{log_C:>2d} FLOPs | {N/1e9:>8.2f}B | {D/1e9:>10.1f}B tok | "
f"{loss:>6.3f} | {ratio:>10.1f}")

# The ratio stays approximately 20 tokens per parameter

Chinchilla's Proof by Construction

To validate their scaling law, DeepMind trained Chinchilla - a 70B parameter model on 1.4T tokens (C5.76×1023C \approx 5.76 \times 10^{23} FLOPs). This was the compute-optimal allocation for their budget.

For comparison, Gopher was a 280B parameter model trained on 300B tokens with approximately the same compute budget. According to Kaplan, Gopher should have been preferred (larger model). According to Chinchilla, the compute would be better spent on a smaller model with more data.

ModelParametersTokensComputeTokens/ParamMMLU
Gopher280B300B~5.76×10235.76 \times 10^{23}1.160.0%
Chinchilla70B1.4T~5.76×10235.76 \times 10^{23}20.067.6%

Same compute, dramatically different performance. Chinchilla proved that Gopher was undertrained - it had too many parameters for the amount of data it saw.

Instant Rejection

If asked "What is the Chinchilla scaling law?" and you answer "Bigger models are better" - that is the opposite of the point. Chinchilla showed that simply making models bigger (Kaplan's approach) is wasteful. The optimal strategy is to scale parameters and data equally, at approximately 20 tokens per parameter. Many models at the time (GPT-3, Gopher) were massively undertrained relative to their size.

Part 4 - Implications for Model Design

Was GPT-3 Undertrained?

GPT-3 (175B parameters) was trained on 300B tokens:

Tokens per parameter=300B175B1.7\text{Tokens per parameter} = \frac{300\text{B}}{175\text{B}} \approx 1.7

Chinchilla optimal would have been 175B×20=3.5T175\text{B} \times 20 = 3.5\text{T} tokens. GPT-3 was trained on roughly 12x fewer tokens than Chinchilla recommends.

This means one of two things:

  1. GPT-3 could have achieved the same performance with a ~20B parameter model trained on 300B tokens
  2. GPT-3 could have performed much better if trained on 3.5T tokens

The Model Landscape After Chinchilla

Chinchilla reshaped the entire field:

ModelYearParametersTokensTokens/ParamChinchilla-Optimal?
GPT-32020175B300B1.7Severely undertrained
Gopher2021280B300B1.1Severely undertrained
Chinchilla202270B1.4T20Yes
LLaMA-7B20237B1T143Deliberately overtrained
LLaMA-65B202365B1.4T21.5Approximately optimal
Mistral-7B20237B~8T~1143Severely overtrained
LLaMA-3-8B20248B15T1875Extremely overtrained

Why Are Modern Models Overtrained?

This brings us to a crucial insight that interview candidates often miss: Chinchilla-optimal training minimizes loss for a given training compute budget, but not for a given inference compute budget.

Training-Optimal vs Inference-Optimal Tradeoff

Part 5 - The Training-Inference Tradeoff

The Cost Model

The total cost of an LLM is:

Total Cost=Training Cost+Inference Cost×Number of Queries\text{Total Cost} = \text{Training Cost} + \text{Inference Cost} \times \text{Number of Queries}

Total Cost=Ctrain+cinference×Nparams×Q\text{Total Cost} = C_{\text{train}} + c_{\text{inference}} \times N_{\text{params}} \times Q

Where QQ is the total number of inference queries over the model's lifetime.

For a model serving millions of users:

  • Training cost is fixed (one-time)
  • Inference cost scales with both model size and query volume
  • Inference cost often dwarfs training cost

Why Overtrain?

Consider two options for achieving the same target loss:

Option A (Chinchilla-optimal): 70B model, 1.4T tokens

  • Training cost: C=6×70B×1.4T=5.88×1023C = 6 \times 70\text{B} \times 1.4\text{T} = 5.88 \times 10^{23} FLOPs
  • Inference cost per token: proportional to 70B parameters

Option B (Overtrained): 7B model, 14T tokens

  • Training cost: C=6×7B×14T=5.88×1023C = 6 \times 7\text{B} \times 14\text{T} = 5.88 \times 10^{23} FLOPs
  • Inference cost per token: proportional to 7B parameters - 10x cheaper!

Both use the same training compute. But Option B produces a model that is 10x cheaper to serve. If you are processing billions of queries, the inference savings far outweigh any suboptimality in training.

import numpy as np

def total_cost(N_params, D_tokens, queries, cost_per_training_flop, cost_per_inference_flop):
"""
Total cost = training + inference.

N_params: model parameters
D_tokens: training tokens
queries: total lifetime inference queries (in tokens)
"""
# Training cost: 6ND FLOPs
training_flops = 6 * N_params * D_tokens
training_cost = training_flops * cost_per_training_flop

# Inference cost: ~2N FLOPs per token (forward pass only)
inference_flops_per_token = 2 * N_params
inference_cost = inference_flops_per_token * queries * cost_per_inference_flop

return training_cost, inference_cost, training_cost + inference_cost


# Compare Chinchilla-optimal vs overtrained
# Assume same total training compute (same budget)
cost_train = 1e-18 # $/FLOP for training
cost_infer = 3e-18 # $/FLOP for inference (less efficient due to small batches)

# Scenario: 1 billion inference queries of 500 tokens each
total_inference_tokens = 1e9 * 500

print(f"{'Model':>20s} | {'Train $':>12s} | {'Infer $':>12s} | {'Total $':>12s}")
print("-" * 65)

for name, N, D in [
("Chinchilla-70B", 70e9, 1.4e12),
("Overtrained-7B", 7e9, 14e12),
("Overtrained-1B", 1e9, 98e12),
]:
tc, ic, total = total_cost(N, D, total_inference_tokens, cost_train, cost_infer)
print(f"{name:>20s} | ${tc/1e6:>10.1f}M | ${ic/1e6:>10.1f}M | ${total/1e6:>10.1f}M")

# The overtrained 7B model costs the same to train but 10x less to serve

The LLaMA Philosophy

Meta's LLaMA (Touvron et al., 2023) explicitly embraced this insight: "The objective of the scaling laws is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale."

LLaMA-7B was trained on 1T tokens (143 tokens/param) - 7x more than Chinchilla-optimal. This "wasted" training compute produced a model that was dramatically cheaper to deploy and only slightly worse in quality.

60-Second Answer

"Chinchilla showed that compute-optimal training uses about 20 tokens per parameter. But modern models like LLaMA deliberately overtrain - using 100-2000 tokens per parameter - because inference cost, not training cost, dominates the total lifetime cost of a model. A 7B model trained on 14T tokens uses the same training compute as a 70B model trained on 1.4T tokens, but costs 10x less to serve. For production deployments serving millions of users, overtraining is the economically rational choice."

Part 6 - The Scaling Law Equations in Detail

The Kaplan Parametric Form

Kaplan proposed that loss decomposes as:

L(N,D)=[(NcN)αN/αD+DcD]αDL(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}

This form assumes a specific functional relationship between the model and data contributions to loss.

The Chinchilla Parametric Form

Chinchilla used a simpler additive decomposition:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

This assumes the model and data bottlenecks contribute independently, with an irreducible loss floor EE.

Compute-Optimal Scaling Exponents

For Chinchilla, the optimal allocation follows:

Nopt=G(C6)a,Dopt=G1(C6)bN_{\text{opt}} = G \left(\frac{C}{6}\right)^a, \quad D_{\text{opt}} = G^{-1} \left(\frac{C}{6}\right)^b

Where a=βα+βa = \frac{\beta}{\alpha + \beta} and b=αα+βb = \frac{\alpha}{\alpha + \beta}.

With α=0.34\alpha = 0.34 and β=0.28\beta = 0.28:

a=0.280.34+0.28=0.452a = \frac{0.28}{0.34 + 0.28} = 0.452 b=0.340.34+0.28=0.548b = \frac{0.34}{0.34 + 0.28} = 0.548

Both are close to 0.5, confirming that parameters and data should scale approximately equally.

Part 7 - Emergent Abilities and Scaling

What Are Emergent Abilities?

Wei et al. (2022) defined emergent abilities as capabilities that are absent in small models but present in large models - they appear to emerge suddenly at a certain scale rather than improving gradually.

Examples of claimed emergent abilities:

  • Multi-step arithmetic (appears at ~100B parameters)
  • Chain-of-thought reasoning (appears at ~60B parameters)
  • Word unscrambling (appears at ~10B parameters)

Standard Scaling vs Emergent Abilities

The Debate: Are Emergent Abilities Real?

This is an active research debate that interviewers may probe:

Argument for emergence (Wei et al., 2022):

  • Many benchmarks show near-random performance until a threshold model size, then rapid improvement
  • This is consistent with phase transitions in physics
  • Some capabilities genuinely require a minimum amount of knowledge/reasoning

Argument against emergence (Schaeffer et al., 2023):

  • "Emergent" abilities may be an artifact of the metric chosen
  • When you switch from accuracy (discrete) to log-likelihood (continuous), the improvement is smooth
  • The "sudden jump" is because accuracy changes from 0% to non-zero at a threshold, but the underlying probability is smoothly improving
import numpy as np

def demonstrate_metric_illusion():
"""
Show how metric choice can create the illusion of emergence.
"""
# Model capability (smooth power law improvement)
scales = np.logspace(7, 11, 50) # 10M to 100B parameters
# Probability of getting a single step correct (smooth)
p_correct = 1 - (1e13 / scales) ** 0.05
p_correct = np.clip(p_correct, 0.01, 0.99)

# Task requires getting 5 steps ALL correct
# Accuracy (discrete metric)
accuracy = p_correct ** 5

print("Scale (params) | p(single step) | accuracy (5 steps)")
print("-" * 55)
for i in range(0, len(scales), 5):
print(f"{scales[i]:>13.0f} | {p_correct[i]:>14.3f} | {accuracy[i]:>18.3f}")

# Key insight: p_correct improves smoothly,
# but accuracy appears to "emerge" because
# p^5 is very small until p is close to 1

demonstrate_metric_illusion()
# p_correct: 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99
# accuracy: 0.002, 0.03, 0.17, 0.33, 0.59, 0.77, 0.95
# Accuracy looks like a sudden jump even though capability is smooth!
Common Trap

Do not state definitively that emergent abilities either "are real" or "are not real." This is an active debate. The sophisticated answer is: "Cross-entropy loss improves smoothly with scale (this is well-established). Whether downstream task performance shows genuine phase transitions or is an artifact of discrete metrics is debated. Schaeffer et al. showed that many claimed emergent abilities disappear when using continuous metrics like log-probability instead of accuracy. However, some complex multi-step reasoning capabilities do appear to require a minimum scale."

Part 8 - Scaling Beyond Language

Vision Scaling Laws

Zhai et al. (2022) found similar scaling laws for vision models (ViT):

L(N,D)=aNα+bDβ+cL(N, D) = \frac{a}{N^{\alpha}} + \frac{b}{D^{\beta}} + c

With α0.5\alpha \approx 0.5 and β0.8\beta \approx 0.8 for ImageNet classification.

Multimodal Scaling

Scaling laws also apply to multimodal models. The key insight is that different modalities may have different scaling exponents, meaning the optimal data mix changes with scale.

Downstream Task Scaling

An important finding: the relationship between pre-training loss and downstream task performance is itself predictable:

Task performance=f(Lpretrain)\text{Task performance} = f(L_{\text{pretrain}})

For many tasks, this relationship is approximately linear or power-law, meaning scaling laws for pre-training loss translate into scaling laws for downstream performance. This is what allows companies like OpenAI and Anthropic to predict the capabilities of future models before training them.

Part 9 - Practical Applications of Scaling Laws

Compute Budget Planning

Scaling laws allow organizations to plan multi-million dollar training runs:

  1. Train small pilot models at multiple scales (1M, 10M, 100M, 1B parameters)
  2. Fit scaling law coefficients (AA, BB, α\alpha, β\beta, EE)
  3. Extrapolate to predict the loss of a much larger model
  4. Decide whether the predicted improvement justifies the investment
import numpy as np
from scipy.optimize import curve_fit

def scaling_law(X, A, alpha, B, beta_param, E):
"""
Chinchilla parametric scaling law.
X is a 2D array: X[0] = N (params), X[1] = D (tokens)
"""
N, D = X
return A / N**alpha + B / D**beta_param + E


# Simulated pilot runs (in practice, these are real experiments)
pilot_data = {
# (N_params, D_tokens): measured_loss
(10e6, 200e6): 3.85,
(50e6, 1e9): 3.45,
(100e6, 2e9): 3.25,
(500e6, 10e9): 2.95,
(1e9, 20e9): 2.80,
}

N_vals = np.array([k[0] for k in pilot_data.keys()])
D_vals = np.array([k[1] for k in pilot_data.keys()])
losses = np.array(list(pilot_data.values()))

# Fit scaling law
X = np.array([N_vals, D_vals])
popt, pcov = curve_fit(
scaling_law, X, losses,
p0=[400, 0.3, 400, 0.3, 1.7],
bounds=([0, 0, 0, 0, 0], [1e6, 1, 1e6, 1, 5])
)
A_fit, alpha_fit, B_fit, beta_fit, E_fit = popt
print(f"Fitted: A={A_fit:.1f}, α={alpha_fit:.3f}, "
f"B={B_fit:.1f}, β={beta_fit:.3f}, E={E_fit:.3f}")

# Predict loss for a 70B model on 1.4T tokens
N_target = 70e9
D_target = 1.4e12
predicted_loss = scaling_law(np.array([[N_target], [D_target]]),
*popt)[0]
print(f"\nPredicted loss for 70B model on 1.4T tokens: {predicted_loss:.3f}")
print(f"Training compute: {6 * N_target * D_target:.2e} FLOPs")

Model Selection for Deployment

Given an inference budget (max model size for your hardware), scaling laws tell you how much data to train on:

Inference ConstraintOptimal ModelTraining Data (Chinchilla)Training Data (Overtrained)
Edge device (1B params max)1B20B tokens200B-1T tokens
Single GPU (7B params max)7B140B tokens1T-15T tokens
Multi-GPU (70B params max)70B1.4T tokens3T-15T tokens
Cluster (400B+ params)400B8T tokens15T+ tokens

Part 10 - Open Questions and Future Directions

Does Scaling Continue?

The billion-dollar question: will power law improvements continue indefinitely, or will they plateau?

Arguments for continued scaling:

  • No theoretical ceiling identified for Transformer language models
  • Each generation has found new data sources and techniques to maintain scaling
  • Loss curves show no signs of bending (within current data)

Arguments for a plateau:

  • Available high-quality text data is finite (~10-20T tokens of quality English text)
  • Synthetic data generation may not substitute for real data
  • Diminishing returns: power laws mean each increment costs exponentially more

Data Walls

The biggest practical constraint on scaling is data. Estimated high-quality text data:

SourceEstimated Tokens
Common Crawl (filtered)~5-10T
Books~500B
Wikipedia~5B
Code (GitHub)~500B
Academic papers~100B
Total (deduplicated, quality-filtered)~10-20T

At the Chinchilla ratio of 20 tokens/param, this limits Chinchilla-optimal models to ~500B-1T parameters. Larger models require either:

  • Synthetic data generation
  • Multimodal data (images, video, audio)
  • Multi-epoch training (revisiting data, which has diminishing returns)
  • New data sources (private data, specialized domains)

Beyond Loss: Scaling Laws for Capabilities

A frontier research direction is developing scaling laws not just for loss but for specific capabilities:

P(can solve task k)=fk(N,D,task complexity)P(\text{can solve task } k) = f_k(N, D, \text{task complexity})

This would allow predicting not just "how good" a model is, but "what it can do" at each scale.

Practice Problems

Problem 1: Compute-Optimal Allocation

You have a compute budget of C=1023C = 10^{23} FLOPs. Using the Chinchilla scaling law (Dopt20NoptD_{\text{opt}} \approx 20N_{\text{opt}} and C=6NDC = 6ND), calculate the optimal model size and training data.

Hint

From C=6NDC = 6ND and D=20ND = 20N: C=6N20N=120N2C = 6N \cdot 20N = 120N^2. Therefore N=C/120=1023/1208.33×10202.89×101029BN = \sqrt{C/120} = \sqrt{10^{23}/120} \approx \sqrt{8.33 \times 10^{20}} \approx 2.89 \times 10^{10} \approx 29B parameters. D=20×29B=580BD = 20 \times 29\text{B} = 580\text{B} tokens. So approximately: a 29B parameter model trained on 580B tokens.

Problem 2: Kaplan vs Chinchilla

The original GPT-3 (175B, 300B tokens) was designed based on Kaplan's scaling laws. What would Chinchilla recommend for the same compute budget?

Hint

GPT-3 compute: C=6×175B×300B=3.15×1023C = 6 \times 175\text{B} \times 300\text{B} = 3.15 \times 10^{23} FLOPs. Chinchilla-optimal: N=C/120=3.15×1023/12051BN = \sqrt{C/120} = \sqrt{3.15 \times 10^{23}/120} \approx 51\text{B} parameters, D=20×51B=1.02TD = 20 \times 51\text{B} = 1.02\text{T} tokens. Chinchilla recommends a model 3.4x smaller trained on 3.4x more data. This is very close to what Chinchilla (70B, 1.4T) actually was, confirming the scaling law.

Problem 3: Overtraining Economics

A company serves 100M queries per day, each requiring 500 tokens of generation. Compare the total 1-year cost of: (a) a 70B Chinchilla-optimal model, and (b) a 7B overtrained model with the same training compute. Assume inference costs 101510^{-15} per FLOP.

Hint

Annual queries: 100M×365=36.5B100M \times 365 = 36.5B queries. Total inference tokens: 36.5B×500=18.25T36.5B \times 500 = 18.25T tokens. (a) 70B model: Inference FLOPs = 2×70B×18.25T=2.555×10242 \times 70B \times 18.25T = 2.555 \times 10^{24}. Inference cost = 2.555×1024×1015=2.555 \times 10^{24} \times 10^{-15} = 2,555,000.(b)7Bmodel:InferenceFLOPs=. (b) 7B model: Inference FLOPs = 2 \times 7B \times 18.25T = 2.555 \times 10^{23}.Inferencecost=. Inference cost = 2.555 \times 10^{23} \times 10^{-15} = 255,500255,500. The 7B model saves ~$2.3M per year in inference costs. Even if it required extra training compute, the savings pay for themselves within months.

Problem 4: Data Wall

If the total available high-quality text data is 15T tokens and you follow Chinchilla-optimal scaling (20 tokens/param), what is the largest Chinchilla-optimal model you can train? What compute would it require?

Hint

Nmax=D/20=15T/20=750BN_{\max} = D / 20 = 15T / 20 = 750B parameters. Compute: C=6×750B×15T=6.75×1025C = 6 \times 750B \times 15T = 6.75 \times 10^{25} FLOPs. This is roughly 100×100\times the compute used for GPT-4 (estimated at 1025\sim 10^{25}). To go beyond 750B parameters Chinchilla-optimally, you need more data \text{---} through synthetic generation, multimodal sources, or multi-epoch training with careful deduplication.

Problem 5: Predicting Model Performance

You train pilot models at 100M, 1B, and 10B parameters (each Chinchilla-optimal) and measure losses of 3.2, 2.8, and 2.5 respectively. Estimate the loss for a 100B parameter Chinchilla-optimal model. Assume L(N)=A/Nα+EL(N) = A/N^{\alpha} + E.

Hint

With three data points and three unknowns (AA, α\alpha, EE), we can fit the curve. From the data: 3.2=A/108α+E3.2 = A/10^{8\alpha} + E, 2.8=A/109α+E2.8 = A/10^{9\alpha} + E, 2.5=A/1010α+E2.5 = A/10^{10\alpha} + E. Taking differences: 0.4=A(108α109α)0.4 = A(10^{-8\alpha} - 10^{-9\alpha}) and 0.3=A(109α1010α)0.3 = A(10^{-9\alpha} - 10^{-10\alpha}). The ratio 0.4/0.3=1.333=(108α109α)/(109α1010α)0.4/0.3 = 1.333 = (10^{-8\alpha} - 10^{-9\alpha})/(10^{-9\alpha} - 10^{-10\alpha}). Solving numerically gives α0.10\alpha \approx 0.10, A1.05A \approx 1.05, E1.54E \approx 1.54. Prediction for 100B: L=1.05/1011×0.10+1.54=1.05/101.1+1.540.083+1.54=1.62L = 1.05/10^{11 \times 0.10} + 1.54 = 1.05/10^{1.1} + 1.54 \approx 0.083 + 1.54 = 1.62. But this should be validated with more pilot models \text{---} three points is the minimum for fitting.

Interview Cheat Sheet

QuestionKey Points
"What are scaling laws?"Loss follows power laws with params (NN), data (DD), compute (CC). C6NDC \approx 6ND.
"What did Kaplan find?"Scale models faster than data (NC0.73N \propto C^{0.73}). Later shown to be wrong due to not training to convergence.
"What did Chinchilla find?"Scale NN and DD equally (NC0.5N \propto C^{0.5}). Optimal: ~20 tokens per parameter.
"What did Kaplan get wrong?"Did not train to convergence. Larger models were compared at fewer tokens/param, biasing toward model size.
"Why was Chinchilla important?"Proved Gopher (280B/300B tok) was worse than Chinchilla (70B/1.4T tok) at same compute.
"Why are modern models overtrained?"Inference cost dominates. Smaller model + more data = same training cost but 10x cheaper inference.
"What is the data wall?"~10-20T quality tokens available. Limits Chinchilla-optimal models to ~500B-1T params.
"Are emergent abilities real?"Debated. May be metric artifacts (discrete accuracy vs continuous log-prob). Multi-step reasoning may genuinely require scale.
"How are scaling laws used in practice?"Train small pilots, fit power law, predict large model performance, decide if investment is justified.
"What is the irreducible loss?"EE in L=A/Nα+B/Dβ+EL = A/N^\alpha + B/D^\beta + E. Entropy of natural language. Cannot be reduced by any model.

Spaced Repetition Checkpoints

Day 0 (Today)

  • State the three scaling variables (NN, DD, CC) and C6NDC \approx 6ND
  • Explain the Chinchilla-optimal ratio (20 tokens/param)
  • Explain what Kaplan got wrong

Day 3

  • Write the Chinchilla parametric loss L=A/Nα+B/Dβ+EL = A/N^\alpha + B/D^\beta + E
  • Calculate optimal NN and DD for a given compute budget
  • Explain overtraining and inference cost tradeoffs

Day 7

  • Do a quantitative comparison of GPT-3 vs Chinchilla
  • Explain the data wall and its implications
  • Discuss emergent abilities and the metric artifact argument

Day 14

  • Mock interview: answer all 10 cheat sheet questions
  • Derive the compute-optimal allocation using Lagrange multipliers
  • Discuss practical scaling law usage for compute budget planning

Day 21

  • Full 20-minute discussion covering Kaplan, Chinchilla, and practical implications
  • Handle follow-up questions on data walls, overtraining economics, and emergence
  • Design a scaling experiment for a hypothetical new model

Next Steps

You now understand the quantitative science behind model scaling - the power laws, the optimal allocations, and the economic tradeoffs that drive every major training decision in the industry. This knowledge connects directly to every other paper discussion in this handbook: the GPT series (Chapter 5) scaled according to these laws, the Transformer architecture (Chapter 3) is the substrate that scaling laws describe, and RLHF (Chapter 10) adds a human preference signal on top of the scaled base model.

© 2026 EngineersOfAI. All rights reserved.