Scaling Laws - The Science of Making Models Bigger

Reading time: ~45 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, ML Infrastructure Engineer

The Real Interview Moment

You are in an Anthropic research interview. The interviewer asks: "You have a fixed compute budget of $10^{24}$ FLOPs. Should you train a 10B parameter model on 2T tokens, or a 70B parameter model on 300B tokens? Show me the math."

You start with the Chinchilla scaling law, and she follows up: "Kaplan et al. (2020) would have given a different answer. What did Kaplan get wrong, and why does the Chinchilla-optimal ratio matter for production deployments? If you are deploying to millions of users, would you actually train Chinchilla-optimally?"

This question tests whether you understand the quantitative science of scaling - not just "bigger is better," but the precise mathematical relationships between parameters, data, and compute. Candidates who can only say "more data is better" without citing the specific power law exponents or explaining the Kaplan-Chinchilla disagreement get a "lean no-hire." Candidates who can derive the compute-optimal allocation, explain why production models are deliberately overtrained, and reason about inference cost tradeoffs get a "strong hire."

What You Will Master

State the three core scaling law variables and their power law relationships
Derive the compute-optimal training allocation (Chinchilla law)
Explain what Kaplan got wrong and why Chinchilla corrected it
Calculate optimal model size and token count for a given compute budget
Discuss why production models are deliberately overtrained
Reason about inference cost and the training-inference tradeoff
Apply scaling laws to practical model design decisions

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
State the three scaling law variables						___
Write the power law equations						___
Explain Kaplan scaling laws						___
Explain Chinchilla scaling laws						___
Derive compute-optimal allocation						___
Calculate optimal N and D for a compute budget						___
Explain overtraining and why it is done						___
Discuss inference cost tradeoffs						___
Explain emergent abilities and scaling						___
Apply scaling laws to a design decision						___

Target: All 4s and 5s before your interview.

Part 1 - The Three Variables of Scale

The Core Insight

The fundamental discovery of scaling laws research is that language model performance (measured by cross-entropy loss) follows smooth power law relationships with three variables:

$N$ - Number of model parameters
$D$ - Number of training tokens (dataset size)
$C$ - Compute budget (in FLOPs)

These three variables are not independent. For a Transformer language model, the compute required for training is approximately:

$C \approx 6ND$

Where the factor of 6 accounts for the forward pass (~2ND FLOPs) and backward pass (~4ND FLOPs).

The Scaling Triangle: Parameters, Data, Compute

Power Law Relationships

Each variable, when the others are not limiting, produces a power law improvement in loss:

$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} \quad \text{(loss vs parameters, data not limiting)}$

$L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} \quad \text{(loss vs data, parameters not limiting)}$

$L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} \quad \text{(loss vs compute, optimally allocated)}$

Where $N_c$ , $D_c$ , $C_c$ are characteristic constants and $\alpha_N$ , $\alpha_D$ , $\alpha_C$ are the scaling exponents.

The crucial implication: performance improves as a power law, not linearly. Doubling compute does not double performance - it produces a fixed fractional improvement. Getting the next increment of performance always costs more than the last.

60-Second Answer

"Scaling laws describe the empirical observation that language model loss follows smooth power laws with respect to model size, dataset size, and compute. The key equation is $C \approx 6ND$ , relating compute to parameters and data. Given a fixed compute budget, there is a unique optimal allocation between model size and data size that minimizes loss. Kaplan et al. (2020) found you should scale models faster than data. Chinchilla (2022) corrected this, showing parameters and data should scale equally - roughly 20 tokens per parameter. This has profound implications for which models to train and deploy."

Part 2 - Kaplan Scaling Laws (2020)

The Paper

"Scaling Laws for Neural Language Models" - Kaplan et al., 2020 (OpenAI)

The Key Findings

Kaplan et al. trained hundreds of language models ranging from 768 parameters to 1.5 billion parameters and measured how loss scaled. Their findings:

Finding 1: Smooth power laws.

$L(N) = \left(\frac{8.8 \times 10^{13}}{N}\right)^{0.076}$

$L(D) = \left(\frac{5.4 \times 10^{13}}{D}\right)^{0.095}$

$L(C_{\min}) = \left(\frac{3.1 \times 10^8}{C_{\min}}\right)^{0.050}$

Finding 2: Performance depends strongly on scale, weakly on architecture.

Model shape (depth vs width, number of heads) matters much less than total parameter count. A wide-shallow model with $N$ parameters performs similarly to a narrow-deep model with $N$ parameters.

Finding 3: The optimal allocation favors larger models.

Given a fixed compute budget, Kaplan concluded you should use most of the compute for a large model trained on relatively little data. Specifically, they claimed:

$N_{\text{opt}} \propto C^{0.73}$ $D_{\text{opt}} \propto C^{0.27}$

This means compute should go mostly to model size, with data growing much more slowly.

import numpy as np

# Kaplan scaling law predictions
def kaplan_loss_N(N):
    """Loss as a function of model parameters (Kaplan)."""
    return (8.8e13 / N) ** 0.076

def kaplan_loss_D(D):
    """Loss as a function of training tokens (Kaplan)."""
    return (5.4e13 / D) ** 0.095

def kaplan_optimal_allocation(C):
    """
    Kaplan's compute-optimal allocation.
    Favors larger models with less data.
    """
    # N_opt ∝ C^0.73, D_opt ∝ C^0.27
    # With C = 6ND, and empirical fits:
    N_opt = 1.3e9 * (C / 1e21) ** 0.73
    D_opt = C / (6 * N_opt)
    tokens_per_param = D_opt / N_opt
    return N_opt, D_opt, tokens_per_param

# Example: what does Kaplan recommend for different compute budgets?
for log_C in [20, 21, 22, 23, 24]:
    C = 10 ** log_C
    N, D, ratio = kaplan_optimal_allocation(C)
    print(f"C=10^{log_C}: N={N/1e9:.1f}B, D={D/1e9:.0f}B tokens, "
          f"ratio={ratio:.1f} tokens/param")

# Kaplan's prediction: large models, relatively few tokens per parameter
# Tokens per parameter ratio is LOW - typically 1-5 tokens/param

What Kaplan Got Wrong

Kaplan's conclusion - favor model size over data - had a critical methodological flaw:

They did not train to convergence. Kaplan used a fixed number of training steps per model size, which meant larger models saw proportionally fewer tokens relative to their capacity. This biased the results toward recommending larger models.

In other words, Kaplan compared:

Small model trained for many tokens ✓ (near convergence)
Large model trained for few tokens ✗ (far from convergence)

Naturally, the large model showed more room for improvement, making it seem like scaling model size was more efficient than scaling data.

Common Trap

Do not say "Kaplan showed that model size matters more than data." This is Kaplan's conclusion but it was wrong. Chinchilla showed that when you properly control for training duration, data is equally important. If an interviewer asks about Kaplan, always mention the correction: "Kaplan's methodology did not train to convergence, which biased the results toward larger models. Chinchilla corrected this."

Part 3 - Chinchilla Scaling Laws (2022)

The Paper

"Training Compute-Optimal Large Language Models" - Hoffmann et al., 2022 (DeepMind)

The Key Correction

Hoffmann et al. trained over 400 models ranging from 70M to 16B parameters, each trained on 5B to 500B tokens. Crucially, they varied both model size and training tokens for each compute budget, training many models to different degrees of convergence.

The Chinchilla Law

Their finding was dramatically different from Kaplan:

$N_{\text{opt}} \propto C^{0.50}$ $D_{\text{opt}} \propto C^{0.50}$

Parameters and data should scale equally. For every doubling of compute, you should double both the model size and the training data.

The specific ratio they found:

$D_{\text{opt}} \approx 20 \times N_{\text{opt}}$

You should train on approximately 20 tokens per parameter.

Kaplan vs Chinchilla Scaling Allocation

The Joint Scaling Law

Chinchilla proposed a parametric loss function that depends on both $N$ and $D$ :

$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$

Where:

$A/N^\alpha$ - the model size term (decreasing loss as model gets bigger)
$B/D^\beta$ - the data term (decreasing loss as data gets larger)
$E$ - the irreducible loss (entropy of natural language, cannot be improved by any model)
$\alpha \approx 0.34$ , $\beta \approx 0.28$

The fitted parameters:

Parameter	Value
$A$	$406.4$
$B$	$410.7$
$\alpha$	$0.34$
$\beta$	$0.28$
$E$	$1.69$

Computing the Optimal Allocation

Given compute $C = 6ND$ , we want to minimize $L(N, D)$ subject to $6ND = C$ :

$\min_{N, D} \left[\frac{A}{N^\alpha} + \frac{B}{D^\beta} + E\right] \quad \text{s.t.} \quad 6ND = C$

Using a Lagrange multiplier and solving:

$\frac{\partial}{\partial N}\left[\frac{A}{N^\alpha} + \frac{B}{D^\beta} + \lambda(6ND - C)\right] = 0$

$\frac{-\alpha A}{N^{\alpha+1}} + 6\lambda D = 0$

Similarly for $D$ :

$\frac{-\beta B}{D^{\beta+1}} + 6\lambda N = 0$

Dividing:

$\frac{\alpha A}{N^{\alpha+1} D} = \frac{\beta B}{N D^{\beta+1}}$

$\frac{\alpha A}{N^\alpha} = \frac{\beta B}{D^\beta}$

This says: at the optimum, the marginal improvement from scaling the model equals the marginal improvement from scaling the data. Neither is a bottleneck - both contribute equally.

import numpy as np
from scipy.optimize import minimize_scalar

# Chinchilla scaling law parameters
A = 406.4
B = 410.7
alpha = 0.34
beta = 0.28
E = 1.69

def chinchilla_loss(N, D):
    """Compute loss given model size N and data size D."""
    return A / N**alpha + B / D**beta + E

def optimal_allocation(C):
    """
    Find compute-optimal N and D for a given compute budget C.
    Constraint: C = 6 * N * D
    """
    def loss_given_N(log_N):
        N = np.exp(log_N)
        D = C / (6 * N)
        if D <= 0:
            return 1e10
        return chinchilla_loss(N, D)

    # Search over possible model sizes
    result = minimize_scalar(
        loss_given_N,
        bounds=(np.log(1e6), np.log(1e12)),
        method='bounded'
    )

    N_opt = np.exp(result.x)
    D_opt = C / (6 * N_opt)
    loss = chinchilla_loss(N_opt, D_opt)
    ratio = D_opt / N_opt

    return N_opt, D_opt, loss, ratio


# What does Chinchilla recommend for various compute budgets?
print(f"{'Compute':>12s} | {'N_opt':>10s} | {'D_opt':>12s} | "
      f"{'Loss':>6s} | {'Tokens/Param':>12s}")
print("-" * 70)

for log_C in [19, 20, 21, 22, 23, 24, 25]:
    C = 10 ** log_C
    N, D, loss, ratio = optimal_allocation(C)
    print(f"10^{log_C:>2d} FLOPs | {N/1e9:>8.2f}B | {D/1e9:>10.1f}B tok | "
          f"{loss:>6.3f} | {ratio:>10.1f}")

# The ratio stays approximately 20 tokens per parameter

Chinchilla's Proof by Construction

To validate their scaling law, DeepMind trained Chinchilla - a 70B parameter model on 1.4T tokens ( $C \approx 5.76 \times 10^{23}$ FLOPs). This was the compute-optimal allocation for their budget.

For comparison, Gopher was a 280B parameter model trained on 300B tokens with approximately the same compute budget. According to Kaplan, Gopher should have been preferred (larger model). According to Chinchilla, the compute would be better spent on a smaller model with more data.

Model	Parameters	Tokens	Compute	Tokens/Param	MMLU
Gopher	280B	300B	~ $5.76 \times 10^{23}$	1.1	60.0%
Chinchilla	70B	1.4T	~ $5.76 \times 10^{23}$	20.0	67.6%

Same compute, dramatically different performance. Chinchilla proved that Gopher was undertrained - it had too many parameters for the amount of data it saw.

Instant Rejection

If asked "What is the Chinchilla scaling law?" and you answer "Bigger models are better" - that is the opposite of the point. Chinchilla showed that simply making models bigger (Kaplan's approach) is wasteful. The optimal strategy is to scale parameters and data equally, at approximately 20 tokens per parameter. Many models at the time (GPT-3, Gopher) were massively undertrained relative to their size.

Part 4 - Implications for Model Design

Was GPT-3 Undertrained?

GPT-3 (175B parameters) was trained on 300B tokens:

$\text{Tokens per parameter} = \frac{300\text{B}}{175\text{B}} \approx 1.7$

Chinchilla optimal would have been $175\text{B} \times 20 = 3.5\text{T}$ tokens. GPT-3 was trained on roughly 12x fewer tokens than Chinchilla recommends.

This means one of two things:

GPT-3 could have achieved the same performance with a ~20B parameter model trained on 300B tokens
GPT-3 could have performed much better if trained on 3.5T tokens

The Model Landscape After Chinchilla

Chinchilla reshaped the entire field:

Model	Year	Parameters	Tokens	Tokens/Param	Chinchilla-Optimal?
GPT-3	2020	175B	300B	1.7	Severely undertrained
Gopher	2021	280B	300B	1.1	Severely undertrained
Chinchilla	2022	70B	1.4T	20	Yes
LLaMA-7B	2023	7B	1T	143	Deliberately overtrained
LLaMA-65B	2023	65B	1.4T	21.5	Approximately optimal
Mistral-7B	2023	7B	~8T	~1143	Severely overtrained
LLaMA-3-8B	2024	8B	15T	1875	Extremely overtrained

Why Are Modern Models Overtrained?

This brings us to a crucial insight that interview candidates often miss: Chinchilla-optimal training minimizes loss for a given training compute budget, but not for a given inference compute budget.

Training-Optimal vs Inference-Optimal Tradeoff

Part 5 - The Training-Inference Tradeoff

The Cost Model

The total cost of an LLM is:

$\text{Total Cost} = \text{Training Cost} + \text{Inference Cost} \times \text{Number of Queries}$

$\text{Total Cost} = C_{\text{train}} + c_{\text{inference}} \times N_{\text{params}} \times Q$

Where $Q$ is the total number of inference queries over the model's lifetime.

For a model serving millions of users:

Training cost is fixed (one-time)
Inference cost scales with both model size and query volume
Inference cost often dwarfs training cost

Why Overtrain?

Consider two options for achieving the same target loss:

Option A (Chinchilla-optimal): 70B model, 1.4T tokens

Training cost: $C = 6 \times 70\text{B} \times 1.4\text{T} = 5.88 \times 10^{23}$ FLOPs
Inference cost per token: proportional to 70B parameters

Option B (Overtrained): 7B model, 14T tokens

Training cost: $C = 6 \times 7\text{B} \times 14\text{T} = 5.88 \times 10^{23}$ FLOPs
Inference cost per token: proportional to 7B parameters - 10x cheaper!

Both use the same training compute. But Option B produces a model that is 10x cheaper to serve. If you are processing billions of queries, the inference savings far outweigh any suboptimality in training.

import numpy as np

def total_cost(N_params, D_tokens, queries, cost_per_training_flop, cost_per_inference_flop):
    """
    Total cost = training + inference.

    N_params: model parameters
    D_tokens: training tokens
    queries: total lifetime inference queries (in tokens)
    """
    # Training cost: 6ND FLOPs
    training_flops = 6 * N_params * D_tokens
    training_cost = training_flops * cost_per_training_flop

    # Inference cost: ~2N FLOPs per token (forward pass only)
    inference_flops_per_token = 2 * N_params
    inference_cost = inference_flops_per_token * queries * cost_per_inference_flop

    return training_cost, inference_cost, training_cost + inference_cost


# Compare Chinchilla-optimal vs overtrained
# Assume same total training compute (same budget)
cost_train = 1e-18    # $/FLOP for training
cost_infer = 3e-18    # $/FLOP for inference (less efficient due to small batches)

# Scenario: 1 billion inference queries of 500 tokens each
total_inference_tokens = 1e9 * 500

print(f"{'Model':>20s} | {'Train $':>12s} | {'Infer $':>12s} | {'Total $':>12s}")
print("-" * 65)

for name, N, D in [
    ("Chinchilla-70B", 70e9, 1.4e12),
    ("Overtrained-7B", 7e9, 14e12),
    ("Overtrained-1B", 1e9, 98e12),
]:
    tc, ic, total = total_cost(N, D, total_inference_tokens, cost_train, cost_infer)
    print(f"{name:>20s} | ${tc/1e6:>10.1f}M | ${ic/1e6:>10.1f}M | ${total/1e6:>10.1f}M")

# The overtrained 7B model costs the same to train but 10x less to serve

The LLaMA Philosophy

Meta's LLaMA (Touvron et al., 2023) explicitly embraced this insight: "The objective of the scaling laws is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale."

LLaMA-7B was trained on 1T tokens (143 tokens/param) - 7x more than Chinchilla-optimal. This "wasted" training compute produced a model that was dramatically cheaper to deploy and only slightly worse in quality.

60-Second Answer

"Chinchilla showed that compute-optimal training uses about 20 tokens per parameter. But modern models like LLaMA deliberately overtrain - using 100-2000 tokens per parameter - because inference cost, not training cost, dominates the total lifetime cost of a model. A 7B model trained on 14T tokens uses the same training compute as a 70B model trained on 1.4T tokens, but costs 10x less to serve. For production deployments serving millions of users, overtraining is the economically rational choice."

Part 6 - The Scaling Law Equations in Detail

The Kaplan Parametric Form

Kaplan proposed that loss decomposes as:

$L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}$

This form assumes a specific functional relationship between the model and data contributions to loss.

The Chinchilla Parametric Form

Chinchilla used a simpler additive decomposition:

$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$

This assumes the model and data bottlenecks contribute independently, with an irreducible loss floor $E$ .

Compute-Optimal Scaling Exponents

For Chinchilla, the optimal allocation follows:

$N_{\text{opt}} = G \left(\frac{C}{6}\right)^a, \quad D_{\text{opt}} = G^{-1} \left(\frac{C}{6}\right)^b$

Where $a = \frac{\beta}{\alpha + \beta}$ and $b = \frac{\alpha}{\alpha + \beta}$ .

With $\alpha = 0.34$ and $\beta = 0.28$ :

$a = \frac{0.28}{0.34 + 0.28} = 0.452$ $b = \frac{0.34}{0.34 + 0.28} = 0.548$

Both are close to 0.5, confirming that parameters and data should scale approximately equally.

Part 7 - Emergent Abilities and Scaling

What Are Emergent Abilities?

Wei et al. (2022) defined emergent abilities as capabilities that are absent in small models but present in large models - they appear to emerge suddenly at a certain scale rather than improving gradually.

Examples of claimed emergent abilities:

Multi-step arithmetic (appears at ~100B parameters)
Chain-of-thought reasoning (appears at ~60B parameters)
Word unscrambling (appears at ~10B parameters)

Standard Scaling vs Emergent Abilities

The Debate: Are Emergent Abilities Real?

This is an active research debate that interviewers may probe:

Argument for emergence (Wei et al., 2022):

Many benchmarks show near-random performance until a threshold model size, then rapid improvement
This is consistent with phase transitions in physics
Some capabilities genuinely require a minimum amount of knowledge/reasoning

Argument against emergence (Schaeffer et al., 2023):

"Emergent" abilities may be an artifact of the metric chosen
When you switch from accuracy (discrete) to log-likelihood (continuous), the improvement is smooth
The "sudden jump" is because accuracy changes from 0% to non-zero at a threshold, but the underlying probability is smoothly improving

import numpy as np

def demonstrate_metric_illusion():
    """
    Show how metric choice can create the illusion of emergence.
    """
    # Model capability (smooth power law improvement)
    scales = np.logspace(7, 11, 50)  # 10M to 100B parameters
    # Probability of getting a single step correct (smooth)
    p_correct = 1 - (1e13 / scales) ** 0.05
    p_correct = np.clip(p_correct, 0.01, 0.99)

    # Task requires getting 5 steps ALL correct
    # Accuracy (discrete metric)
    accuracy = p_correct ** 5

    print("Scale (params) | p(single step) | accuracy (5 steps)")
    print("-" * 55)
    for i in range(0, len(scales), 5):
        print(f"{scales[i]:>13.0f} | {p_correct[i]:>14.3f} | {accuracy[i]:>18.3f}")

    # Key insight: p_correct improves smoothly,
    # but accuracy appears to "emerge" because
    # p^5 is very small until p is close to 1

demonstrate_metric_illusion()
# p_correct: 0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99
# accuracy:  0.002, 0.03, 0.17, 0.33, 0.59, 0.77, 0.95
# Accuracy looks like a sudden jump even though capability is smooth!

Common Trap

Do not state definitively that emergent abilities either "are real" or "are not real." This is an active debate. The sophisticated answer is: "Cross-entropy loss improves smoothly with scale (this is well-established). Whether downstream task performance shows genuine phase transitions or is an artifact of discrete metrics is debated. Schaeffer et al. showed that many claimed emergent abilities disappear when using continuous metrics like log-probability instead of accuracy. However, some complex multi-step reasoning capabilities do appear to require a minimum scale."

Part 8 - Scaling Beyond Language

Vision Scaling Laws

Zhai et al. (2022) found similar scaling laws for vision models (ViT):

$L(N, D) = \frac{a}{N^{\alpha}} + \frac{b}{D^{\beta}} + c$

With $\alpha \approx 0.5$ and $\beta \approx 0.8$ for ImageNet classification.

Multimodal Scaling

Scaling laws also apply to multimodal models. The key insight is that different modalities may have different scaling exponents, meaning the optimal data mix changes with scale.

Downstream Task Scaling

An important finding: the relationship between pre-training loss and downstream task performance is itself predictable:

$\text{Task performance} = f(L_{\text{pretrain}})$

For many tasks, this relationship is approximately linear or power-law, meaning scaling laws for pre-training loss translate into scaling laws for downstream performance. This is what allows companies like OpenAI and Anthropic to predict the capabilities of future models before training them.

Part 9 - Practical Applications of Scaling Laws

Compute Budget Planning

Scaling laws allow organizations to plan multi-million dollar training runs:

Train small pilot models at multiple scales (1M, 10M, 100M, 1B parameters)
Fit scaling law coefficients ( $A$ , $B$ , $\alpha$ , $\beta$ , $E$ )
Extrapolate to predict the loss of a much larger model
Decide whether the predicted improvement justifies the investment

import numpy as np
from scipy.optimize import curve_fit

def scaling_law(X, A, alpha, B, beta_param, E):
    """
    Chinchilla parametric scaling law.
    X is a 2D array: X[0] = N (params), X[1] = D (tokens)
    """
    N, D = X
    return A / N**alpha + B / D**beta_param + E


# Simulated pilot runs (in practice, these are real experiments)
pilot_data = {
    # (N_params, D_tokens): measured_loss
    (10e6, 200e6): 3.85,
    (50e6, 1e9): 3.45,
    (100e6, 2e9): 3.25,
    (500e6, 10e9): 2.95,
    (1e9, 20e9): 2.80,
}

N_vals = np.array([k[0] for k in pilot_data.keys()])
D_vals = np.array([k[1] for k in pilot_data.keys()])
losses = np.array(list(pilot_data.values()))

# Fit scaling law
X = np.array([N_vals, D_vals])
popt, pcov = curve_fit(
    scaling_law, X, losses,
    p0=[400, 0.3, 400, 0.3, 1.7],
    bounds=([0, 0, 0, 0, 0], [1e6, 1, 1e6, 1, 5])
)
A_fit, alpha_fit, B_fit, beta_fit, E_fit = popt
print(f"Fitted: A={A_fit:.1f}, α={alpha_fit:.3f}, "
      f"B={B_fit:.1f}, β={beta_fit:.3f}, E={E_fit:.3f}")

# Predict loss for a 70B model on 1.4T tokens
N_target = 70e9
D_target = 1.4e12
predicted_loss = scaling_law(np.array([[N_target], [D_target]]),
                             *popt)[0]
print(f"\nPredicted loss for 70B model on 1.4T tokens: {predicted_loss:.3f}")
print(f"Training compute: {6 * N_target * D_target:.2e} FLOPs")

Model Selection for Deployment

Given an inference budget (max model size for your hardware), scaling laws tell you how much data to train on:

Inference Constraint	Optimal Model	Training Data (Chinchilla)	Training Data (Overtrained)
Edge device (1B params max)	1B	20B tokens	200B-1T tokens
Single GPU (7B params max)	7B	140B tokens	1T-15T tokens
Multi-GPU (70B params max)	70B	1.4T tokens	3T-15T tokens
Cluster (400B+ params)	400B	8T tokens	15T+ tokens

Part 10 - Open Questions and Future Directions

Does Scaling Continue?

The billion-dollar question: will power law improvements continue indefinitely, or will they plateau?

Arguments for continued scaling:

No theoretical ceiling identified for Transformer language models
Each generation has found new data sources and techniques to maintain scaling
Loss curves show no signs of bending (within current data)

Arguments for a plateau:

Available high-quality text data is finite (~10-20T tokens of quality English text)
Synthetic data generation may not substitute for real data
Diminishing returns: power laws mean each increment costs exponentially more

Data Walls

The biggest practical constraint on scaling is data. Estimated high-quality text data:

Source	Estimated Tokens
Common Crawl (filtered)	~5-10T
Books	~500B
Wikipedia	~5B
Code (GitHub)	~500B
Academic papers	~100B
Total (deduplicated, quality-filtered)	~10-20T

At the Chinchilla ratio of 20 tokens/param, this limits Chinchilla-optimal models to ~500B-1T parameters. Larger models require either:

Synthetic data generation
Multimodal data (images, video, audio)
Multi-epoch training (revisiting data, which has diminishing returns)
New data sources (private data, specialized domains)

Beyond Loss: Scaling Laws for Capabilities

A frontier research direction is developing scaling laws not just for loss but for specific capabilities:

$P(\text{can solve task } k) = f_k(N, D, \text{task complexity})$

This would allow predicting not just "how good" a model is, but "what it can do" at each scale.

Practice Problems

Problem 1: Compute-Optimal Allocation

You have a compute budget of $C = 10^{23}$ FLOPs. Using the Chinchilla scaling law ( $D_{\text{opt}} \approx 20N_{\text{opt}}$ and $C = 6ND$ ), calculate the optimal model size and training data.

Hint

From $C = 6ND$ and $D = 20N$ : $C = 6N \cdot 20N = 120N^2$ . Therefore $N = \sqrt{C/120} = \sqrt{10^{23}/120} \approx \sqrt{8.33 \times 10^{20}} \approx 2.89 \times 10^{10} \approx 29B$ parameters. $D = 20 \times 29\text{B} = 580\text{B}$ tokens. So approximately: a 29B parameter model trained on 580B tokens.

Problem 2: Kaplan vs Chinchilla

The original GPT-3 (175B, 300B tokens) was designed based on Kaplan's scaling laws. What would Chinchilla recommend for the same compute budget?

Hint

GPT-3 compute: $C = 6 \times 175\text{B} \times 300\text{B} = 3.15 \times 10^{23}$ FLOPs. Chinchilla-optimal: $N = \sqrt{C/120} = \sqrt{3.15 \times 10^{23}/120} \approx 51\text{B}$ parameters, $D = 20 \times 51\text{B} = 1.02\text{T}$ tokens. Chinchilla recommends a model 3.4x smaller trained on 3.4x more data. This is very close to what Chinchilla (70B, 1.4T) actually was, confirming the scaling law.

Problem 3: Overtraining Economics

A company serves 100M queries per day, each requiring 500 tokens of generation. Compare the total 1-year cost of: (a) a 70B Chinchilla-optimal model, and (b) a 7B overtrained model with the same training compute. Assume inference costs $10^{-15}$ per FLOP.

Hint

Annual queries: $100M \times 365 = 36.5B$ queries. Total inference tokens: $36.5B \times 500 = 18.25T$ tokens. (a) 70B model: Inference FLOPs = $2 \times 70B \times 18.25T = 2.555 \times 10^{24}$ . Inference cost = $2.555 \times 10^{24} \times 10^{-15} =$ 2,555,000 $. (b) 7B model: Inference FLOPs =$ 2 \times 7B \times 18.25T = 2.555 \times 10^{23} $. Inference cost =$ 2.555 \times 10^{23} \times 10^{-15} = $255,500$ . The 7B model saves ~$2.3M per year in inference costs. Even if it required extra training compute, the savings pay for themselves within months.

Problem 4: Data Wall

If the total available high-quality text data is 15T tokens and you follow Chinchilla-optimal scaling (20 tokens/param), what is the largest Chinchilla-optimal model you can train? What compute would it require?

Hint

$N_{\max} = D / 20 = 15T / 20 = 750B$ parameters. Compute: $C = 6 \times 750B \times 15T = 6.75 \times 10^{25}$ FLOPs. This is roughly $100\times$ the compute used for GPT-4 (estimated at $\sim 10^{25}$ ). To go beyond 750B parameters Chinchilla-optimally, you need more data \text{---} through synthetic generation, multimodal sources, or multi-epoch training with careful deduplication.

Problem 5: Predicting Model Performance

You train pilot models at 100M, 1B, and 10B parameters (each Chinchilla-optimal) and measure losses of 3.2, 2.8, and 2.5 respectively. Estimate the loss for a 100B parameter Chinchilla-optimal model. Assume $L(N) = A/N^{\alpha} + E$ .

Hint

With three data points and three unknowns ( $A$ , $\alpha$ , $E$ ), we can fit the curve. From the data: $3.2 = A/10^{8\alpha} + E$ , $2.8 = A/10^{9\alpha} + E$ , $2.5 = A/10^{10\alpha} + E$ . Taking differences: $0.4 = A(10^{-8\alpha} - 10^{-9\alpha})$ and $0.3 = A(10^{-9\alpha} - 10^{-10\alpha})$ . The ratio $0.4/0.3 = 1.333 = (10^{-8\alpha} - 10^{-9\alpha})/(10^{-9\alpha} - 10^{-10\alpha})$ . Solving numerically gives $\alpha \approx 0.10$ , $A \approx 1.05$ , $E \approx 1.54$ . Prediction for 100B: $L = 1.05/10^{11 \times 0.10} + 1.54 = 1.05/10^{1.1} + 1.54 \approx 0.083 + 1.54 = 1.62$ . But this should be validated with more pilot models \text{---} three points is the minimum for fitting.

Interview Cheat Sheet

Question	Key Points
"What are scaling laws?"	Loss follows power laws with params ( $N$ ), data ( $D$ ), compute ( $C$ ). $C \approx 6ND$ .
"What did Kaplan find?"	Scale models faster than data ( $N \propto C^{0.73}$ ). Later shown to be wrong due to not training to convergence.
"What did Chinchilla find?"	Scale $N$ and $D$ equally ( $N \propto C^{0.5}$ ). Optimal: ~20 tokens per parameter.
"What did Kaplan get wrong?"	Did not train to convergence. Larger models were compared at fewer tokens/param, biasing toward model size.
"Why was Chinchilla important?"	Proved Gopher (280B/300B tok) was worse than Chinchilla (70B/1.4T tok) at same compute.
"Why are modern models overtrained?"	Inference cost dominates. Smaller model + more data = same training cost but 10x cheaper inference.
"What is the data wall?"	~10-20T quality tokens available. Limits Chinchilla-optimal models to ~500B-1T params.
"Are emergent abilities real?"	Debated. May be metric artifacts (discrete accuracy vs continuous log-prob). Multi-step reasoning may genuinely require scale.
"How are scaling laws used in practice?"	Train small pilots, fit power law, predict large model performance, decide if investment is justified.
"What is the irreducible loss?"	$E$ in $L = A/N^\alpha + B/D^\beta + E$ . Entropy of natural language. Cannot be reduced by any model.

Spaced Repetition Checkpoints

Day 0 (Today)

State the three scaling variables ( $N$ , $D$ , $C$ ) and $C \approx 6ND$
Explain the Chinchilla-optimal ratio (20 tokens/param)
Explain what Kaplan got wrong

Day 3

Write the Chinchilla parametric loss $L = A/N^\alpha + B/D^\beta + E$
Calculate optimal $N$ and $D$ for a given compute budget
Explain overtraining and inference cost tradeoffs

Day 7

Do a quantitative comparison of GPT-3 vs Chinchilla
Explain the data wall and its implications
Discuss emergent abilities and the metric artifact argument

Day 14

Mock interview: answer all 10 cheat sheet questions
Derive the compute-optimal allocation using Lagrange multipliers
Discuss practical scaling law usage for compute budget planning

Day 21

Full 20-minute discussion covering Kaplan, Chinchilla, and practical implications
Handle follow-up questions on data walls, overtraining economics, and emergence
Design a scaling experiment for a hypothetical new model

Next Steps

You now understand the quantitative science behind model scaling - the power laws, the optimal allocations, and the economic tradeoffs that drive every major training decision in the industry. This knowledge connects directly to every other paper discussion in this handbook: the GPT series (Chapter 5) scaled according to these laws, the Transformer architecture (Chapter 3) is the substrate that scaling laws describe, and RLHF (Chapter 10) adds a human preference signal on top of the scaled base model.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Three Variables of Scale​

The Core Insight​

Power Law Relationships​

Part 2 - Kaplan Scaling Laws (2020)​

The Paper​

The Key Findings​

What Kaplan Got Wrong​

Part 3 - Chinchilla Scaling Laws (2022)​

The Paper​

The Key Correction​

The Chinchilla Law​

The Joint Scaling Law​

Computing the Optimal Allocation​

Chinchilla's Proof by Construction​

Part 4 - Implications for Model Design​

Was GPT-3 Undertrained?​

The Model Landscape After Chinchilla​

Why Are Modern Models Overtrained?​

Part 5 - The Training-Inference Tradeoff​

The Cost Model​

Why Overtrain?​

The LLaMA Philosophy​

Part 6 - The Scaling Law Equations in Detail​

The Kaplan Parametric Form​

The Chinchilla Parametric Form​

Compute-Optimal Scaling Exponents​

Part 7 - Emergent Abilities and Scaling​

What Are Emergent Abilities?​

The Debate: Are Emergent Abilities Real?​

Part 8 - Scaling Beyond Language​

Vision Scaling Laws​

Multimodal Scaling​

Downstream Task Scaling​

Part 9 - Practical Applications of Scaling Laws​

Compute Budget Planning​

Model Selection for Deployment​

Part 10 - Open Questions and Future Directions​

Does Scaling Continue?​

Data Walls​

Beyond Loss: Scaling Laws for Capabilities​

Practice Problems​

Problem 1: Compute-Optimal Allocation​

Problem 2: Kaplan vs Chinchilla​

Problem 3: Overtraining Economics​

Problem 4: Data Wall​

Problem 5: Predicting Model Performance​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Three Variables of Scale

The Core Insight

Power Law Relationships

Part 2 - Kaplan Scaling Laws (2020)

The Paper

The Key Findings

What Kaplan Got Wrong

Part 3 - Chinchilla Scaling Laws (2022)

The Paper

The Key Correction

The Chinchilla Law

The Joint Scaling Law

Computing the Optimal Allocation

Chinchilla's Proof by Construction

Part 4 - Implications for Model Design

Was GPT-3 Undertrained?

The Model Landscape After Chinchilla

Why Are Modern Models Overtrained?

Part 5 - The Training-Inference Tradeoff

The Cost Model

Why Overtrain?

The LLaMA Philosophy

Part 6 - The Scaling Law Equations in Detail

The Kaplan Parametric Form

The Chinchilla Parametric Form

Compute-Optimal Scaling Exponents

Part 7 - Emergent Abilities and Scaling

What Are Emergent Abilities?

The Debate: Are Emergent Abilities Real?

Part 8 - Scaling Beyond Language

Vision Scaling Laws

Multimodal Scaling

Downstream Task Scaling

Part 9 - Practical Applications of Scaling Laws

Compute Budget Planning

Model Selection for Deployment

Part 10 - Open Questions and Future Directions

Does Scaling Continue?

Data Walls

Beyond Loss: Scaling Laws for Capabilities

Practice Problems

Problem 1: Compute-Optimal Allocation

Problem 2: Kaplan vs Chinchilla

Problem 3: Overtraining Economics

Problem 4: Data Wall

Problem 5: Predicting Model Performance

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Next Steps