What is scaling laws?

Empirical power-law relationships between LLM performance and compute, data, and parameters - from Kaplan (2020) to Chinchilla (2022) and beyond.

How does Kaplan scaling work in practice?

Scaling Laws covers scaling laws, Kaplan scaling, Chinchilla optimal from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/transformer-architecture/scaling-laws

What is the difference between scaling laws and Chinchilla optimal?

See the full breakdown at https://engineersofai.com/docs/llms/transformer-architecture/scaling-laws

Scaling Laws

Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, AI Engineer, Research Engineer

The Bet That Changed the Industry

In 2019, a group at OpenAI had a decision to make. GPT-2 had just been released - 1.5 billion parameters, genuinely impressive text generation. The question on the table: do they train a 10× bigger model (GPT-3) or invest in algorithmic improvements?

The conventional wisdom, imported from computer vision, said algorithmic improvements matter more than scale. You should optimize your architecture, your training procedure, your data curation - not just throw more compute at the problem. Scaling was seen as brute force, not science.

Jared Kaplan, Sam McCandlish, and their collaborators had been running a different kind of experiment. Instead of training one large model, they had trained hundreds of small models at different scales - varying the number of parameters (N), the amount of training data (D), and the compute budget (C) independently. They were looking for a pattern.

What they found was startlingly clean: the test loss of language models follows a power law in each of these quantities. Double the compute, and the loss improves by a predictable, consistent amount. Double the parameters, same story. The improvement is not random noise. It is a law.

The paper they published - "Scaling Laws for Neural Language Models" (Kaplan et al., 2020) - gave the OpenAI leadership the framework they needed. If scaling obeys a power law, and you have enough compute, you can predict how much better GPT-3 will be before you spend a single GPU-hour training it. The bet was justified. They trained GPT-3.

The result wasn't just a better model. It was a different kind of model - one that, at 175 billion parameters, could perform tasks it was never explicitly trained for. Few-shot learning, arithmetic, coding, reasoning - none of these were objectives. They emerged from scale.

Understanding scaling laws is understanding why the AI industry spent $100 billion on compute in 2023. The power laws said it was worth it.

What Scaling Laws Say

Kaplan et al. (2020) found that for transformer language models, the cross-entropy test loss $L$ scales as a power law in the key quantities:

Parameters (N)

$L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}$

With exponent $\alpha_N \approx 0.076$ (fitted from data). Holding data and compute fixed, every 10× increase in parameters decreases loss by a predictable, consistent amount.

Data (D, training tokens)

$L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}$

With exponent $\alpha_D \approx 0.095$ . More training data → lower loss, following a power law.

Compute (C, FLOPs)

$L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}$

With exponent $\alpha_C \approx 0.050$ . More compute → lower loss.

These exponents look small, but they compound. A power law with exponent 0.05 means that a 10× compute increase gives: $10^{0.05} \approx 1.12$ , a 12% reduction in perplexity. To halve the perplexity requires $2^{1/0.05} = 2^{20} \approx 1{,}000{,}000 \times$ more compute. Progress is steady but requires exponential investment.

The Three Axes and How They Interact

For a fixed compute budget $C$ , you have to decide how to allocate it between:

Model size (N): More parameters → better loss at fixed data
Training time (D tokens seen): More tokens → better loss at fixed model size

Kaplan et al. found that parameters scale faster than data when compute is the constraint. Given a fixed compute budget, you should spend more on model size and less on data:

$\text{Kaplan's rule}: N \propto C^{0.73}, \quad D \propto C^{0.27}$

This means: if you double compute, the optimal model is $2^{0.73} \approx 1.66\times$ larger, trained on $2^{0.27} \approx 1.21\times$ more data.

This was the rule the industry followed from 2020-2022. It said: bigger models beat more training data. GPT-3 was trained on ~300B tokens with 175B parameters - a relatively short training run by later standards.

Chinchilla: The Rule That Changed Everything

In 2022, Google DeepMind's team (Hoffmann et al.) published "Training Compute-Optimal Large Language Models" - known as the Chinchilla paper.

They criticized Kaplan's analysis: the experiments had not properly varied training duration. They ran a cleaner experiment - for fixed compute budgets ranging from $10^{18}$ to $10^{24}$ FLOPs, they trained models of many different sizes for varying durations, and found the truly optimal combination.

Their finding contradicted Kaplan:

$\text{Chinchilla's rule}: N \propto C^{0.5}, \quad D \propto C^{0.5}$

Optimal tokens per parameter: approximately 20.

This means: for a 70B parameter model, the compute-optimal training data is $70B \times 20 = 1.4T$ tokens.

GPT-3 Was Undertrained

At 175B parameters and ~300B tokens of training, GPT-3 had a ratio of $300B / 175B \approx 1.7$ tokens per parameter - far below the optimal 20.

According to Chinchilla's analysis, for the same compute budget used to train GPT-3, a model of approximately 70B parameters trained on 1.4T tokens should outperform the 175B model trained on 300B tokens.

DeepMind tested this by training Chinchilla: 70B parameters, 1.4T tokens. It outperformed GPT-3, Gopher (280B), and PaLM (540B) on most benchmarks. At 4× fewer parameters.

Post-Chinchilla Reality: LLaMA's Twist

LLaMA (Meta AI, 2023) took Chinchilla's insight and pushed it further.

Chinchilla optimizes for training compute - it gives you the best model for a given training budget. But what about inference compute? You only train a model once. You run inference millions or billions of times.

For a fixed inference cost, a smaller model that has been trained longer is better than a larger model trained for fewer steps. If you train a 7B model on 1T tokens (Chinchilla-optimal), it performs similarly to a 13B model trained on 500B tokens. But the 7B model is much cheaper to run.

LLaMA's choice: train smaller models for much longer than Chinchilla-optimal - past the point of diminishing returns on training loss, because the inference savings justify the extra training compute.

LLaMA 2 (7B): trained on 2T tokens (vs Chinchilla-optimal ~140B for 7B params). This is 14× "over-trained" by Chinchilla's metric. The resulting model is competitive with much larger models at inference time.

This shift in reasoning - optimize for inference cost, not training cost - changed how the open-source community thinks about model scaling.

Emergent Capabilities

One of the most striking findings in scaling law research: some capabilities appear suddenly at certain scales, not gradually.

Wei et al. (2022) documented this in "Emergent Abilities of Large Language Models". They found that many capabilities - arithmetic, chain-of-thought reasoning, multi-step analogy - show near-zero accuracy for models below a certain scale threshold, then jump dramatically.

Examples of emergent capabilities (Wei et al., 2022):

3-digit arithmetic: Near 0% accuracy at 5B params, jumps to 70%+ at 100B
Chain-of-thought reasoning: Appears at ~100B params with zero examples; at 7B params, usable with few-shot prompting
Multi-language translation to English: Emerges around 10B params despite no explicit translation training
Code generation: Dramatically improves beyond 5B params

The controversy: Schaeffer et al. (2023) argued that many "emergent" abilities are artifacts of measurement - if you measure with a smooth metric (e.g., token-level log-probability) rather than a threshold metric (exact string match), the improvement appears gradual, not sudden. The threshold behavior is real, but it may reflect the metric's sharpness, not a genuine discontinuity in model capability.

Practical implication: when evaluating your model, the choice of metric matters. A model that "suddenly works" on a benchmark may have been gradually improving on a softer metric all along.

Practical Scaling Law Calculations

import numpy as np
import matplotlib.pyplot as plt


def chinchilla_optimal(compute_budget_flops: float) -> dict:
    """
    Compute Chinchilla-optimal model size and training tokens
    for a given compute budget.

    Based on Hoffmann et al. (2022):
    - Optimal N ≈ 0.1174 * C^0.4942 (parameters)
    - Optimal D ≈ 5.4 * C^0.5 (tokens) -- approximately 20 tokens per param

    Args:
        compute_budget_flops: Total FLOPs for training

    Returns:
        dict with optimal N (parameters), D (tokens), and ratios
    """
    # Chinchilla coefficients (from paper Table A9)
    # L(N, D) ≈ E + A/N^alpha + B/D^beta
    # Optimal ratio is approximately 20 tokens per parameter

    tokens_per_param = 20  # Chinchilla's key result

    # Approximate: C ≈ 6 * N * D (FLOPs for one token through N-param model)
    # So C ≈ 6 * N * 20 * N = 120 * N^2
    # => N = sqrt(C / 120)

    optimal_N = np.sqrt(compute_budget_flops / 120)
    optimal_D = tokens_per_param * optimal_N

    # Verify: C_check ≈ 6 * N * D
    C_check = 6 * optimal_N * optimal_D

    return {
        "compute_flops": compute_budget_flops,
        "optimal_parameters": optimal_N,
        "optimal_tokens": optimal_D,
        "tokens_per_param": tokens_per_param,
        "compute_check": C_check,
    }


def estimate_training_flops(N: int, D: int) -> float:
    """
    Estimate total training FLOPs for a model with N parameters trained on D tokens.
    Rule of thumb: C ≈ 6 * N * D
    (forward pass ≈ 2*N*D FLOPs, backward ≈ 4*N*D FLOPs for gradients + activations)
    """
    return 6 * N * D


def loss_prediction(N: float, D: float,
                    E: float = 1.69, A: float = 406.4, B: float = 410.7,
                    alpha: float = 0.34, beta: float = 0.28) -> float:
    """
    Chinchilla loss formula (IsoFLOP analysis from paper):
    L(N, D) = E + A/N^alpha + B/D^beta

    Default values from Chinchilla paper (Table 3).
    """
    return E + A / (N ** alpha) + B / (D ** beta)


# Example: planning a training run
print("=== Chinchilla-Optimal Training Plans ===\n")

compute_budgets = {
    "Small experiment": 1e19,      # ~10^19 FLOPs
    "BERT-scale": 1e21,           # ~10^21 FLOPs
    "GPT-3-scale": 3.14e23,       # GPT-3's actual compute budget
    "GPT-4-scale (estimated)": 2e25,
}

for name, C in compute_budgets.items():
    result = chinchilla_optimal(C)
    N = result["optimal_parameters"]
    D = result["optimal_tokens"]
    print(f"{name} ({C:.0e} FLOPs):")
    print(f"  Optimal model size: {N/1e6:.0f}M parameters")
    print(f"  Optimal training tokens: {D/1e9:.0f}B tokens")
    print(f"  Predicted loss: {loss_prediction(N, D):.3f}")
    print()

# Historical comparison
print("=== Historical Models vs Chinchilla-Optimal ===\n")
historical = [
    ("GPT-3 (actual)", 175e9, 300e9),
    ("Chinchilla 70B (compute-optimal)", 70e9, 1.4e12),
    ("LLaMA-2 7B (inference-optimal)", 7e9, 2e12),
    ("LLaMA-2 70B", 70e9, 2e12),
    ("LLaMA-3 8B", 8e9, 15e12),
]

for name, N, D in historical:
    actual_C = estimate_training_flops(int(N), int(D))
    chinchilla_N = chinchilla_optimal(actual_C)["optimal_parameters"]
    ratio = N / chinchilla_N
    tpp = D / N  # tokens per parameter

    print(f"{name}:")
    print(f"  Actual: {N/1e9:.0f}B params, {D/1e12:.1f}T tokens")
    print(f"  Tokens/param: {tpp:.0f} (Chinchilla-optimal: 20)")
    print(f"  Model size vs Chinchilla-optimal: {ratio:.1f}x {'larger' if ratio>1 else 'smaller'}")
    print()

Inference-Time Compute Scaling (o1, o3)

A new dimension of scaling appeared in late 2024: test-time compute.

Instead of scaling parameters or training data, OpenAI's o1 model scales the amount of compute spent during inference - allowing the model to "think longer" about hard problems using chain-of-thought reasoning and self-verification.

The key finding: for mathematical reasoning and coding, additional inference-time compute provides scaling gains similar to training-time compute, but along a separate dimension.

Empirically (Snell et al., 2024 "Scaling LLM Test-Time Compute Optimally"):

On hard math problems, a 14B parameter model with extended inference compute (compute budget 16× normal) can outperform a 70B model with standard inference
The optimal strategy for allocating inference compute depends on problem difficulty - easy problems don't benefit; hard problems scale

This opens a new axis for capability improvement: instead of training a 10× larger model, you can train a smaller model and let it "think longer" on hard problems. This is the approach behind o1 (complex reasoning mode), o3 (even more inference compute), and similar models.

Production Engineering Notes

Estimating Training Cost

Before training any large model, estimate compute:

def estimate_training_cost(
    num_params: int,
    num_tokens: int,
    gpu_flops_per_second: float = 312e12,  # A100 BF16: 312 TFLOPS
    gpu_efficiency: float = 0.4,            # Typical MFU (Model FLOPs Utilization)
    num_gpus: int = 1,
    price_per_gpu_hour: float = 3.0,        # USD per A100 hour
) -> dict:
    """
    Estimate training compute, time, and cost.
    """
    total_flops = 6 * num_params * num_tokens

    # Effective throughput
    effective_flops_per_second = gpu_flops_per_second * gpu_efficiency * num_gpus

    # Time
    seconds = total_flops / effective_flops_per_second
    hours = seconds / 3600
    gpu_hours = hours * num_gpus

    # Cost
    cost_usd = gpu_hours * price_per_gpu_hour

    return {
        "total_flops": total_flops,
        "wall_clock_hours": hours,
        "gpu_hours": gpu_hours,
        "estimated_cost_usd": cost_usd,
    }


# Examples
configs = [
    ("BERT-base (fine-tune, 3 epochs)", 110e6, 300e6, 4),
    ("7B fine-tune, 1B tokens", 7e9, 1e9, 64),
    ("7B pretrain (Chinchilla-optimal)", 7e9, 140e9, 64),
    ("7B pretrain (LLaMA-2 scale)", 7e9, 2e12, 1024),
    ("GPT-3 (approximate)", 175e9, 300e9, 1024),
]

for name, N, D, gpus in configs:
    result = estimate_training_cost(int(N), int(D), num_gpus=gpus)
    print(f"{name}:")
    print(f"  Compute: {result['total_flops']:.2e} FLOPs")
    print(f"  Wall clock: {result['wall_clock_hours']:.1f} hours")
    print(f"  GPU-hours: {result['gpu_hours']:.0f}")
    print(f"  Cost: ${result['estimated_cost_usd']:,.0f}")
    print()

The Model FLOPs Utilization (MFU)

MFU = actual FLOPs/sec / theoretical peak FLOPs/sec.

A100 (BF16): theoretical 312 TFLOPS. Typical MFU in LLM training: 35-50%. The rest is overhead: data loading, optimizer steps, gradient communication, memory bandwidth constraints.

Flash Attention, FSDP, tensor parallelism, and careful implementation of softmax/normalization all contribute to higher MFU. PaLM achieved 46.2% MFU - a benchmark for efficient large-scale training.

Common Mistakes

:::danger Applying Chinchilla's rule without considering inference cost Chinchilla optimizes for training compute. If you're building a product that will serve millions of requests, inference cost dominates. A 70B Chinchilla-optimal model may be the best bang for training compute but is much more expensive to serve than a 7B model trained for longer. Always consider the full lifecycle - training + inference × (expected queries) - when choosing model size. :::

:::warning Treating scaling laws as guarantees for all tasks Scaling laws were measured on language modeling loss (cross-entropy on held-out text). They do not guarantee that every downstream task scales smoothly. Many tasks (especially formal reasoning, multi-step arithmetic, specific domain knowledge) show irregular scaling - plateaus, sudden jumps, or even degradation at certain scales. :::

:::tip Emergent abilities are measurement-dependent If you're trying to hit an emergent capability threshold, don't assume you need a specific model size. Try evaluating your intermediate checkpoints with different prompting strategies (few-shot, chain-of-thought). Capabilities that don't "emerge" with zero-shot prompting may already be present and accessible with few-shot examples. :::

Interview Q&A

Q1: What are neural scaling laws? What do they tell us?

Answer: Neural scaling laws are empirical power-law relationships between model performance (test loss) and the key scale variables: model parameters (N), training data (D), and training compute (C).

Kaplan et al. (2020) found: $L \propto N^{-0.076}$ , $L \propto D^{-0.095}$ , $L \propto C^{-0.050}$ (each with other variables held fixed). The key properties:

Smooth, predictable improvement: Loss decreases continuously as you scale - no dramatic phase transitions at specific scales (though capabilities can show emergent behavior)
Predictability: You can train small models and extrapolate to estimate the loss of a 100× larger model - reducing the need to run expensive large-scale experiments
Trade-offs are quantified: Given a fixed compute budget, you can compute the optimal allocation between model size and training duration
Architectural insensitivity: The scaling laws hold across different architectures (as long as they're based on transformers), suggesting the laws reflect fundamental properties of learning from language, not architecture-specific effects

Q2: What is the Chinchilla result and how did it change LLM training practices?

Answer: Hoffmann et al. (2022) showed that Kaplan et al.'s compute-optimal training overweighted model size relative to data. Kaplan recommended allocating ~73% of a compute budget increase to model size and ~27% to training data. Chinchilla showed the optimal split is ~50%/50%.

The key quantitative result: for compute-optimal training, use approximately 20 tokens per parameter. A 70B parameter model should be trained on ~1.4T tokens.

Implication: GPT-3 (175B params, 300B tokens) was undertrained. For the same compute, a 70B model trained on 1.4T tokens (Chinchilla) outperforms it.

Changes in practice:

LLaMA (2023): 7B-65B parameter models trained on 1-1.4T tokens - Chinchilla-informed
LLaMA-2: Extended training to 2T tokens - pushing further for inference efficiency
LLaMA-3: 8B model trained on 15T tokens - dramatically "over-trained" by Chinchilla, optimized for inference
The industry shifted focus from "biggest model" to "best inference-cost tradeoff"

Q3: What are emergent abilities in LLMs? Are they real?

Answer: Emergent abilities are capabilities that appear suddenly at certain model scales, showing near-zero accuracy below a threshold and high accuracy above.

Examples: 3-digit arithmetic (near 0% below 7B params, ~70% above 50B), chain-of-thought reasoning, multi-language translation without explicit training.

The controversy: Schaeffer et al. (2023) argued these apparent discontinuities are artifacts of evaluation metrics:

If you measure with a threshold metric (exact string match: wrong = 0, right = 1), you get sharp phase transitions
If you measure with a smooth metric (token-level log-probability), the improvement is gradual

Their evidence: on tasks that show "emergence" with exact-match metrics, alternative metrics show smooth scaling.

The honest answer: Both views have merit. The capability improvements are real. Whether the transition is truly discontinuous (in any useful sense) is debated. For engineering purposes: capabilities like multi-step reasoning reliably appear at ~10-100B parameters, and this threshold is consistent enough to plan around.

Q4: How would you estimate the compute and cost of fine-tuning a 70B parameter model on 10B tokens?

Answer: Using the standard approximation:

Compute (FLOPs): $C \approx 6ND$ for full fine-tuning (forward + backward): $= 6 \times 70 \times 10^9 \times 10 \times 10^9 = 4.2 \times 10^{21}$ FLOPs

Time on 64 × A100 (BF16):

Theoretical: 64 GPUs × 312 TFLOPS = 19,968 TFLOPS total
With 40% MFU: effective 7,987 TFLOPS = $7.987 \times 10^{15}$ FLOPs/sec
Time: $4.2 \times 10^{21} / 7.987 \times 10^{15} \approx 526,000$ seconds ≈ 145 hours ≈ 6 days

Cost (at $3/A100-hour):

145 hours × 64 GPUs × $3 =$ 27,840

Caveats:

LoRA/QLoRA reduce FLOPs significantly (train only adapter layers, freeze the rest)
Gradient checkpointing adds ~30% compute overhead to save memory
Data loading, optimizer state, communication overhead reduce effective MFU
Real cost depends on cloud provider, GPU type, spot vs on-demand pricing

For most production fine-tuning: use LoRA (5-10× cheaper), consider 8-bit or 4-bit quantization, use spot instances for cost reduction.

Q5: What is the "inference-time compute" scaling paradigm? How does it differ from training-time scaling?

Answer: Training-time scaling (Kaplan, Chinchilla) improves model capability by investing more FLOPs during training: bigger models, more data, longer training. The resulting model is fixed at inference time.

Inference-time compute scaling (o1, o3, Snell et al. 2024) improves performance on specific tasks by investing more FLOPs during inference: generating more reasoning tokens, sampling multiple solutions and selecting the best, using tree search or other deliberate algorithms.

Key differences:

Cost model: Training-time compute is amortized over all inference calls. Inference-time compute scales with each query - spending 16× more compute per query is 16× more expensive to serve.

Task selectivity: Inference-time scaling helps most on hard, structured problems (math, code, formal reasoning) where more deliberate thinking improves the answer. It helps little on factual recall or generation tasks.

Complementarity: They scale along different axes. A base model improved by training-time scaling can be further improved for hard tasks by inference-time scaling. GPT-4 + o1 reasoning mode is better than GPT-4 with standard sampling, and this benefit is on top of GPT-4's training-time improvements.

Practical implication: For mission-critical reasoning tasks (autonomous agents, complex problem solving), budget additional inference compute. For high-volume, simple tasks (classification, short-form generation), use smaller models with fast inference - inference-time scaling is too expensive.

The open research question: is there a fundamental limit to inference-time scaling, analogous to training-time scaling laws? Early evidence suggests yes - there are scaling laws for inference-time compute with similar power-law behavior, but the exponents and saturation points are task-dependent.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Scaling Laws: Compute, Data & Parameters demo on the EngineersOfAI Playground - no code required.

:::

The Bet That Changed the Industry​

What Scaling Laws Say​

Parameters (N)​

Data (D, training tokens)​

Compute (C, FLOPs)​

The Three Axes and How They Interact​

Chinchilla: The Rule That Changed Everything​

GPT-3 Was Undertrained​

Post-Chinchilla Reality: LLaMA's Twist​

Emergent Capabilities​

Practical Scaling Law Calculations​

Inference-Time Compute Scaling (o1, o3)​

Production Engineering Notes​

Estimating Training Cost​

The Model FLOPs Utilization (MFU)​

Common Mistakes​

Interview Q&A​

Q1: What are neural scaling laws? What do they tell us?​

Q2: What is the Chinchilla result and how did it change LLM training practices?​

Q3: What are emergent abilities in LLMs? Are they real?​

Q4: How would you estimate the compute and cost of fine-tuning a 70B parameter model on 10B tokens?​

Q5: What is the "inference-time compute" scaling paradigm? How does it differ from training-time scaling?​