What is language modeling?

Learn the training objectives that teach LLMs to understand language - causal language modeling, masked language modeling, cross-entropy loss, and perplexity.

How does causal language model work in practice?

Language Modeling Objectives covers language modeling, causal language model, masked language model from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/language-modeling-objectives

What is the difference between language modeling and masked language model?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/language-modeling-objectives

Language Modeling Objectives

A Model That Reads the Internet

Picture a team at a research lab in 2019. They have 10,000 TPUs, access to the entire Common Crawl dataset - about 45 terabytes of text scraped from the web - and a clear goal: build a model that understands language. The question staring them down is not about architecture. The transformer architecture was already solid. The question is: what exactly should the model be trained to predict?

This sounds like a minor implementation detail. It is not. The training objective determines everything - what the model learns to do, what it is good at, what it cannot do. A model trained to fill in blanks learns something fundamentally different from a model trained to predict what comes next, even if both models have identical architectures and identical training data.

The team writes two lines in their config file: objective: causal_lm. The model will learn to predict the next token given all previous tokens. This single decision will produce a model that is extraordinarily good at text generation but has no direct mechanism to look at context from both directions simultaneously. A different team at Google makes a different choice: objective: masked_lm. Their model will be able to use context from both left and right to understand each token. It will be excellent at tasks that require understanding a full sentence - classification, question answering, named entity recognition - but it cannot generate text token by token the same way.

The first team is OpenAI. Their model is GPT-2. The second team is Google. Their model is BERT. Both became foundational, but for completely different things. The objective is not a detail. It is the soul of the model.

By the end of this lesson, you will understand exactly what each training objective computes, why the choice matters for downstream tasks, and how to implement the cross-entropy loss for language modeling from scratch.

Why This Exists: The Problem with Supervised Learning

Before self-supervised language modeling, building NLP systems required labeled data. You want a model to understand sentiment? Label 10,000 sentences as positive or negative. You want it to do named entity recognition? Annotate thousands of documents with entity spans. This process is expensive, slow, and produces models that are narrow - trained on one task and brittle when the input distribution shifts even slightly.

The key insight that changed everything: text is its own supervision signal. Every sentence is a sequence of tokens, and every token (except the first) is a label for the tokens before it. A 1TB corpus of text contains, implicitly, hundreds of billions of (input, label) pairs that cost nothing to generate - you just take the text and use each token as the target for predicting from the context.

This is called self-supervised learning. No human annotation required. The data labels itself. The result is a model that can be pretrained on raw text at massive scale, learning rich representations of language structure, facts, and reasoning - and then fine-tuned on small labeled datasets for specific tasks.

Historical Context: A Brief Timeline

1948 - Claude Shannon publishes "A Mathematical Theory of Communication." He defines entropy and proposes that language can be modeled probabilistically. He asks how well humans can predict the next character in English text and estimates English has about 1.3 bits per character of entropy.

2003 - Bengio et al. publish "A Neural Probabilistic Language Model." They train a feed-forward network to predict the next word and show that distributed representations (embeddings) outperform n-gram models. This is the ancestor of every modern LLM.

2013 - Word2Vec (Mikolov et al.) shows that simple self-supervised objectives on text produce word embeddings with remarkable semantic structure. The famous king - man + woman = queen property.

2018 - BERT (Devlin et al.) and GPT (Radford et al.) both release within months of each other, representing the two dominant paradigms: masked (bidirectional) and causal (left-to-right) language modeling.

2020 - GPT-3 (Brown et al.) demonstrates that scaling causal language models creates emergent few-shot capabilities. The training objective choice had consequences nobody anticipated.

The Two Main Paradigms

Causal Language Modeling (CLM)

Causal language modeling is the simplest possible formulation: given the previous tokens, predict the next one. Formally, the model learns a distribution:

$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$

The probability of the full sequence is then the product of these conditional probabilities:

$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$

This factorization, called the chain rule of probability, is exact - it is not an approximation. Any joint distribution over sequences can be decomposed this way. The model is autoregressive: it generates one token at a time, each conditioned on all previous tokens.

The key architectural constraint: the attention mechanism uses a causal mask that prevents each position from attending to future positions. Position 5 can only see positions 1 through 5. This ensures the model cannot "cheat" during training by looking at the token it is supposed to predict.

The input is "The cat sat on the" and the target is "mat". During training, all five prediction targets (cat, sat, on, the, mat) are computed in parallel using the causal mask - this is what makes the transformer efficient compared to RNNs, which had to process tokens sequentially.

What CLM models learn well: Generation, continuation, in-context learning, reasoning by token-by-token chain of thought. GPT-2, GPT-3, LLaMA, Mistral, Claude - all use CLM.

What CLM models are less suited for: Tasks requiring full bidirectional context to classify a single span (e.g., "does this sentence express positive sentiment?").

Masked Language Modeling (MLM)

Masked language modeling takes a different approach. Instead of predicting the next token, the model predicts randomly masked tokens using context from both left and right.

$P(x_{\text{masked}} \mid x_{\text{unmasked}})$

At training time, some percentage of tokens (BERT uses 15%) are replaced with a special [MASK] token (or randomly replaced, or left unchanged). The model must predict the original tokens at those positions.

Because the model sees both left and right context for each prediction, it learns deeply bidirectional representations. A model predicting the word "bank" in "She deposited money at the ___" can use both the left context ("deposited money") and right context (if any exists) to predict the answer.

What MLM models learn well: Understanding tasks - classification, NER, extractive QA, semantic similarity. BERT, RoBERTa, ELECTRA, DeBERTa all use MLM.

What MLM models cannot do natively: Autoregressive text generation. You cannot generate text token-by-token with a masked LM because the model expects the full context (including positions after the generation point) to be present.

The Cross-Entropy Loss

Both CLM and MLM use cross-entropy loss, just applied differently. Understanding this loss is essential.

For a single prediction, if the model outputs a probability distribution over the vocabulary $V$ and the true token is $x_t$ , the cross-entropy loss is:

$\mathcal{L} = -\log P_\theta(x_t \mid \text{context})$

This is just the negative log probability assigned to the correct token. If the model assigns probability 0.9 to the correct token, the loss is $-\log(0.9) \approx 0.105$ . If it assigns probability 0.01, the loss is $-\log(0.01) \approx 4.6$ .

For an entire sequence of $T$ tokens, the average loss is:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$

For MLM, only the loss at masked positions is computed and the unmasked positions are ignored.

In PyTorch, this is computed using F.cross_entropy, which takes the raw logits (before softmax) and the target indices. This is numerically more stable than computing softmax first and then taking the log.

Perplexity: What It Means Intuitively

Perplexity (PPL) is the most common metric for evaluating language models. It is defined as the exponentiated average negative log-likelihood:

$\text{PPL} = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})\right)$

Intuitively, perplexity answers: "On average, how many tokens was the model equally confused between when making each prediction?"

Perplexity of 1 means perfect prediction - the model always assigns probability 1 to the correct next token
Perplexity of 100 means the model was as uncertain as if it were choosing uniformly among 100 options
GPT-2 achieves perplexity around 29 on WikiText-103
GPT-3 achieves around 20 on the same benchmark
A random model over a vocabulary of 50,000 would have perplexity 50,000

note

Perplexity is only comparable across models trained with the same tokenizer. A model with a larger vocabulary will naturally have higher perplexity because there are more options per step - even if it is a better model.

Other Pretraining Objectives

Next Sentence Prediction (NSP)

BERT's original training included NSP: given two sentences A and B, predict whether B actually follows A in the original text (positive example) or is a randomly sampled sentence (negative example). The hypothesis was that NSP would teach models to understand inter-sentence relationships.

Did it work? RoBERTa (Liu et al., 2019) showed that removing NSP and training longer with larger batches actually improved downstream performance. The conclusion: NSP was too easy and did not add signal beyond what MLM already provided. Modern BERT-style models do not use NSP.

Sentence Order Prediction (SOP)

ALBERT (Lan et al., 2019) replaced NSP with Sentence Order Prediction: predict whether two consecutive segments are in the correct order or swapped. This is harder than NSP (both segments come from the same document, so the model cannot solve it by noticing topic differences) and proved more useful.

Span Corruption (T5)

T5 (Raffel et al., 2019) introduced a different objective: mask contiguous spans of tokens (not individual tokens) with a single sentinel token, and train the decoder to predict the original spans. This bridges the gap between encoder-only (BERT) and decoder-only (GPT) models and enables the model to be used for both understanding and generation tasks.

$\text{Input:} \quad \text{Thank you} \langle X \rangle \text{me to the} \langle Y \rangle \text{you}$ $\text{Target:} \quad \langle X \rangle \text{ for inviting } \langle Y \rangle \text{ cook for}$

Replaced Token Detection (ELECTRA)

ELECTRA (Clark et al., 2020) uses a generator network (small MLM model) to replace some tokens, then trains a discriminator to detect which tokens were replaced. Unlike MLM (which only computes loss on 15% of tokens), ELECTRA computes a signal on every single token - making it dramatically more sample-efficient. ELECTRA-Small outperforms GPT on the GLUE benchmark despite using 25x less compute.

How Objective Choice Shapes Model Capabilities

The objective is not just a training detail - it is an architectural constraint that determines the information flow in the model. A causal model cannot use future context because the causal mask prevents it. This means at inference time, generating the 100th token requires running the full forward pass through the first 99 tokens. A masked model has no such constraint, but it also cannot generate sequentially.

This is why, when building a system that generates text (chatbots, code assistants, summarizers), you almost always use a CLM model. When building a classifier or information extractor, you often do better with an MLM model - or a fine-tuned CLM model that has seen enough instruction-tuning data to do classification.

Code: Implementing Language Modeling Loss from Scratch

import torch
import torch.nn.functional as F
from torch import Tensor
from typing import Optional


def causal_lm_loss(
    logits: Tensor,       # (batch, seq_len, vocab_size)
    labels: Tensor,       # (batch, seq_len) - token ids
    ignore_index: int = -100
) -> Tensor:
    """
    Compute causal language modeling loss.

    The key: we shift logits and labels so that at position t,
    we are predicting token t+1 from tokens 0..t.

    logits[:, :-1, :] are the predictions for positions 1..T
    labels[:, 1:] are the targets (the actual tokens at positions 1..T)
    """
    # Shift: predict token[t+1] from token[t]
    shift_logits = logits[:, :-1, :].contiguous()   # (batch, seq_len-1, vocab_size)
    shift_labels = labels[:, 1:].contiguous()         # (batch, seq_len-1)

    # Flatten to (batch * seq_len, vocab_size) and (batch * seq_len,)
    batch_size, seq_len, vocab_size = shift_logits.shape
    shift_logits = shift_logits.view(-1, vocab_size)
    shift_labels = shift_labels.view(-1)

    # Cross-entropy loss - averages over non-ignored positions
    loss = F.cross_entropy(
        shift_logits,
        shift_labels,
        ignore_index=ignore_index,
        reduction='mean'
    )
    return loss


def masked_lm_loss(
    logits: Tensor,           # (batch, seq_len, vocab_size)
    labels: Tensor,           # (batch, seq_len) - -100 for non-masked positions
    ignore_index: int = -100
) -> Tensor:
    """
    Compute masked language modeling loss.

    Unlike CLM, we do NOT shift - we predict at the exact positions
    that were masked. Non-masked positions have label = ignore_index.
    """
    batch_size, seq_len, vocab_size = logits.shape
    flat_logits = logits.view(-1, vocab_size)
    flat_labels = labels.view(-1)

    loss = F.cross_entropy(
        flat_logits,
        flat_labels,
        ignore_index=ignore_index,
        reduction='mean'
    )
    return loss


def compute_perplexity(loss: float) -> float:
    """
    Perplexity = exp(average cross-entropy loss).

    If loss = 3.0 (nats), perplexity = e^3 ≈ 20.1
    This means the model is as uncertain as choosing uniformly from ~20 tokens.
    """
    import math
    return math.exp(loss)


# ---- Example: MLM masking logic ----

def create_mlm_inputs(
    input_ids: Tensor,        # (batch, seq_len)
    vocab_size: int,
    mask_token_id: int,
    mask_prob: float = 0.15,
    ignore_index: int = -100
):
    """
    Apply BERT-style masking:
    - Select 15% of tokens randomly
    - Of those:
      - 80% → replace with [MASK]
      - 10% → replace with random token
      - 10% → leave unchanged

    Returns masked input_ids and labels (only masked positions have real labels).
    """
    labels = input_ids.clone()

    # Create mask for which positions to consider (15% of tokens)
    probability_matrix = torch.full(input_ids.shape, mask_prob)
    masked_indices = torch.bernoulli(probability_matrix).bool()

    # Only compute loss on masked positions
    labels[~masked_indices] = ignore_index

    # 80% → [MASK]
    indices_replaced = torch.bernoulli(torch.full(input_ids.shape, 0.8)).bool() & masked_indices
    input_ids[indices_replaced] = mask_token_id

    # 10% → random token (of the remaining 20% of masked positions)
    indices_random = (
        torch.bernoulli(torch.full(input_ids.shape, 0.5)).bool()
        & masked_indices
        & ~indices_replaced
    )
    random_words = torch.randint(vocab_size, input_ids.shape)
    input_ids[indices_random] = random_words[indices_random]

    # Remaining 10% → unchanged (labels still set, model must predict original)

    return input_ids, labels


# ---- Demo ----
if __name__ == "__main__":
    torch.manual_seed(42)

    batch_size = 2
    seq_len = 10
    vocab_size = 1000

    # Simulate CLM
    logits = torch.randn(batch_size, seq_len, vocab_size)
    labels = torch.randint(0, vocab_size, (batch_size, seq_len))

    clm_loss = causal_lm_loss(logits, labels)
    print(f"CLM Loss: {clm_loss.item():.4f}")
    print(f"CLM Perplexity: {compute_perplexity(clm_loss.item()):.1f}")

    # Simulate MLM
    mlm_input_ids = torch.randint(0, vocab_size, (batch_size, seq_len))
    masked_ids, mlm_labels = create_mlm_inputs(
        mlm_input_ids.clone(),
        vocab_size=vocab_size,
        mask_token_id=103  # BERT's [MASK] token id
    )

    mlm_logits = torch.randn(batch_size, seq_len, vocab_size)
    mlm_loss_val = masked_lm_loss(mlm_logits, mlm_labels)
    print(f"\nMLM Loss: {mlm_loss_val.item():.4f}")
    print(f"MLM Perplexity: {compute_perplexity(mlm_loss_val.item()):.1f}")

    # Show which positions are masked
    print(f"\nOriginal IDs: {mlm_input_ids[0].tolist()}")
    print(f"Masked IDs:   {masked_ids[0].tolist()}")
    print(f"Labels:       {mlm_labels[0].tolist()} (-100 = not predicted)")

Practical Implementation with HuggingFace

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForMaskedLM,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset
import torch

# ---- CLM Example (GPT-2) ----
tokenizer_gpt = AutoTokenizer.from_pretrained("gpt2")
tokenizer_gpt.pad_token = tokenizer_gpt.eos_token

model_gpt = AutoModelForCausalLM.from_pretrained("gpt2")

text = "The transformer architecture revolutionized natural language processing"
inputs = tokenizer_gpt(text, return_tensors="pt")

# Forward pass - HuggingFace handles the shift internally
outputs = model_gpt(**inputs, labels=inputs["input_ids"])
print(f"GPT-2 CLM loss: {outputs.loss.item():.4f}")
print(f"GPT-2 perplexity: {torch.exp(outputs.loss).item():.2f}")

# ---- MLM Example (BERT) ----
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
model_bert = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

# DataCollatorForLanguageModeling handles BERT-style masking automatically
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_bert,
    mlm=True,
    mlm_probability=0.15  # 15% masking rate
)

# Tokenize a batch
texts = [
    "The cat sat on the mat.",
    "Language models learn from raw text."
]
encoded = tokenizer_bert(texts, return_tensors="pt", padding=True, truncation=True)

# Apply masking
batch = data_collator([
    {"input_ids": encoded["input_ids"][i]}
    for i in range(len(texts))
])

outputs_bert = model_bert(
    input_ids=batch["input_ids"],
    attention_mask=batch["attention_mask"] if "attention_mask" in batch else None,
    labels=batch["labels"]
)
print(f"\nBERT MLM loss: {outputs_bert.loss.item():.4f}")
print(f"BERT perplexity: {torch.exp(outputs_bert.loss).item():.2f}")

Production Engineering Notes

Tokenizer Consistency is Critical Never compare perplexity across models with different tokenizers. A model with a vocabulary of 100K will have higher perplexity than a model with vocabulary 32K on the same text - even if the 100K vocab model is strictly better. The denominator of the averaging changes because more tokens cover the same text.

Streaming Cross-Entropy for Long Sequences For sequences longer than 4096 tokens, computing logits for the full sequence requires storing a tensor of shape (batch, seq_len, vocab_size). At vocab_size=32K and seq_len=8192, this is 8192 × 32000 × 4 bytes ≈ 1GB per batch item. Use gradient checkpointing or compute loss in chunks.

Label Smoothing Many production LLM training runs use label smoothing (typically 0.1) - instead of hard targets (probability 1 at the correct token), use soft targets (probability 0.9 at correct, 0.1 / (V-1) at all others). This prevents overconfident predictions and slightly improves generalization.

loss = F.cross_entropy(logits, labels, label_smoothing=0.1)

warning

Sequence Packing vs Padding Padding short sequences to the same length wastes compute on attention over padding tokens. Production training almost always uses sequence packing: concatenate multiple documents into one long sequence with a separator token. This eliminates padding waste entirely and significantly improves GPU utilization. Make sure your attention mask correctly separates packed documents if you do not want cross-document attention.

Common Mistakes

danger

Computing CLM loss without the shift A very common mistake: passing labels=input_ids and also computing logits at the same positions. HuggingFace models for causal LM handle the shift internally - but if you implement the loss manually, you must shift: shift_logits = logits[:, :-1, :], shift_labels = labels[:, 1:]. Failing to shift means you are computing loss at position t predicting token t (the token you already see), which will produce near-zero loss trivially.

danger

Comparing perplexity across datasets Perplexity measured on your training set tells you almost nothing. Always use a held-out test set from the same distribution. More importantly, perplexity on one dataset says nothing about model quality on a different dataset or task. A model with perplexity 20 on Wikipedia might be terrible at medical text.

warning

Using MLM model for generation BERT and its variants cannot be used for standard autoregressive generation. If you try to use a masked LM to generate text by iteratively predicting one [MASK] at a time, you will find the outputs are repetitive and incoherent. Use CLM models (GPT family, LLaMA, Mistral) for generation tasks.

tip

Tracking the right metric during training Log both training loss and validation perplexity. Validation perplexity is your real signal. If training loss is decreasing but validation perplexity stagnates or rises, you are overfitting - rare in large-scale pretraining but common in fine-tuning on small datasets.

Interview Q&A

Q1: What is the difference between causal and masked language modeling, and when would you choose each?

Causal LM predicts each token from all previous tokens (left-to-right only, enforced by a causal attention mask). Masked LM predicts randomly masked tokens using bidirectional context. Choose CLM when building generation systems - chatbots, code assistants, summarizers - because the autoregressive property allows you to generate one token at a time. The most powerful LLMs today (GPT-4, Claude, LLaMA) all use CLM. Choose MLM when doing classification, NER, or extraction tasks where you need to understand the full context of an input sequence. In practice, with large enough instruction-tuned CLM models, you often do not need a dedicated MLM model anymore - the CLM model can perform classification tasks via prompting.

Q2: What is perplexity and what does a perplexity of 50 tell you?

Perplexity is the exponentiated average negative log-likelihood: $\text{PPL} = \exp(\text{avg loss})$ . Intuitively, it tells you how many tokens the model was equally confused between at each step. A perplexity of 50 means the model was as uncertain as if it had to choose uniformly from 50 equally likely options at each token. Lower is better. On English text, a good LLM achieves perplexity in the range of 5–30 depending on the dataset. A random model over a 50K vocabulary would have perplexity 50,000. Critical caveat: perplexity is only comparable across models using the same tokenizer on the same test set.

Q3: Why did BERT's NSP objective end up being abandoned?

Next Sentence Prediction (NSP) was added to BERT to help it learn inter-sentence relationships. The intuition was reasonable: understand whether two sentences are coherent together. However, RoBERTa (Liu et al., 2019) showed that removing NSP and training with longer sequences on more data actually improved performance on downstream tasks. The problem with NSP was that it was too easy - the negative examples (randomly sampled sentences from the corpus) were obviously unrelated in topic, so the model could solve NSP without learning anything about discourse structure. The model learned to detect topic change rather than coherence. Sentence Order Prediction (used in ALBERT) is a harder version that proved more useful.

Q4: Why is the label for non-masked positions set to -100 in MLM training?

PyTorch's F.cross_entropy has an ignore_index parameter (default -100). When a label equals ignore_index, the loss contribution from that position is set to zero and excluded from the mean. This means you only train on the masked positions - the model is not penalized for its predictions at positions where the original token was not masked. This is critical because training on all positions with the original tokens as labels would be equivalent to CLM (for the unmasked 85%), which is not what you want in a masked LM.

Q5: What is the relationship between cross-entropy loss and perplexity?

They measure the same thing in different units. Cross-entropy loss in nats (using natural log) is the per-token average: $\mathcal{L} = -\frac{1}{T}\sum_t \log P(x_t | x_{<t})$ . Perplexity is just $e^{\mathcal{L}}$ - the exponentiated loss. The advantage of perplexity over raw loss is that it has a more intuitive interpretation (branching factor of the model) and is independent of the base of the logarithm once you exponentiate. In practice, ML engineers usually monitor loss during training and report perplexity in papers for comparability.

Q6: How does label smoothing affect language model training?

Label smoothing replaces hard one-hot targets with soft targets: instead of probability 1.0 at the correct token and 0 at all others, use $1 - \varepsilon$ at the correct token and $\varepsilon / (V-1)$ at all others (typically $\varepsilon = 0.1$ ). This prevents the model from becoming overconfident and forces it to put at least some probability mass on all tokens. The practical effect is slight regularization - validation perplexity often improves by 1-3%, the model's probability distributions are slightly more calibrated, and the model is less likely to assign near-zero probability to unusual continuations. Most production LLM training runs use label smoothing.

The Information-Theoretic View

To fully understand language modeling objectives, it helps to understand what these losses are measuring from an information theory perspective.

Entropy and cross-entropy: Shannon entropy $H(p) = -\sum_x p(x) \log p(x)$ measures the inherent randomness in a distribution $p$ . For natural language, this represents how hard it is to predict text even with a perfect model. Cross-entropy $H(p, q) = -\sum_x p(x) \log q(x)$ measures how well a model $q$ (our language model) approximates the true distribution $p$ (natural language). The cross-entropy loss in LLM training is exactly this: measuring how well the model's predicted distribution $q_\theta$ approximates the empirical distribution of next tokens in the training corpus.

KL divergence: The gap between cross-entropy and entropy is the KL divergence: $H(p, q) = H(p) + \text{KL}(p \| q)$ . Since the true language entropy $H(p)$ is fixed, minimizing cross-entropy is equivalent to minimizing $\text{KL}(p \| q)$ - making the model's distribution as close as possible to the true language distribution.

Bits per byte: an alternative to perplexity is "bits per byte" (BPB), which normalizes by the number of bytes rather than tokens. This is tokenizer-independent, making it suitable for comparing models with different vocabularies. GPT-4 class models achieve around 0.9-1.0 bits per byte on English text.

Sequence-Level vs Token-Level Objectives

The standard language modeling objectives (CLM and MLM) are token-level: the loss is averaged over individual token predictions. This is efficient - every token contributes a training signal - but it means the model's training objective is not directly aligned with sequence-level quality.

Sequence-level quality matters for generation: a model trained on token-level cross-entropy is optimized to make each individual token prediction accurate, not to generate coherent paragraphs. This is one reason why perplexity and human evaluation of generation quality can diverge - a model with perplexity 15 might generate worse text than a model with perplexity 18, if the former model produces locally probable but globally incoherent text.

BLEU and ROUGE for sequence evaluation: BLEU (for translation) and ROUGE (for summarization) measure n-gram overlap between generated text and reference text. These are crude but widely used. They correlate reasonably well with token-level perplexity but capture different aspects of quality.

Why token-level training still works: despite the objective mismatch, token-level training works because good token-level predictions imply good sequence-level understanding. A model that consistently predicts the most likely next token is implicitly modeling long-range coherence - it must track context across many tokens to make each individual prediction well.

Practical: Pretraining Loss Curve Analysis

Understanding what a healthy pretraining loss curve looks like is essential for anyone managing large training runs.

"""
Analysis tools for pretraining loss curves.
Helps identify issues early in a training run.
"""

import numpy as np
from typing import List


def analyze_loss_curve(
    losses: List[float],
    window_size: int = 100,
    spike_threshold: float = 1.5,
) -> dict:
    """
    Analyze a loss curve for common issues:
    - Loss spikes (sudden increases)
    - Plateau (loss stopped decreasing)
    - Divergence (loss increasing overall)
    """
    losses = np.array(losses)
    n = len(losses)

    results = {
        "final_loss": losses[-1],
        "best_loss": losses.min(),
        "spikes_detected": [],
        "plateau_start": None,
        "diverging": False,
    }

    # Rolling average for baseline
    if n > window_size:
        rolling_avg = np.convolve(losses, np.ones(window_size) / window_size, mode='valid')

        # Detect spikes: current loss > spike_threshold * recent average
        for i in range(window_size, n):
            recent_avg = rolling_avg[i - window_size]
            if losses[i] > spike_threshold * recent_avg:
                results["spikes_detected"].append({
                    "step": i,
                    "loss": losses[i],
                    "baseline": recent_avg,
                    "ratio": losses[i] / recent_avg,
                })

        # Detect plateau: loss hasn't decreased by more than 1% in last 20% of steps
        last_20pct = int(0.2 * n)
        recent_losses = losses[-last_20pct:]
        improvement = (recent_losses[0] - recent_losses[-1]) / recent_losses[0]
        if improvement < 0.01:
            results["plateau_start"] = n - last_20pct

        # Detect divergence: loss is higher than 80% of training history
        if losses[-1] > np.percentile(losses, 80):
            results["diverging"] = True

    return results


def estimate_remaining_improvement(
    losses: List[float],
    target_loss: float = None,
) -> dict:
    """
    Fit a power law to the loss curve and estimate remaining improvement.
    Loss often follows L(N) = C * N^(-alpha) where N is training steps.
    """
    losses = np.array(losses)
    steps = np.arange(1, len(losses) + 1)

    # Fit log-linear model: log(L) = log(C) - alpha * log(N)
    log_losses = np.log(losses)
    log_steps = np.log(steps)

    # Linear regression in log space
    coeffs = np.polyfit(log_steps, log_losses, 1)
    alpha = -coeffs[0]   # Scaling exponent
    log_C = coeffs[1]

    # Predict future loss
    def predict_loss(n_steps):
        return np.exp(log_C) * (n_steps ** (-alpha))

    current_step = len(losses)
    result = {
        "scaling_exponent": alpha,  # Higher = faster decay = better scaling
        "current_loss": losses[-1],
        "predicted_2x_steps": predict_loss(current_step * 2),
        "predicted_5x_steps": predict_loss(current_step * 5),
        "predicted_10x_steps": predict_loss(current_step * 10),
    }

    if target_loss is not None:
        # Estimate steps needed to reach target
        needed_steps = (np.exp(log_C) / target_loss) ** (1 / alpha)
        result["steps_to_target"] = needed_steps
        result["remaining_steps"] = needed_steps - current_step

    return result


# Example usage
if __name__ == "__main__":
    # Simulate a healthy loss curve
    np.random.seed(42)
    steps = np.arange(1, 1001)
    # Power law decay with noise
    base_losses = 10 * steps ** (-0.3)
    noisy_losses = base_losses + np.random.normal(0, 0.1 * base_losses)
    # Add a spike at step 500
    noisy_losses[500] *= 1.8

    analysis = analyze_loss_curve(noisy_losses.tolist())
    print(f"Final loss: {analysis['final_loss']:.4f}")
    print(f"Spikes detected: {len(analysis['spikes_detected'])} spikes")
    if analysis['spikes_detected']:
        spike = analysis['spikes_detected'][0]
        print(f"  Spike at step {spike['step']}: {spike['ratio']:.1f}x baseline")
    print(f"Plateau: {'Yes at step ' + str(analysis['plateau_start']) if analysis['plateau_start'] else 'No'}")

    projection = estimate_remaining_improvement(noisy_losses.tolist())
    print(f"\nScaling exponent: {projection['scaling_exponent']:.3f}")
    print(f"Current loss: {projection['current_loss']:.4f}")
    print(f"2x more steps → loss: {projection['predicted_2x_steps']:.4f}")
    print(f"10x more steps → loss: {projection['predicted_10x_steps']:.4f}")

Connecting Objectives to Model Behavior

One of the most practically important insights from studying language modeling objectives is understanding how pretraining shapes downstream capabilities:

Causal LM → Generation + in-context learning

The CLM objective creates a model that is excellent at "what comes next?" in any context. This directly enables: text completion, creative writing, code generation, question answering by continuing a prompt that includes the question format, and in-context learning by continuing a prompt that includes examples. The entire GPT family's capabilities stem from solving CLM at scale.

Masked LM → Dense representations

The MLM objective creates a model where every token's representation is informed by the full surrounding context. This is ideal for tasks that require understanding the full sentence or passage simultaneously - classification, NER, extractive QA, semantic similarity. The model learns to encode "what does this token mean in this context?" for every token.

Why CLM won at scale

Despite MLM producing richer per-token representations, CLM became the dominant paradigm for frontier models. The reasons: (1) CLM trains on 100% of tokens vs 15% for MLM - 6.7x more efficient signal; (2) CLM naturally enables generation - the most commercially valuable capability; (3) CLM scales better - the GPT line showed that scaling CLM produces emergent capabilities that were not explicitly trained; (4) instruction tuning and RLHF work naturally with CLM (you generate a response and evaluate it) but awkwardly with MLM (you cannot generate autoregressively).

The result: BERT-family models remain competitive for embedding tasks (retrieval, similarity), but for every generation and instruction-following task, CLM dominates.

Evaluating Language Models Beyond Perplexity

Perplexity is the training signal but a poor proxy for downstream quality. Understanding complementary evaluation metrics is essential for production work.

Bits Per Character (BPC) and Bits Per Byte (BPB)

Perplexity depends on the tokenizer - a model with a large vocabulary (50K tokens) will have different perplexity than one with a small vocabulary (32K tokens) on the same text, even if they have identical character-level predictive ability. Bits per character (BPC) normalizes by the number of characters rather than tokens:

$\text{BPC} = \frac{\log_2 \text{perplexity}}{\text{avg\_chars\_per\_token}}$

Bits per byte (BPB) is similar but uses bytes (relevant for multilingual models where character widths vary). State-of-the-art language models achieve BPB around 0.9–1.0 on clean English text. The theoretical entropy of English text is approximately 0.7–1.0 bits per character - meaning frontier LLMs are approaching the theoretical limit.

import math
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def compute_bits_per_byte(
    model_name: str,
    text: str,
    device: str = "cuda",
) -> dict[str, float]:
    """Compute BPB and BPC alongside perplexity for fair cross-model comparison."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16
    ).to(device).eval()

    encoded = tokenizer(text, return_tensors="pt")
    input_ids = encoded.input_ids.to(device)

    num_tokens = input_ids.shape[1]
    num_chars = len(text)
    num_bytes = len(text.encode("utf-8"))

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        nll_per_token = outputs.loss.item()  # Negative log-likelihood per token (nats)

    # Convert from nats to bits: multiply by log2(e)
    nll_bits_per_token = nll_per_token * math.log2(math.e)

    perplexity = math.exp(nll_per_token)
    bpc = nll_bits_per_token * num_tokens / num_chars
    bpb = nll_bits_per_token * num_tokens / num_bytes

    return {
        "perplexity": perplexity,
        "bits_per_character": bpc,
        "bits_per_byte": bpb,
        "num_tokens": num_tokens,
        "avg_chars_per_token": num_chars / num_tokens,
    }

Downstream Task Evaluation

For most practical purposes, you care about downstream task performance, not perplexity. Key benchmarks and what they measure:

Benchmark	Measures	Size	Type
MMLU	Knowledge across 57 domains	14K	Multiple choice
HellaSwag	Physical/temporal commonsense	70K	Multiple choice
TruthfulQA	Factual accuracy, avoiding myths	817	Multiple choice
HumanEval	Code generation correctness	164	Code with tests
GSM8K	Grade school math reasoning	8,500	Open generation
MATH	Competition math	12,500	Open generation
MT-Bench	Instruction following quality	80	LLM judge 1-10

from datasets import load_dataset
import re

def evaluate_on_mmlu(
    model_generate_fn,    # Callable: (prompt: str) -> str
    subjects: list[str] | None = None,
    num_shots: int = 5,
) -> dict[str, float]:
    """
    Evaluate a model on MMLU (Massive Multitask Language Understanding).
    Returns accuracy by subject and overall.
    """
    dataset = load_dataset("cais/mmlu", "all", split="test")
    if subjects:
        dataset = dataset.filter(lambda x: x["subject"] in subjects)

    choices = ["A", "B", "C", "D"]
    subject_results = {}

    for example in dataset:
        subject = example["subject"]

        # Build few-shot prompt
        prompt = f"The following are multiple choice questions about {subject.replace('_', ' ')}.\n\n"

        # Add few-shot examples
        dev_dataset = load_dataset("cais/mmlu", subject, split="dev")
        for dev_ex in list(dev_dataset)[:num_shots]:
            prompt += f"Question: {dev_ex['question']}\n"
            for i, choice in enumerate(dev_ex["choices"]):
                prompt += f"{choices[i]}. {choice}\n"
            prompt += f"Answer: {choices[dev_ex['answer']]}\n\n"

        # Add test question
        prompt += f"Question: {example['question']}\n"
        for i, choice in enumerate(example["choices"]):
            prompt += f"{choices[i]}. {choice}\n"
        prompt += "Answer:"

        response = model_generate_fn(prompt)
        predicted = response.strip()[0].upper() if response.strip() else "X"
        correct = choices[example["answer"]]

        if subject not in subject_results:
            subject_results[subject] = {"correct": 0, "total": 0}
        subject_results[subject]["total"] += 1
        if predicted == correct:
            subject_results[subject]["correct"] += 1

    overall_correct = sum(r["correct"] for r in subject_results.values())
    overall_total = sum(r["total"] for r in subject_results.values())

    return {
        "overall_accuracy": overall_correct / overall_total,
        "by_subject": {
            subject: r["correct"] / r["total"]
            for subject, r in subject_results.items()
        },
        "num_questions": overall_total,
    }

Common Mistakes

danger

Evaluating cross-entropy loss across different tokenizers

Cross-entropy (and perplexity) depends heavily on tokenization. A model using BPE with vocabulary size 32K will have different perplexity than a model using BPE with 128K vocabulary on the same text, even with identical English comprehension. Never directly compare perplexity numbers across models with different tokenizers. Instead, compare bits-per-byte (BPB) or bits-per-character (BPC), which normalize by the raw text rather than the token count.

warning

Computing perplexity on text with sequence length exceeding the model's context window

If your evaluation text is longer than the model's context window (e.g., 5,000 tokens on a GPT-2 model with 1,024 token context), the perplexity calculation will be invalid - tokens beyond position 1,024 have no valid position encoding. Use a sliding window approach: compute perplexity in overlapping chunks, using only the latter half of each chunk's tokens (the first half is "context" that helps the model but is not scored).

warning

Confusing per-token loss with per-sequence loss

The default model(input_ids, labels=input_ids).loss in HuggingFace returns mean loss per token, averaged across the sequence. This is correct for comparing runs with the same sequence length but misleading when comparing across different sequence lengths. If you care about loss for a specific subset of tokens (e.g., only the response tokens in a conversation, not the system prompt), implement custom label masking. See Lesson 05 (Supervised Fine-Tuning) for the masked loss implementation.

Key Takeaways

Language modeling objectives are the foundation of everything in this module. The cross-entropy loss between the model's predicted distribution and the next token - simple in definition, profound in consequence - drives all of pretraining. Whether applied to the 15% of masked tokens in BERT or the 100% of tokens in GPT, this loss is what transforms random initialized weights into a model that understands language.

The choice of objective - CLM vs MLM - determined the direction of the entire field. CLM's 6.7x sample efficiency advantage and natural generation capability made it the winner at scale, giving rise to GPT-3, ChatGPT, LLaMA, and every frontier model we use today. Understanding why this objective was chosen and what its properties are is not academic history - it is the foundation for understanding why instruction tuning, RLHF, and DPO work the way they do.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Language Modeling: MLM vs CLM demo on the EngineersOfAI Playground - no code required.

:::

A Model That Reads the Internet​

Why This Exists: The Problem with Supervised Learning​

Historical Context: A Brief Timeline​

The Two Main Paradigms​

Causal Language Modeling (CLM)​

Masked Language Modeling (MLM)​

The Cross-Entropy Loss​

Perplexity: What It Means Intuitively​

Other Pretraining Objectives​

Next Sentence Prediction (NSP)​

Sentence Order Prediction (SOP)​

Span Corruption (T5)​

Replaced Token Detection (ELECTRA)​

How Objective Choice Shapes Model Capabilities​

Code: Implementing Language Modeling Loss from Scratch​

Practical Implementation with HuggingFace​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

The Information-Theoretic View​

Sequence-Level vs Token-Level Objectives​

Practical: Pretraining Loss Curve Analysis​

Connecting Objectives to Model Behavior​

Evaluating Language Models Beyond Perplexity​

Bits Per Character (BPC) and Bits Per Byte (BPB)​

Downstream Task Evaluation​

Common Mistakes​

Key Takeaways​

A Model That Reads the Internet

Why This Exists: The Problem with Supervised Learning

Historical Context: A Brief Timeline

The Two Main Paradigms

Causal Language Modeling (CLM)

Masked Language Modeling (MLM)

The Cross-Entropy Loss

Perplexity: What It Means Intuitively

Other Pretraining Objectives

Next Sentence Prediction (NSP)

Sentence Order Prediction (SOP)

Span Corruption (T5)

Replaced Token Detection (ELECTRA)

How Objective Choice Shapes Model Capabilities

Code: Implementing Language Modeling Loss from Scratch

Practical Implementation with HuggingFace

Production Engineering Notes

Common Mistakes

Interview Q&A

The Information-Theoretic View

Sequence-Level vs Token-Level Objectives

Practical: Pretraining Loss Curve Analysis

Connecting Objectives to Model Behavior

Evaluating Language Models Beyond Perplexity

Bits Per Character (BPC) and Bits Per Byte (BPB)

Downstream Task Evaluation

Common Mistakes

Key Takeaways