What is causal language modeling?

Learn how GPT-style autoregressive models work, the evolution from GPT-1 to GPT-4, sampling strategies, and why causal LM became the dominant paradigm for LLMs.

How does GPT work in practice?

Causal Language Modeling and GPT covers causal language modeling, GPT, autoregressive from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/causal-language-modeling-gpt

What is the difference between causal language modeling and autoregressive?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/causal-language-modeling-gpt

Causal Language Modeling and GPT

The Night GPT-3 Surprised Everyone

It is the summer of 2020. OpenAI has quietly given API access to a small group of researchers and developers. One of them, a programmer named Arram Sabeti, sits down at his computer and types a prompt into the API: "Here is a bedtime story about a bear in the style of Ernest Hemingway."

What comes back is not a mangled sentence or a repetitive loop. It is a coherent, stylistically distinct short story - terse sentences, the Hemingway cadence, a bear that stands for something. Arram tweets about it. Then more beta users try it. They get GPT-3 to write legal briefs, debug code, explain quantum mechanics, generate poetry, translate languages - all without any fine-tuning. Just a prompt.

The entire ML community's reaction is some version of: "Wait, that was not supposed to work." Nobody had predicted that scaling a simple next-token prediction objective to 175 billion parameters would produce a model that could do all these tasks in a single forward pass. There was no explicit training for legal brief writing, or code generation, or Hemingway imitation. The model just learned to continue text - and continuing text, at scale, turned out to subsume almost everything.

This is the story of causal language modeling: an objective so simple it can be stated in one sentence, yet so powerful that it underpins every frontier language model in existence today.

Why This Exists: The Limits of Supervised Task Training

The dominant NLP paradigm before self-supervised pretraining was task-specific training. You want a translation model? Train on parallel corpora. You want a summarizer? Train on document-summary pairs. You want a question answering system? Train on QA datasets.

This approach has two fundamental problems. First, each model is narrow - a translation model cannot summarize. Second, curated labeled data is expensive. Most knowledge in the world is not in the form of (input, output) pairs. It is in books, the web, scientific papers, code, conversations - raw text.

Causal language modeling reframes the problem: instead of curating specific tasks, train on everything and predict the next token. The model learns from the raw text itself. The objective is trivially scalable - any text becomes training data. And the learned representations generalize to a vast range of downstream tasks because the model had to implicitly solve many sub-tasks (grammar, facts, reasoning, style) in order to predict text well.

Historical Context: The GPT Lineage

GPT-1 (Radford et al., 2018) - The Proof of Concept

117 million parameters. Trained on BooksCorpus (7,000 unpublished books, ~800M words). Key contribution: showed that a pretrained transformer fine-tuned with a simple linear layer could match or exceed task-specific models on 9 out of 12 NLP tasks. The "pretraining then fine-tuning" paradigm worked.

GPT-2 (Radford et al., 2019) - The First Shock

1.5 billion parameters. Trained on WebText - 40GB of text scraped from Reddit links with over 3 upvotes (a proxy for quality). Key contribution: demonstrated zero-shot transfer - without any fine-tuning, GPT-2 achieved state-of-the-art perplexity on several language modeling benchmarks. OpenAI controversially staged the release, claiming the model was "too dangerous to release." Looking back, this seems overstated, but it signaled that the field was starting to take capability concerns seriously.

GPT-3 (Brown et al., 2020) - The Paradigm Shift

175 billion parameters. Trained on ~300 billion tokens (Common Crawl filtered, WebText2, Books1, Books2, Wikipedia). Key contribution: few-shot learning without gradient updates. By showing the model a few examples in the prompt, it could solve tasks it was never explicitly trained for. The paper coined the term "in-context learning."

GPT-4 (OpenAI, 2023) - The Capability Leap

Architecture details not published (OpenAI stopped publishing technical details). Estimated to be a mixture-of-experts model. Key contribution: multimodal (text + images), significantly improved reasoning, passed bar exam at 90th percentile.

LLaMA (Touvron et al., 2023) - Open Source Arrives

65 billion parameters. Key contribution: showed that a smaller model trained longer on more tokens can outperform larger models trained for fewer steps. Chinchilla-optimal training (Hoffmann et al., 2022) - train smaller models on more data.

The Core Mechanism: Autoregressive Prediction

The causal language modeling objective is deceptively simple: given tokens $x_1, x_2, \ldots, x_{t-1}$ , predict $x_t$ .

$\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$

The "causal" part refers to the attention mask. In the standard transformer, every token can attend to every other token. In a causal transformer (decoder-only), each token can only attend to itself and previous tokens - future tokens are masked.

The mask is an upper-triangular matrix of negative infinity values added to the attention scores before softmax:

$\text{Attention mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}$

Adding $-\infty$ before softmax makes the softmax output 0 at those positions - effectively preventing attention to future tokens.

Each token attends to all previous tokens and itself, but NOT to future tokens. This is enforced by the mask - not by sequential computation. The entire sequence is processed in parallel during training. At inference time (generation), tokens are produced one at a time because each new token must be produced before it can be attended to.

Temperature and Sampling Strategies

Once the model produces a probability distribution over the vocabulary, how do you choose the next token? This is not a trivial question - the choice significantly affects the quality and diversity of generated text.

Greedy Decoding

Always pick the highest probability token:

$x_t = \arg\max_{v} P_\theta(v \mid x_{<t})$

Fast and deterministic. Often produces repetitive, "safe" text. The most probable continuation at each step does not produce the globally most probable sequence.

Temperature Sampling

Scale the logits by temperature $T$ before computing softmax:

$P_T(x_t = v) = \frac{\exp(z_v / T)}{\sum_{v'} \exp(z_{v'} / T)}$

$T < 1$ : "sharper" distribution - the model becomes more confident, more likely to pick high-probability tokens. Text becomes more predictable and focused.
$T = 1$ : sample from the raw model distribution.
$T > 1$ : "flatter" distribution - the model becomes more random, more likely to pick unusual tokens. Text becomes more creative but also more likely to be incoherent.

Typical values: T=0.7 for focused generation, T=1.0 for normal sampling, T=1.2+ for creative/diverse outputs.

Top-k Sampling

Restrict sampling to only the top $k$ most probable tokens, then sample from that restricted distribution.

At each step, the model might assign near-zero probability to 49,990 tokens out of 50,000. Top-k with k=50 focuses sampling on the plausible tokens and avoids accidentally sampling from the long tail of low-probability nonsense.

Problem with top-k: $k$ is fixed, but the shape of the distribution varies. Sometimes the model is very confident and the top-5 tokens capture 95% of probability - in this case, top-50 is too loose. Other times the model is very uncertain and the top-50 tokens capture only 30% of probability - in this case, top-50 is too restrictive.

Top-p (Nucleus) Sampling

Instead of fixing the number of tokens, fix the cumulative probability threshold $p$ . Include the smallest set of tokens whose cumulative probability exceeds $p$ .

If p=0.9, at each step collect tokens starting from the highest probability until their probabilities sum to 0.9, then sample from that set.

This adapts to the model's uncertainty: when the model is confident, the nucleus is small (few high-probability tokens). When uncertain, the nucleus is large.

Holtzman et al. (2020) showed that top-p sampling produces more human-like text than top-k. Most production systems use top_p=0.9 or top_p=0.95 with temperature 0.7-1.0.

Beam Search

Maintain $B$ partial sequences (beams) and at each step expand each beam into the top- $B$ continuations, keeping only the $B$ highest-probability sequences overall.

Beam search produces the globally most likely sequence over a fixed window. It was dominant in neural machine translation. For open-ended generation, it produces repetitive text ("degenerate repetition") because it maximizes probability rather than diversity.

The In-Context Learning Surprise

When GPT-3 was released, something unexpected happened: the model could solve tasks it had never been fine-tuned for, simply by showing it examples in the prompt.

Zero-shot: just describe the task.

Translate English to French:
sea otter =>

Few-shot: show a few examples, then your input.

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe =>

GPT-3 was not explicitly trained on French-English translation as a task. It was trained to predict the next token. But in order to predict tokens well, it had to learn patterns - including translation patterns - from the training corpus. The in-context examples "prime" the model to continue in the right pattern.

This was genuinely surprising. The standard assumption was that tasks required fine-tuning. GPT-3 showed that at sufficient scale, the pretraining objective itself creates a general-purpose task performer.

The theoretical understanding is still incomplete, but the working hypothesis is that transformers at scale are doing some form of implicit in-context gradient descent - reading the examples in the prompt and updating their "effective weights" within the forward pass. This is an active research area.

KV Cache: Efficient Autoregressive Inference

During autoregressive generation, the model generates one token at a time. For each new token, it needs to attend to all previous tokens. Without caching, this means recomputing the key and value projections for all previous tokens at every step - $O(T^2)$ total computation.

The key-value (KV) cache avoids this. After computing keys and values for token $t$ , store them. When generating token $t+1$ , only compute keys and values for the new token and append to the cache. The attention computation for the new token attends to the cached keys and values.

Memory cost: the KV cache for a full sequence grows to 2 * num_layers * seq_len * num_heads * head_dim * bytes_per_element. For a 7B model with 32 layers, 32 heads, head_dim=128, and a 4096-token sequence in FP16 (2 bytes): 2 * 32 * 4096 * 32 * 128 * 2 = ~1GB. For a 70B model at 32K tokens, the KV cache alone exceeds 100GB.

Code: Autoregressive Generation with KV Cache

"""
Autoregressive text generation with:
1. Manual sampling implementation (greedy, top-k, top-p)
2. KV cache usage in HuggingFace
3. Streaming generation
"""

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import Optional


# ---- Sampling functions ----

def greedy_decode(logits: torch.Tensor) -> int:
    """Pick the highest probability token."""
    return logits.argmax(-1).item()


def temperature_sample(logits: torch.Tensor, temperature: float = 1.0) -> int:
    """Scale logits by temperature and sample."""
    if temperature == 0:
        return greedy_decode(logits)
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()


def top_k_sample(logits: torch.Tensor, k: int = 50, temperature: float = 1.0) -> int:
    """Sample from the top-k most probable tokens."""
    if temperature != 1.0:
        logits = logits / temperature

    # Set all tokens outside top-k to -infinity
    top_k_logits, top_k_indices = torch.topk(logits, k)
    filtered_logits = torch.full_like(logits, float('-inf'))
    filtered_logits.scatter_(-1, top_k_indices, top_k_logits)

    probs = F.softmax(filtered_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()


def top_p_sample(
    logits: torch.Tensor,
    p: float = 0.9,
    temperature: float = 1.0
) -> int:
    """
    Nucleus sampling: sample from the smallest set of tokens
    whose cumulative probability exceeds p.
    """
    if temperature != 1.0:
        logits = logits / temperature

    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

    # Remove tokens with cumulative probability above the threshold
    # (shift by 1 to include the token that pushes cumprob above p)
    sorted_indices_to_remove = cumulative_probs - F.softmax(sorted_logits, dim=-1) > p
    sorted_logits[sorted_indices_to_remove] = float('-inf')

    # Restore original ordering
    logits_filtered = torch.zeros_like(logits)
    logits_filtered.scatter_(0, sorted_indices, sorted_logits)

    probs = F.softmax(logits_filtered, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()


# ---- Generation with HuggingFace (recommended for production) ----

def generate_text(
    prompt: str,
    model_name: str = "gpt2",
    max_new_tokens: int = 100,
    temperature: float = 0.8,
    top_p: float = 0.9,
    top_k: int = 50,
    do_sample: bool = True,
):
    """
    Generate text using HuggingFace with KV cache (enabled by default).
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    model.eval()

    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]

    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True,              # KV cache - always True in production
            repetition_penalty=1.1,      # Slightly penalize repeated tokens
        )

    # Decode only the new tokens (not the prompt)
    new_tokens = output_ids[0][input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)


# ---- Manual autoregressive loop showing KV cache explicitly ----

def manual_generate_with_kv_cache(
    model,
    tokenizer,
    prompt: str,
    max_new_tokens: int = 50,
    temperature: float = 1.0,
):
    """
    Explicit autoregressive loop to show how KV cache works.
    """
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"]

    past_key_values = None  # KV cache starts empty
    generated_ids = input_ids

    model.eval()
    with torch.no_grad():
        for _ in range(max_new_tokens):
            if past_key_values is None:
                # First step: process entire prompt
                outputs = model(
                    input_ids=input_ids,
                    use_cache=True,
                )
            else:
                # Subsequent steps: only process the last generated token
                # KV cache provides context for all previous tokens
                outputs = model(
                    input_ids=generated_ids[:, -1:],  # Only last token
                    past_key_values=past_key_values,
                    use_cache=True,
                )

            # Cache stores K, V for all processed tokens
            past_key_values = outputs.past_key_values

            # Sample next token from the last position's logits
            next_token_logits = outputs.logits[:, -1, :]
            next_token_id = temperature_sample(
                next_token_logits[0],
                temperature=temperature
            )

            # Stop if EOS
            if next_token_id == tokenizer.eos_token_id:
                break

            # Append to sequence
            next_token_tensor = torch.tensor([[next_token_id]])
            generated_ids = torch.cat([generated_ids, next_token_tensor], dim=1)

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)


# Demo
if __name__ == "__main__":
    prompt = "The transformer architecture revolutionized NLP because"
    result = generate_text(
        prompt=prompt,
        model_name="gpt2",
        max_new_tokens=80,
        temperature=0.8,
        top_p=0.95,
    )
    print(f"Prompt: {prompt}")
    print(f"Generated: {result}")

Production Engineering Notes

Batched Inference and KV Cache Memory

The KV cache grows linearly with both sequence length and batch size. For a 70B model serving requests at batch size 32 with 4096-token context in FP16: 2 * 80 * 4096 * 64 * 128 * 2 * 32 bytes = ~85GB just for the KV cache. This is why serving large models requires careful batch size management and often requires quantizing the KV cache to INT8 or INT4.

Speculative Decoding

At inference, the KV cache makes each generation step fast, but you still need to run the full model for each token. Speculative decoding (Leviathan et al., 2023) uses a small "draft" model to propose multiple tokens at once, then verifies them in a single forward pass of the large model. Speedups of 2-3x with identical output quality. Supported natively in HuggingFace generate().

Flash Attention for Long Context

Standard attention computes a (seq_len, seq_len) attention matrix, requiring $O(n^2)$ memory. For a 32K token context, this is 32000^2 * 2 bytes = ~2GB per attention head. Flash Attention 2 (Dao, 2023) recomputes attention in tiles, never materializing the full matrix - reducing attention memory to $O(n)$ and typically achieving 2-4x speedup on modern hardware.

note

The Chinchilla Scaling Laws Hoffmann et al. (2022) showed that many large models were trained with too little data relative to their size. The optimal ratio is roughly 20 training tokens per model parameter. A 7B model should train on ~140B tokens. This led to models like LLaMA - smaller models trained much longer that outperform larger but undertrained models. This is now the standard approach.

Common Mistakes

danger

Using deterministic decoding (greedy or beam search) for open-ended generation Greedy decoding maximizes token-level probability but does not produce the globally most probable sequence. More importantly, for creative or diverse generation tasks, greedy output is repetitive and stilted. Use top-p sampling with p=0.9 and temperature around 0.7-0.9 for most generation tasks. Only use greedy/beam search for tasks with a single correct answer (translation, constrained generation).

danger

Not setting a repetition penalty for long generation Without a repetition penalty, causal LMs often fall into repetition loops - especially GPT-2 and smaller models. The model enters a high-probability loop ("The cat sat on the mat. The cat sat on the mat..."). Set repetition_penalty=1.1 to 1.3 for long-form generation. The no_repeat_ngram_size parameter prevents any n-gram from repeating, which is more aggressive but effective.

warning

Prompt format inconsistencies between training and inference Instruction-tuned models (Llama-Instruct, ChatGPT) are trained with specific prompt templates. Using a different format at inference time significantly degrades quality. Always use the exact prompt template the model was fine-tuned with. For Llama-2-Chat, this is the [INST] ... [/INST] format. For ChatML models, this is <|im_start|>user ... <|im_end|>.

tip

Streaming generation for better user experience For production applications, use streaming generation - yield tokens as they are generated rather than waiting for the full response. HuggingFace TextIteratorStreamer enables this. Users perceive streaming as faster even if total time is identical, and it allows users to stop generation early.

Interview Q&A

Q1: What is the causal attention mask and why is it necessary?

The causal attention mask is an upper-triangular matrix of negative infinity values added to the raw attention scores before the softmax operation. This forces the softmax to output zero weight for any future position - each token can only attend to itself and previous tokens. It is necessary because during training, we compute predictions for all positions in parallel for efficiency. Without the mask, position 5 would attend to position 6 and could trivially predict position 6's token (it already has it). The mask enforces the autoregressive property: each prediction must be made using only the information available at that point in time.

Q2: What is in-context learning and how does it work?

In-context learning (Brown et al., 2020) is the ability of large language models to perform tasks by conditioning on a few examples in the prompt, without any gradient updates. Show the model "English: cat, French: chat. English: dog, French: chien. English: book, French:" and it will complete the pattern. The mechanism is not fully understood, but leading theories suggest that transformers at scale are effectively performing a form of implicit gradient descent in the forward pass - reading the in-context examples and adapting their computation to the demonstrated pattern. Key finding from GPT-3: few-shot in-context learning scales with model size - larger models are dramatically better at it.

Q3: What are the trade-offs between top-k and top-p sampling?

Top-k fixes the number of candidate tokens (e.g., always sample from the 50 most probable tokens). This is simple but ignores the shape of the distribution - when the model is very confident, top-50 includes many implausible tokens; when the model is very uncertain, top-50 might miss probability mass. Top-p (nucleus) sampling fixes the cumulative probability threshold (e.g., sample from tokens that sum to 90% probability), which adapts to distribution shape. In practice, top-p produces more coherent text than top-k. Most production systems use top-p sampling with p=0.9 to 0.95 combined with temperature 0.7 to 1.0.

Q4: Why did the GPT paradigm (causal LM) win over BERT (masked LM) for large-scale models?

Three reasons. First, causal LM trains on every single token - loss is computed on 100% of tokens at every step. Masked LM only trains on the 15% of masked tokens, making it less sample-efficient. Second, causal LM enables autoregressive generation natively - you can generate text token by token. Masked LM cannot do this naturally. Third, in-context learning and instruction following emerged naturally from causal LM at scale, making it a general-purpose interface. Masked LM models required task-specific fine-tuning for each new task. That said, masked LM still dominates for embedding tasks where you need to encode a fixed-length document into a dense vector.

Q5: What is the KV cache and what are its memory implications at scale?

The KV cache stores the key and value projections computed for all previous tokens. Without it, generating each new token requires recomputing keys and values for all previous tokens - $O(n^2)$ total computation. With the cache, each step only computes keys and values for the single new token. Memory: 2 * num_layers * seq_len * (num_heads * head_dim) * bytes_per_element. For a 70B LLaMA model (80 layers, 64 heads, head_dim 128, FP16) with a 4096-token context: approximately 2.7GB per request. For batch size 32: 86GB just for the KV cache. This is why KV cache quantization (storing INT8 instead of FP16) and paged attention (vLLM) are important for production serving.

Modern Inference Optimizations

Understanding how CLM inference works in production requires knowing the gap between naive autoregressive decoding and the highly optimized systems that power GPT-4, Claude, and Llama deployments.

Speculative Decoding

Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) dramatically speeds up inference by using a small draft model to speculatively generate multiple tokens, then verifying them in parallel with the large target model.

The key insight: for most sequences, several consecutive tokens are "obvious" - the large model would have generated them too. Verifying multiple tokens in a single forward pass is much cheaper than generating them one by one.

Naive decoding:
  Large model: "The" → "capital" → "of" → "France" → "is" → "Paris"
  6 serial forward passes

Speculative decoding:
  Draft model (small, fast): speculatively generates "capital of France is Paris"
  Large model: verifies all 5 tokens in ONE forward pass
  → 3–4x wall-clock speedup with identical output quality (mathematically proven)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def speculative_decode(
    prompt: str,
    draft_model_name: str = "facebook/opt-125m",
    target_model_name: str = "facebook/opt-6.7b",
    max_new_tokens: int = 200,
    num_speculative: int = 5,   # tokens to draft before verification
) -> str:
    """Speculative decoding: draft with small model, verify with large."""
    tokenizer = AutoTokenizer.from_pretrained(target_model_name)
    draft_model = AutoModelForCausalLM.from_pretrained(
        draft_model_name, torch_dtype=torch.float16
    ).cuda()
    target_model = AutoModelForCausalLM.from_pretrained(
        target_model_name, torch_dtype=torch.float16
    ).cuda()

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    generated = input_ids.clone()

    for _ in range(max_new_tokens // num_speculative):
        # Step 1: Draft model generates K speculative tokens
        draft_output = draft_model.generate(
            generated,
            max_new_tokens=num_speculative,
            do_sample=False,          # Greedy for speed
        )
        draft_tokens = draft_output[:, generated.shape[1]:]

        # Step 2: Target model scores all K+1 positions in ONE forward pass
        candidate = torch.cat([generated, draft_tokens], dim=1)
        with torch.no_grad():
            target_logits = target_model(candidate).logits

        # Step 3: Rejection sampling - accept tokens where target agrees
        accepted = []
        for k in range(num_speculative):
            draft_token = draft_tokens[0, k].item()
            target_probs = torch.softmax(target_logits[0, generated.shape[1] + k - 1], dim=-1)
            draft_probs = torch.softmax(
                draft_model(candidate[:, :generated.shape[1]+k]).logits[0, -1], dim=-1
            )
            # Accept with probability min(1, p_target / p_draft)
            accept_prob = min(1.0, (target_probs[draft_token] / (draft_probs[draft_token] + 1e-8)).item())
            if torch.rand(1).item() < accept_prob:
                accepted.append(draft_token)
            else:
                # Reject - sample from corrected distribution and stop
                corrected = (target_probs - draft_probs).clamp(min=0)
                corrected /= corrected.sum()
                accepted.append(torch.multinomial(corrected, 1).item())
                break

        generated = torch.cat([
            generated,
            torch.tensor(accepted, device=generated.device).unsqueeze(0)
        ], dim=1)

        if tokenizer.eos_token_id in accepted:
            break

    return tokenizer.decode(generated[0], skip_special_tokens=True)

In practice, use HuggingFace's built-in speculative decoding:

# Much simpler - HF handles all the sampling logic
output = target_model.generate(
    input_ids,
    assistant_model=draft_model,   # Speculative decoding built-in
    max_new_tokens=200,
)

Continuous Batching and PagedAttention

The fundamental inefficiency in naive batch inference: you must reserve memory for the maximum possible KV cache at request start, even though most requests end early. This leads to GPU memory fragmentation - you cannot fit as many concurrent requests as the GPU's VRAM would theoretically allow.

PagedAttention (Kwon et al., 2023, vLLM) borrows the virtual memory concept from OS design. The KV cache is stored in non-contiguous physical blocks (like OS memory pages). A logical-to-physical block table maps each sequence's logical KV positions to physical GPU memory blocks. Blocks are allocated on demand and freed when the sequence finishes.

Result: near-zero KV cache waste, 2–4x higher throughput, and the ability to share KV cache blocks across requests with identical prefixes (radix attention - useful when all requests share a long system prompt).

# Production CLM serving - use vLLM, not naive HuggingFace generate()
from vllm import LLM, SamplingParams

# vLLM handles PagedAttention, continuous batching, CUDA graphs automatically
llm = LLM(
    model="meta-llama/Llama-2-70b-chat-hf",
    tensor_parallel_size=4,       # Split across 4 GPUs
    gpu_memory_utilization=0.90,  # Use 90% of GPU memory for KV cache
    max_model_len=4096,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

prompts = [
    "Explain quantum entanglement in simple terms.",
    "Write a Python function that sorts a list of dictionaries by a key.",
    "What were the main causes of World War I?",
]

# Continuous batching - processes all prompts efficiently
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text[:200])

Context Length Scaling

Original GPT models used learned absolute position embeddings - fixed at training time. This limited context to 512 (GPT-2) or 2048 (GPT-3) tokens. Modern models use:

Rotary Position Embeddings (RoPE) (Su et al., 2021): encode position as a rotation applied to query/key vectors. The rotation angle depends on both the absolute position and the attention head dimension. Crucially, the attention score between positions $m$ and $n$ depends only on the relative position $m - n$ , which emerges naturally from the rotation algebra. This makes RoPE inherently relative and enables out-of-distribution generalization to longer contexts than seen during training (with position interpolation tricks).

YaRN (Peng et al., 2023): scale RoPE's base frequency to extend trained context to 128K+ tokens without full retraining - just a small fine-tuning run on long-context data. Llama 3.1 (128K context) and Mistral (32K context) both use RoPE with extended context via this approach.

# Checking whether your model supports long context
from transformers import AutoConfig

config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3-8B")
print(f"Max position embeddings: {config.max_position_embeddings}")  # 8192
print(f"RoPE base: {config.rope_theta}")  # 500000.0 (Llama-3 uses high base)
# High rope_theta = slower frequency decay = better long-context generalization

Decoding Strategy Deep Dive

The choice of decoding strategy has a larger impact on output quality than most engineers realize. Here is a systematic comparison:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).eval()

prompt = "The most important breakthrough in AI research was"
inputs = tokenizer(prompt, return_tensors="pt")

def decode_with_strategy(strategy_name: str, **kwargs) -> str:
    with torch.no_grad():
        output = model.generate(
            inputs.input_ids,
            max_new_tokens=50,
            pad_token_id=tokenizer.eos_token_id,
            **kwargs,
        )
    new_tokens = output[0][inputs.input_ids.shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

results = {
    "greedy": decode_with_strategy("greedy", do_sample=False),
    "temp_0.7": decode_with_strategy("temp_0.7", do_sample=True, temperature=0.7),
    "temp_1.5": decode_with_strategy("temp_1.5", do_sample=True, temperature=1.5),
    "top_k_50": decode_with_strategy("top_k_50", do_sample=True, top_k=50),
    "top_p_0.9": decode_with_strategy("top_p_0.9", do_sample=True, top_p=0.9),
    "top_p+temp": decode_with_strategy(
        "top_p+temp", do_sample=True, top_p=0.9, temperature=0.7
    ),
    "beam_search": decode_with_strategy(
        "beam_search", do_sample=False, num_beams=5,
        no_repeat_ngram_size=3, early_stopping=True
    ),
}

for strategy, text in results.items():
    print(f"\n[{strategy}] {text[:100]}")

When to use each strategy:

Strategy	Use Case	Trade-off
Greedy	Factual Q&A, code completion	Fast, deterministic, repetitive
Temperature 0.3–0.5	Summaries, classification	Near-deterministic, slightly varied
Temperature 0.7 + top-p 0.9	Chat, creative writing	Good default for most applications
Temperature 1.2+	Creative ideation, poetry	Very diverse, sometimes incoherent
Beam search (n=4–5)	Machine translation, structured output	High quality, slow, "safe" outputs
Beam search + n-gram block	Summarization	Prevents repetition artifacts

tip

The practical default for production chat applications

Use temperature=0.7, top_p=0.9, do_sample=True and do NOT use beam search. Beam search is slower, and for chat applications the outputs sound slightly unnatural - humans don't generate "optimal" responses, they generate natural ones. Reserve beam search for structured output generation (translation, code) where you want the most probable coherent sequence.

Production Serving Architecture

A complete CLM inference endpoint handles more than just model.generate(). Here is what a production system looks like:

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, TextIteratorStreamer
from vllm import AsyncLLMEngine, SamplingParams, AsyncEngineArgs
from fastapi.responses import StreamingResponse
import asyncio
import uuid

app = FastAPI()

# Initialize vLLM async engine - handles continuous batching automatically
engine_args = AsyncEngineArgs(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,
    gpu_memory_utilization=0.85,
    max_model_len=8192,
    enable_prefix_caching=True,  # Cache KV for repeated system prompts
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 512
    temperature: float = 0.7
    top_p: float = 0.9
    stream: bool = True

def format_llama3_prompt(user_message: str, system_prompt: str = "") -> str:
    """Format prompt in Llama-3 chat template."""
    parts = ["<|begin_of_text|>"]
    if system_prompt:
        parts.append(f"<|start_header_id|>system<|end_header_id|>\n{system_prompt}<|eot_id|>")
    parts.append(f"<|start_header_id|>user<|end_header_id|>\n{user_message}<|eot_id|>")
    parts.append("<|start_header_id|>assistant<|end_header_id|>\n")
    return "".join(parts)

@app.post("/generate")
async def generate(request: GenerationRequest):
    formatted_prompt = format_llama3_prompt(request.prompt)
    request_id = str(uuid.uuid4())

    sampling_params = SamplingParams(
        temperature=request.temperature,
        top_p=request.top_p,
        max_tokens=request.max_tokens,
        stop=["<|eot_id|>", "<|end_of_text|>"],
    )

    async def stream_results():
        async for output in engine.generate(formatted_prompt, sampling_params, request_id):
            if output.outputs:
                yield output.outputs[0].text

    if request.stream:
        return StreamingResponse(stream_results(), media_type="text/plain")
    else:
        full_output = ""
        async for chunk in stream_results():
            full_output = chunk  # vLLM returns cumulative text
        return {"text": full_output}

note

The CLM inference stack (2024–2025)

The standard production CLM serving stack:

Model: Llama 3.1 (8B for latency-critical) or Llama 3.1 70B (quality-critical)
Serving: vLLM with PagedAttention (2–4x higher throughput than naive HuggingFace)
Quantization: AWQ or GPTQ at INT4 for 70B models to fit on 2x A100 (vs 4x)
Speculative decoding: Draft with Llama 3.2 1B to accelerate Llama 3.1 8B by 2–3x
Context caching: Radix attention or prefix caching for shared system prompts
API layer: FastAPI + async streaming for sub-100ms time-to-first-token

Key Takeaways

Causal language modeling is one of those rare ideas where the simplest possible training objective - predict the next token - turns out to produce the most capable models when scaled. GPT-2 could generate coherent paragraphs; GPT-3 could solve reasoning problems with examples in context; GPT-4 can pass professional exams. The architecture barely changed - the scale did.

Understanding CLM gives you the foundation for everything that follows in this module. Fine-tuning (SFT, instruction tuning), alignment (RLHF, DPO), and efficient adaptation (LoRA, QLoRA) all operate on top of a pretrained causal LM. The pretrained model's quality is the ceiling for everything downstream - no fine-tuning technique can compensate for a poor base model.

The production serving story has also matured rapidly: vLLM with PagedAttention, speculative decoding with small draft models, and INT4 quantization via AWQ/GPTQ have collectively made 70B-class models economically viable to serve at consumer scale. Understanding these systems is increasingly part of what it means to be an AI engineer.

Interview Q&A (Extended)

Q6: Why did GPT-3's in-context learning ability come as a surprise to the research community?

Before GPT-3, the dominant assumption was that task-specific fine-tuning was necessary to apply LLMs to new tasks. GPT-3 (Brown et al., 2020) demonstrated that a large enough language model could perform translation, summarization, arithmetic, and question answering from just a few prompt examples, without any gradient updates. This was surprising for several reasons: (1) the model was trained only on next-token prediction, with no signal about what a "task" is; (2) the ability emerged suddenly at large scale - GPT-2 (1.5B) showed minimal few-shot capability; GPT-3 (175B) was dramatically better; (3) the mechanism was not designed - it emerged from scale. Leading theories today suggest that transformers at scale are performing something like implicit gradient descent via their attention patterns, but this remains an active research question.

Q7: How does temperature affect the diversity of generated text, and what are the failure modes at extreme values?

Temperature $T$ modifies logits before softmax: $p_i = \text{softmax}(z_i / T)$ . At $T \to 0$ : the softmax output approaches one-hot (argmax), selecting only the highest-probability token at each step - deterministic greedy decoding. Output is coherent and safe, but repetitive and generic. At $T = 1.0$ : raw model probabilities, preserving the model's original uncertainty. At $T > 1.0$ : logits are flattened, low-probability tokens become more likely, increasing diversity. Above $T \approx 1.5$ : outputs become incoherent because low-probability tokens that the model learned to suppress (grammatical errors, topic shifts, factual errors) start being sampled frequently. The standard practical range for creative generation: $T = 0.7$ – $1.0$ . For structured outputs (JSON, code): $T = 0.2$ – $0.5$ .

Q8: What is the attention complexity problem and how has the research community addressed it?

Standard self-attention is $O(n^2)$ in sequence length - both in compute and memory. For a 32K token context window, this is $32000^2 = 10^9$ attention score computations per layer, which becomes the bottleneck for both training and inference. Solutions: (1) Flash Attention (Dao et al., 2022) - does not reduce the $O(n^2)$ compute but dramatically reduces memory by computing attention in tiles that fit in fast SRAM, avoiding slow HBM reads/writes. Enables 10x longer context for the same memory; (2) Grouped Query Attention (GQA) - multiple query heads share a single key/value head, reducing KV cache by 4–8x with minimal quality loss (used in Llama 2/3, Mistral); (3) Sliding window attention (Mistral, Longformer) - each token attends to only a local window of $w$ tokens, giving $O(n \cdot w)$ complexity; (4) Linear attention variants (Mamba, RWKV) - reformulate attention to have $O(n)$ complexity using recurrent state, at the cost of some modeling capacity.

Q9: How do you choose between greedy, sampling, and beam search for a production CLM application?

The choice depends on the application type. For factual Q&A (retrieving a specific fact, answering a yes/no question): use greedy decoding or low-temperature sampling ( $T = 0.3$ ). Determinism is valuable and the answer space is narrow. For chat and instruction following: temperature sampling ( $T = 0.7$ , top-p $= 0.9$ ) provides natural variation without incoherence. This is the standard configuration for ChatGPT-style interfaces. For code generation: low temperature ( $T = 0.2$ – $0.4$ ) because code has strict correctness requirements; alternatively, use best-of-N sampling (generate 5–10 samples, run tests, return the first passing). For creative writing: temperature $= 1.0$ – $1.1$ with top-p $= 0.95$ . For machine translation: beam search (n=4–5) remains the best choice because it produces globally coherent translations and is less prone to local degeneration.

Q10: What is Grouped Query Attention (GQA) and why is it important for inference?

Grouped Query Attention (Ainslie et al., 2023) is a generalization between Multi-Head Attention (MHA, one KV head per Q head) and Multi-Query Attention (MQA, one KV head shared by all Q heads). In GQA, Q heads are divided into groups, and each group shares a single KV head. For example, with 32 Q heads and GQA groups=4: 8 Q heads share each of 4 KV pairs. Llama 2 (70B) uses GQA with 8 groups; Mistral-7B uses GQA with 8 groups. Why it matters for inference: the KV cache is the primary memory bottleneck at large batch sizes and long contexts. MQA reduces KV cache by 32x (one KV head instead of 32) with a small quality penalty. GQA recovers most of that quality while still getting 4–8x KV cache reduction. This directly translates to higher throughput (more concurrent requests per GPU) and support for longer contexts.

Practical CLM Model Selection Guide

Choosing the right CLM for a production application involves balancing quality, latency, and cost:

def select_clm_model(
    use_case: str,             # "chat", "code", "reasoning", "embedding"
    latency_requirement_ms: int,  # Target time-to-first-token in milliseconds
    quality_tier: str,         # "best_available", "production", "fast"
    deployment: str,           # "cloud_api", "self_hosted_gpu", "on_device"
) -> dict:
    """
    Recommend a CLM for a given use case and constraints.
    Based on benchmark performance and inference benchmarks from 2024-2025.
    """
    recommendations = {
        ("chat", "cloud_api", "best_available"): {
            "model": "claude-opus-4 or gpt-4o",
            "context": "200K tokens",
            "cost": "$15-20 per 1M output tokens",
            "latency": "500-2000ms TTFT",
        },
        ("chat", "cloud_api", "production"): {
            "model": "claude-haiku-3-5 or gpt-4o-mini",
            "context": "200K tokens",
            "cost": "$0.4-1.0 per 1M output tokens",
            "latency": "100-500ms TTFT",
        },
        ("chat", "self_hosted_gpu", "production"): {
            "model": "Llama-3.1-8B-Instruct (AWQ INT4)",
            "hardware": "1x A100 80GB",
            "throughput": "~2000 tokens/s at batch=32",
            "latency": "50-150ms TTFT",
            "cost": "$1-3/hr GPU cost",
        },
        ("code", "self_hosted_gpu", "best_available"): {
            "model": "Qwen2.5-Coder-32B-Instruct or DeepSeek-Coder-V2",
            "hardware": "2-4x A100 80GB",
            "notes": "Best open-source code models as of 2025",
        },
        ("reasoning", "self_hosted_gpu", "best_available"): {
            "model": "DeepSeek-R1-Distill-Qwen-32B",
            "hardware": "2x A100 80GB",
            "notes": "Best open-source reasoning. Uses chain-of-thought. High latency due to long thinking tokens.",
        },
        ("chat", "on_device", "fast"): {
            "model": "Llama-3.2-1B or Phi-3-mini (4-bit GGUF)",
            "hardware": "M2 MacBook Pro or RTX 4060 laptop",
            "throughput": "15-40 tokens/s on CPU",
            "notes": "Suitable for offline/privacy-sensitive applications",
        },
    }

    key = (use_case, deployment, quality_tier)
    if key in recommendations:
        return recommendations[key]
    else:
        return {
            "note": f"No specific recommendation for ({use_case}, {deployment}, {quality_tier})",
            "suggestion": "Start with Llama-3.1-8B-Instruct for most self-hosted production use cases",
        }


# Example lookups
print(select_clm_model("chat", "self_hosted_gpu", "production"))
print(select_clm_model("code", "self_hosted_gpu", "best_available"))

The CLM landscape has bifurcated into two tiers: cloud API models (GPT-4o, Claude 3.5/4, Gemini) that offer maximum quality at pay-per-token pricing, and open-source self-hosted models (Llama 3.1, Qwen 2.5, Mistral) that offer lower cost at scale with infrastructure overhead. For most production applications serving thousands of users, the crossover point where self-hosting becomes cheaper than API usage is approximately 50-100 million output tokens per month on commodity A100 hardware.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Language Modeling: MLM vs CLM demo on the EngineersOfAI Playground - no code required.

:::

The Night GPT-3 Surprised Everyone​

Why This Exists: The Limits of Supervised Task Training​

Historical Context: The GPT Lineage​

The Core Mechanism: Autoregressive Prediction​

Temperature and Sampling Strategies​

Greedy Decoding​

Temperature Sampling​

Top-k Sampling​

Top-p (Nucleus) Sampling​

Beam Search​

The In-Context Learning Surprise​

KV Cache: Efficient Autoregressive Inference​

Code: Autoregressive Generation with KV Cache​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Modern Inference Optimizations​

Speculative Decoding​

Continuous Batching and PagedAttention​

Context Length Scaling​

Decoding Strategy Deep Dive​

Production Serving Architecture​

Key Takeaways​

Interview Q&A (Extended)​

Practical CLM Model Selection Guide​