What is temperature sampling?

Master the sampling algorithms that control LLM output diversity - from greedy decoding to nucleus sampling - and learn when to use each in production.

How does top-k sampling work in practice?

Sampling Strategies: Temperature, Top-K, Top-P covers temperature sampling, top-k sampling, top-p sampling from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-inference/sampling-strategies-temperature-topk-topp

What is the difference between temperature sampling and top-p sampling?

See the full breakdown at https://engineersofai.com/docs/llms/llm-inference/sampling-strategies-temperature-topk-topp

Sampling Strategies: Temperature, Top-K, Top-P

The Production Scenario

Your AI writing assistant has been live for six months. The creative writing feature is getting complaints: outputs are repetitive, bland, always predictable. You look at the configuration and discover someone set temperature to 0.1 - probably from a "temperature = 0 for consistency" cargo cult rule applied without thinking. You bump it to 0.8 and the writing immediately becomes richer. Then you get a different complaint: the factual Q&A feature is now occasionally generating plausible-sounding nonsense, making things up with the same creative flair that helps fiction writing.

You realize there is no universal best setting. Every application needs a different point in the creativity-accuracy trade-off space. And you realize you do not actually know how these parameters interact mechanically - just that "higher = more random."

This lesson builds the precise mechanical understanding of what each parameter does to the probability distribution. Once you understand the mechanics, the right settings for every use case become obvious rather than trial-and-error.

The key insight is that temperature, top-K, and top-P are all different ways of reshaping the same probability distribution before sampling. Temperature scales the logits. Top-K zeroes out all but the highest-probability tokens. Top-P finds a minimal set of tokens whose probabilities sum to a threshold. These are independent operations that stack together, each addressing a different failure mode of pure random sampling.

Why This Exists: The Problems With Naive Approaches

Greedy Decoding Fails

The simplest approach: always pick the highest-probability token. Deterministic, fast, reproducible. But it produces degenerate outputs for any creative task:

Prompt: "The sun rose over the mountains and..."
Greedy: "the mountains and the mountains and the mountains and the mountains..."

Greedy decoding falls into repetitive loops because once you generate "the mountains," it becomes the highest-probability next token in context, creating a feedback loop. This is called exposure bias or repetition degeneration.

Pure Random Sampling Fails Too

Sampling from the raw softmax distribution solves repetition but introduces incoherence. The vocabulary has 32,000–128,000 tokens. Even a "unlikely" token with probability 0.001% gets sampled eventually. After "The capital of France is," you want "Paris" - but pure sampling might occasionally produce "banana" or "quantum" just because they have tiny but nonzero probability.

The solution is truncation: before sampling, zero out the probabilities of clearly wrong tokens. Temperature, top-K, and top-P are different strategies for this truncation.

Historical Context

Early neural language models used beam search almost exclusively. Beam search maintains the $B$ highest-probability partial sequences simultaneously and was considered the gold standard for quality. It was the dominant decoding strategy from the seq2seq era (2014–2018).

The turning point was Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration." They showed empirically that human text does not maximize probability - humans often use surprising but coherent words. Beam search produces text that is too predictable, too "safe," and often repetitive. They introduced top-P (nucleus) sampling and showed it produces more human-like text by multiple evaluation metrics.

Temperature scaling was used even earlier in the context of language model training (it comes from statistical mechanics, where temperature controls the randomness of a Boltzmann distribution). Top-K sampling was a natural precursor to top-P. The min-P sampling approach emerged around 2023 as a refinement that handles some failure modes of both top-K and top-P.

The Logit Distribution

Understanding sampling starts with the model's raw output: logits.

For each decode step, the model outputs a vector of unnormalized scores (logits) $z \in \mathbb{R}^V$ where $V$ is vocabulary size (typically 32,000–128,000). The probability of token $i$ is:

$P(i) = \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

All sampling strategies manipulate these logits or the resulting probabilities before the final sampling step.

Temperature Scaling

Temperature $T$ is applied by dividing logits by $T$ before the softmax:

$P_T(i) = \text{softmax}(z/T)_i = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}}$

What temperature does:

$T \to 0$ : Logits scaled to infinity. The highest logit dominates completely. Approaches greedy (argmax).
$T = 1$ : No change. Use the raw model probabilities.
$T > 1$ : Logits compressed toward zero. Distribution flattens - more tokens become equiprobable.
$T \to \infty$ : Uniform distribution over all tokens. Completely random.

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np


def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    """Apply temperature scaling to logits. Temperature=0 gives greedy (argmax)."""
    if temperature <= 0:
        # Temperature 0 = argmax (greedy)
        one_hot = torch.zeros_like(logits)
        one_hot[logits.argmax()] = 1.0
        return one_hot
    return F.softmax(logits / temperature, dim=-1)


def visualize_temperature_effect():
    """
    Show how temperature reshapes the probability distribution.
    Uses a simplified vocabulary of 10 tokens.
    """
    # Simulate logits for a small vocabulary
    torch.manual_seed(42)
    logits = torch.tensor([3.2, 1.8, 1.1, 0.5, 0.3, -0.2, -0.5, -1.0, -2.0, -3.0])

    temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
    token_labels = [f"tok_{i}" for i in range(len(logits))]

    print("Temperature effect on probability distribution:")
    print(f"{'Token':>10}", end="")
    for T in temperatures:
        print(f"  T={T:3.1f}", end="")
    print()
    print("-" * 60)

    for i, label in enumerate(token_labels):
        print(f"{label:>10}", end="")
        for T in temperatures:
            prob = apply_temperature(logits, T)[i].item()
            print(f"  {prob:.3f}", end="")
        print()

    # The key insight: at T=0.1, almost all probability mass on tok_0
    # At T=2.0, the distribution is much flatter

Expected output (partial):

     Token  T=0.1  T=0.5  T=1.0  T=1.5  T=2.0
------------------------------------------------------------
     tok_0  0.978  0.741  0.418  0.296  0.229
     tok_1  0.020  0.191  0.230  0.211  0.193
     tok_2  0.002  0.063  0.153  0.164  0.164
     tok_3  0.000  0.005  0.042  0.068  0.085

Recommended temperature settings by task:

Task	Temperature	Rationale
Factual Q&A, extraction	0.0–0.2	Need deterministic, correct answers
Code generation	0.1–0.4	Syntax must be correct; small creativity OK
Summarization	0.3–0.6	Mostly faithful, some paraphrase variety
Chat/conversation	0.7–0.9	Natural, not robotic
Creative writing	0.8–1.2	Variety and surprise valued
Brainstorming	1.0–1.5	Diversity of ideas wanted

Top-K Sampling

Top-K sampling zeroes out the probability of all tokens except the $K$ highest-probability ones, then resamples from the truncated distribution:

$P_K(i) = \begin{cases} \frac{P(i)}{\sum_{j \in \text{TopK}} P(j)} & \text{if } i \in \text{TopK}(P, K) \\ 0 & \text{otherwise} \end{cases}$

def top_k_sampling(logits: torch.Tensor, k: int, temperature: float = 1.0) -> int:
    """
    Sample from the top-K highest probability tokens.

    Args:
        logits: Raw unnormalized scores [vocab_size]
        k: Number of top tokens to keep
        temperature: Applied before filtering

    Returns:
        Sampled token index
    """
    # Apply temperature first
    scaled_logits = logits / max(temperature, 1e-8)

    # Zero out all but top-K
    if k > 0 and k < logits.shape[-1]:
        top_k_values, _ = torch.topk(scaled_logits, k)
        min_top_k = top_k_values[..., -1, None]  # Threshold value
        # Replace below-threshold with very negative (becomes ~0 after softmax)
        scaled_logits = scaled_logits.masked_fill(
            scaled_logits < min_top_k, float('-inf')
        )

    # Sample from the filtered distribution
    probs = F.softmax(scaled_logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).item()


def demonstrate_top_k_problem():
    """
    Show the key problem with top-K: K is fixed regardless of distribution shape.
    """
    torch.manual_seed(42)

    # Case 1: Confident distribution (one token dominates)
    # Top-50 would include 49 clearly wrong tokens
    confident_logits = torch.tensor(
        [5.0] + [-2.0] * 99  # One dominant token
    )
    probs_confident = F.softmax(confident_logits, dim=-1)

    # Case 2: Uncertain distribution (many reasonable tokens)
    uncertain_logits = torch.randn(100)
    probs_uncertain = F.softmax(uncertain_logits, dim=-1)

    print("Problem with Top-K: K=10 on different distributions")
    print()
    print("Confident distribution (one clear winner):")
    top_probs, _ = torch.topk(probs_confident, 10)
    print(f"  Top-10 tokens cover {top_probs.sum():.1%} of probability mass")
    print(f"  Top-1 token has {top_probs[0]:.1%} of mass")
    print(f"  → K=10 includes 9 nearly-impossible tokens")
    print()
    print("Uncertain distribution (many reasonable choices):")
    top_probs2, _ = torch.topk(probs_uncertain, 10)
    print(f"  Top-10 tokens cover {top_probs2.sum():.1%} of probability mass")
    print(f"  Top-1 token has {top_probs2[0]:.1%} of mass")
    print(f"  → K=10 excludes many reasonable options")

The problem with top-K: $K$ is a fixed count, but the "right" number of candidates varies dramatically with distribution shape. When the model is confident (steep distribution), K=50 includes 49 tokens that should never be sampled. When the model is uncertain (flat distribution), K=50 might cut off many reasonable alternatives. Top-P solves exactly this problem.

Top-P (Nucleus) Sampling

Introduced by Holtzman et al. (2020), top-P sampling dynamically selects a minimal set of tokens whose cumulative probability exceeds a threshold $p$ :

Sort tokens by probability in descending order
Accumulate probabilities until the sum exceeds $p$
Include only those tokens (the "nucleus")
Renormalize and sample

$\text{nucleus}(p) = \arg\min_V \left\{ V' \subseteq V : \sum_{i \in V'} P(i) \geq p \right\}$

def top_p_sampling(logits: torch.Tensor, p: float, temperature: float = 1.0) -> int:
    """
    Nucleus (top-P) sampling: sample from minimal token set covering probability p.

    Args:
        logits: Raw unnormalized scores [vocab_size]
        p: Cumulative probability threshold (e.g., 0.9)
        temperature: Applied before filtering

    Returns:
        Sampled token index
    """
    # Apply temperature
    scaled_logits = logits / max(temperature, 1e-8)
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find the cutoff: remove tokens once cumulative prob exceeds p
    # Shift right by 1 so that the first token exceeding p is kept
    sorted_indices_to_remove = cumulative_probs - sorted_probs > p
    # Always keep the top token (never remove if only one token)
    sorted_indices_to_remove[0] = False

    # Scatter back to original ordering
    indices_to_remove = sorted_indices_to_remove.scatter(
        0, sorted_indices, sorted_indices_to_remove
    )

    # Zero out removed tokens
    filtered_logits = scaled_logits.masked_fill(indices_to_remove, float('-inf'))
    final_probs = F.softmax(filtered_logits, dim=-1)

    return torch.multinomial(final_probs, num_samples=1).item()


def compare_topk_vs_topp():
    """
    Show how top-P adapts to distribution shape while top-K does not.
    """
    torch.manual_seed(0)

    # Confident distribution
    logits_conf = torch.tensor([4.0, 1.0] + [-3.0] * 98)
    probs_conf = F.softmax(logits_conf, dim=-1)

    # Uncertain distribution
    logits_unc = torch.tensor([1.5, 1.4, 1.3, 1.2, 1.1, 1.0] + [0.0] * 94)
    probs_unc = F.softmax(logits_unc, dim=-1)

    print("Top-K (K=10) vs Top-P (P=0.9) on different distributions:")
    print()

    for name, probs in [("Confident", probs_conf), ("Uncertain", probs_unc)]:
        # Count top-K candidates
        top_10_mass = torch.topk(probs, 10).values.sum().item()

        # Count top-P candidates (P=0.9)
        sorted_p, _ = torch.sort(probs, descending=True)
        cumsum = torch.cumsum(sorted_p, dim=-1)
        n_nucleus = (cumsum < 0.9).sum().item() + 1

        print(f"{name} distribution:")
        print(f"  Top-K (K=10) covers: {top_10_mass:.1%} of probability")
        print(f"  Top-P (P=0.9) nucleus size: {n_nucleus} tokens")
        print()

Why top-P is adaptive:

When the model is confident: top token has 95% probability. Nucleus at P=0.9 includes just 1 token. Top-K with K=50 would unnecessarily include 49 low-probability tokens.
When the model is uncertain: top 50 tokens each have ~2% probability. Nucleus at P=0.9 includes 45 tokens. Top-K with K=10 would artificially restrict to only 10.

Min-P Sampling

Min-P (2023) is a newer alternative that filters tokens below a fraction of the maximum token probability:

$\text{MinP threshold} = p_{\text{min}} \times \max_i P(i)$

Tokens with probability below this threshold are removed.

def min_p_sampling(logits: torch.Tensor, min_p: float, temperature: float = 1.0) -> int:
    """
    Min-P sampling: remove tokens below min_p * max_token_probability.
    More stable than top-P for high temperatures.

    Args:
        logits: Raw unnormalized scores [vocab_size]
        min_p: Minimum probability fraction relative to top token (e.g., 0.05)
        temperature: Applied before filtering
    """
    scaled_logits = logits / max(temperature, 1e-8)
    probs = F.softmax(scaled_logits, dim=-1)

    # Scale threshold relative to top token probability
    max_prob = probs.max()
    threshold = min_p * max_prob

    # Zero out tokens below threshold
    filtered_probs = probs.masked_fill(probs < threshold, 0.0)

    # Renormalize
    filtered_probs = filtered_probs / filtered_probs.sum()

    return torch.multinomial(filtered_probs, num_samples=1).item()

Min-P behaves better at high temperatures because the threshold scales with the top token's probability. When the model is very uncertain (flat distribution), the threshold is low, keeping many candidates. When the model is very confident, the threshold is high, keeping only the top options.

Repetition Penalty

Repetition penalty multiplies the logits of recently generated tokens by a factor less than 1 (for tokens already in the output) to discourage repetition:

def apply_repetition_penalty(
    logits: torch.Tensor,
    input_ids: torch.Tensor,
    penalty: float = 1.3
) -> torch.Tensor:
    """
    Apply repetition penalty to logits.
    Tokens that appeared in input_ids get their logits scaled down.

    Args:
        logits: Raw logits [vocab_size]
        input_ids: Previously generated token IDs [seq_len]
        penalty: > 1.0 discourages repetition (1.0 = no effect)
    """
    if penalty == 1.0:
        return logits

    # Get unique tokens from previous context
    unique_tokens = set(input_ids.tolist())

    for token_id in unique_tokens:
        if logits[token_id] > 0:
            logits[token_id] /= penalty
        else:
            logits[token_id] *= penalty

    return logits

Recommended penalty values:

1.0: No penalty (default)
1.1–1.2: Mild - good for most chat
1.3–1.5: Aggressive - use for long creative text
Above 1.5: Too aggressive - starts producing incoherent text

Beam Search

Beam search maintains the $B$ highest-scoring partial sequences simultaneously:

def beam_search(
    model,
    input_ids: torch.Tensor,
    beam_width: int = 4,
    max_new_tokens: int = 50,
    length_penalty: float = 1.0
) -> list:
    """
    Beam search: maintain B best sequences at each step.
    Returns list of (score, token_ids) tuples, sorted by score.
    """
    # Initialize: B copies of the input
    beams = [(0.0, input_ids.tolist())]

    for _ in range(max_new_tokens):
        all_candidates = []

        for score, seq in beams:
            # Get logits for this sequence
            with torch.no_grad():
                ids = torch.tensor([seq])
                outputs = model(ids)
                logits = outputs.logits[0, -1, :]  # Last token logits
                log_probs = F.log_softmax(logits, dim=-1)

            # Get top B next tokens for this beam
            top_log_probs, top_tokens = torch.topk(log_probs, beam_width)

            for log_prob, token in zip(top_log_probs, top_tokens):
                new_score = score + log_prob.item()
                new_seq = seq + [token.item()]
                all_candidates.append((new_score, new_seq))

        # Keep top B candidates (with length normalization)
        all_candidates.sort(
            key=lambda x: x[0] / (len(x[1]) ** length_penalty),
            reverse=True
        )
        beams = all_candidates[:beam_width]

    return beams

Beam search vs sampling:

Aspect	Beam Search	Sampling (T+P)
Determinism	Yes (given same inputs)	No
Quality (factual)	Often better	Depends
Diversity	Low	High
Repetition risk	High (all beams similar)	Lower with penalty
Latency	B× slower	1×
Use case	Translation, summarization	Chat, creative tasks

Combining Sampling Parameters

In practice, you combine multiple techniques. The HuggingFace generate() API applies them in this order:

Apply repetition penalty to logits
Apply temperature (divide logits by T)
Apply top-K (zero out all but K highest)
Apply top-P (zero out until cumulative mass exceeds P)
Sample from remaining distribution

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def generate_with_sampling(
    model_name: str,
    prompt: str,
    temperature: float = 0.8,
    top_k: int = 50,
    top_p: float = 0.9,
    repetition_penalty: float = 1.1,
    max_new_tokens: int = 200,
    num_return_sequences: int = 3
):
    """
    Generate text with configurable sampling parameters.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="auto"
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            do_sample=True,           # Required for temperature/top-k/top-p
            num_return_sequences=num_return_sequences,
            pad_token_id=tokenizer.eos_token_id
        )

    results = []
    for output in outputs:
        # Decode only the new tokens (not the prompt)
        new_tokens = output[inputs["input_ids"].shape[1]:]
        text = tokenizer.decode(new_tokens, skip_special_tokens=True)
        results.append(text)

    return results


# Task-specific configurations
TASK_CONFIGS = {
    "factual_qa": {
        "temperature": 0.1,
        "top_k": 10,
        "top_p": 0.9,
        "repetition_penalty": 1.0,
        "do_sample": True
    },
    "coding": {
        "temperature": 0.2,
        "top_k": 40,
        "top_p": 0.95,
        "repetition_penalty": 1.05,
        "do_sample": True
    },
    "chat": {
        "temperature": 0.7,
        "top_k": 50,
        "top_p": 0.9,
        "repetition_penalty": 1.1,
        "do_sample": True
    },
    "creative_writing": {
        "temperature": 1.0,
        "top_k": 0,       # Disable top-K, rely on top-P
        "top_p": 0.95,
        "repetition_penalty": 1.2,
        "do_sample": True
    },
    "brainstorming": {
        "temperature": 1.2,
        "top_k": 0,
        "top_p": 0.98,
        "repetition_penalty": 1.3,
        "do_sample": True
    }
}

Visualizing the Full Pipeline

Implementing Full Sampling from Scratch

import torch
import torch.nn.functional as F
from typing import Optional


def sample_next_token(
    logits: torch.Tensor,
    temperature: float = 1.0,
    top_k: int = 0,
    top_p: float = 1.0,
    min_p: float = 0.0,
    repetition_penalty: float = 1.0,
    previous_tokens: Optional[torch.Tensor] = None,
) -> int:
    """
    Complete sampling pipeline: temperature + top-K + top-P + min-P + repetition penalty.

    Args:
        logits: Raw model output [vocab_size]
        temperature: Scale factor (0 = greedy, 1 = no scaling, >1 = flatter)
        top_k: Keep only top K tokens (0 = disabled)
        top_p: Keep minimal nucleus covering probability P (1.0 = disabled)
        min_p: Filter tokens below min_p * max_prob (0.0 = disabled)
        repetition_penalty: Penalize previously used tokens (1.0 = no penalty)
        previous_tokens: Token IDs to penalize [seq_len]

    Returns:
        Sampled token index
    """
    # Step 1: Repetition penalty
    if repetition_penalty != 1.0 and previous_tokens is not None:
        logits = logits.clone()
        for token_id in set(previous_tokens.tolist()):
            if 0 <= token_id < len(logits):
                if logits[token_id] > 0:
                    logits[token_id] /= repetition_penalty
                else:
                    logits[token_id] *= repetition_penalty

    # Step 2: Temperature scaling (or greedy)
    if temperature <= 0:
        return logits.argmax().item()

    logits = logits / temperature

    # Step 3: Top-K filtering
    if top_k > 0:
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[-1]] = float('-inf')

    # Step 4: Convert to probabilities
    probs = F.softmax(logits, dim=-1)

    # Step 5: Min-P filtering (on probabilities, not logits)
    if min_p > 0:
        min_threshold = min_p * probs.max()
        probs[probs < min_threshold] = 0.0
        probs = probs / probs.sum()

    # Step 6: Top-P (nucleus) filtering
    if top_p < 1.0:
        sorted_probs, sorted_indices = torch.sort(probs, descending=True)
        cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
        # Remove tokens once cumulative mass exceeds p
        sorted_indices_to_remove = cumulative_probs - sorted_probs > top_p
        sorted_indices_to_remove[0] = False  # Always keep top token
        probs[sorted_indices[sorted_indices_to_remove]] = 0.0
        probs = probs / probs.sum()

    # Step 7: Sample
    return torch.multinomial(probs, num_samples=1).item()


def benchmark_sampling_methods(vocab_size: int = 32000, n_samples: int = 1000):
    """Compare output diversity across sampling methods."""
    import time
    from collections import Counter

    torch.manual_seed(42)
    logits = torch.randn(vocab_size)

    methods = {
        "Greedy (T=0)": lambda l: l.argmax().item(),
        "T=0.5, K=50, P=0.9": lambda l: sample_next_token(l, temperature=0.5, top_k=50, top_p=0.9),
        "T=1.0, K=50, P=0.9": lambda l: sample_next_token(l, temperature=1.0, top_k=50, top_p=0.9),
        "T=1.5, P=0.95": lambda l: sample_next_token(l, temperature=1.5, top_p=0.95),
    }

    print(f"{'Method':<30} {'Unique tokens':>15} {'Top-1 frequency':>18}")
    print("-" * 65)

    for name, method in methods.items():
        t0 = time.perf_counter()
        samples = [method(logits.clone()) for _ in range(n_samples)]
        elapsed = time.perf_counter() - t0

        counter = Counter(samples)
        unique = len(counter)
        top1_freq = counter.most_common(1)[0][1] / n_samples

        print(f"{name:<30} {unique:>15} {top1_freq:>17.1%}")

Contrastive Decoding

Contrastive decoding (Li et al., 2022) improves quality by subtracting the logits of a weaker "amateur" model from the strong "expert" model:

$\text{score}(x_t) = \log P_{\text{expert}}(x_t) - \log P_{\text{amateur}}(x_t)$

The idea: tokens that the amateur model also assigns high probability to are generic, common tokens. Subtracting them out amplifies the expert model's unique knowledge. Applied to factual QA and reasoning, this reduces hallucination.

def contrastive_decoding(
    expert_logits: torch.Tensor,
    amateur_logits: torch.Tensor,
    alpha: float = 0.1,
    temperature: float = 1.0
) -> int:
    """
    Contrastive decoding: amplify expert model's unique predictions.

    Args:
        expert_logits: Large model logits [vocab_size]
        amateur_logits: Small model logits [vocab_size]
        alpha: Threshold - only consider tokens where expert prob > alpha * max_expert_prob
        temperature: Temperature for final sampling
    """
    expert_log_probs = F.log_softmax(expert_logits, dim=-1)
    amateur_log_probs = F.log_softmax(amateur_logits, dim=-1)

    # Adaptive plausibility constraint: only consider tokens
    # where expert assigns reasonable probability
    expert_probs = expert_log_probs.exp()
    cutoff = alpha * expert_probs.max()
    valid_tokens = expert_probs >= cutoff

    # Contrastive score
    contrastive_scores = expert_log_probs - amateur_log_probs

    # Mask invalid tokens
    contrastive_scores[~valid_tokens] = float('-inf')

    # Sample from contrastive scores
    return sample_next_token(contrastive_scores, temperature=temperature)

Production Engineering Notes

A/B Testing Sampling Parameters

Never change sampling parameters in production without A/B testing:

import hashlib


def get_sampling_config(user_id: str, task: str) -> dict:
    """
    Route users to sampling configurations based on user_id hash.
    Enables stable A/B testing - same user always gets same config.
    """
    # Hash user_id for stable routing
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100

    if hash_val < 50:
        # Control: current production config
        base = TASK_CONFIGS[task].copy()
        base["experiment"] = "control"
    else:
        # Treatment: candidate config
        base = TASK_CONFIGS[task].copy()
        base["temperature"] = base.get("temperature", 0.7) * 1.1
        base["experiment"] = "treatment_higher_temp"

    return base

Monitoring Output Quality

from dataclasses import dataclass
from typing import List
import statistics


@dataclass
class SamplingMetrics:
    """Track metrics that correlate with sampling quality issues."""
    request_id: str
    task_type: str
    config_name: str
    output_tokens: int
    unique_tokens_ratio: float   # Repetition indicator
    avg_token_probability: float  # Hallucination indicator


def compute_sampling_metrics(
    token_ids: List[int],
    token_probs: List[float],
    request_id: str,
    task_type: str,
    config_name: str
) -> SamplingMetrics:
    unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
    avg_prob = statistics.mean(token_probs) if token_probs else 0.0

    return SamplingMetrics(
        request_id=request_id,
        task_type=task_type,
        config_name=config_name,
        output_tokens=len(token_ids),
        unique_tokens_ratio=unique_ratio,
        avg_token_probability=avg_prob
    )

Common Mistakes

:::danger Setting temperature = 0 for all tasks Temperature 0 (greedy decoding) is deterministic and fast, but degrades quality for any open-ended generation. It is appropriate for structured extraction (JSON, code with exact syntax), classification, or any task where there is exactly one right answer. For chat, summarization, or Q&A where multiple phrasings are acceptable, temperature 0 produces robotic, repetitive text. Always set non-zero temperature unless you specifically need determinism. :::

:::danger Using top-K without top-P (or vice versa) Top-K and top-P complement each other. Top-K alone fails on flat distributions (too restrictive) and steep distributions (not restrictive enough). Top-P alone can sometimes include too many tokens when the distribution is very flat (many near-equal probabilities all just below the P threshold). The standard production recipe is both: top-K=50 as a hard cap, top-P=0.9 as the dynamic nucleus. This prevents the edge cases of each method. :::

:::warning Applying repetition penalty too aggressively Repetition penalty above 1.5 causes incoherence. The model uses repeated phrases intentionally - "the the" is rarely generated, but phrases like "I think I think" or repeating proper nouns are penalized incorrectly. Very aggressive repetition penalty (1.5+) causes the model to artificially avoid all repetition, producing grammatically odd sentences. Values of 1.1–1.2 handle pathological repetition loops without harming coherent repetition. :::

:::warning Not seeding random state for reproducibility in testing Even with fixed sampling parameters, results vary because sampling is stochastic. Always set torch.manual_seed() and pass a seed parameter in production when you need reproducible outputs for debugging or regression testing. Log the seed used for each generation so you can replay failing cases. :::

Interview Questions

Q1: What is the difference between temperature scaling and top-P sampling? Can you use both?

Temperature scaling reshapes the entire probability distribution by dividing logits by $T$ before softmax - low temperature concentrates mass on likely tokens, high temperature spreads it. Top-P filtering removes tokens from the tail of the distribution by keeping only a minimal nucleus whose probabilities sum to at least $P$ . They address different problems: temperature controls overall sharpness, top-P controls tail truncation. They can and should be used together - temperature first to reshape the distribution, then top-P to remove the long tail. Standard production settings like T=0.8, P=0.9, K=50 combine all three.

Q2: Why does top-K with a fixed K fail on distributions with different shapes?

Top-K always keeps exactly $K$ tokens regardless of the distribution's shape. When the model is confident (one token has 95% probability), K=50 keeps 49 tokens that together have only 5% probability - adding noise without benefit. When the model is uncertain (each token has ~2% probability), K=50 might cut off many reasonable candidates. Top-P adapts the nucleus size to the distribution: for a confident distribution, P=0.9 might include only 1–3 tokens; for an uncertain distribution, P=0.9 might include 45+ tokens. This adaptive behavior is why top-P generally produces better text than top-K alone.

Q3: What is beam search and when is it better than sampling?

Beam search maintains the $B$ highest-scoring partial sequences simultaneously. At each step, it expands every beam, computes scores for all next tokens, and keeps the top $B$ candidates across all expansions. It maximizes the probability of the full sequence (approximately). Beam search is better than sampling for tasks with clear correct answers: machine translation, structured generation (SQL, regex), extractive summarization. It fails for open-ended generation because it produces overly safe, repetitive, generic text - the "beam search degeneracy" problem. Sampling is better for chat, creative writing, and any task where multiple good outputs exist.

Q4: What is the effect of temperature = 0 vs temperature approaching 0?

Temperature = 0 is undefined mathematically (division by zero in logit scaling) but is conventionally interpreted as argmax (greedy decoding) - always pick the highest-probability token. Temperature approaching 0 from above gives increasingly concentrated probability on the top token, approaching 100% as T approaches 0. In practice, frameworks implement temperature 0 as argmax directly rather than computing softmax of logits/epsilon. The outputs are identical.

Q5: If a model generates repetitive text in a loop, what is the correct fix?

First diagnose the cause: is it low temperature (greedy falls into loops), missing repetition penalty, or poor top-K/P settings? For a chat model, try temperature 0.7–0.8 + top-P 0.9 + repetition penalty 1.1. Repetition penalty specifically reduces the logit of tokens already in the context - effective for "the...the...the" loops. For serious repetition degeneration (generating the same phrase hundreds of times), there is likely a prompt or context issue. Ensure the KV cache is not corrupted and that the context window has not wrapped around in a way that creates feedback loops.

Q6: How would you tune sampling parameters for a medical Q&A application vs a creative writing assistant?

Medical Q&A: temperature 0.1–0.2 (need accurate, conservative answers), top-K 10–20, top-P 0.9, no repetition penalty, do_sample=True or even greedy if determinism is required. The risk of hallucination is high with creative sampling; you want the model to stay close to its highest-confidence outputs. Creative writing: temperature 0.9–1.1, top-K disabled (rely on top-P), top-P 0.95, repetition penalty 1.2. Diversity and surprise are valued. You want the model to explore lower-probability but coherent continuations. The right settings reflect the cost of each type of error: for medical use, a surprising but wrong fact is dangerous; for fiction, a predictable phrase is the failure mode.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sampling Strategies: Temperature, Top-K, Top-P demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why This Exists: The Problems With Naive Approaches​

Greedy Decoding Fails​

Pure Random Sampling Fails Too​

Historical Context​

The Logit Distribution​

Temperature Scaling​

Top-K Sampling​

Top-P (Nucleus) Sampling​

Min-P Sampling​

Repetition Penalty​

Beam Search​

Combining Sampling Parameters​

Visualizing the Full Pipeline​

Implementing Full Sampling from Scratch​

Contrastive Decoding​

Production Engineering Notes​

A/B Testing Sampling Parameters​

Monitoring Output Quality​

Common Mistakes​

Interview Questions​