Sampling Strategies: Temperature, Top-K, Top-P
The Production Scenario
Your AI writing assistant has been live for six months. The creative writing feature is getting complaints: outputs are repetitive, bland, always predictable. You look at the configuration and discover someone set temperature to 0.1 - probably from a "temperature = 0 for consistency" cargo cult rule applied without thinking. You bump it to 0.8 and the writing immediately becomes richer. Then you get a different complaint: the factual Q&A feature is now occasionally generating plausible-sounding nonsense, making things up with the same creative flair that helps fiction writing.
You realize there is no universal best setting. Every application needs a different point in the creativity-accuracy trade-off space. And you realize you do not actually know how these parameters interact mechanically - just that "higher = more random."
This lesson builds the precise mechanical understanding of what each parameter does to the probability distribution. Once you understand the mechanics, the right settings for every use case become obvious rather than trial-and-error.
The key insight is that temperature, top-K, and top-P are all different ways of reshaping the same probability distribution before sampling. Temperature scales the logits. Top-K zeroes out all but the highest-probability tokens. Top-P finds a minimal set of tokens whose probabilities sum to a threshold. These are independent operations that stack together, each addressing a different failure mode of pure random sampling.
Why This Exists: The Problems With Naive Approaches
Greedy Decoding Fails
The simplest approach: always pick the highest-probability token. Deterministic, fast, reproducible. But it produces degenerate outputs for any creative task:
Prompt: "The sun rose over the mountains and..."
Greedy: "the mountains and the mountains and the mountains and the mountains..."
Greedy decoding falls into repetitive loops because once you generate "the mountains," it becomes the highest-probability next token in context, creating a feedback loop. This is called exposure bias or repetition degeneration.
Pure Random Sampling Fails Too
Sampling from the raw softmax distribution solves repetition but introduces incoherence. The vocabulary has 32,000–128,000 tokens. Even a "unlikely" token with probability 0.001% gets sampled eventually. After "The capital of France is," you want "Paris" - but pure sampling might occasionally produce "banana" or "quantum" just because they have tiny but nonzero probability.
The solution is truncation: before sampling, zero out the probabilities of clearly wrong tokens. Temperature, top-K, and top-P are different strategies for this truncation.
Historical Context
Early neural language models used beam search almost exclusively. Beam search maintains the highest-probability partial sequences simultaneously and was considered the gold standard for quality. It was the dominant decoding strategy from the seq2seq era (2014–2018).
The turning point was Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration." They showed empirically that human text does not maximize probability - humans often use surprising but coherent words. Beam search produces text that is too predictable, too "safe," and often repetitive. They introduced top-P (nucleus) sampling and showed it produces more human-like text by multiple evaluation metrics.
Temperature scaling was used even earlier in the context of language model training (it comes from statistical mechanics, where temperature controls the randomness of a Boltzmann distribution). Top-K sampling was a natural precursor to top-P. The min-P sampling approach emerged around 2023 as a refinement that handles some failure modes of both top-K and top-P.
The Logit Distribution
Understanding sampling starts with the model's raw output: logits.
For each decode step, the model outputs a vector of unnormalized scores (logits) where is vocabulary size (typically 32,000–128,000). The probability of token is:
All sampling strategies manipulate these logits or the resulting probabilities before the final sampling step.
Temperature Scaling
Temperature is applied by dividing logits by before the softmax:
What temperature does:
- : Logits scaled to infinity. The highest logit dominates completely. Approaches greedy (argmax).
- : No change. Use the raw model probabilities.
- : Logits compressed toward zero. Distribution flattens - more tokens become equiprobable.
- : Uniform distribution over all tokens. Completely random.
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
def apply_temperature(logits: torch.Tensor, temperature: float) -> torch.Tensor:
"""Apply temperature scaling to logits. Temperature=0 gives greedy (argmax)."""
if temperature <= 0:
# Temperature 0 = argmax (greedy)
one_hot = torch.zeros_like(logits)
one_hot[logits.argmax()] = 1.0
return one_hot
return F.softmax(logits / temperature, dim=-1)
def visualize_temperature_effect():
"""
Show how temperature reshapes the probability distribution.
Uses a simplified vocabulary of 10 tokens.
"""
# Simulate logits for a small vocabulary
torch.manual_seed(42)
logits = torch.tensor([3.2, 1.8, 1.1, 0.5, 0.3, -0.2, -0.5, -1.0, -2.0, -3.0])
temperatures = [0.1, 0.5, 1.0, 1.5, 2.0]
token_labels = [f"tok_{i}" for i in range(len(logits))]
print("Temperature effect on probability distribution:")
print(f"{'Token':>10}", end="")
for T in temperatures:
print(f" T={T:3.1f}", end="")
print()
print("-" * 60)
for i, label in enumerate(token_labels):
print(f"{label:>10}", end="")
for T in temperatures:
prob = apply_temperature(logits, T)[i].item()
print(f" {prob:.3f}", end="")
print()
# The key insight: at T=0.1, almost all probability mass on tok_0
# At T=2.0, the distribution is much flatter
Expected output (partial):
Token T=0.1 T=0.5 T=1.0 T=1.5 T=2.0
------------------------------------------------------------
tok_0 0.978 0.741 0.418 0.296 0.229
tok_1 0.020 0.191 0.230 0.211 0.193
tok_2 0.002 0.063 0.153 0.164 0.164
tok_3 0.000 0.005 0.042 0.068 0.085
Recommended temperature settings by task:
| Task | Temperature | Rationale |
|---|---|---|
| Factual Q&A, extraction | 0.0–0.2 | Need deterministic, correct answers |
| Code generation | 0.1–0.4 | Syntax must be correct; small creativity OK |
| Summarization | 0.3–0.6 | Mostly faithful, some paraphrase variety |
| Chat/conversation | 0.7–0.9 | Natural, not robotic |
| Creative writing | 0.8–1.2 | Variety and surprise valued |
| Brainstorming | 1.0–1.5 | Diversity of ideas wanted |
Top-K Sampling
Top-K sampling zeroes out the probability of all tokens except the highest-probability ones, then resamples from the truncated distribution:
def top_k_sampling(logits: torch.Tensor, k: int, temperature: float = 1.0) -> int:
"""
Sample from the top-K highest probability tokens.
Args:
logits: Raw unnormalized scores [vocab_size]
k: Number of top tokens to keep
temperature: Applied before filtering
Returns:
Sampled token index
"""
# Apply temperature first
scaled_logits = logits / max(temperature, 1e-8)
# Zero out all but top-K
if k > 0 and k < logits.shape[-1]:
top_k_values, _ = torch.topk(scaled_logits, k)
min_top_k = top_k_values[..., -1, None] # Threshold value
# Replace below-threshold with very negative (becomes ~0 after softmax)
scaled_logits = scaled_logits.masked_fill(
scaled_logits < min_top_k, float('-inf')
)
# Sample from the filtered distribution
probs = F.softmax(scaled_logits, dim=-1)
return torch.multinomial(probs, num_samples=1).item()
def demonstrate_top_k_problem():
"""
Show the key problem with top-K: K is fixed regardless of distribution shape.
"""
torch.manual_seed(42)
# Case 1: Confident distribution (one token dominates)
# Top-50 would include 49 clearly wrong tokens
confident_logits = torch.tensor(
[5.0] + [-2.0] * 99 # One dominant token
)
probs_confident = F.softmax(confident_logits, dim=-1)
# Case 2: Uncertain distribution (many reasonable tokens)
uncertain_logits = torch.randn(100)
probs_uncertain = F.softmax(uncertain_logits, dim=-1)
print("Problem with Top-K: K=10 on different distributions")
print()
print("Confident distribution (one clear winner):")
top_probs, _ = torch.topk(probs_confident, 10)
print(f" Top-10 tokens cover {top_probs.sum():.1%} of probability mass")
print(f" Top-1 token has {top_probs[0]:.1%} of mass")
print(f" → K=10 includes 9 nearly-impossible tokens")
print()
print("Uncertain distribution (many reasonable choices):")
top_probs2, _ = torch.topk(probs_uncertain, 10)
print(f" Top-10 tokens cover {top_probs2.sum():.1%} of probability mass")
print(f" Top-1 token has {top_probs2[0]:.1%} of mass")
print(f" → K=10 excludes many reasonable options")
The problem with top-K: is a fixed count, but the "right" number of candidates varies dramatically with distribution shape. When the model is confident (steep distribution), K=50 includes 49 tokens that should never be sampled. When the model is uncertain (flat distribution), K=50 might cut off many reasonable alternatives. Top-P solves exactly this problem.
Top-P (Nucleus) Sampling
Introduced by Holtzman et al. (2020), top-P sampling dynamically selects a minimal set of tokens whose cumulative probability exceeds a threshold :
- Sort tokens by probability in descending order
- Accumulate probabilities until the sum exceeds
- Include only those tokens (the "nucleus")
- Renormalize and sample
def top_p_sampling(logits: torch.Tensor, p: float, temperature: float = 1.0) -> int:
"""
Nucleus (top-P) sampling: sample from minimal token set covering probability p.
Args:
logits: Raw unnormalized scores [vocab_size]
p: Cumulative probability threshold (e.g., 0.9)
temperature: Applied before filtering
Returns:
Sampled token index
"""
# Apply temperature
scaled_logits = logits / max(temperature, 1e-8)
probs = F.softmax(scaled_logits, dim=-1)
# Sort probabilities in descending order
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# Compute cumulative probabilities
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Find the cutoff: remove tokens once cumulative prob exceeds p
# Shift right by 1 so that the first token exceeding p is kept
sorted_indices_to_remove = cumulative_probs - sorted_probs > p
# Always keep the top token (never remove if only one token)
sorted_indices_to_remove[0] = False
# Scatter back to original ordering
indices_to_remove = sorted_indices_to_remove.scatter(
0, sorted_indices, sorted_indices_to_remove
)
# Zero out removed tokens
filtered_logits = scaled_logits.masked_fill(indices_to_remove, float('-inf'))
final_probs = F.softmax(filtered_logits, dim=-1)
return torch.multinomial(final_probs, num_samples=1).item()
def compare_topk_vs_topp():
"""
Show how top-P adapts to distribution shape while top-K does not.
"""
torch.manual_seed(0)
# Confident distribution
logits_conf = torch.tensor([4.0, 1.0] + [-3.0] * 98)
probs_conf = F.softmax(logits_conf, dim=-1)
# Uncertain distribution
logits_unc = torch.tensor([1.5, 1.4, 1.3, 1.2, 1.1, 1.0] + [0.0] * 94)
probs_unc = F.softmax(logits_unc, dim=-1)
print("Top-K (K=10) vs Top-P (P=0.9) on different distributions:")
print()
for name, probs in [("Confident", probs_conf), ("Uncertain", probs_unc)]:
# Count top-K candidates
top_10_mass = torch.topk(probs, 10).values.sum().item()
# Count top-P candidates (P=0.9)
sorted_p, _ = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_p, dim=-1)
n_nucleus = (cumsum < 0.9).sum().item() + 1
print(f"{name} distribution:")
print(f" Top-K (K=10) covers: {top_10_mass:.1%} of probability")
print(f" Top-P (P=0.9) nucleus size: {n_nucleus} tokens")
print()
Why top-P is adaptive:
- When the model is confident: top token has 95% probability. Nucleus at P=0.9 includes just 1 token. Top-K with K=50 would unnecessarily include 49 low-probability tokens.
- When the model is uncertain: top 50 tokens each have ~2% probability. Nucleus at P=0.9 includes 45 tokens. Top-K with K=10 would artificially restrict to only 10.
Min-P Sampling
Min-P (2023) is a newer alternative that filters tokens below a fraction of the maximum token probability:
Tokens with probability below this threshold are removed.
def min_p_sampling(logits: torch.Tensor, min_p: float, temperature: float = 1.0) -> int:
"""
Min-P sampling: remove tokens below min_p * max_token_probability.
More stable than top-P for high temperatures.
Args:
logits: Raw unnormalized scores [vocab_size]
min_p: Minimum probability fraction relative to top token (e.g., 0.05)
temperature: Applied before filtering
"""
scaled_logits = logits / max(temperature, 1e-8)
probs = F.softmax(scaled_logits, dim=-1)
# Scale threshold relative to top token probability
max_prob = probs.max()
threshold = min_p * max_prob
# Zero out tokens below threshold
filtered_probs = probs.masked_fill(probs < threshold, 0.0)
# Renormalize
filtered_probs = filtered_probs / filtered_probs.sum()
return torch.multinomial(filtered_probs, num_samples=1).item()
Min-P behaves better at high temperatures because the threshold scales with the top token's probability. When the model is very uncertain (flat distribution), the threshold is low, keeping many candidates. When the model is very confident, the threshold is high, keeping only the top options.
Repetition Penalty
Repetition penalty multiplies the logits of recently generated tokens by a factor less than 1 (for tokens already in the output) to discourage repetition:
def apply_repetition_penalty(
logits: torch.Tensor,
input_ids: torch.Tensor,
penalty: float = 1.3
) -> torch.Tensor:
"""
Apply repetition penalty to logits.
Tokens that appeared in input_ids get their logits scaled down.
Args:
logits: Raw logits [vocab_size]
input_ids: Previously generated token IDs [seq_len]
penalty: > 1.0 discourages repetition (1.0 = no effect)
"""
if penalty == 1.0:
return logits
# Get unique tokens from previous context
unique_tokens = set(input_ids.tolist())
for token_id in unique_tokens:
if logits[token_id] > 0:
logits[token_id] /= penalty
else:
logits[token_id] *= penalty
return logits
Recommended penalty values:
- 1.0: No penalty (default)
- 1.1–1.2: Mild - good for most chat
- 1.3–1.5: Aggressive - use for long creative text
- Above 1.5: Too aggressive - starts producing incoherent text
Beam Search
Beam search maintains the highest-scoring partial sequences simultaneously:
def beam_search(
model,
input_ids: torch.Tensor,
beam_width: int = 4,
max_new_tokens: int = 50,
length_penalty: float = 1.0
) -> list:
"""
Beam search: maintain B best sequences at each step.
Returns list of (score, token_ids) tuples, sorted by score.
"""
# Initialize: B copies of the input
beams = [(0.0, input_ids.tolist())]
for _ in range(max_new_tokens):
all_candidates = []
for score, seq in beams:
# Get logits for this sequence
with torch.no_grad():
ids = torch.tensor([seq])
outputs = model(ids)
logits = outputs.logits[0, -1, :] # Last token logits
log_probs = F.log_softmax(logits, dim=-1)
# Get top B next tokens for this beam
top_log_probs, top_tokens = torch.topk(log_probs, beam_width)
for log_prob, token in zip(top_log_probs, top_tokens):
new_score = score + log_prob.item()
new_seq = seq + [token.item()]
all_candidates.append((new_score, new_seq))
# Keep top B candidates (with length normalization)
all_candidates.sort(
key=lambda x: x[0] / (len(x[1]) ** length_penalty),
reverse=True
)
beams = all_candidates[:beam_width]
return beams
Beam search vs sampling:
| Aspect | Beam Search | Sampling (T+P) |
|---|---|---|
| Determinism | Yes (given same inputs) | No |
| Quality (factual) | Often better | Depends |
| Diversity | Low | High |
| Repetition risk | High (all beams similar) | Lower with penalty |
| Latency | B× slower | 1× |
| Use case | Translation, summarization | Chat, creative tasks |
Combining Sampling Parameters
In practice, you combine multiple techniques. The HuggingFace generate() API applies them in this order:
- Apply repetition penalty to logits
- Apply temperature (divide logits by T)
- Apply top-K (zero out all but K highest)
- Apply top-P (zero out until cumulative mass exceeds P)
- Sample from remaining distribution
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def generate_with_sampling(
model_name: str,
prompt: str,
temperature: float = 0.8,
top_k: int = 50,
top_p: float = 0.9,
repetition_penalty: float = 1.1,
max_new_tokens: int = 200,
num_return_sequences: int = 3
):
"""
Generate text with configurable sampling parameters.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_k=top_k,
top_p=top_p,
repetition_penalty=repetition_penalty,
do_sample=True, # Required for temperature/top-k/top-p
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.eos_token_id
)
results = []
for output in outputs:
# Decode only the new tokens (not the prompt)
new_tokens = output[inputs["input_ids"].shape[1]:]
text = tokenizer.decode(new_tokens, skip_special_tokens=True)
results.append(text)
return results
# Task-specific configurations
TASK_CONFIGS = {
"factual_qa": {
"temperature": 0.1,
"top_k": 10,
"top_p": 0.9,
"repetition_penalty": 1.0,
"do_sample": True
},
"coding": {
"temperature": 0.2,
"top_k": 40,
"top_p": 0.95,
"repetition_penalty": 1.05,
"do_sample": True
},
"chat": {
"temperature": 0.7,
"top_k": 50,
"top_p": 0.9,
"repetition_penalty": 1.1,
"do_sample": True
},
"creative_writing": {
"temperature": 1.0,
"top_k": 0, # Disable top-K, rely on top-P
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True
},
"brainstorming": {
"temperature": 1.2,
"top_k": 0,
"top_p": 0.98,
"repetition_penalty": 1.3,
"do_sample": True
}
}
Visualizing the Full Pipeline
Implementing Full Sampling from Scratch
import torch
import torch.nn.functional as F
from typing import Optional
def sample_next_token(
logits: torch.Tensor,
temperature: float = 1.0,
top_k: int = 0,
top_p: float = 1.0,
min_p: float = 0.0,
repetition_penalty: float = 1.0,
previous_tokens: Optional[torch.Tensor] = None,
) -> int:
"""
Complete sampling pipeline: temperature + top-K + top-P + min-P + repetition penalty.
Args:
logits: Raw model output [vocab_size]
temperature: Scale factor (0 = greedy, 1 = no scaling, >1 = flatter)
top_k: Keep only top K tokens (0 = disabled)
top_p: Keep minimal nucleus covering probability P (1.0 = disabled)
min_p: Filter tokens below min_p * max_prob (0.0 = disabled)
repetition_penalty: Penalize previously used tokens (1.0 = no penalty)
previous_tokens: Token IDs to penalize [seq_len]
Returns:
Sampled token index
"""
# Step 1: Repetition penalty
if repetition_penalty != 1.0 and previous_tokens is not None:
logits = logits.clone()
for token_id in set(previous_tokens.tolist()):
if 0 <= token_id < len(logits):
if logits[token_id] > 0:
logits[token_id] /= repetition_penalty
else:
logits[token_id] *= repetition_penalty
# Step 2: Temperature scaling (or greedy)
if temperature <= 0:
return logits.argmax().item()
logits = logits / temperature
# Step 3: Top-K filtering
if top_k > 0:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[-1]] = float('-inf')
# Step 4: Convert to probabilities
probs = F.softmax(logits, dim=-1)
# Step 5: Min-P filtering (on probabilities, not logits)
if min_p > 0:
min_threshold = min_p * probs.max()
probs[probs < min_threshold] = 0.0
probs = probs / probs.sum()
# Step 6: Top-P (nucleus) filtering
if top_p < 1.0:
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Remove tokens once cumulative mass exceeds p
sorted_indices_to_remove = cumulative_probs - sorted_probs > top_p
sorted_indices_to_remove[0] = False # Always keep top token
probs[sorted_indices[sorted_indices_to_remove]] = 0.0
probs = probs / probs.sum()
# Step 7: Sample
return torch.multinomial(probs, num_samples=1).item()
def benchmark_sampling_methods(vocab_size: int = 32000, n_samples: int = 1000):
"""Compare output diversity across sampling methods."""
import time
from collections import Counter
torch.manual_seed(42)
logits = torch.randn(vocab_size)
methods = {
"Greedy (T=0)": lambda l: l.argmax().item(),
"T=0.5, K=50, P=0.9": lambda l: sample_next_token(l, temperature=0.5, top_k=50, top_p=0.9),
"T=1.0, K=50, P=0.9": lambda l: sample_next_token(l, temperature=1.0, top_k=50, top_p=0.9),
"T=1.5, P=0.95": lambda l: sample_next_token(l, temperature=1.5, top_p=0.95),
}
print(f"{'Method':<30} {'Unique tokens':>15} {'Top-1 frequency':>18}")
print("-" * 65)
for name, method in methods.items():
t0 = time.perf_counter()
samples = [method(logits.clone()) for _ in range(n_samples)]
elapsed = time.perf_counter() - t0
counter = Counter(samples)
unique = len(counter)
top1_freq = counter.most_common(1)[0][1] / n_samples
print(f"{name:<30} {unique:>15} {top1_freq:>17.1%}")
Contrastive Decoding
Contrastive decoding (Li et al., 2022) improves quality by subtracting the logits of a weaker "amateur" model from the strong "expert" model:
The idea: tokens that the amateur model also assigns high probability to are generic, common tokens. Subtracting them out amplifies the expert model's unique knowledge. Applied to factual QA and reasoning, this reduces hallucination.
def contrastive_decoding(
expert_logits: torch.Tensor,
amateur_logits: torch.Tensor,
alpha: float = 0.1,
temperature: float = 1.0
) -> int:
"""
Contrastive decoding: amplify expert model's unique predictions.
Args:
expert_logits: Large model logits [vocab_size]
amateur_logits: Small model logits [vocab_size]
alpha: Threshold - only consider tokens where expert prob > alpha * max_expert_prob
temperature: Temperature for final sampling
"""
expert_log_probs = F.log_softmax(expert_logits, dim=-1)
amateur_log_probs = F.log_softmax(amateur_logits, dim=-1)
# Adaptive plausibility constraint: only consider tokens
# where expert assigns reasonable probability
expert_probs = expert_log_probs.exp()
cutoff = alpha * expert_probs.max()
valid_tokens = expert_probs >= cutoff
# Contrastive score
contrastive_scores = expert_log_probs - amateur_log_probs
# Mask invalid tokens
contrastive_scores[~valid_tokens] = float('-inf')
# Sample from contrastive scores
return sample_next_token(contrastive_scores, temperature=temperature)
Production Engineering Notes
A/B Testing Sampling Parameters
Never change sampling parameters in production without A/B testing:
import hashlib
def get_sampling_config(user_id: str, task: str) -> dict:
"""
Route users to sampling configurations based on user_id hash.
Enables stable A/B testing - same user always gets same config.
"""
# Hash user_id for stable routing
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
if hash_val < 50:
# Control: current production config
base = TASK_CONFIGS[task].copy()
base["experiment"] = "control"
else:
# Treatment: candidate config
base = TASK_CONFIGS[task].copy()
base["temperature"] = base.get("temperature", 0.7) * 1.1
base["experiment"] = "treatment_higher_temp"
return base
Monitoring Output Quality
from dataclasses import dataclass
from typing import List
import statistics
@dataclass
class SamplingMetrics:
"""Track metrics that correlate with sampling quality issues."""
request_id: str
task_type: str
config_name: str
output_tokens: int
unique_tokens_ratio: float # Repetition indicator
avg_token_probability: float # Hallucination indicator
def compute_sampling_metrics(
token_ids: List[int],
token_probs: List[float],
request_id: str,
task_type: str,
config_name: str
) -> SamplingMetrics:
unique_ratio = len(set(token_ids)) / max(len(token_ids), 1)
avg_prob = statistics.mean(token_probs) if token_probs else 0.0
return SamplingMetrics(
request_id=request_id,
task_type=task_type,
config_name=config_name,
output_tokens=len(token_ids),
unique_tokens_ratio=unique_ratio,
avg_token_probability=avg_prob
)
Common Mistakes
:::danger Setting temperature = 0 for all tasks Temperature 0 (greedy decoding) is deterministic and fast, but degrades quality for any open-ended generation. It is appropriate for structured extraction (JSON, code with exact syntax), classification, or any task where there is exactly one right answer. For chat, summarization, or Q&A where multiple phrasings are acceptable, temperature 0 produces robotic, repetitive text. Always set non-zero temperature unless you specifically need determinism. :::
:::danger Using top-K without top-P (or vice versa) Top-K and top-P complement each other. Top-K alone fails on flat distributions (too restrictive) and steep distributions (not restrictive enough). Top-P alone can sometimes include too many tokens when the distribution is very flat (many near-equal probabilities all just below the P threshold). The standard production recipe is both: top-K=50 as a hard cap, top-P=0.9 as the dynamic nucleus. This prevents the edge cases of each method. :::
:::warning Applying repetition penalty too aggressively Repetition penalty above 1.5 causes incoherence. The model uses repeated phrases intentionally - "the the" is rarely generated, but phrases like "I think I think" or repeating proper nouns are penalized incorrectly. Very aggressive repetition penalty (1.5+) causes the model to artificially avoid all repetition, producing grammatically odd sentences. Values of 1.1–1.2 handle pathological repetition loops without harming coherent repetition. :::
:::warning Not seeding random state for reproducibility in testing
Even with fixed sampling parameters, results vary because sampling is stochastic. Always set torch.manual_seed() and pass a seed parameter in production when you need reproducible outputs for debugging or regression testing. Log the seed used for each generation so you can replay failing cases.
:::
Interview Questions
Q1: What is the difference between temperature scaling and top-P sampling? Can you use both?
Temperature scaling reshapes the entire probability distribution by dividing logits by before softmax - low temperature concentrates mass on likely tokens, high temperature spreads it. Top-P filtering removes tokens from the tail of the distribution by keeping only a minimal nucleus whose probabilities sum to at least . They address different problems: temperature controls overall sharpness, top-P controls tail truncation. They can and should be used together - temperature first to reshape the distribution, then top-P to remove the long tail. Standard production settings like T=0.8, P=0.9, K=50 combine all three.
Q2: Why does top-K with a fixed K fail on distributions with different shapes?
Top-K always keeps exactly tokens regardless of the distribution's shape. When the model is confident (one token has 95% probability), K=50 keeps 49 tokens that together have only 5% probability - adding noise without benefit. When the model is uncertain (each token has ~2% probability), K=50 might cut off many reasonable candidates. Top-P adapts the nucleus size to the distribution: for a confident distribution, P=0.9 might include only 1–3 tokens; for an uncertain distribution, P=0.9 might include 45+ tokens. This adaptive behavior is why top-P generally produces better text than top-K alone.
Q3: What is beam search and when is it better than sampling?
Beam search maintains the highest-scoring partial sequences simultaneously. At each step, it expands every beam, computes scores for all next tokens, and keeps the top candidates across all expansions. It maximizes the probability of the full sequence (approximately). Beam search is better than sampling for tasks with clear correct answers: machine translation, structured generation (SQL, regex), extractive summarization. It fails for open-ended generation because it produces overly safe, repetitive, generic text - the "beam search degeneracy" problem. Sampling is better for chat, creative writing, and any task where multiple good outputs exist.
Q4: What is the effect of temperature = 0 vs temperature approaching 0?
Temperature = 0 is undefined mathematically (division by zero in logit scaling) but is conventionally interpreted as argmax (greedy decoding) - always pick the highest-probability token. Temperature approaching 0 from above gives increasingly concentrated probability on the top token, approaching 100% as T approaches 0. In practice, frameworks implement temperature 0 as argmax directly rather than computing softmax of logits/epsilon. The outputs are identical.
Q5: If a model generates repetitive text in a loop, what is the correct fix?
First diagnose the cause: is it low temperature (greedy falls into loops), missing repetition penalty, or poor top-K/P settings? For a chat model, try temperature 0.7–0.8 + top-P 0.9 + repetition penalty 1.1. Repetition penalty specifically reduces the logit of tokens already in the context - effective for "the...the...the" loops. For serious repetition degeneration (generating the same phrase hundreds of times), there is likely a prompt or context issue. Ensure the KV cache is not corrupted and that the context window has not wrapped around in a way that creates feedback loops.
Q6: How would you tune sampling parameters for a medical Q&A application vs a creative writing assistant?
Medical Q&A: temperature 0.1–0.2 (need accurate, conservative answers), top-K 10–20, top-P 0.9, no repetition penalty, do_sample=True or even greedy if determinism is required. The risk of hallucination is high with creative sampling; you want the model to stay close to its highest-confidence outputs. Creative writing: temperature 0.9–1.1, top-K disabled (rely on top-P), top-P 0.95, repetition penalty 1.2. Diversity and surprise are valued. You want the model to explore lower-probability but coherent continuations. The right settings reflect the cost of each type of error: for medical use, a surprising but wrong fact is dangerous; for fiction, a predictable phrase is the failure mode.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Sampling Strategies: Temperature, Top-K, Top-P demo on the EngineersOfAI Playground - no code required.
:::
