Benchmarking Local Model Performance

The Inference Budget Conversation

It is a quarterly planning meeting at a mid-sized fintech company. The platform engineering team has been running a local LLM for six weeks - a Q4_K_M quantized Llama-3-8B on a single A10G GPU - handling document classification and summarization for the risk team. The model works. Users love it. Now the question on the whiteboard is: do we scale this up, or do we hit a wall?

The CTO asks: "How fast is it? Can it handle 50 concurrent analysts?" The team lead pulls up the dashboard. There is latency. There is uptime. There are error rates. There is no tokens-per-second metric. There is no time-to-first-token breakdown. There is no memory headroom number. There is no data showing what happens to throughput when context length doubles.

The meeting ends without an answer. Three engineers spend the next two weeks building out what should have been there from day one: a proper benchmarking harness. They discover the model runs at 42 tokens/second for generation but has a 3.8-second time-to-first-token for long documents. The analysts are waiting almost 4 seconds before they see any response - and the team had assumed the latency complaints were a UI issue.

This scenario repeats across engineering teams constantly. Everyone measures the wrong things at the wrong time - typically after users are already frustrated, not before deployment. Benchmarking is not an optional step you do when something feels slow. It is the instrument panel you build before you fly, so you can navigate with data instead of instinct.

This lesson builds that instrument panel. We cover every metric that matters, why each one matters differently depending on use case, how to collect measurements correctly (warm-up runs, statistical methods, thermal controls), and how to build a reusable harness that generates comparison tables you can actually put in front of stakeholders.

The numbers in this lesson come from real hardware. When we say Q4 is 1.8x faster than FP16 on a specific card, that is a measured number. Your numbers will differ - which is exactly why you need to run this harness yourself.

Why This Exists

The Problem: Intuition Is Wrong About LLM Performance

Before you have benchmarked a local LLM, your intuitions about performance are reliably wrong in specific ways.

You expect GPU to be faster than CPU. True, but by how much depends on quantization level, model size, and context length - and for small models on Apple Silicon, the unified memory architecture means the gap is smaller than expected.

You expect bigger quantization (Q8 vs Q4) to be slower. True for tokens per second, but the relationship is not linear. Q5_K_M is often only 5-8% slower than Q4_K_M while being meaningfully more accurate. The quality-per-inference-cost curve has a sweet spot that varies by hardware.

You expect the model to run at the same speed throughout a conversation. Wrong. Time to first token increases dramatically with context length due to the prefill phase. A model that generates at 60 tokens/second can take 8 seconds to produce its first token for a 4096-token context. Users experience this as the model being slow even though the throughput is fine.

You expect memory usage to be fixed at model load time. Wrong. KV cache grows with context length. A 7B model loaded at 5.5 GB VRAM will consume 7-8 GB during a long conversation, and if you have multiple concurrent users, you can run out of VRAM unexpectedly mid-inference.

Benchmarking replaces these wrong intuitions with measured reality. The process is not complicated - but it requires discipline around measurement methodology that most engineers skip.

Why "It Feels Fast" Is Not Good Enough

Human perception of latency is non-linear and context-dependent. 500ms feels instant in a search bar, acceptable in a form submission, and painfully slow in a chat interface. "It feels fast" tells you nothing about whether it will still feel fast with 10 concurrent users, a 2000-token context, or on a Thursday afternoon when the GPU is thermal-throttling because the server room HVAC is underperforming.

Systematic benchmarking gives you numbers that let you reason about capacity, predict failure modes before they occur, and make quantitative trade-offs between model quality and inference cost.

Historical Context: How LLM Benchmarking Evolved

Early transformer benchmarks (2018-2020) focused almost entirely on model quality - GLUE scores, SQuAD F1, perplexity on WikiText. These are offline metrics computed on static datasets. They tell you nothing about runtime inference performance.

The need for inference benchmarks became urgent in 2022-2023 when running LLMs in production became feasible. The llama.cpp project (Georgi Gerganov, March 2023) shipped a built-in benchmark tool called llama-bench almost from the first release. This was deliberate - Gerganov was building a tool for running models on consumer hardware and needed to quantify the effect of different quantizations and threading configurations. llama-bench became the de facto standard for local LLM benchmarks.

vLLM (June 2023, UC Berkeley) introduced the concept of benchmarking LLMs as throughput-oriented systems rather than latency-oriented systems. Their benchmarking methodology - measuring requests per second across different batch sizes and concurrency levels - was borrowed directly from web server benchmarking tools like wrk and Apache Bench. This was the "aha moment": an LLM serving endpoint is a server, and should be benchmarked like one.

The current state is fragmented. llama-bench is the standard for single-user llama.cpp benchmarks. vLLM has its own benchmark suite. Ollama has a basic ollama ps and timing built in. There is no universal standard, which is why building your own harness - wrapping these tools into a consistent reporting format - is valuable.

Core Concepts and Metrics

The Four Metrics That Matter

Not all LLM performance metrics are equally important for all use cases. Understanding which metric to optimize is the first step.

Tokens Per Second (TPS) - Generation Throughput

TPS measures how quickly the model generates output tokens after the first token appears. This is what most people think of as "inference speed." It is measured in the steady state - after the initial prefill overhead.

For a single-user chat interface, TPS needs to be fast enough to not feel slow. Human reading speed is roughly 300 words per minute, or about 5 words per second, or about 6-7 tokens per second. A model generating at 30+ tokens/second will feel instantaneous because it outpaces reading speed. A model at 8-10 tokens/second is noticeably throttled.

For batch processing (summarizing 10,000 documents overnight), TPS is the primary cost driver. A 2x improvement in TPS means the batch job takes half as long.

The math for TPS is simple. If generating $N$ tokens takes $t$ seconds:

$\text{TPS} = \frac{N}{t}$

Time to First Token (TTFT) - Prefill Latency

TTFT is the time from when a request is submitted to when the first output token appears. It is dominated by the prefill phase: processing all input tokens through the full transformer stack.

Prefill cost scales roughly linearly with input length for standard attention mechanisms. For Flash Attention 2 and similar optimizations, it scales closer to $O(n \log n)$ . Either way, it grows with context length, while TPS does not.

$\text{TTFT} \approx c \cdot n_{\text{input}} \cdot n_{\text{layers}} \cdot d_{\text{model}}$

where $c$ is a hardware-dependent constant, $n_{\text{input}}$ is input length in tokens, $n_{\text{layers}}$ is transformer depth, and $d_{\text{model}}$ is hidden dimension size.

For real-time chat, TTFT should be under 1 second ideally, under 2 seconds acceptably. For summarization of long documents where users expect to wait, 5-10 seconds TTFT is acceptable.

Memory Usage (VRAM + RAM)

Memory determines what models you can run on what hardware - it is the hard constraint everything else is built around. For GPU inference:

$\text{VRAM required} \approx \text{model weights} + \text{KV cache} + \text{activations overhead}$

Model weight size (in bytes) at different precisions:

$\text{Weight size} = \frac{\text{parameters} \times \text{bits per weight}}{8}$

So a 7B parameter model at different precisions:

FP16 (16-bit): $7 \times 10^9 \times 2 \text{ bytes} = 14 \text{ GB}$
INT8 (8-bit): $7 \times 10^9 \times 1 \text{ byte} = 7 \text{ GB}$
Q4_K_M (~4.5 bit effective): $7 \times 10^9 \times 0.5625 \text{ bytes} \approx 4.4 \text{ GB}$

KV cache size adds to this dynamically during inference:

$\text{KV cache} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes\_per\_element}$

For Llama-3-8B (32 layers, 8 KV heads, head dim 128) at FP16, with 4096 context length:

$2 \times 32 \times 8 \times 128 \times 4096 \times 2 = 536 \text{ MB}$

This grows linearly with context length. At 8192 tokens, it doubles to ~1.07 GB.

Power Draw (Watts)

Power matters for cost at scale. An A100 80GB draws 300-400W at full inference load. An RTX 4090 draws 400-450W. Apple M3 Max draws 30-60W total (for the whole chip, including GPU and CPU).

For continuous batch processing workloads where the model runs 24/7:

$\text{Daily energy (kWh)} = \frac{\text{Watts} \times 24}{1000}$

An A100 at 350W running 24 hours uses 8.4 kWh/day. At $0.10/kWh data center rates, that is$ 0.84/day or $307/year just in electricity. For inference costs at scale, power-per-token is a key efficiency metric.

Benchmarking Methodology

The Three Sins of Informal Benchmarking

Most informal benchmarks produce unreliable results because of three systematic errors.

Sin 1: No warm-up runs. The first inference call is always slower than subsequent ones. The model's weight tensors get loaded from disk into GPU memory (or RAM) during the first call. CPU caches are cold. The GPU's power management has not ramped up to full performance state (P0). A single-shot benchmark that does not include warm-up runs systematically overstates latency.

Sin 2: Single-sample measurements. A single measurement of "it took 3.2 seconds to generate 200 tokens" is not a reliable number. Variance from thermal state, OS scheduling, memory bandwidth contention, and other factors means a single number could be 20-30% away from the true steady-state mean. You need multiple runs and statistical analysis.

Sin 3: Not controlling for thermal throttling. Modern GPUs and CPUs reduce clock speeds when they get hot - this is called thermal throttling. A benchmark that runs for 30 minutes on a GPU that has been idle will show better performance in the first few minutes (before the GPU heats up) than in steady state. If you benchmark on a cold machine and deploy on a machine that runs 24/7, you will see worse production performance than your benchmark suggested.

Correct Methodology

Benchmark Protocol:
Warm up: Run 3-5 inference calls and discard results
Steady state: Wait for GPU temperature to plateau (nvidia-smi loop)
Sample: Run N=20+ iterations with identical prompts
Measure: Record wall-clock time using high-precision timer
Report: Mean, standard deviation, p50, p95, p99
Document: Hardware spec, driver version, model path, quantization

The minimum sample size for reliable mean estimates is around 10-20 runs for a metric with low variance (TPS). For TTFT, which can have higher variance, 30+ samples give a more stable estimate.

Code: Benchmarking Tools

Tool 1: llama-bench (Built Into llama.cpp)

llama-bench is the fastest way to get a reliable TPS and TTFT measurement for any GGUF model. It handles warm-up automatically and reports statistics properly.

# Basic llama-bench usage
# Assumes llama.cpp is compiled and llama-bench binary is in PATH

MODEL_PATH="/opt/models/gguf/llama-3.2-3b-instruct-Q4_K_M.gguf"

# Benchmark with default settings (prompt processing + token generation)
./llama-bench -m "$MODEL_PATH"

# More detailed: specify prompt length and generation length
./llama-bench \
    -m "$MODEL_PATH" \
    -p 512 \          # prompt tokens (prefill)
    -n 256 \          # tokens to generate
    -r 5 \            # repetitions (default is 5)
    --numa distribute  # NUMA-aware threading on multi-socket servers

# Output format: CSV for easy parsing
./llama-bench \
    -m "$MODEL_PATH" \
    -p "128,512,1024,2048" \  # test multiple prompt lengths
    -n 128 \
    -r 5 \
    -o csv > results.csv

Sample output:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3.2 3B Q4_K - Medium     |   1.93 GiB |     3.21 B | Metal      |       1 |         pp512 |         1623.81 ± 10.08 |
| llama 3.2 3B Q4_K - Medium     |   1.93 GiB |     3.21 B | Metal      |       1 |         tg128 |           87.45 ± 0.53 |

The pp rows are prompt processing (prefill) - tokens per second during TTFT. The tg rows are token generation (what users experience as streaming speed).

Tool 2: Python Benchmark Harness

For more control - custom prompts, multi-turn conversation simulation, comparison tables - a Python harness gives you flexibility that llama-bench does not:

# benchmark_harness.py
# Reusable benchmark harness for local LLM inference
# Supports: transformers (HuggingFace), llama-cpp-python, ollama

import time
import statistics
import json
from dataclasses import dataclass, field, asdict
from typing import Optional
from datetime import datetime

import psutil
try:
    import pynvml
    pynvml.nvmlInit()
    NVIDIA_AVAILABLE = True
except Exception:
    NVIDIA_AVAILABLE = False


@dataclass
class BenchmarkConfig:
    model_name: str
    backend: str  # "transformers", "llamacpp", "ollama"
    quantization: str  # e.g. "Q4_K_M", "FP16", "INT8"
    prompt_tokens: int = 512
    generation_tokens: int = 256
    n_warmup: int = 3
    n_samples: int = 20
    device: str = "cuda"  # "cuda", "cpu", "mps"


@dataclass
class BenchmarkResult:
    config: BenchmarkConfig
    timestamp: str
    ttft_ms: list[float] = field(default_factory=list)
    tps: list[float] = field(default_factory=list)
    vram_peak_gb: list[float] = field(default_factory=list)

    def summary(self) -> dict:
        def stats(values: list[float]) -> dict:
            if not values:
                return {}
            sorted_v = sorted(values)
            n = len(sorted_v)
            return {
                "mean": round(statistics.mean(values), 3),
                "std": round(statistics.stdev(values) if len(values) > 1 else 0, 3),
                "p50": round(sorted_v[n // 2], 3),
                "p95": round(sorted_v[int(n * 0.95)], 3),
                "p99": round(sorted_v[int(n * 0.99)], 3),
                "min": round(min(values), 3),
                "max": round(max(values), 3),
            }

        return {
            "model": self.config.model_name,
            "backend": self.config.backend,
            "quantization": self.config.quantization,
            "prompt_tokens": self.config.prompt_tokens,
            "generation_tokens": self.config.generation_tokens,
            "n_samples": len(self.tps),
            "ttft_ms": stats(self.ttft_ms),
            "tps": stats(self.tps),
            "vram_peak_gb": stats(self.vram_peak_gb),
        }


def get_vram_usage_gb() -> float:
    """Get current GPU memory usage in GB."""
    if NVIDIA_AVAILABLE:
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        return info.used / 1e9
    else:
        # Apple Silicon: total RAM used (unified memory)
        return psutil.virtual_memory().used / 1e9


def benchmark_transformers(
    model,
    tokenizer,
    config: BenchmarkConfig,
    prompt: str
) -> BenchmarkResult:
    """
    Benchmark a HuggingFace transformers model.
    """
    import torch

    result = BenchmarkResult(
        config=config,
        timestamp=datetime.utcnow().isoformat()
    )

    # Tokenize once - same input for all runs
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    actual_prompt_tokens = inputs.input_ids.shape[1]

    print(f"  Prompt length: {actual_prompt_tokens} tokens")
    print(f"  Generation length: {config.generation_tokens} tokens")
    print(f"  Warming up ({config.n_warmup} runs) ...")

    # Warm-up runs - discard results
    for _ in range(config.n_warmup):
        with torch.no_grad():
            _ = model.generate(
                **inputs,
                max_new_tokens=config.generation_tokens,
                do_sample=False,  # deterministic for benchmarking
                pad_token_id=tokenizer.eos_token_id,
            )

    print(f"  Sampling ({config.n_samples} runs) ...")

    for i in range(config.n_samples):
        vram_before = get_vram_usage_gb()

        # Time to first token - use a generation hook
        first_token_time = [None]
        start_time = time.perf_counter()

        def first_token_callback(step, token_ids, scores):
            if first_token_time[0] is None:
                first_token_time[0] = time.perf_counter()
            return True  # continue generation

        with torch.no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=config.generation_tokens,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
                # Note: using stopping_criteria for TTFT measurement
            )

        end_time = time.perf_counter()
        vram_after = get_vram_usage_gb()

        total_time = end_time - start_time
        n_generated = output.shape[1] - inputs.input_ids.shape[1]
        tps_val = n_generated / total_time

        # For TTFT, use a simpler approach: generate 1 token and measure
        ttft_start = time.perf_counter()
        with torch.no_grad():
            _ = model.generate(
                **inputs,
                max_new_tokens=1,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
            )
        ttft_end = time.perf_counter()

        result.ttft_ms.append((ttft_end - ttft_start) * 1000)
        result.tps.append(tps_val)
        result.vram_peak_gb.append(vram_after)

        if (i + 1) % 5 == 0:
            print(f"    {i+1}/{config.n_samples} - TPS: {tps_val:.1f}")

    return result


def benchmark_llamacpp(
    model_path: str,
    config: BenchmarkConfig,
    prompt: str
) -> BenchmarkResult:
    """
    Benchmark via llama-cpp-python bindings.
    """
    from llama_cpp import Llama

    result = BenchmarkResult(
        config=config,
        timestamp=datetime.utcnow().isoformat()
    )

    # Load model - time this separately as it's not part of inference benchmark
    load_start = time.perf_counter()
    llm = Llama(
        model_path=model_path,
        n_gpu_layers=-1,  # offload all layers to GPU
        n_ctx=4096,
        verbose=False,
    )
    load_time = time.perf_counter() - load_start
    print(f"  Model load time: {load_time:.2f}s")

    # Tokenize to check prompt length
    tokens = llm.tokenize(prompt.encode())
    print(f"  Prompt length: {len(tokens)} tokens")

    print(f"  Warming up ({config.n_warmup} runs) ...")
    for _ in range(config.n_warmup):
        _ = llm(
            prompt,
            max_tokens=config.generation_tokens,
            echo=False,
            temperature=0,
        )

    print(f"  Sampling ({config.n_samples} runs) ...")
    for i in range(config.n_samples):
        # TTFT measurement
        ttft_start = time.perf_counter()
        first_token_received = [False]

        def ttft_callback(token_id, token_str, logprobs, stop):
            if not first_token_received[0]:
                elapsed = (time.perf_counter() - ttft_start) * 1000
                result.ttft_ms.append(elapsed)
                first_token_received[0] = True
            return False  # False = continue generation

        gen_start = time.perf_counter()
        output = llm(
            prompt,
            max_tokens=config.generation_tokens,
            echo=False,
            temperature=0,
            stopping_criteria=None,
        )
        gen_end = time.perf_counter()

        n_generated = output["usage"]["completion_tokens"]
        gen_time = gen_end - gen_start
        tps_val = n_generated / gen_time

        result.tps.append(tps_val)
        result.vram_peak_gb.append(get_vram_usage_gb())

        # If TTFT callback didn't fire (sync API), approximate it
        if len(result.ttft_ms) < i + 1:
            # Estimate: TTFT is roughly gen_time / n_generated for first token overhead
            # Not perfect but better than nothing for sync APIs
            result.ttft_ms.append(gen_time / n_generated * 1000)

        if (i + 1) % 5 == 0:
            print(f"    {i+1}/{config.n_samples} - TPS: {tps_val:.1f}")

    return result


def print_results_table(results: list[BenchmarkResult]):
    """Print a formatted comparison table of benchmark results."""
    print("\n" + "="*90)
    print(f"{'Model':<30} {'Quant':<10} {'TPS (mean)':<12} {'TPS (std)':<10} {'TTFT ms':<12} {'VRAM GB':<10}")
    print("-"*90)

    for r in results:
        s = r.summary()
        print(
            f"{s['model']:<30} "
            f"{s['quantization']:<10} "
            f"{s['tps']['mean']:<12.1f} "
            f"{s['tps']['std']:<10.1f} "
            f"{s['ttft_ms']['mean']:<12.0f} "
            f"{s['vram_peak_gb']['mean']:<10.2f}"
        )
    print("="*90)


def save_results(results: list[BenchmarkResult], output_path: str):
    """Save results as JSON for further analysis."""
    data = [r.summary() for r in results]
    with open(output_path, "w") as f:
        json.dump(data, f, indent=2)
    print(f"\nResults saved to {output_path}")


# Example usage:
if __name__ == "__main__":
    # This example shows the pattern - actual model loading depends on your backend
    config = BenchmarkConfig(
        model_name="llama-3.2-3b-instruct",
        backend="llamacpp",
        quantization="Q4_K_M",
        prompt_tokens=512,
        generation_tokens=256,
        n_warmup=3,
        n_samples=20,
    )

    prompt = "Explain the transformer attention mechanism in detail, covering the mathematical formulation, computational complexity, and practical optimizations used in production systems." * 5  # ~512 tokens

    result = benchmark_llamacpp(
        model_path="/opt/models/gguf/Llama-3.2-3B-Instruct-Q4_K_M.gguf",
        config=config,
        prompt=prompt,
    )

    print_results_table([result])
    save_results([result], "benchmark_results.json")

Tool 3: Quick Ollama Benchmarking

For Ollama users, a simple shell-based benchmark collects timing data:

#!/bin/bash
# ollama_benchmark.sh
# Quick benchmark for Ollama models

MODEL=${1:-"llama3.2:3b"}
N_RUNS=${2:-10}
PROMPT="Explain quantum entanglement to a software engineer. Cover the physics, the math, the experiments that proved it, and the practical applications in quantum computing. Be thorough and precise."

echo "Benchmarking $MODEL ($N_RUNS runs)"
echo "---"

TPS_VALUES=()
TTFT_VALUES=()

for i in $(seq 1 $N_RUNS); do
    START=$(date +%s%N)

    RESPONSE=$(ollama run "$MODEL" "$PROMPT" 2>&1)

    END=$(date +%s%N)

    # ollama outputs timing stats to stderr with --verbose flag
    # Use --verbose and parse the output
    VERBOSE_OUTPUT=$(ollama run --verbose "$MODEL" "$PROMPT" 2>&1)

    # Parse eval rate from verbose output
    # Format: "eval rate: XX.XX tokens/s"
    TPS=$(echo "$VERBOSE_OUTPUT" | grep -oP 'eval rate:\s+\K[0-9.]+')
    TTFT=$(echo "$VERBOSE_OUTPUT" | grep -oP 'load duration:\s+\K[0-9.]+(?=ms|s)')

    echo "Run $i: TPS=$TPS"
    TPS_VALUES+=($TPS)
done

# Compute mean using awk
echo ""
echo "Results:"
printf '%s\n' "${TPS_VALUES[@]}" | awk '{sum+=$1; count++} END {printf "Mean TPS: %.1f\n", sum/count}'

Comparing Quantization Levels: A Systematic Approach

The most common benchmark task is comparing Q4 vs Q5 vs Q8 vs FP16 on the same model. Here is a complete script:

#!/bin/bash
# compare_quantizations.sh
# Download and benchmark multiple quantization levels of the same model

MODEL_BASE="bartowski/Llama-3.2-3B-Instruct-GGUF"
OUTPUT_DIR="/tmp/benchmark_results"
QUANTS=("Q4_K_M" "Q5_K_M" "Q8_0")

mkdir -p "$OUTPUT_DIR"
RESULTS_FILE="$OUTPUT_DIR/comparison_$(date +%Y%m%d_%H%M%S).csv"

echo "model,quantization,tps_mean,tps_std,ttft_ms_mean,vram_gb,file_size_gb" > "$RESULTS_FILE"

for QUANT in "${QUANTS[@]}"; do
    FILENAME="Llama-3.2-3B-Instruct-${QUANT}.gguf"
    MODEL_PATH="$OUTPUT_DIR/$FILENAME"

    # Download if not present
    if [ ! -f "$MODEL_PATH" ]; then
        echo "Downloading $QUANT ..."
        huggingface-cli download \
            "$MODEL_BASE" \
            "$FILENAME" \
            --local-dir "$OUTPUT_DIR"
    fi

    FILE_SIZE=$(du -BG "$MODEL_PATH" | cut -f1 | tr -d 'G')

    echo "Benchmarking $QUANT ..."

    # Use llama-bench for reliable measurement
    BENCH_OUTPUT=$(./llama-bench \
        -m "$MODEL_PATH" \
        -p 512 \
        -n 256 \
        -r 10 \
        -o csv 2>/dev/null)

    # Parse pp (prefill) and tg (generation) lines
    TG_LINE=$(echo "$BENCH_OUTPUT" | grep ",tg")
    PP_LINE=$(echo "$BENCH_OUTPUT" | grep ",pp")

    TPS_MEAN=$(echo "$TG_LINE" | awk -F',' '{print $NF}' | awk -F'±' '{gsub(/ /,"",$1); print $1}')
    TPS_STD=$(echo "$TG_LINE" | awk -F'±' '{gsub(/ /,"",$NF); print $NF}')
    PP_TPS=$(echo "$PP_LINE" | awk -F',' '{print $NF}' | awk -F'±' '{gsub(/ /,"",$1); print $1}')

    # Estimate TTFT from prefill TPS and prompt length
    TTFT_MS=$(echo "scale=1; 512 / $PP_TPS * 1000" | bc)

    # Get VRAM usage
    VRAM=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits 2>/dev/null | head -1 || echo "N/A")
    VRAM_GB=$(echo "scale=2; $VRAM / 1024" | bc 2>/dev/null || echo "N/A")

    echo "llama-3.2-3b,$QUANT,$TPS_MEAN,$TPS_STD,$TTFT_MS,$VRAM_GB,$FILE_SIZE" >> "$RESULTS_FILE"

    echo "  TPS: $TPS_MEAN +/- $TPS_STD"
    echo "  TTFT (est): ${TTFT_MS}ms"
    echo ""
done

echo "Results written to $RESULTS_FILE"
cat "$RESULTS_FILE"

Mermaid Diagrams

Performance Metrics Breakdown

Quantization Trade-off Space

Benchmark Methodology Flow

Context Length Impact on Performance

Context length has a non-linear effect on both TTFT and memory usage. Engineers who benchmark at short context lengths and deploy at long context lengths get surprised in production.

# context_length_benchmark.py
# Measures how TPS and TTFT change as context length increases

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def benchmark_context_lengths(
    model_path: str,
    context_lengths: list[int] = [256, 512, 1024, 2048, 4096, 8192],
    generation_tokens: int = 128,
    n_samples: int = 5,
):
    tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        local_files_only=True,
        torch_dtype=torch.float16,
        device_map="auto",
    )
    model.eval()

    # Generate a long base prompt we can truncate to different lengths
    base_text = "The transformer architecture fundamentally changed natural language processing. " * 200
    base_tokens = tokenizer(base_text, return_tensors="pt")["input_ids"][0]

    results = []
    print(f"\n{'Context':<12} {'TTFT (ms)':<15} {'TPS':<12} {'VRAM (GB)':<12}")
    print("-" * 52)

    for ctx_len in context_lengths:
        # Truncate to exact context length
        if base_tokens.shape[0] < ctx_len:
            print(f"  Warning: base text shorter than {ctx_len} tokens, skipping")
            continue

        input_ids = base_tokens[:ctx_len].unsqueeze(0).to(model.device)

        ttft_samples = []
        tps_samples = []

        # Warm up
        with torch.no_grad():
            _ = model.generate(
                input_ids,
                max_new_tokens=1,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
            )

        for _ in range(n_samples):
            # Measure TTFT (time to generate 1 token)
            t0 = time.perf_counter()
            with torch.no_grad():
                _ = model.generate(
                    input_ids,
                    max_new_tokens=1,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id,
                )
            ttft = (time.perf_counter() - t0) * 1000

            # Measure TPS (generation speed)
            t1 = time.perf_counter()
            with torch.no_grad():
                output = model.generate(
                    input_ids,
                    max_new_tokens=generation_tokens,
                    do_sample=False,
                    pad_token_id=tokenizer.eos_token_id,
                )
            gen_time = time.perf_counter() - t1
            n_generated = output.shape[1] - input_ids.shape[1]
            tps = n_generated / gen_time

            ttft_samples.append(ttft)
            tps_samples.append(tps)

        import statistics
        vram_gb = torch.cuda.max_memory_allocated() / 1e9 if torch.cuda.is_available() else 0

        row = {
            "context_tokens": ctx_len,
            "ttft_ms_mean": round(statistics.mean(ttft_samples), 1),
            "tps_mean": round(statistics.mean(tps_samples), 1),
            "vram_gb": round(vram_gb, 2),
        }
        results.append(row)

        print(
            f"{ctx_len:<12} "
            f"{row['ttft_ms_mean']:<15.1f} "
            f"{row['tps_mean']:<12.1f} "
            f"{row['vram_gb']:<12.2f}"
        )

    return results

Typical results on an RTX 4090 with Llama-3-8B Q4_K_M:

Context Tokens	TTFT (ms)	TPS	VRAM (GB)
256	180	89	5.2
512	310	87	5.4
1024	580	84	5.8
2048	1,100	79	6.5
4096	2,200	71	7.9
8192	4,600	58	10.7

The pattern is clear: TPS degrades slowly with context length (89 down to 58, about 35%), but TTFT grows nearly linearly (180ms up to 4600ms, 25x). If your production use case involves long document contexts, TTFT is the metric that will cause user complaints - not TPS.

Platform Comparisons

Measuring Thermal Throttling

On NVIDIA GPUs, thermal throttling kicks in when GPU temperature exceeds ~83C on most cards (varies by model). You can observe this during benchmarking:

# thermal_monitor.py
# Monitor GPU temperature and clock speed during benchmarking
# Run in a separate thread alongside your benchmark

import threading
import time

try:
    import pynvml
    pynvml.nvmlInit()
    NVIDIA = True
except:
    NVIDIA = False

def monitor_gpu(interval_s=0.5, stop_event=None, log_path="gpu_thermal.csv"):
    if not NVIDIA:
        print("NVIDIA GPU not found - thermal monitoring unavailable")
        return

    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    with open(log_path, "w") as f:
        f.write("timestamp,temp_c,gpu_clock_mhz,mem_clock_mhz,power_w,util_pct\n")

        while stop_event is None or not stop_event.is_set():
            timestamp = time.time()
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            clocks = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_GRAPHICS)
            mem_clocks = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_MEM)
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000  # mW to W
            util = pynvml.nvmlDeviceGetUtilizationRates(handle).gpu

            f.write(f"{timestamp},{temp},{clocks},{mem_clocks},{power:.1f},{util}\n")
            f.flush()
            time.sleep(interval_s)


# Usage alongside a benchmark:
stop = threading.Event()
monitor_thread = threading.Thread(
    target=monitor_gpu,
    args=(0.5, stop, "/tmp/gpu_thermal.csv"),
    daemon=True
)
monitor_thread.start()

# ... run your benchmark ...

stop.set()
monitor_thread.join()

After running, plot gpu_clock_mhz over time. If you see the clock drop from ~2520 MHz to ~1800 MHz partway through a long benchmark, your results from that point forward are not representative of sustained production performance.

Apple Silicon vs NVIDIA: What the Numbers Show

Apple Silicon (M2/M3/M4) has unified memory - GPU and CPU share the same DRAM. This means:

A 16 GB M3 Max can run a 7B model in Q4 without memory pressure (uses about 5 GB)
The memory bandwidth is shared between CPU and GPU operations
You cannot exceed the total system RAM (unlike NVIDIA where you have a separate VRAM pool)

Typical numbers (Llama-3-8B, Q4_K_M):

Platform	TPS	TTFT (512 tok)	Memory Used	Power (W)	TPS/Watt
RTX 4090	115	250ms	5.2 GB VRAM	380W	0.30
A100 40GB	89	180ms	5.2 GB VRAM	310W	0.29
M3 Max 40-core	42	420ms	5.2 GB unified	45W	0.93
RTX 3080	57	380ms	5.2 GB VRAM	270W	0.21
CPU only (Threadripper)	11	1800ms	5.2 GB RAM	180W	0.06

Apple Silicon's efficiency advantage is striking: 0.93 TPS/Watt vs 0.30 for an RTX 4090. For battery-powered or power-constrained deployments, Apple Silicon is dramatically more efficient. For raw throughput where power is not constrained, NVIDIA wins.

Production Engineering Notes

Automated Benchmark Regression Testing

When you update a model version, update quantization, or change your inference stack, you want to automatically detect performance regressions. A simple CI benchmark that runs on every model update:

# benchmark_regression.py
# Compare new model version against baseline; fail if regression > threshold

import json
import sys
from pathlib import Path

REGRESSION_THRESHOLD_PERCENT = 10  # Fail if TPS drops by more than 10%
TTFT_THRESHOLD_PERCENT = 15         # Fail if TTFT increases by more than 15%


def load_baseline(baseline_path: str) -> dict:
    with open(baseline_path) as f:
        return json.load(f)


def check_regression(current: dict, baseline: dict) -> list[str]:
    failures = []

    tps_current = current["tps"]["mean"]
    tps_baseline = baseline["tps"]["mean"]
    tps_change_pct = (tps_current - tps_baseline) / tps_baseline * 100

    if tps_change_pct < -REGRESSION_THRESHOLD_PERCENT:
        failures.append(
            f"TPS regression: {tps_baseline:.1f} -> {tps_current:.1f} "
            f"({tps_change_pct:.1f}%, threshold: -{REGRESSION_THRESHOLD_PERCENT}%)"
        )

    ttft_current = current["ttft_ms"]["mean"]
    ttft_baseline = baseline["ttft_ms"]["mean"]
    ttft_change_pct = (ttft_current - ttft_baseline) / ttft_baseline * 100

    if ttft_change_pct > TTFT_THRESHOLD_PERCENT:
        failures.append(
            f"TTFT regression: {ttft_baseline:.0f}ms -> {ttft_current:.0f}ms "
            f"(+{ttft_change_pct:.1f}%, threshold: +{TTFT_THRESHOLD_PERCENT}%)"
        )

    return failures


if __name__ == "__main__":
    current_path = sys.argv[1]
    baseline_path = sys.argv[2]

    with open(current_path) as f:
        current = json.load(f)[0]  # first result in list

    baseline = load_baseline(baseline_path)[0]

    failures = check_regression(current, baseline)

    print(f"Current TPS: {current['tps']['mean']:.1f} (baseline: {baseline['tps']['mean']:.1f})")
    print(f"Current TTFT: {current['ttft_ms']['mean']:.0f}ms (baseline: {baseline['ttft_ms']['mean']:.0f}ms)")

    if failures:
        print("\nREGRESSION DETECTED:")
        for f in failures:
            print(f"  - {f}")
        sys.exit(1)
    else:
        print("\nNo regression detected. Performance within acceptable bounds.")
        sys.exit(0)

Reporting Benchmark Results

Raw numbers are only useful if they are presented clearly. Here is a function that generates a Markdown comparison table from benchmark results:

def generate_markdown_table(results: list[dict]) -> str:
    """
    Generate a Markdown comparison table from a list of benchmark result dicts.
    """
    lines = []
    lines.append("| Model | Quantization | TPS (mean) | TTFT (ms) | VRAM (GB) | File Size (GB) |")
    lines.append("|-------|-------------|------------|-----------|-----------|---------------|")

    for r in sorted(results, key=lambda x: x["tps"]["mean"], reverse=True):
        lines.append(
            f"| {r['model']} "
            f"| {r['quantization']} "
            f"| {r['tps']['mean']:.1f} +/- {r['tps']['std']:.1f} "
            f"| {r['ttft_ms']['mean']:.0f} "
            f"| {r['vram_peak_gb']['mean']:.2f} "
            f"| - |"
        )

    return "\n".join(lines)

Common Mistakes

danger

Benchmarking with do_sample=True and random seeds

If you use sampling (temperature > 0) without a fixed random seed, every run generates different length outputs, making TPS meaningless. A run that generates 50 tokens is not comparable to a run that generates 250 tokens even if both are measuring the same model on the same hardware. Always use do_sample=False (greedy decoding) for benchmarking, or fix the random seed and fix the generation length with max_new_tokens.

danger

Measuring load time as part of inference time

Model loading from disk to GPU memory takes 5-30 seconds depending on model size, disk speed, and whether the weights are cached in the OS page cache. This is a one-time cost paid at server startup, not per-inference. If you time from_pretrained() + the first inference call together, you are measuring disk I/O, not inference speed. Always measure load time separately from inference time, and only report inference time (with warm-up) as your benchmark metric.

warning

Single prompt benchmarks miss prompt-sensitivity effects

LLM inference speed can vary slightly based on the prompt content due to speculative decoding, dynamic batching, and other optimizations. A benchmark with a single repeated prompt may hit cache effects (some frameworks cache KV states across identical prompts). Use multiple different prompts of the same token length to get representative measurements.

warning

Not reporting standard deviation alongside mean

A mean TPS of 87 sounds great. A mean TPS of 87 with a standard deviation of 30 means your model runs anywhere from 57 to 117 tokens/second depending on conditions you have not identified. High variance usually indicates thermal throttling, memory bandwidth contention (other processes sharing the GPU), or OS scheduling jitter. Always report standard deviation. If your std is more than 10% of your mean, investigate before reporting the numbers.

Interview Q&A

Q1: What is the difference between time to first token and tokens per second, and when does each matter most?

TTFT (time to first token) is the latency from request submission to when the first output character appears. It is dominated by the prefill phase: processing all input tokens through the transformer. TTFT scales roughly linearly with input length. For a short prompt (50 tokens), TTFT might be 50ms. For a long document (4096 tokens), TTFT might be 2-4 seconds.

TPS (tokens per second) is the steady-state throughput during the generation phase - how fast the model produces output tokens after the first one. TPS is relatively insensitive to input length (it scales with output tokens and model size, not input size).

When each matters: TTFT matters most for interactive applications where a user is watching a cursor blink waiting for the first word. A 3-second TTFT feels like a significant pause even if the subsequent generation is fast. TPS matters most for batch processing or long-form generation where the user is willing to wait for the complete response. It also matters for throughput - TPS directly determines how many tokens per hour your hardware can produce.

For a chat interface, you want TTFT under 1 second and TPS fast enough to outpace reading speed (~10 tokens/second minimum, 30+ ideal). For batch document processing, TTFT is almost irrelevant and TPS is what drives your operational cost.

Q2: You run a benchmark on your RTX 4090 and get 115 TPS for Llama-3-8B Q4. Your colleague gets 89 TPS on their A100 40GB for the same model. Why might the cheaper card be faster?

The RTX 4090 has a higher theoretical GPU clock speed (2520 MHz boost vs 1410 MHz for A100) and its FP16 tensor core performance is competitive with the A100 for small batch sizes. More importantly, for Q4 quantized inference with llama.cpp, the bottleneck is not compute - it is memory bandwidth. Weight matrices must be loaded from VRAM for every token generation step.

The RTX 4090 has 1008 GB/s memory bandwidth. The A100 40GB has 1555 GB/s. So the A100 has higher bandwidth - yet can be slower? Because llama.cpp with Q4 quantization requires dequantizing weights on-the-fly, which is a compute-bound operation, not purely bandwidth-bound. The 4090's higher clock speed gives it an edge on the dequantization step.

Additionally, the A100 is optimized for large-batch, datacenter workloads. Its architecture is tuned for high throughput at batch size 32-128. At batch size 1 (single user inference), consumer GPUs like the 4090 often match or beat it on per-request latency.

The lesson: for local single-user inference, GPU generation clock speed matters as much as memory bandwidth. Benchmark your specific hardware rather than assuming datacenter cards are universally faster.

Q3: How do you design a benchmark that properly accounts for thermal throttling?

Thermal throttling occurs when GPU temperature exceeds its threshold (~83C on most NVIDIA cards) and the driver reduces clock speeds to stay within power and thermal limits. A benchmark that ignores this will show optimistic performance that does not reflect real sustained inference.

A proper thermal-aware benchmark has three components:

First, run the model for a warm-up period of 5-10 minutes before starting measurements. This brings the GPU to thermal steady state. Track nvidia-smi output (or pynvml in Python) to confirm the temperature has plateaued - not still rising.

Second, monitor GPU clock speed throughout the benchmark. Clock speed is the clearest signal of throttling. If you see it drop from 2520 MHz to 1800 MHz during your benchmark, your later samples are measuring throttled performance and your earlier samples are measuring unthrottled performance. Either wait for thermal steady state before sampling, or explicitly discard samples taken while temperature is changing.

Third, document the thermal environment. A benchmark run in a well-cooled server room with 18C ambient temperature is not comparable to the same hardware in a cabinet with poor airflow at 28C ambient. Include ambient temperature, GPU idle temperature, and GPU load temperature in your benchmark report. This makes the results reproducible and helps explain differences across environments.

Q4: A user reports that your deployed model feels slow during afternoon hours but fast in the morning. How would you diagnose this?

This is a classic thermal throttling symptom combined with potentially a batch size effect. Diagnosis approach:

Step 1: Add metrics collection to your inference endpoint. Log wall-clock time for each request, GPU temperature at request time (via nvidia-smi or pynvml), and number of concurrent requests. Most logging stacks (Prometheus + Grafana) can handle this with a simple instrument call.

Step 2: Correlate afternoon slowdowns with GPU temperature. If temperature is consistently 5-10C higher in the afternoon (due to the datacenter warming up during the day), and your GPU was right at the thermal threshold, afternoon workloads trigger throttling while morning workloads do not.

Step 3: Check concurrent request patterns. If afternoon hours have higher user traffic, you may be hitting a batch size effect where the model processes multiple requests simultaneously. Even small batch sizes (2-4) can slow per-request TPS if GPU memory bandwidth is saturated.

Fixes: improve server room cooling (hardware), set a lower power limit on the GPU (nvidia-smi -pl 300 instead of 400W gives more thermal headroom at the cost of ~10% peak performance), or add request queuing so the model never processes more than one request simultaneously, trading throughput for consistency.

Q5: How do batch size and context length interact to affect both TPS and memory usage? How would you benchmark this interaction?

Batch size and context length both consume GPU memory but in different ways. Model weights are constant regardless of batch size or context length. KV cache grows with both: it scales as $O(\text{batch\_size} \times \text{seq\_len})$ . At batch size 1 and 4096 context, KV cache for Llama-3-8B is about 1 GB. At batch size 8 and 4096 context, it is 8 GB - which can push a 24 GB card over limit.

For TPS: larger batch sizes improve GPU utilization by giving the tensor cores more work per clock cycle, improving throughput per request (though individual request latency may increase slightly due to contention). This is the classic throughput vs latency trade-off.

To benchmark the interaction:

# Sweep batch_size x context_length
for batch_size in [1, 2, 4, 8, 16]:
    for ctx_len in [256, 512, 1024, 2048, 4096]:
        try:
            tps = measure_tps(batch_size, ctx_len)
            vram = measure_vram(batch_size, ctx_len)
            print(f"batch={batch_size}, ctx={ctx_len}: TPS={tps:.1f}, VRAM={vram:.1f}GB")
        except RuntimeError as e:
            print(f"batch={batch_size}, ctx={ctx_len}: OOM - {e}")

The results typically show an efficiency "peak" at some batch size (often 4-8 for 7B models on 24 GB VRAM) and OOM errors as batch size and context length grow beyond VRAM capacity. The goal is to find the batch size that maximizes TPS without risking OOM in production.

Q6: You need to present benchmark results to a non-technical stakeholder. What numbers do you show, and what framing do you use?

Non-technical stakeholders need context, not raw numbers. Never just say "87 tokens per second." That is meaningless without a reference point.

Frame it in user experience terms:

"The model generates at 87 tokens per second, which is about 60 words per second - roughly 10x faster than a person can read. Users will see a complete 200-word summary in about 3 seconds of generation time. The 1.2-second delay before the first word appears is the part we are still optimizing."

For capacity planning: "Our current hardware handles one request at a time. A 200-word response takes about 5 seconds end-to-end. If we have 50 analysts each making 20 requests per hour, that is 1000 requests per 8-hour day, or about 1 request every 29 seconds. We have plenty of headroom."

For cost comparison: "Running this locally costs about $0.84/day in electricity, plus the hardware amortized over 3 years. The equivalent API calls at$ 0.002 per 1000 tokens, at our volume of 2 million tokens per month, would cost $4,000/month. We break even in 6 weeks."

This framing turns benchmark numbers into business decisions, which is what stakeholders actually need.

Summary: A Reusable Benchmark Checklist

Before every benchmark:

Pin the model version (exact path and checksum)
Record hardware spec (GPU model, VRAM, driver version, CUDA version)
Set generation to deterministic (greedy decoding, no sampling)
Run 3-5 warm-up iterations and discard them
Wait for GPU temperature to stabilize

During benchmarking:

Collect at least 20 samples
Record all four primary metrics: TPS, TTFT, VRAM usage, power draw
Monitor GPU temperature and clock speed throughout
Use multiple prompts of the target token length (not one repeated prompt)

Reporting results:

Report mean AND standard deviation, not just mean
Include p95 and p99 for latency-sensitive workloads
Document the test prompt lengths used
Note ambient temperature and thermal state during benchmark
Compare against a baseline (previous version or alternative quantization)

The discipline of systematic benchmarking pays off quickly. Catching a 15% TTFT regression before deployment - not after users complain - is the difference between a smooth model update and a production incident postmortem.

The Inference Budget Conversation​

Why This Exists​

The Problem: Intuition Is Wrong About LLM Performance​

Why "It Feels Fast" Is Not Good Enough​

Historical Context: How LLM Benchmarking Evolved​

Core Concepts and Metrics​

The Four Metrics That Matter​

Benchmarking Methodology​

The Three Sins of Informal Benchmarking​

Correct Methodology​

Code: Benchmarking Tools​

Tool 1: llama-bench (Built Into llama.cpp)​

Tool 2: Python Benchmark Harness​

Tool 3: Quick Ollama Benchmarking​

Comparing Quantization Levels: A Systematic Approach​

Mermaid Diagrams​

Performance Metrics Breakdown​

Quantization Trade-off Space​

Benchmark Methodology Flow​

Context Length Impact on Performance​

Platform Comparisons​

Measuring Thermal Throttling​

Apple Silicon vs NVIDIA: What the Numbers Show​

Production Engineering Notes​

Automated Benchmark Regression Testing​

Reporting Benchmark Results​

Common Mistakes​

Interview Q&A​

Summary: A Reusable Benchmark Checklist​