What is autoregressive decoding?

Understand how LLMs generate tokens one at a time, why decoding is memory-bandwidth bound, and how to reason about inference latency with the roofline model.

How does llm inference work in practice?

Autoregressive Decoding covers autoregressive decoding, llm inference, prefill phase from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/llm-inference/autoregressive-decoding

What is the difference between autoregressive decoding and prefill phase?

See the full breakdown at https://engineersofai.com/docs/llms/llm-inference/autoregressive-decoding

Autoregressive Decoding

The Production Scenario

It is 2:47 AM on a Tuesday. Your on-call phone buzzes. The dashboard shows P95 latency for the coding assistant has spiked from 800 ms to 6.2 seconds. Users are abandoning requests. Revenue is dropping. You pull up the GPU metrics: utilization is 45%, nowhere near saturated. Memory bandwidth utilization is 98%. You stare at the screen trying to understand how the GPU can be nearly idle on compute but completely bottlenecked on memory.

Three weeks later, after reading every inference optimization paper you can find, the answer becomes obvious: LLM decoding is not a compute problem. It was never a compute problem. You were solving the wrong equation entirely.

This lesson explains the fundamental reason inference is hard to optimize, why it defies your intuition about GPU utilization, and what the actual bottleneck is. Everything else in this module - KV caching, quantization, continuous batching, speculative decoding - only makes sense once you understand autoregressive decoding at the hardware level.

The insight is deceptively simple. During each decode step, you are loading the entire model from GPU memory to compute just one token. One token. For LLaMA-2 70B, that is 140 GB of weights loaded from HBM2e memory running at 2 TB/s - taking about 70 milliseconds of memory transfer time - to perform a trivial amount of arithmetic. The compute units sit idle, waiting for data. You are bottlenecked by memory bandwidth, not FLOP count.

Until you internalize this fact, you will keep trying to optimize the wrong thing.

Why This Exists: The Fundamental Problem

Language models are probabilistic. They do not "know" the next word - they compute a probability distribution over the entire vocabulary and sample from it. To get the second word, they need the first word as context. To get the third, they need the first two. This creates an unavoidable sequential dependency: you cannot compute token $t$ until you have token $t-1$ .

This is what "autoregressive" means. The model feeds its own output back as input. Every modern LLM - GPT-4, LLaMA, Claude, Gemini - works this way at inference time.

The sequential nature of decoding is the root cause of every performance challenge in LLM serving. You cannot simply throw more parallel compute at it. Each decode step is a tiny computation with enormous memory requirements, executed one step at a time.

Historical Context

The transformer architecture (Vaswani et al., 2017) was designed primarily for training efficiency. The attention mechanism - allowing every token to attend to every other token - was revolutionary for training, where you can process the entire sequence in parallel. The authors did not anticipate that serving these models would become a multi-billion-dollar engineering problem.

Early GPT models (2018–2020) were small enough that inference latency was not the dominant concern. When GPT-3 (Brown et al., 2020) arrived at 175B parameters, the scale changed everything. The inference community began seriously studying the memory-bandwidth bottleneck. The 2022–2023 period saw an explosion of inference optimization work: Flash Attention (Dao et al., 2022), PagedAttention (Kwon et al., 2023), LLM.int8() (Dettmers et al., 2022), and speculative decoding (Leviathan et al., 2022; Chen et al., 2022) all emerged in roughly 18 months.

The Two Phases of LLM Inference

LLM inference has two fundamentally different phases with different performance characteristics.

Phase 1: Prefill

During prefill, you process the entire input prompt in a single forward pass. If your prompt is 512 tokens, the transformer processes all 512 tokens simultaneously using the parallel attention mechanism.

Characteristics:

All input tokens are known upfront - fully parallel
Compute-intensive: many tokens, many attention operations
Produces the KV cache for all input tokens
Outputs the probability distribution for the first generated token
Duration scales roughly linearly with prompt length

For a 512-token prompt on an A100 80GB, prefill typically completes in 20–100 ms depending on model size.

Phase 2: Decode

After prefill, you enter the decode loop. Each iteration:

Takes the last generated token as input (or the last token from prefill on the first step)
Runs a full forward pass through all model layers
Computes attention over all previous tokens using the KV cache
Samples the next token from the output distribution
Appends the new token and repeats

Characteristics:

Each step generates exactly one token
Memory-bandwidth-bound (see roofline model below)
Duration per step is roughly constant (independent of how many tokens you've already generated, given a fixed batch size)
Total decode time scales linearly with output length

Why Decoding Is Memory-Bandwidth Bound

The Roofline Model

The roofline model is a visual performance model that tells you whether a computation is limited by compute (FLOPs) or memory bandwidth. Every operation falls somewhere on this curve.

Arithmetic intensity is the key metric:

$I = \frac{\text{FLOPs}}{\text{Bytes of memory accessed}}$

High arithmetic intensity = compute-bound. Low arithmetic intensity = memory-bound.

For an A100 SXM GPU:

Peak compute: 312 TFLOPS (FP16 tensor cores)
Peak memory bandwidth: 2,000 GB/s
Roofline "ridge point": $\frac{312 \times 10^{12}}{2 \times 10^{12}} = 156$ FLOPs/byte

Any operation with arithmetic intensity below 156 FLOPs/byte on an A100 is memory-bandwidth-bound.

Arithmetic Intensity of a Decode Step

During a single decode step (batch size = 1), for a linear layer with weight matrix $W \in \mathbb{R}^{d_{out} \times d_{in}}$ :

FLOPs: $2 \times d_{in} \times d_{out}$ (matrix-vector multiply)
Memory bytes read: $2 \times d_{in} \times d_{out}$ (reading the weight matrix in FP16)
Arithmetic intensity: $I = \frac{2 \times d_{in} \times d_{out}}{2 \times d_{in} \times d_{out}} = 1$ FLOP/byte

Arithmetic intensity of 1 FLOP/byte. The A100 ridge point is 156. We are 156× below the threshold for compute-bound operation.

This is the core insight: at batch size 1, you spend 99.4% of your time waiting for weight data to arrive from HBM memory. The tensor cores do trivial work. The bottleneck is pure memory bandwidth.

Increasing Batch Size Helps - To A Point

With batch size $B$ :

FLOPs: $2 \times B \times d_{in} \times d_{out}$
Memory bytes: still $2 \times d_{in} \times d_{out}$ (weights loaded once, reused for all batch items)
Arithmetic intensity: $I = B$ FLOPs/byte

At batch size 156, you finally reach the A100 ridge point and become compute-bound. This is why batching helps so much - it amortizes the weight loading cost across multiple requests.

But there is a catch: at batch size 156 with a 70B model, your KV cache alone could require hundreds of GB of memory. You quickly run out of GPU RAM before reaching the optimal batch size.

GPU Memory Breakdown

Understanding what lives in GPU memory is essential for capacity planning.

For LLaMA-2 70B in FP16:

Component	Size	Notes
Model weights	140 GB	70B params × 2 bytes (FP16)
KV cache (one sequence, 4096 ctx)	~16 GB	See Module 02 for formula
Activations	~1–2 GB	Temporary per forward pass
CUDA context + framework overhead	~2–3 GB	Fixed cost

An A100 80GB can barely fit the model weights. You need at least 2× A100s just to load the model, leaving no headroom for the KV cache. This is why large models require multi-GPU serving and why quantization is so important.

The Memory Capacity Formula

$\text{GPU memory required} = \text{model weights} + \text{KV cache} \times \text{batch size} + \text{overhead}$

$\text{model weights (FP16)} = \frac{N_{\text{params}} \times 2}{10^9} \text{ GB}$

$\text{KV cache per sequence} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times L_{\text{seq}} \times \text{bytes}$

Where the factor of 2 is for K and V tensors. For LLaMA-2 70B (80 layers, 64 heads, 128 head dim, FP16):

$2 \times 80 \times 64 \times 128 \times 4096 \times 2 \text{ bytes} \approx 16.8 \text{ GB per sequence}$

This is why KV cache management is the central problem in LLM serving.

Latency Metrics

Production SLO definitions for LLM services:

Time to First Token (TTFT)

The time from request submission to receiving the first generated token. Dominated by the prefill phase.

$\text{TTFT} = t_{\text{queue}} + t_{\text{prefill}}$

A good TTFT for interactive applications: under 200 ms at P50, under 500 ms at P95.

Time Per Output Token (TPOT)

The average time between consecutive generated tokens during the decode phase. This determines how fast text appears on screen for streaming applications.

$\text{TPOT} = \frac{1}{\text{tokens/sec per sequence}}$

For a 70B model on 2× A100s with batch size 1: roughly 25–40 ms per token (25–40 tokens/sec).

End-to-End Latency

$\text{E2E latency} = \text{TTFT} + N_{\text{output tokens}} \times \text{TPOT}$

For a response of 200 tokens: $500 + 200 \times 35 = 7500 \text{ ms}$ - 7.5 seconds.

Throughput

$\text{throughput} = \frac{\text{total tokens generated}}{\text{total time}} \text{ (tokens/sec, all users)}$

Throughput and per-request latency trade off against each other. Larger batches increase throughput but increase TTFT (because new requests wait for the current batch to process).

Code: Profiling Prefill vs Decode Latency

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def profile_inference(model_name: str = "meta-llama/Llama-2-7b-hf"):
    """
    Profile prefill and decode phases separately.
    Requires ~14GB GPU memory for 7B FP16.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        device_map="cuda"
    )
    model.eval()

    prompt = "Explain the transformer architecture in detail, covering " * 20
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    input_ids = inputs["input_ids"]
    print(f"Prompt length: {input_ids.shape[1]} tokens")

    # --- Benchmark prefill ---
    torch.cuda.synchronize()

    NUM_RUNS = 5
    prefill_times = []

    for _ in range(NUM_RUNS):
        torch.cuda.synchronize()
        t0 = time.perf_counter()

        with torch.no_grad():
            # Single forward pass - this IS the prefill
            outputs = model(input_ids, use_cache=True)
            past_key_values = outputs.past_key_values

        torch.cuda.synchronize()
        t1 = time.perf_counter()
        prefill_times.append((t1 - t0) * 1000)

    prefill_ms = sum(prefill_times) / len(prefill_times)
    print(f"Prefill latency: {prefill_ms:.1f} ms avg over {NUM_RUNS} runs")

    # --- Benchmark single decode step ---
    # Start from the prefill state, generate one more token
    decode_times = []
    last_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)

    for _ in range(20):  # 20 decode steps
        torch.cuda.synchronize()
        t0 = time.perf_counter()

        with torch.no_grad():
            step_outputs = model(
                input_ids=last_token,
                past_key_values=past_key_values,
                use_cache=True
            )
            past_key_values = step_outputs.past_key_values
            last_token = step_outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)

        torch.cuda.synchronize()
        t1 = time.perf_counter()
        decode_times.append((t1 - t0) * 1000)

    decode_ms = sum(decode_times[2:]) / len(decode_times[2:])  # Skip warmup
    print(f"Decode step latency: {decode_ms:.1f} ms avg (= {1000/decode_ms:.0f} tokens/sec)")

    # Compute theoretical minimum based on memory bandwidth
    # Rough model: 7B params, FP16 = 14GB, must read entire model each step
    model_size_gb = sum(p.numel() * 2 for p in model.parameters()) / 1e9
    bw_gb_per_sec = 2000  # A100 peak bandwidth in GB/s
    theoretical_min_ms = (model_size_gb / bw_gb_per_sec) * 1000
    print(f"\nModel size: {model_size_gb:.1f} GB")
    print(f"Theoretical min decode step (BW-bound): {theoretical_min_ms:.1f} ms")
    print(f"Hardware efficiency: {theoretical_min_ms / decode_ms * 100:.0f}%")


if __name__ == "__main__":
    profile_inference()

Expected output for LLaMA-2 7B on an A100 80GB:

Prompt length: 256 tokens
Prefill latency: 18.3 ms avg over 5 runs
Decode step latency: 8.2 ms avg (= 122 tokens/sec)

Model size: 13.5 GB
Theoretical min decode step (BW-bound): 6.75 ms
Hardware efficiency: 82%

The hardware efficiency of ~82% means we are achieving 82% of the theoretical maximum given memory bandwidth. The remaining 18% is overhead from kernel launches, attention computation, and other operations.

Measuring Arithmetic Intensity with Torch Profiler

import torch
from torch.profiler import profile, ProfilerActivity

def measure_arithmetic_intensity(model, input_ids, past_key_values=None):
    """
    Use torch profiler to estimate FLOPs and memory operations.
    """
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        record_shapes=True,
        with_flops=True,
        with_modules=True
    ) as prof:
        with torch.no_grad():
            if past_key_values is not None:
                # Decode step
                out = model(
                    input_ids=input_ids[:, -1:],
                    past_key_values=past_key_values,
                    use_cache=True
                )
            else:
                # Prefill
                out = model(input_ids, use_cache=True)

    # Summarize
    events = prof.key_averages()
    total_flops = sum(e.flops for e in events if e.flops > 0)
    print(f"Estimated FLOPs: {total_flops / 1e9:.1f} GFLOPs")
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

    return total_flops

Understanding the Roofline Visually

This diagram captures the key insight: prefill is compute-bound (good GPU utilization), decode at small batch sizes is deep in memory-bandwidth-bound territory.

Production Engineering Notes

SLO Definition

Define separate SLOs for TTFT and TPOT:

# Example SLO configuration for a production LLM API
SLO_CONFIG = {
    "chat": {
        "ttft_p50_ms": 150,
        "ttft_p95_ms": 400,
        "ttft_p99_ms": 800,
        "tpot_p50_ms": 35,
        "tpot_p95_ms": 70,
        "tpot_p99_ms": 120,
    },
    "batch_generation": {
        "ttft_p50_ms": 500,
        "ttft_p95_ms": 2000,
        "tpot_p50_ms": 20,
        "tpot_p95_ms": 50,
    }
}

Measuring TTFT and TPOT in Production

import time
from dataclasses import dataclass, field
from typing import List
import statistics

@dataclass
class InferenceMetrics:
    request_id: str
    prompt_tokens: int
    start_time: float = field(default_factory=time.time)
    first_token_time: float = None
    token_timestamps: List[float] = field(default_factory=list)

    @property
    def ttft_ms(self) -> float:
        if self.first_token_time is None:
            return None
        return (self.first_token_time - self.start_time) * 1000

    @property
    def tpot_ms(self) -> float:
        if len(self.token_timestamps) < 2:
            return None
        gaps = [
            (self.token_timestamps[i+1] - self.token_timestamps[i]) * 1000
            for i in range(len(self.token_timestamps) - 1)
        ]
        return statistics.mean(gaps)

    @property
    def total_output_tokens(self) -> int:
        return len(self.token_timestamps)

    @property
    def e2e_latency_ms(self) -> float:
        if not self.token_timestamps:
            return None
        return (self.token_timestamps[-1] - self.start_time) * 1000


def streaming_generate_with_metrics(model, tokenizer, prompt: str) -> InferenceMetrics:
    """
    Generate tokens one at a time, recording per-token timestamps.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    metrics = InferenceMetrics(
        request_id="req_001",
        prompt_tokens=inputs["input_ids"].shape[1]
    )

    past_key_values = None
    input_ids = inputs["input_ids"]

    with torch.no_grad():
        for step in range(256):  # max output tokens
            t = time.perf_counter()

            outputs = model(
                input_ids=input_ids if step == 0 else input_ids[:, -1:],
                past_key_values=past_key_values,
                use_cache=True
            )
            past_key_values = outputs.past_key_values
            next_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)
            input_ids = torch.cat([input_ids, next_token], dim=1)

            now = time.perf_counter()

            if step == 0:
                metrics.first_token_time = now
            metrics.token_timestamps.append(now)

            # Check for EOS
            if next_token.item() == tokenizer.eos_token_id:
                break

    return metrics

GPU Memory Estimation Script

def estimate_gpu_memory_requirements(
    num_params_billions: float,
    dtype_bytes: int,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    max_seq_len: int,
    batch_size: int,
    kv_dtype_bytes: int = 2
) -> dict:
    """
    Estimate GPU memory requirements for LLM serving.

    Args:
        num_params_billions: Model size (e.g., 70 for LLaMA-2 70B)
        dtype_bytes: 2 for FP16/BF16, 4 for FP32, 1 for INT8
        num_layers: Number of transformer layers
        num_kv_heads: Number of KV heads (use GQA value if applicable)
        head_dim: Dimension per head
        max_seq_len: Maximum sequence length
        batch_size: Number of concurrent sequences
        kv_dtype_bytes: Bytes per element in KV cache
    """
    weights_gb = (num_params_billions * 1e9 * dtype_bytes) / 1e9

    # KV cache: 2 (K+V) × layers × kv_heads × head_dim × seq_len × bytes
    kv_per_seq_gb = (
        2 * num_layers * num_kv_heads * head_dim * max_seq_len * kv_dtype_bytes
    ) / 1e9
    total_kv_gb = kv_per_seq_gb * batch_size

    overhead_gb = 3.0  # CUDA context, framework, activations

    total_gb = weights_gb + total_kv_gb + overhead_gb

    return {
        "weights_gb": round(weights_gb, 1),
        "kv_per_sequence_gb": round(kv_per_seq_gb, 2),
        "kv_total_gb": round(total_kv_gb, 1),
        "overhead_gb": overhead_gb,
        "total_gb": round(total_gb, 1),
        "notes": f"Max batch {batch_size} with {max_seq_len} context tokens"
    }


# Example: LLaMA-2 70B on 2× A100 80GB (160GB total)
llama70b = estimate_gpu_memory_requirements(
    num_params_billions=70,
    dtype_bytes=2,       # FP16
    num_layers=80,
    num_kv_heads=8,      # GQA: only 8 KV heads despite 64 query heads
    head_dim=128,
    max_seq_len=4096,
    batch_size=8
)

print("LLaMA-2 70B Memory Estimate:")
for k, v in llama70b.items():
    print(f"  {k}: {v}")

# Output:
# weights_gb: 140.0
# kv_per_sequence_gb: 0.67
# kv_total_gb: 5.4
# overhead_gb: 3.0
# total_gb: 148.4

Common Mistakes

:::danger Optimizing for GPU compute utilization instead of throughput GPU utilization percentage is a misleading metric for LLM inference. During memory-bandwidth-bound decode, the GPU compute units are mostly idle by design. "60% GPU utilization" during decode does NOT mean you have 40% headroom - it means you are already hitting the memory bandwidth ceiling. Monitor memory bandwidth utilization and tokens/second, not GPU utilization percentage. :::

:::danger Conflating prefill and decode performance Prefill is compute-bound. Decode is memory-bandwidth-bound. They respond differently to optimizations. INT4 quantization helps decode (reads 4× less data from memory) but barely helps prefill (already compute-bound). Batching helps decode dramatically (amortizes weight loading). Prefill can be sped up by increasing compute resources. Decode cannot. Never apply a prefill optimization to a decode bottleneck. :::

:::warning Ignoring the KV cache in memory calculations It is easy to calculate model weights and call it done. For a 7B model at FP16, weights are 14 GB - no problem on a single A100. But add 32 sequences at 4096 tokens each: the KV cache adds another 8–20 GB depending on architecture. With batch size 64, you can easily exceed GPU memory even for "small" models. Always calculate KV cache requirements alongside weights. :::

:::warning Not warming up before benchmarking The first 2–3 inference calls are slower due to CUDA kernel compilation, memory allocation, and cache warming. Always discard warmup runs when measuring latency. A minimum of 10 timed runs with 3 warmup runs is a reasonable benchmark protocol. :::

Interview Questions

Q1: Why is LLM decoding memory-bandwidth-bound and not compute-bound?

During each decode step with batch size 1, the model performs a matrix-vector multiplication: one token (a vector) against each weight matrix. The number of FLOPs is proportional to the number of parameters ( $2N$ for a matmul), but so is the number of bytes that must be read from memory ( $2N$ bytes for FP16). The arithmetic intensity is therefore 1 FLOP/byte. The A100 has a roofline ridge point at 156 FLOPs/byte, meaning any operation below that threshold is memory-bound. Decode at small batch sizes is deep in memory-bound territory - the compute units are largely idle, waiting for data from HBM memory.

Q2: What is the difference between TTFT and TPOT, and when does each matter more?

TTFT (Time To First Token) measures latency from request submission to the first output token. It is dominated by the prefill phase. TPOT (Time Per Output Token) measures the average inter-token delay during decode. TTFT matters for interactive applications where users notice the initial pause - a code assistant that takes 2 seconds before typing anything feels broken. TPOT matters for reading experience - if tokens arrive faster than ~30 ms each (33 tokens/sec), reading keeps up with generation. For batch processing (summarization pipelines, document processing), overall throughput matters more than either.

Q3: How does batch size affect the arithmetic intensity of decode?

With batch size $B$ , the model performs a matrix-batch-of-vectors multiply. The FLOPs scale as $2BN$ (proportional to batch size), but the bytes of weight data loaded remain $2N$ (weights are loaded once and reused across all batch items). Therefore arithmetic intensity scales linearly with batch size: $I = B$ FLOPs/byte. At batch size 156 on an A100 (ridge point = 156), decoding transitions from memory-bound to compute-bound. This is why batching is the single most impactful optimization for decode throughput - but it requires enough GPU memory to hold all the concurrent KV caches.

Q4: What are the two phases of inference and what are their compute characteristics?

Prefill processes all input tokens simultaneously using the full parallel attention mechanism. It is compute-intensive because many tokens contribute to many attention operations simultaneously. Arithmetic intensity during prefill is high (300+ FLOPs/byte for typical prompt lengths), making it compute-bound. Decode generates one token per forward pass, resulting in a matrix-vector multiply with arithmetic intensity near 1 FLOPs/byte at batch size 1 - deep in memory-bandwidth-bound territory. These different characteristics mean different optimization strategies apply: compute scaling (more FLOP capacity) helps prefill; memory capacity and bandwidth optimizations help decode.

Q5: Given a 70B parameter model in FP16 on 2× A100 80GB, what limits the maximum batch size you can serve?

Available GPU memory for KV cache = total GPU memory - model weights - overhead = 160 GB - 140 GB - 3 GB = 17 GB. With LLaMA-2 70B using GQA (8 KV heads, 128 head dim, 80 layers), KV cache per sequence at 4096 context = 2 × 80 × 8 × 128 × 4096 × 2 bytes ≈ 1.34 GB. Maximum batch size = floor(17 / 1.34) ≈ 12 sequences. In practice, memory fragmentation and variable sequence lengths reduce this further - PagedAttention (covered in Module 02) addresses this fragmentation problem.

Q6: Why can't you simply parallelize decode steps across multiple GPU cores to speed things up?

Each decode step has a hard sequential dependency on the previous step's output. Step $t$ requires the token generated at step $t-1$ as input. You cannot start step $t$ until step $t-1$ completes because you do not know the input. This is fundamentally different from training, where you have the full target sequence and can process everything in parallel. The only way to break this sequential bottleneck is speculative decoding (where a draft model speculatively guesses multiple future tokens, and the target model verifies them in parallel) - covered in Module 05.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Sampling Strategies: Temperature, Top-K, Top-P demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Why This Exists: The Fundamental Problem​

Historical Context​

The Two Phases of LLM Inference​

Phase 1: Prefill​

Phase 2: Decode​

Why Decoding Is Memory-Bandwidth Bound​

The Roofline Model​

Arithmetic Intensity of a Decode Step​

Increasing Batch Size Helps - To A Point​

GPU Memory Breakdown​

The Memory Capacity Formula​

Latency Metrics​

Time to First Token (TTFT)​

Time Per Output Token (TPOT)​

End-to-End Latency​

Throughput​

Code: Profiling Prefill vs Decode Latency​

Measuring Arithmetic Intensity with Torch Profiler​

Understanding the Roofline Visually​

Production Engineering Notes​

SLO Definition​

Measuring TTFT and TPOT in Production​

GPU Memory Estimation Script​

Common Mistakes​

Interview Questions​