Skip to main content

:::tip 🎮 Interactive Playground Visualize this concept: Try the Back-of-Envelope Estimation demo on the EngineersOfAI Playground - no code required. :::

Back-of-the-Envelope Estimation for ML Systems

The engineer who can estimate is worth ten engineers who can only measure. Measuring requires the system to exist. Estimating determines whether it should.

The Production Moment

Your manager has just come back from an executive meeting. "We're going to train a large language model," she says. "Internal use - document summarization, code review, customer support. The board approved a $2M infrastructure budget. Can we do it?"

You have 30 minutes before the follow-up meeting where you need to give an answer.

This is a back-of-envelope problem. You don't need exact numbers. You need to know whether a $2M budget is in the right ballpark or off by an order of magnitude. One order of magnitude means the project is feasible with careful planning. Two orders of magnitude means someone has fundamentally misunderstood what this costs.

You need the vocabulary of numbers - the key quantities every ML engineer should have internalized - and a systematic estimation framework that converts high-level goals into concrete infrastructure requirements.

Here is what that looks like in practice. A 7B parameter model trained on 1 trillion tokens (a reasonable scale for a capable internal LLM) requires approximately 1.7MincomputeatA100spotpricing.A13Bmodeltrainedon2trilliontokensthescaleofLlama213Bwouldcostroughly1.7M in compute at A100 spot pricing. A 13B model trained on 2 trillion tokens - the scale of Llama 2 13B - would cost roughly 6M. The answer to your manager's question depends entirely on the model scale she has in mind, and without asking, you can't give a useful answer.

The estimation framework makes these calculations fast, reliable, and defensible.

Why Estimation Matters

Engineers who skip estimation make two types of expensive mistakes.

The over-engineering mistake: Building a distributed Spark cluster to process 10 GB of training data that fits in memory on a single machine. Adding a Redis cluster for feature caching when the feature data is 500 MB and can live on the serving machine's local disk. Over-engineering wastes weeks of development time and ongoing operational cost.

The under-engineering mistake: Designing a serving system for 1,000 QPS when the actual peak is 50,000 QPS. Building a feature pipeline that processes data with a single-machine Pandas script when the dataset is 50 TB. Under-engineering produces systems that fail at scale, requiring emergency rewrites under production pressure.

Both mistakes are avoided by spending 15 minutes on estimation before architectural decisions are made. The numbers don't need to be precise - an order of magnitude estimate is enough to avoid both failure modes.

Key Numbers Every ML Engineer Should Know

Internalize these. They are the vocabulary of estimation.

Compute

HardwareFP32 FLOPSFP16 FLOPSMemory BWMemory
A100 80GB19.5 TFLOPS312 TFLOPS2 TB/s80 GB
A100 40GB19.5 TFLOPS312 TFLOPS1.6 TB/s40 GB
H100 SXM67 TFLOPS1,979 TFLOPS3.35 TB/s80 GB
RTX 409082.6 TFLOPS165 TFLOPS1 TB/s24 GB
CPU (modern)~1 TFLOPS~2 TFLOPS50-100 GB/svaries

The key insight: GPU memory bandwidth is often the bottleneck for inference, not FLOPS. An autoregressive LLM generating tokens one-at-a-time is memory-bandwidth-bound because it must load all model weights for every token generated.

Storage and Memory

UnitSizePractical examples
1 float324 bytesSingle model weight
1 float16/bfloat162 bytesQuantized weight
1 int81 byteINT8 quantized weight
1B parameters (fp32)4 GBGPT-2 XL fits on 1 GPU
7B parameters (fp16)14 GBLlama 7B fits on 1 A100 40GB
70B parameters (fp16)140 GBNeeds 2 x A100 80GB or tensor parallel
1 hour of video (720p)~1 GB-
1 million text tokens~4 MB-
1 trillion text tokens~4 TBGPT-3 training data scale

Networking

ConnectionBandwidthLatency
NVLink (GPU-GPU, same node)600 GB/smicroseconds
InfiniBand HDR (node-node)25 GB/smicroseconds
100GbE (datacenter)12.5 GB/s<1ms
Internet backbonevaries10-100ms

Cost (approximate, 2024-2025)

ResourceCost
A100 80GB (cloud, on-demand)$3-4/hour
A100 80GB (cloud, spot/preemptible)$1-2/hour
H100 SXM (cloud, on-demand)$8-12/hour
S3/GCS storage$0.023/GB/month
Data transfer (egress)$0.09/GB

The Estimation Framework

The estimation cascade: Users → Requests → Data → Compute → Storage → Cost

Model Size Estimation

The first estimation every ML engineer must do: how much memory does the model require?

Rule: Model memory (in bytes) ≈ number of parameters × bytes per parameter

def estimate_model_memory_gb(
num_parameters: float, # e.g., 7e9 for 7B model
precision: str = "fp16"
) -> float:
"""Estimate GPU memory required for model weights."""
bytes_per_param = {
"fp32": 4,
"fp16": 2,
"bf16": 2,
"int8": 1,
"int4": 0.5
}
bpp = bytes_per_param[precision]
model_bytes = num_parameters * bpp
model_gb = model_bytes / (1024 ** 3)
return model_gb

# Examples
print(f"GPT-2 XL (1.5B, fp32): {estimate_model_memory_gb(1.5e9, 'fp32'):.1f} GB")
print(f"Llama 7B (7B, fp16): {estimate_model_memory_gb(7e9, 'fp16'):.1f} GB")
print(f"Llama 13B (13B, fp16): {estimate_model_memory_gb(13e9, 'fp16'):.1f} GB")
print(f"GPT-4 estimate (1T, fp16): {estimate_model_memory_gb(1e12, 'fp16'):.0f} GB")

# Output:
# GPT-2 XL (1.5B, fp32): 5.6 GB
# Llama 7B (7B, fp16): 13.0 GB
# Llama 13B (13B, fp16): 24.2 GB
# GPT-4 estimate (1T, fp16): 1863 GB

But serving memory is higher than model memory. For training, you also need optimizer states:

def estimate_training_memory_gb(
num_parameters: float,
precision: str = "fp16",
optimizer: str = "adam"
) -> dict:
"""
Training memory breakdown.
Adam optimizer with mixed precision (common setup):
- fp16 weights: 2 bytes/param
- fp32 master weights (for optimizer): 4 bytes/param
- Adam: fp32 gradient + 2 fp32 momentum terms = 12 bytes/param
Total: ~18 bytes/param for mixed-precision Adam
"""
model_gb = estimate_model_memory_gb(num_parameters, precision)

if optimizer == "adam" and precision in ("fp16", "bf16"):
# Mixed precision training: fp16 forward, fp32 optimizer states
optimizer_bytes_per_param = 4 + 4 + 4 # master weights + 2 moments
grad_bytes_per_param = 4 # fp32 gradients
activation_overhead_gb = num_parameters * 2 / (1024**3) * 0.1 # ~10% of model
else:
optimizer_bytes_per_param = 0
grad_bytes_per_param = 2
activation_overhead_gb = 0

optimizer_gb = (num_parameters * optimizer_bytes_per_param) / (1024**3)
gradient_gb = (num_parameters * grad_bytes_per_param) / (1024**3)

return {
"model_weights": model_gb,
"optimizer_states": optimizer_gb,
"gradients": gradient_gb,
"activations_estimate": activation_overhead_gb,
"total_estimate": model_gb + optimizer_gb + gradient_gb + activation_overhead_gb
}

breakdown = estimate_training_memory_gb(7e9, "fp16", "adam")
for k, v in breakdown.items():
print(f" {k}: {v:.1f} GB")
# model_weights: 13.0 GB
# optimizer_states: 78.0 GB <-- this is why training needs much more memory than inference!
# gradients: 26.0 GB
# activations_estimate: 1.3 GB
# total_estimate: 118.3 GB <-- need at least 2 x A100 80GB

Training Compute Estimation: The 6PD Rule

How many FLOPS does it take to train a model? The answer comes from Kaplan et al. (2020), "Scaling Laws for Neural Language Models," and Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla paper):

C6×P×DC \approx 6 \times P \times D

Where:

  • CC = total training FLOPs
  • PP = number of model parameters
  • DD = number of training tokens

The factor of 6 comes from: 2 FLOPs for a forward pass multiply-add, times 3 for backpropagation (forward + backward + gradient update ≈ 3× forward).

def estimate_training_flops(
num_parameters: float, # e.g., 7e9 for 7B
training_tokens: float # e.g., 1e12 for 1 trillion tokens
) -> dict:
"""
Estimate training compute using the 6PD rule.
Reference: Kaplan et al. (2020) Scaling Laws for Neural LMs
"""
total_flops = 6 * num_parameters * training_tokens

# Convert to practical units
petaflops = total_flops / 1e15
petaflop_days = petaflops / (24 * 3600)

# Time estimate on GPU cluster
a100_fp16_flops_per_second = 312e12 # 312 TFLOPS fp16
# Assume 40% MFU (Model FLOPS Utilization) - typical for large model training
effective_flops_per_second = a100_fp16_flops_per_second * 0.40

a100_hours_single_gpu = (total_flops / effective_flops_per_second) / 3600
a100_days_single_gpu = a100_hours_single_gpu / 24

# Cost estimate at $2/hr spot
cost_single_gpu = a100_hours_single_gpu * 2.0
# On 64-GPU cluster
cluster_hours = a100_hours_single_gpu / 64
cluster_cost_64 = cluster_hours * 64 * 2.0

return {
"total_flops": f"{total_flops:.2e}",
"petaflop_days": f"{petaflop_days:.1f}",
"single_a100_days": f"{a100_days_single_gpu:.0f}",
"cluster_64_gpu_days": f"{a100_days_single_gpu/64:.1f}",
"cost_64_gpu_cluster_usd": f"${cluster_cost_64:,.0f}"
}

# Practical examples
print("=== Llama 7B (1T tokens) ===")
result = estimate_training_flops(7e9, 1e12)
for k, v in result.items():
print(f" {k}: {v}")

print("\n=== Llama 13B (1.4T tokens, Chinchilla optimal) ===")
result = estimate_training_flops(13e9, 1.4e12)
for k, v in result.items():
print(f" {k}: {v}")

# === Llama 7B (1T tokens) ===
# total_flops: 4.20e+22
# petaflop_days: 486.1
# single_a100_days: 1,574
# cluster_64_gpu_days: 24.6
# cost_64_gpu_cluster_usd: $75,072

# === Llama 13B (1.4T tokens) ===
# cost_64_gpu_cluster_usd: $196,560

:::note Chinchilla Scaling Law The Chinchilla paper (Hoffmann et al., 2022) showed that prior models were undertrained - too many parameters, not enough data. The compute-optimal training ratio is approximately 20 tokens per parameter: a 7B model should see ~140B tokens for compute-optimal training, but Llama 2 trained on 2T tokens (much more than Chinchilla-optimal) because inference is expensive and over-training reduces inference costs by allowing a smaller model to reach the same quality. :::

Inference Throughput Estimation

For serving, you need to know how many requests per second a single GPU can handle.

def estimate_inference_throughput(
model_params: float, # number of parameters
hardware_memory_bw_gbps: float = 2000, # A100 80GB: ~2000 GB/s
precision_bytes: int = 2, # fp16 = 2 bytes
batch_size: int = 1,
avg_output_tokens: int = 100, # for generative models
) -> dict:
"""
For autoregressive generation (LLMs), the bottleneck is memory bandwidth
not compute FLOPS. Every token generated loads all model weights once.

Throughput = Memory BW / (Model Size * tokens_per_forward_pass)
"""
model_size_bytes = model_params * precision_bytes
model_size_gb = model_size_bytes / (1024 ** 3)

# Memory bandwidth in GB/s to bytes/s
memory_bw_bytes_per_second = hardware_memory_bw_gbps * (1024 ** 3)

# Time to load all weights once (one token generation step)
time_per_token_seconds = model_size_bytes / memory_bw_bytes_per_second

# Tokens per second per GPU (without batching)
tokens_per_second = 1.0 / time_per_token_seconds

# Requests per second (each request generates avg_output_tokens tokens)
requests_per_second = tokens_per_second / avg_output_tokens

# With batching: roughly linear improvement up to memory limits
batch_tokens_per_second = tokens_per_second * min(batch_size, 16) # sublinear batching
batch_requests_per_second = batch_tokens_per_second / avg_output_tokens

return {
"model_size_gb": f"{model_size_gb:.1f} GB",
"time_per_token_ms": f"{time_per_token_seconds*1000:.1f} ms",
"tokens_per_second_bs1": f"{tokens_per_second:.0f}",
"requests_per_second_bs1": f"{requests_per_second:.1f}",
"requests_per_second_bs16": f"{batch_requests_per_second:.1f}",
}

print("=== Llama 7B on A100 80GB ===")
for k, v in estimate_inference_throughput(7e9, 2000, 2, 1, 100).items():
print(f" {k}: {v}")

# === Llama 7B on A100 80GB ===
# model_size_gb: 13.0 GB
# time_per_token_ms: 6.4 ms
# tokens_per_second_bs1: 156
# requests_per_second_bs1: 1.6 <-- very low! need batching or more GPUs
# requests_per_second_bs16: 25.3

This is why LLM serving is expensive: a 7B model on an A100 serves roughly 1–2 requests/second without batching, and 25 with batching. To serve 1,000 QPS, you need ~40 A100s even for a 7B model.

Storage Estimation

Feature Store Storage

def estimate_feature_store_size(
num_users: int,
features_per_user: int,
bytes_per_feature: float = 4.0, # float32 = 4 bytes
num_items: int = 0,
features_per_item: int = 0,
embedding_dim: int = 256, # embedding dimension
) -> dict:
"""
Estimate storage for an online feature store (e.g., Redis).
Includes user features, item features, and embeddings.
"""
user_feature_bytes = num_users * features_per_user * bytes_per_feature
item_feature_bytes = num_items * features_per_item * bytes_per_feature

# Embeddings: stored as float32 vectors
user_embedding_bytes = num_users * embedding_dim * 4
item_embedding_bytes = num_items * embedding_dim * 4

# Redis overhead: ~2x raw data size due to keys, metadata, encoding
redis_overhead = 2.0

total_raw_gb = (user_feature_bytes + item_feature_bytes +
user_embedding_bytes + item_embedding_bytes) / (1024**3)
total_redis_gb = total_raw_gb * redis_overhead

return {
"user_features_gb": user_feature_bytes / (1024**3),
"item_features_gb": item_feature_bytes / (1024**3),
"embeddings_gb": (user_embedding_bytes + item_embedding_bytes) / (1024**3),
"total_raw_gb": total_raw_gb,
"total_redis_estimate_gb": total_redis_gb,
"redis_instance_recommendation": f"{int(total_redis_gb * 1.2 / 64) + 1} x 64GB Redis nodes"
}

# Recommendation system: 50M users, 1B items
result = estimate_feature_store_size(
num_users=50_000_000,
features_per_user=100,
num_items=1_000_000_000,
features_per_item=50,
embedding_dim=256
)
for k, v in result.items():
if isinstance(v, float):
print(f" {k}: {v:.1f} GB")
else:
print(f" {k}: {v}")

Training Data Storage

def estimate_training_data_storage(
dau: int,
events_per_user_per_day: float,
bytes_per_event: int = 200,
retention_days: int = 365,
compression_ratio: float = 4.0 # Parquet+Snappy typically 4-5x
) -> dict:
"""Estimate storage for raw events and compressed training data."""
raw_events_per_day = dau * events_per_user_per_day
raw_bytes_per_day = raw_events_per_day * bytes_per_event
raw_gb_per_day = raw_bytes_per_day / (1024**3)
raw_tb_per_year = raw_gb_per_day * retention_days / 1024

compressed_tb_per_year = raw_tb_per_year / compression_ratio

# S3 cost at $0.023/GB/month
storage_cost_per_month = compressed_tb_per_year * 1024 * 0.023

return {
"raw_events_per_day": f"{raw_events_per_day:,.0f}",
"raw_gb_per_day": f"{raw_gb_per_day:.1f}",
"raw_tb_per_year": f"{raw_tb_per_year:.1f}",
"compressed_tb_per_year": f"{compressed_tb_per_year:.1f}",
"s3_cost_per_month_usd": f"${storage_cost_per_month:,.0f}"
}

# Netflix-scale recommendation events
result = estimate_training_data_storage(
dau=50_000_000,
events_per_user_per_day=100, # views, clicks, ratings
bytes_per_event=200,
retention_days=365
)
for k, v in result.items():
print(f" {k}: {v}")

# raw_events_per_day: 5,000,000,000
# raw_gb_per_day: 931.3
# raw_tb_per_year: 331.1
# compressed_tb_per_year: 82.8
# s3_cost_per_month_usd: $1,907

Worked Example: Real-Time Fraud Detection System

Let's walk through a complete estimation for a fraud detection system at payment scale.

Given: 10M active cards, average 3 transactions/card/day, peak 5× average during business hours.

# Step 1: Traffic
dau = 10_000_000
avg_transactions_per_day = 3
peak_factor = 5.0

avg_qps = (dau * avg_transactions_per_day) / 86_400
peak_qps = avg_qps * peak_factor
print(f"Average QPS: {avg_qps:.0f}") # ~347 QPS
print(f"Peak QPS: {peak_qps:.0f}") # ~1,736 QPS

# Step 2: Model sizing (gradient boosted tree for fraud)
# XGBoost model: 500 trees, max depth 6, ~1M parameters
# XGBoost CPU inference: ~2ms per request
model_latency_ms = 2.0
cpu_requests_per_second = 1000 / model_latency_ms # 500 RPS per CPU core

# Number of CPU cores needed
cores_needed = peak_qps / cpu_requests_per_second
print(f"CPU cores needed (no headroom): {cores_needed:.1f}") # ~3.5 cores
print(f"With 3x headroom: {cores_needed * 3:.0f} cores") # ~11 cores

# Step 3: Feature store sizing
# Features: user velocity (30-min/1hr/24hr transaction count/amount),
# device fingerprint, merchant history, card age
# ~50 numerical features per inference
features_per_request = 50
feature_fetch_redis_ops = peak_qps # one Redis GET per request
print(f"Redis GET ops/second: {feature_fetch_redis_ops:.0f}") # 1736/s (trivial)

# User feature data in Redis
user_feature_size_bytes = 50 * 4 # 50 float32 features
total_redis_bytes = dau * user_feature_size_bytes
print(f"Redis memory for user features: {total_redis_bytes/1e9:.1f} GB") # 2 GB (tiny!)

# Step 4: Training data storage
# 10M users × 3 tx/day × 365 days × 200 bytes/tx
raw_bytes_per_year = 10_000_000 * 3 * 365 * 200
print(f"Raw training data per year: {raw_bytes_per_year/1e9:.1f} GB") # 2.19 TB
print(f"Compressed (4x): {raw_bytes_per_year/1e9/4:.1f} GB") # 548 GB

# Step 5: Cost estimate
# Serving: 4 x 8-core machines (c5.2xlarge, ~$0.40/hr) = $1.60/hr = $1,150/month
# Storage: 548 GB S3 = $12.60/month
# Total: ~$1,200/month
print(f"\nEstimated monthly serving cost: ~$1,200")
print("Conclusion: This is a CPU-served, non-GPU problem at this scale")

Conclusion: Fraud detection at 10M active cards does NOT need GPU serving. A gradient boosted model runs on CPU with sub-2ms latency. You need a Redis cluster for feature storage (2 GB - trivially small), and a modest compute cluster for inference. The expensive part is building the streaming feature pipeline (Kafka + Flink for real-time velocity features), not the model serving.

Worked Example: LLM Serving

# LLM serving: 100K MAU, avg 5 requests/day, avg 500 output tokens/request

mau = 100_000
requests_per_user_per_day = 5
output_tokens_per_request = 500

# Daily requests
daily_requests = mau * requests_per_user_per_day # assume 30% are DAU
dau_requests = mau * 0.3 * requests_per_user_per_day
avg_qps = dau_requests / 86_400
peak_qps = avg_qps * 3 # 3x peak

print(f"Average QPS: {avg_qps:.1f}") # ~0.52 QPS (much lower than it sounds!)
print(f"Peak QPS: {peak_qps:.1f}") # ~1.56 QPS

# Tokens per second needed
avg_tokens_per_second = avg_qps * output_tokens_per_request
peak_tokens_per_second = peak_qps * output_tokens_per_request
print(f"Average tokens/s: {avg_tokens_per_second:.0f}") # ~260 tok/s
print(f"Peak tokens/s: {peak_tokens_per_second:.0f}") # ~780 tok/s

# Llama 7B on A100: ~150 tokens/second (batched)
tokens_per_second_per_gpu = 150

gpus_needed = peak_tokens_per_second / tokens_per_second_per_gpu
print(f"GPUs needed: {gpus_needed:.1f}") # ~5.2 -> 2 GPUs with batching efficiency

# With vLLM continuous batching: much better throughput
# vLLM can achieve 500-1000 tokens/second on A100 with large batches
vllm_tokens_per_second = 600
gpus_vllm = peak_tokens_per_second / vllm_tokens_per_second
print(f"GPUs needed with vLLM: {gpus_vllm:.1f}") # ~1.3 -> 2 GPUs for redundancy

# Monthly cost: 2 x A100 80GB, ~$3/hr spot
monthly_gpu_cost = 2 * 3.0 * 24 * 30
print(f"Monthly GPU cost: ${monthly_gpu_cost:,}") # $4,320/month

Common Estimation Mistakes

:::danger Off-By-Ten Errors The most common mistake is ignoring the difference between GB and TB, or between MB and GB. Always write units explicitly. "The feature store needs 400 of storage" is useless. "The feature store needs 400 GB of RAM, which requires a 6-node Redis cluster at 64 GB per node with 10% headroom" is an engineering decision. :::

:::warning Forgetting Peak Traffic Always estimate for peak, not average. A system designed for average QPS fails on the first holiday or product launch. Rule of thumb: 3× for consumer products, 5× for payments/financial, 10× for advertising (Black Friday). If you don't know the peak factor, use 5× and document the assumption. :::

:::warning Forgetting Inference Memory Overhead Model weights are not the only memory cost during inference. For transformer models, the KV cache (key-value cache for attention) can be 2–4× the model size during long-context inference. A 7B Llama model (14 GB weights in fp16) serving 10 concurrent requests with 4K context each needs approximately 14 GB (weights) + 20 GB (KV cache) = 34 GB - doesn't fit on a 24 GB consumer GPU. :::

Estimation Quick Reference

# Quick estimation functions - useful in interviews

def model_gb(params_billions, dtype="fp16"):
"""Model size in GB. dtype: fp32=4B, fp16/bf16=2B, int8=1B, int4=0.5B"""
bpb = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}
return params_billions * bpb[dtype]

def training_cost_usd(params_billions, tokens_billions,
gpu_tflops_fp16=312, mfu=0.4,
spot_price_per_hour=2.0, num_gpus=64):
flops = 6 * params_billions * 1e9 * tokens_billions * 1e9
effective_flops_per_sec = gpu_tflops_fp16 * 1e12 * mfu
cluster_seconds = flops / (effective_flops_per_sec * num_gpus)
cluster_hours = cluster_seconds / 3600
return cluster_hours * num_gpus * spot_price_per_hour

def peak_qps(dau_millions, requests_per_day, peak_factor=3.0):
return (dau_millions * 1e6 * requests_per_day / 86_400) * peak_factor

# Examples
print(f"Llama 7B fp16: {model_gb(7):.0f} GB")
print(f"GPT-4 (1T fp16): {model_gb(1000):.0f} GB")
print(f"Train 7B/1T tokens (64x A100): ${training_cost_usd(7, 1000):,.0f}")
print(f"50M DAU, 10 req/day, 3x peak: {peak_qps(50, 10):.0f} QPS")

Interview Q&A

Q1: How would you estimate the compute cost to train a 10B parameter model on 200 billion tokens?

Use the 6PD rule: C=6×P×D=6×1010×2×1011=1.2×1022C = 6 \times P \times D = 6 \times 10^{10} \times 2 \times 10^{11} = 1.2 \times 10^{22} FLOPs.

On a 64-GPU A100 cluster with 40% MFU: effective FLOPs = 64 × 312 × 10^12 × 0.4 = 7.99 × 10^15 FLOPs/second.

Time = 1.2 × 10^22 / 7.99 × 10^15 ≈ 1.5 × 10^6 seconds ≈ 17.4 days.

Cost at 2/hrspot×64GPUs×417hours2/hr spot × 64 GPUs × 417 hours ≈ 53,400.

Note: this is compute-optimal for a 10B model per Chinchilla (20 tokens/param × 10B = 200B tokens).

Q2: Why is LLM inference memory-bandwidth-bound rather than compute-bound?

For autoregressive token generation, the model generates one token at a time. Each token generation requires one forward pass through all transformer layers. The input sequence length is 1 new token (plus cached KV states), but you must load all model parameters from GPU memory to compute the forward pass.

The ratio of compute (FLOPs) to memory reads (bytes) for a single token is very low - approximately 2 FLOPs per byte (one multiply and one add per weight read). Modern A100 GPUs can perform 312 TFLOPS FP16 but only 2 TB/s memory bandwidth. The ratio is 156 FLOPs/byte, meaning the GPU can compute 156 operations per byte it reads. But inference with batch size 1 only requires 2 FLOPs per byte - 78× less compute than the hardware can deliver. The GPU is sitting idle waiting for data, not compute.

Solution: batching increases the effective FLOPs/byte ratio, which is why vLLM's continuous batching dramatically improves throughput while adding latency.

Q3: How many GPU replicas do you need to serve GPT-3 (175B parameters) at 100 QPS with 200 output tokens per response?

First, model memory: 175B × 2 bytes (fp16) = 350 GB. This requires at least 5 × A100 80GB (400 GB total) using tensor parallelism.

Tokens per second needed: 100 QPS × 200 tokens = 20,000 tokens/second.

With a 175B model on 8 × A100 (to fit comfortably with KV cache): memory bandwidth per token ≈ 350 GB / (8 × 2 TB/s) ≈ 22ms per token. Tokens per second per replica: ~45 tok/s.

Replicas needed: 20,000 tok/s / 45 tok/s = 444 replicas. Each replica is 8 GPUs. Total: ~3,552 A100s.

This is why GPT-3 at scale costs OpenAI an estimated $700K/day - it's the raw GPU economics.

Q4: How would you estimate the storage required for a feature store serving 100M users with 200 features each?

Feature storage: 100M users × 200 features × 4 bytes (float32) = 80 GB raw data.

Redis overhead: ~2× for keys, metadata, and encoding overhead = 160 GB Redis memory.

Embedding storage (512-dim): 100M × 512 × 4 bytes = 200 GB raw = 400 GB in Redis.

Total: ~560 GB Redis memory. This fits in a Redis cluster of 10 × 64 GB nodes with a standard replication setup.

S3 backing for the offline feature store: features computed daily, retaining 365 days = 80 GB × 365 × compression(4×) ≈ 7.3 TB. At 0.023/GB/month= 0.023/GB/month = ~168/month.

Q5: A product manager asks whether the team can train a 70B LLM on their $500K infrastructure budget. How would you evaluate this?

Training compute: 6 × 70B × D tokens. For Chinchilla-optimal: 70B × 20 = 1.4T tokens. FLOPs = 6 × 7 × 10^10 × 1.4 × 10^12 = 5.88 × 10^23.

On 256 × A100 cluster (reasonable for 70B): effective FLOPs = 256 × 312T × 0.40 = 3.19 × 10^16 per second. Time = 5.88 × 10^23 / 3.19 × 10^16 ≈ 1.84 × 10^7 seconds ≈ 213 days.

Cost at 2/hrspot×256GPUs×(213×24)hours2/hr spot × 256 GPUs × (213 × 24) hours ≈ 2.6M.

Conclusion: 500Kisapproximately5×toosmallforaChinchillaoptimal70Bmodel.A7Bmodeltrainedon1Ttokens(roughlyLlama7Bscale)wouldcost 500K is approximately 5× too small for a Chinchilla-optimal 70B model. A 7B model trained on 1T tokens (roughly Llama 7B scale) would cost ~75K - within budget. The PM needs to reduce the target model scale by ~3-4× or increase the budget by ~5×.

Summary

Back-of-envelope estimation is a practical skill, not a theoretical one. The key numbers to internalize: model size = parameters × bytes per parameter; training FLOPs = 6 × P × D; inference is memory-bandwidth-bound for LLMs; peak QPS = average × 3–5×.

The estimation cascade - users → requests → data → compute → storage → cost - gives you a systematic path from business requirements to infrastructure numbers. Combined with the reference hardware table, it takes 15 minutes to determine whether a proposed system is feasible, expensive but doable, or fundamentally misscaled.

:::tip Interview Technique In system design interviews, estimation is a signal of seniority. Candidates who say "we'd need some GPUs for serving" without quantifying are treated differently than candidates who say "at 10K QPS with a 7B model and 100-token average output, we need approximately 20 A100 GPUs with vLLM continuous batching." The numbers don't need to be exact - within a factor of 2 is excellent. What matters is the reasoning. :::

© 2026 EngineersOfAI. All rights reserved.