:::tip 🎮 Interactive Playground Visualize this concept: Try the Back-of-Envelope Estimation demo on the EngineersOfAI Playground - no code required. :::
Back-of-the-Envelope Estimation for ML Systems
The engineer who can estimate is worth ten engineers who can only measure. Measuring requires the system to exist. Estimating determines whether it should.
The Production Moment
Your manager has just come back from an executive meeting. "We're going to train a large language model," she says. "Internal use - document summarization, code review, customer support. The board approved a $2M infrastructure budget. Can we do it?"
You have 30 minutes before the follow-up meeting where you need to give an answer.
This is a back-of-envelope problem. You don't need exact numbers. You need to know whether a $2M budget is in the right ballpark or off by an order of magnitude. One order of magnitude means the project is feasible with careful planning. Two orders of magnitude means someone has fundamentally misunderstood what this costs.
You need the vocabulary of numbers - the key quantities every ML engineer should have internalized - and a systematic estimation framework that converts high-level goals into concrete infrastructure requirements.
Here is what that looks like in practice. A 7B parameter model trained on 1 trillion tokens (a reasonable scale for a capable internal LLM) requires approximately 6M. The answer to your manager's question depends entirely on the model scale she has in mind, and without asking, you can't give a useful answer.
The estimation framework makes these calculations fast, reliable, and defensible.
Why Estimation Matters
Engineers who skip estimation make two types of expensive mistakes.
The over-engineering mistake: Building a distributed Spark cluster to process 10 GB of training data that fits in memory on a single machine. Adding a Redis cluster for feature caching when the feature data is 500 MB and can live on the serving machine's local disk. Over-engineering wastes weeks of development time and ongoing operational cost.
The under-engineering mistake: Designing a serving system for 1,000 QPS when the actual peak is 50,000 QPS. Building a feature pipeline that processes data with a single-machine Pandas script when the dataset is 50 TB. Under-engineering produces systems that fail at scale, requiring emergency rewrites under production pressure.
Both mistakes are avoided by spending 15 minutes on estimation before architectural decisions are made. The numbers don't need to be precise - an order of magnitude estimate is enough to avoid both failure modes.
Key Numbers Every ML Engineer Should Know
Internalize these. They are the vocabulary of estimation.
Compute
| Hardware | FP32 FLOPS | FP16 FLOPS | Memory BW | Memory |
|---|---|---|---|---|
| A100 80GB | 19.5 TFLOPS | 312 TFLOPS | 2 TB/s | 80 GB |
| A100 40GB | 19.5 TFLOPS | 312 TFLOPS | 1.6 TB/s | 40 GB |
| H100 SXM | 67 TFLOPS | 1,979 TFLOPS | 3.35 TB/s | 80 GB |
| RTX 4090 | 82.6 TFLOPS | 165 TFLOPS | 1 TB/s | 24 GB |
| CPU (modern) | ~1 TFLOPS | ~2 TFLOPS | 50-100 GB/s | varies |
The key insight: GPU memory bandwidth is often the bottleneck for inference, not FLOPS. An autoregressive LLM generating tokens one-at-a-time is memory-bandwidth-bound because it must load all model weights for every token generated.
Storage and Memory
| Unit | Size | Practical examples |
|---|---|---|
| 1 float32 | 4 bytes | Single model weight |
| 1 float16/bfloat16 | 2 bytes | Quantized weight |
| 1 int8 | 1 byte | INT8 quantized weight |
| 1B parameters (fp32) | 4 GB | GPT-2 XL fits on 1 GPU |
| 7B parameters (fp16) | 14 GB | Llama 7B fits on 1 A100 40GB |
| 70B parameters (fp16) | 140 GB | Needs 2 x A100 80GB or tensor parallel |
| 1 hour of video (720p) | ~1 GB | - |
| 1 million text tokens | ~4 MB | - |
| 1 trillion text tokens | ~4 TB | GPT-3 training data scale |
Networking
| Connection | Bandwidth | Latency |
|---|---|---|
| NVLink (GPU-GPU, same node) | 600 GB/s | microseconds |
| InfiniBand HDR (node-node) | 25 GB/s | microseconds |
| 100GbE (datacenter) | 12.5 GB/s | <1ms |
| Internet backbone | varies | 10-100ms |
Cost (approximate, 2024-2025)
| Resource | Cost |
|---|---|
| A100 80GB (cloud, on-demand) | $3-4/hour |
| A100 80GB (cloud, spot/preemptible) | $1-2/hour |
| H100 SXM (cloud, on-demand) | $8-12/hour |
| S3/GCS storage | $0.023/GB/month |
| Data transfer (egress) | $0.09/GB |
The Estimation Framework
The estimation cascade: Users → Requests → Data → Compute → Storage → Cost
Model Size Estimation
The first estimation every ML engineer must do: how much memory does the model require?
Rule: Model memory (in bytes) ≈ number of parameters × bytes per parameter
def estimate_model_memory_gb(
num_parameters: float, # e.g., 7e9 for 7B model
precision: str = "fp16"
) -> float:
"""Estimate GPU memory required for model weights."""
bytes_per_param = {
"fp32": 4,
"fp16": 2,
"bf16": 2,
"int8": 1,
"int4": 0.5
}
bpp = bytes_per_param[precision]
model_bytes = num_parameters * bpp
model_gb = model_bytes / (1024 ** 3)
return model_gb
# Examples
print(f"GPT-2 XL (1.5B, fp32): {estimate_model_memory_gb(1.5e9, 'fp32'):.1f} GB")
print(f"Llama 7B (7B, fp16): {estimate_model_memory_gb(7e9, 'fp16'):.1f} GB")
print(f"Llama 13B (13B, fp16): {estimate_model_memory_gb(13e9, 'fp16'):.1f} GB")
print(f"GPT-4 estimate (1T, fp16): {estimate_model_memory_gb(1e12, 'fp16'):.0f} GB")
# Output:
# GPT-2 XL (1.5B, fp32): 5.6 GB
# Llama 7B (7B, fp16): 13.0 GB
# Llama 13B (13B, fp16): 24.2 GB
# GPT-4 estimate (1T, fp16): 1863 GB
But serving memory is higher than model memory. For training, you also need optimizer states:
def estimate_training_memory_gb(
num_parameters: float,
precision: str = "fp16",
optimizer: str = "adam"
) -> dict:
"""
Training memory breakdown.
Adam optimizer with mixed precision (common setup):
- fp16 weights: 2 bytes/param
- fp32 master weights (for optimizer): 4 bytes/param
- Adam: fp32 gradient + 2 fp32 momentum terms = 12 bytes/param
Total: ~18 bytes/param for mixed-precision Adam
"""
model_gb = estimate_model_memory_gb(num_parameters, precision)
if optimizer == "adam" and precision in ("fp16", "bf16"):
# Mixed precision training: fp16 forward, fp32 optimizer states
optimizer_bytes_per_param = 4 + 4 + 4 # master weights + 2 moments
grad_bytes_per_param = 4 # fp32 gradients
activation_overhead_gb = num_parameters * 2 / (1024**3) * 0.1 # ~10% of model
else:
optimizer_bytes_per_param = 0
grad_bytes_per_param = 2
activation_overhead_gb = 0
optimizer_gb = (num_parameters * optimizer_bytes_per_param) / (1024**3)
gradient_gb = (num_parameters * grad_bytes_per_param) / (1024**3)
return {
"model_weights": model_gb,
"optimizer_states": optimizer_gb,
"gradients": gradient_gb,
"activations_estimate": activation_overhead_gb,
"total_estimate": model_gb + optimizer_gb + gradient_gb + activation_overhead_gb
}
breakdown = estimate_training_memory_gb(7e9, "fp16", "adam")
for k, v in breakdown.items():
print(f" {k}: {v:.1f} GB")
# model_weights: 13.0 GB
# optimizer_states: 78.0 GB <-- this is why training needs much more memory than inference!
# gradients: 26.0 GB
# activations_estimate: 1.3 GB
# total_estimate: 118.3 GB <-- need at least 2 x A100 80GB
Training Compute Estimation: The 6PD Rule
How many FLOPS does it take to train a model? The answer comes from Kaplan et al. (2020), "Scaling Laws for Neural Language Models," and Hoffmann et al. (2022), "Training Compute-Optimal Large Language Models" (Chinchilla paper):
Where:
- = total training FLOPs
- = number of model parameters
- = number of training tokens
The factor of 6 comes from: 2 FLOPs for a forward pass multiply-add, times 3 for backpropagation (forward + backward + gradient update ≈ 3× forward).
def estimate_training_flops(
num_parameters: float, # e.g., 7e9 for 7B
training_tokens: float # e.g., 1e12 for 1 trillion tokens
) -> dict:
"""
Estimate training compute using the 6PD rule.
Reference: Kaplan et al. (2020) Scaling Laws for Neural LMs
"""
total_flops = 6 * num_parameters * training_tokens
# Convert to practical units
petaflops = total_flops / 1e15
petaflop_days = petaflops / (24 * 3600)
# Time estimate on GPU cluster
a100_fp16_flops_per_second = 312e12 # 312 TFLOPS fp16
# Assume 40% MFU (Model FLOPS Utilization) - typical for large model training
effective_flops_per_second = a100_fp16_flops_per_second * 0.40
a100_hours_single_gpu = (total_flops / effective_flops_per_second) / 3600
a100_days_single_gpu = a100_hours_single_gpu / 24
# Cost estimate at $2/hr spot
cost_single_gpu = a100_hours_single_gpu * 2.0
# On 64-GPU cluster
cluster_hours = a100_hours_single_gpu / 64
cluster_cost_64 = cluster_hours * 64 * 2.0
return {
"total_flops": f"{total_flops:.2e}",
"petaflop_days": f"{petaflop_days:.1f}",
"single_a100_days": f"{a100_days_single_gpu:.0f}",
"cluster_64_gpu_days": f"{a100_days_single_gpu/64:.1f}",
"cost_64_gpu_cluster_usd": f"${cluster_cost_64:,.0f}"
}
# Practical examples
print("=== Llama 7B (1T tokens) ===")
result = estimate_training_flops(7e9, 1e12)
for k, v in result.items():
print(f" {k}: {v}")
print("\n=== Llama 13B (1.4T tokens, Chinchilla optimal) ===")
result = estimate_training_flops(13e9, 1.4e12)
for k, v in result.items():
print(f" {k}: {v}")
# === Llama 7B (1T tokens) ===
# total_flops: 4.20e+22
# petaflop_days: 486.1
# single_a100_days: 1,574
# cluster_64_gpu_days: 24.6
# cost_64_gpu_cluster_usd: $75,072
# === Llama 13B (1.4T tokens) ===
# cost_64_gpu_cluster_usd: $196,560
:::note Chinchilla Scaling Law The Chinchilla paper (Hoffmann et al., 2022) showed that prior models were undertrained - too many parameters, not enough data. The compute-optimal training ratio is approximately 20 tokens per parameter: a 7B model should see ~140B tokens for compute-optimal training, but Llama 2 trained on 2T tokens (much more than Chinchilla-optimal) because inference is expensive and over-training reduces inference costs by allowing a smaller model to reach the same quality. :::
Inference Throughput Estimation
For serving, you need to know how many requests per second a single GPU can handle.
def estimate_inference_throughput(
model_params: float, # number of parameters
hardware_memory_bw_gbps: float = 2000, # A100 80GB: ~2000 GB/s
precision_bytes: int = 2, # fp16 = 2 bytes
batch_size: int = 1,
avg_output_tokens: int = 100, # for generative models
) -> dict:
"""
For autoregressive generation (LLMs), the bottleneck is memory bandwidth
not compute FLOPS. Every token generated loads all model weights once.
Throughput = Memory BW / (Model Size * tokens_per_forward_pass)
"""
model_size_bytes = model_params * precision_bytes
model_size_gb = model_size_bytes / (1024 ** 3)
# Memory bandwidth in GB/s to bytes/s
memory_bw_bytes_per_second = hardware_memory_bw_gbps * (1024 ** 3)
# Time to load all weights once (one token generation step)
time_per_token_seconds = model_size_bytes / memory_bw_bytes_per_second
# Tokens per second per GPU (without batching)
tokens_per_second = 1.0 / time_per_token_seconds
# Requests per second (each request generates avg_output_tokens tokens)
requests_per_second = tokens_per_second / avg_output_tokens
# With batching: roughly linear improvement up to memory limits
batch_tokens_per_second = tokens_per_second * min(batch_size, 16) # sublinear batching
batch_requests_per_second = batch_tokens_per_second / avg_output_tokens
return {
"model_size_gb": f"{model_size_gb:.1f} GB",
"time_per_token_ms": f"{time_per_token_seconds*1000:.1f} ms",
"tokens_per_second_bs1": f"{tokens_per_second:.0f}",
"requests_per_second_bs1": f"{requests_per_second:.1f}",
"requests_per_second_bs16": f"{batch_requests_per_second:.1f}",
}
print("=== Llama 7B on A100 80GB ===")
for k, v in estimate_inference_throughput(7e9, 2000, 2, 1, 100).items():
print(f" {k}: {v}")
# === Llama 7B on A100 80GB ===
# model_size_gb: 13.0 GB
# time_per_token_ms: 6.4 ms
# tokens_per_second_bs1: 156
# requests_per_second_bs1: 1.6 <-- very low! need batching or more GPUs
# requests_per_second_bs16: 25.3
This is why LLM serving is expensive: a 7B model on an A100 serves roughly 1–2 requests/second without batching, and 25 with batching. To serve 1,000 QPS, you need ~40 A100s even for a 7B model.
Storage Estimation
Feature Store Storage
def estimate_feature_store_size(
num_users: int,
features_per_user: int,
bytes_per_feature: float = 4.0, # float32 = 4 bytes
num_items: int = 0,
features_per_item: int = 0,
embedding_dim: int = 256, # embedding dimension
) -> dict:
"""
Estimate storage for an online feature store (e.g., Redis).
Includes user features, item features, and embeddings.
"""
user_feature_bytes = num_users * features_per_user * bytes_per_feature
item_feature_bytes = num_items * features_per_item * bytes_per_feature
# Embeddings: stored as float32 vectors
user_embedding_bytes = num_users * embedding_dim * 4
item_embedding_bytes = num_items * embedding_dim * 4
# Redis overhead: ~2x raw data size due to keys, metadata, encoding
redis_overhead = 2.0
total_raw_gb = (user_feature_bytes + item_feature_bytes +
user_embedding_bytes + item_embedding_bytes) / (1024**3)
total_redis_gb = total_raw_gb * redis_overhead
return {
"user_features_gb": user_feature_bytes / (1024**3),
"item_features_gb": item_feature_bytes / (1024**3),
"embeddings_gb": (user_embedding_bytes + item_embedding_bytes) / (1024**3),
"total_raw_gb": total_raw_gb,
"total_redis_estimate_gb": total_redis_gb,
"redis_instance_recommendation": f"{int(total_redis_gb * 1.2 / 64) + 1} x 64GB Redis nodes"
}
# Recommendation system: 50M users, 1B items
result = estimate_feature_store_size(
num_users=50_000_000,
features_per_user=100,
num_items=1_000_000_000,
features_per_item=50,
embedding_dim=256
)
for k, v in result.items():
if isinstance(v, float):
print(f" {k}: {v:.1f} GB")
else:
print(f" {k}: {v}")
Training Data Storage
def estimate_training_data_storage(
dau: int,
events_per_user_per_day: float,
bytes_per_event: int = 200,
retention_days: int = 365,
compression_ratio: float = 4.0 # Parquet+Snappy typically 4-5x
) -> dict:
"""Estimate storage for raw events and compressed training data."""
raw_events_per_day = dau * events_per_user_per_day
raw_bytes_per_day = raw_events_per_day * bytes_per_event
raw_gb_per_day = raw_bytes_per_day / (1024**3)
raw_tb_per_year = raw_gb_per_day * retention_days / 1024
compressed_tb_per_year = raw_tb_per_year / compression_ratio
# S3 cost at $0.023/GB/month
storage_cost_per_month = compressed_tb_per_year * 1024 * 0.023
return {
"raw_events_per_day": f"{raw_events_per_day:,.0f}",
"raw_gb_per_day": f"{raw_gb_per_day:.1f}",
"raw_tb_per_year": f"{raw_tb_per_year:.1f}",
"compressed_tb_per_year": f"{compressed_tb_per_year:.1f}",
"s3_cost_per_month_usd": f"${storage_cost_per_month:,.0f}"
}
# Netflix-scale recommendation events
result = estimate_training_data_storage(
dau=50_000_000,
events_per_user_per_day=100, # views, clicks, ratings
bytes_per_event=200,
retention_days=365
)
for k, v in result.items():
print(f" {k}: {v}")
# raw_events_per_day: 5,000,000,000
# raw_gb_per_day: 931.3
# raw_tb_per_year: 331.1
# compressed_tb_per_year: 82.8
# s3_cost_per_month_usd: $1,907
Worked Example: Real-Time Fraud Detection System
Let's walk through a complete estimation for a fraud detection system at payment scale.
Given: 10M active cards, average 3 transactions/card/day, peak 5× average during business hours.
# Step 1: Traffic
dau = 10_000_000
avg_transactions_per_day = 3
peak_factor = 5.0
avg_qps = (dau * avg_transactions_per_day) / 86_400
peak_qps = avg_qps * peak_factor
print(f"Average QPS: {avg_qps:.0f}") # ~347 QPS
print(f"Peak QPS: {peak_qps:.0f}") # ~1,736 QPS
# Step 2: Model sizing (gradient boosted tree for fraud)
# XGBoost model: 500 trees, max depth 6, ~1M parameters
# XGBoost CPU inference: ~2ms per request
model_latency_ms = 2.0
cpu_requests_per_second = 1000 / model_latency_ms # 500 RPS per CPU core
# Number of CPU cores needed
cores_needed = peak_qps / cpu_requests_per_second
print(f"CPU cores needed (no headroom): {cores_needed:.1f}") # ~3.5 cores
print(f"With 3x headroom: {cores_needed * 3:.0f} cores") # ~11 cores
# Step 3: Feature store sizing
# Features: user velocity (30-min/1hr/24hr transaction count/amount),
# device fingerprint, merchant history, card age
# ~50 numerical features per inference
features_per_request = 50
feature_fetch_redis_ops = peak_qps # one Redis GET per request
print(f"Redis GET ops/second: {feature_fetch_redis_ops:.0f}") # 1736/s (trivial)
# User feature data in Redis
user_feature_size_bytes = 50 * 4 # 50 float32 features
total_redis_bytes = dau * user_feature_size_bytes
print(f"Redis memory for user features: {total_redis_bytes/1e9:.1f} GB") # 2 GB (tiny!)
# Step 4: Training data storage
# 10M users × 3 tx/day × 365 days × 200 bytes/tx
raw_bytes_per_year = 10_000_000 * 3 * 365 * 200
print(f"Raw training data per year: {raw_bytes_per_year/1e9:.1f} GB") # 2.19 TB
print(f"Compressed (4x): {raw_bytes_per_year/1e9/4:.1f} GB") # 548 GB
# Step 5: Cost estimate
# Serving: 4 x 8-core machines (c5.2xlarge, ~$0.40/hr) = $1.60/hr = $1,150/month
# Storage: 548 GB S3 = $12.60/month
# Total: ~$1,200/month
print(f"\nEstimated monthly serving cost: ~$1,200")
print("Conclusion: This is a CPU-served, non-GPU problem at this scale")
Conclusion: Fraud detection at 10M active cards does NOT need GPU serving. A gradient boosted model runs on CPU with sub-2ms latency. You need a Redis cluster for feature storage (2 GB - trivially small), and a modest compute cluster for inference. The expensive part is building the streaming feature pipeline (Kafka + Flink for real-time velocity features), not the model serving.
Worked Example: LLM Serving
# LLM serving: 100K MAU, avg 5 requests/day, avg 500 output tokens/request
mau = 100_000
requests_per_user_per_day = 5
output_tokens_per_request = 500
# Daily requests
daily_requests = mau * requests_per_user_per_day # assume 30% are DAU
dau_requests = mau * 0.3 * requests_per_user_per_day
avg_qps = dau_requests / 86_400
peak_qps = avg_qps * 3 # 3x peak
print(f"Average QPS: {avg_qps:.1f}") # ~0.52 QPS (much lower than it sounds!)
print(f"Peak QPS: {peak_qps:.1f}") # ~1.56 QPS
# Tokens per second needed
avg_tokens_per_second = avg_qps * output_tokens_per_request
peak_tokens_per_second = peak_qps * output_tokens_per_request
print(f"Average tokens/s: {avg_tokens_per_second:.0f}") # ~260 tok/s
print(f"Peak tokens/s: {peak_tokens_per_second:.0f}") # ~780 tok/s
# Llama 7B on A100: ~150 tokens/second (batched)
tokens_per_second_per_gpu = 150
gpus_needed = peak_tokens_per_second / tokens_per_second_per_gpu
print(f"GPUs needed: {gpus_needed:.1f}") # ~5.2 -> 2 GPUs with batching efficiency
# With vLLM continuous batching: much better throughput
# vLLM can achieve 500-1000 tokens/second on A100 with large batches
vllm_tokens_per_second = 600
gpus_vllm = peak_tokens_per_second / vllm_tokens_per_second
print(f"GPUs needed with vLLM: {gpus_vllm:.1f}") # ~1.3 -> 2 GPUs for redundancy
# Monthly cost: 2 x A100 80GB, ~$3/hr spot
monthly_gpu_cost = 2 * 3.0 * 24 * 30
print(f"Monthly GPU cost: ${monthly_gpu_cost:,}") # $4,320/month
Common Estimation Mistakes
:::danger Off-By-Ten Errors The most common mistake is ignoring the difference between GB and TB, or between MB and GB. Always write units explicitly. "The feature store needs 400 of storage" is useless. "The feature store needs 400 GB of RAM, which requires a 6-node Redis cluster at 64 GB per node with 10% headroom" is an engineering decision. :::
:::warning Forgetting Peak Traffic Always estimate for peak, not average. A system designed for average QPS fails on the first holiday or product launch. Rule of thumb: 3× for consumer products, 5× for payments/financial, 10× for advertising (Black Friday). If you don't know the peak factor, use 5× and document the assumption. :::
:::warning Forgetting Inference Memory Overhead Model weights are not the only memory cost during inference. For transformer models, the KV cache (key-value cache for attention) can be 2–4× the model size during long-context inference. A 7B Llama model (14 GB weights in fp16) serving 10 concurrent requests with 4K context each needs approximately 14 GB (weights) + 20 GB (KV cache) = 34 GB - doesn't fit on a 24 GB consumer GPU. :::
Estimation Quick Reference
# Quick estimation functions - useful in interviews
def model_gb(params_billions, dtype="fp16"):
"""Model size in GB. dtype: fp32=4B, fp16/bf16=2B, int8=1B, int4=0.5B"""
bpb = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}
return params_billions * bpb[dtype]
def training_cost_usd(params_billions, tokens_billions,
gpu_tflops_fp16=312, mfu=0.4,
spot_price_per_hour=2.0, num_gpus=64):
flops = 6 * params_billions * 1e9 * tokens_billions * 1e9
effective_flops_per_sec = gpu_tflops_fp16 * 1e12 * mfu
cluster_seconds = flops / (effective_flops_per_sec * num_gpus)
cluster_hours = cluster_seconds / 3600
return cluster_hours * num_gpus * spot_price_per_hour
def peak_qps(dau_millions, requests_per_day, peak_factor=3.0):
return (dau_millions * 1e6 * requests_per_day / 86_400) * peak_factor
# Examples
print(f"Llama 7B fp16: {model_gb(7):.0f} GB")
print(f"GPT-4 (1T fp16): {model_gb(1000):.0f} GB")
print(f"Train 7B/1T tokens (64x A100): ${training_cost_usd(7, 1000):,.0f}")
print(f"50M DAU, 10 req/day, 3x peak: {peak_qps(50, 10):.0f} QPS")
Interview Q&A
Q1: How would you estimate the compute cost to train a 10B parameter model on 200 billion tokens?
Use the 6PD rule: FLOPs.
On a 64-GPU A100 cluster with 40% MFU: effective FLOPs = 64 × 312 × 10^12 × 0.4 = 7.99 × 10^15 FLOPs/second.
Time = 1.2 × 10^22 / 7.99 × 10^15 ≈ 1.5 × 10^6 seconds ≈ 17.4 days.
Cost at 53,400.
Note: this is compute-optimal for a 10B model per Chinchilla (20 tokens/param × 10B = 200B tokens).
Q2: Why is LLM inference memory-bandwidth-bound rather than compute-bound?
For autoregressive token generation, the model generates one token at a time. Each token generation requires one forward pass through all transformer layers. The input sequence length is 1 new token (plus cached KV states), but you must load all model parameters from GPU memory to compute the forward pass.
The ratio of compute (FLOPs) to memory reads (bytes) for a single token is very low - approximately 2 FLOPs per byte (one multiply and one add per weight read). Modern A100 GPUs can perform 312 TFLOPS FP16 but only 2 TB/s memory bandwidth. The ratio is 156 FLOPs/byte, meaning the GPU can compute 156 operations per byte it reads. But inference with batch size 1 only requires 2 FLOPs per byte - 78× less compute than the hardware can deliver. The GPU is sitting idle waiting for data, not compute.
Solution: batching increases the effective FLOPs/byte ratio, which is why vLLM's continuous batching dramatically improves throughput while adding latency.
Q3: How many GPU replicas do you need to serve GPT-3 (175B parameters) at 100 QPS with 200 output tokens per response?
First, model memory: 175B × 2 bytes (fp16) = 350 GB. This requires at least 5 × A100 80GB (400 GB total) using tensor parallelism.
Tokens per second needed: 100 QPS × 200 tokens = 20,000 tokens/second.
With a 175B model on 8 × A100 (to fit comfortably with KV cache): memory bandwidth per token ≈ 350 GB / (8 × 2 TB/s) ≈ 22ms per token. Tokens per second per replica: ~45 tok/s.
Replicas needed: 20,000 tok/s / 45 tok/s = 444 replicas. Each replica is 8 GPUs. Total: ~3,552 A100s.
This is why GPT-3 at scale costs OpenAI an estimated $700K/day - it's the raw GPU economics.
Q4: How would you estimate the storage required for a feature store serving 100M users with 200 features each?
Feature storage: 100M users × 200 features × 4 bytes (float32) = 80 GB raw data.
Redis overhead: ~2× for keys, metadata, and encoding overhead = 160 GB Redis memory.
Embedding storage (512-dim): 100M × 512 × 4 bytes = 200 GB raw = 400 GB in Redis.
Total: ~560 GB Redis memory. This fits in a Redis cluster of 10 × 64 GB nodes with a standard replication setup.
S3 backing for the offline feature store: features computed daily, retaining 365 days = 80 GB × 365 × compression(4×) ≈ 7.3 TB. At 168/month.
Q5: A product manager asks whether the team can train a 70B LLM on their $500K infrastructure budget. How would you evaluate this?
Training compute: 6 × 70B × D tokens. For Chinchilla-optimal: 70B × 20 = 1.4T tokens. FLOPs = 6 × 7 × 10^10 × 1.4 × 10^12 = 5.88 × 10^23.
On 256 × A100 cluster (reasonable for 70B): effective FLOPs = 256 × 312T × 0.40 = 3.19 × 10^16 per second. Time = 5.88 × 10^23 / 3.19 × 10^16 ≈ 1.84 × 10^7 seconds ≈ 213 days.
Cost at 2.6M.
Conclusion: 75K - within budget. The PM needs to reduce the target model scale by ~3-4× or increase the budget by ~5×.
Summary
Back-of-envelope estimation is a practical skill, not a theoretical one. The key numbers to internalize: model size = parameters × bytes per parameter; training FLOPs = 6 × P × D; inference is memory-bandwidth-bound for LLMs; peak QPS = average × 3–5×.
The estimation cascade - users → requests → data → compute → storage → cost - gives you a systematic path from business requirements to infrastructure numbers. Combined with the reference hardware table, it takes 15 minutes to determine whether a proposed system is feasible, expensive but doable, or fundamentally misscaled.
:::tip Interview Technique In system design interviews, estimation is a signal of seniority. Candidates who say "we'd need some GPUs for serving" without quantifying are treated differently than candidates who say "at 10K QPS with a 7B model and 100-token average output, we need approximately 20 A100 GPUs with vLLM continuous batching." The numbers don't need to be exact - within a factor of 2 is excellent. What matters is the reasoning. :::
