Hardware Requirements and Selection
The $3,000 Mistake
A startup's founding engineer spends three weeks researching hardware and orders two RTX 3090s for their local fine-tuning rig. The thinking is solid: 24 GB VRAM per card, connected via a PCIe x16 slot, should handle a 13B model fine-tuning job. The hardware arrives, the system is assembled, and the first fine-tuning run starts. Thirty minutes in, the training loss is diverging. The engineer notices the GPU utilization dashboard: GPU 0 is at 98%, GPU 1 is at 12%. The bottleneck is PCIe bandwidth. The two cards are communicating at around 16 GB/s through the PCIe bus. For fine-tuning, the gradient synchronization between GPUs requires transferring large tensors every optimizer step, and at 16 GB/s that transfer takes longer than the actual forward-backward pass on each card. The second GPU is spending most of its time waiting.
The fix requires either a workstation motherboard with dual x16 PCIe slots (rare and expensive) or NVLink - which the RTX 3090 does not support in dual-card configuration. The engineer spends another $1,200 on an NVLink bridge and a second motherboard before realizing that consumer RTX cards do not expose NVLink to anything but SLI gaming configurations. The actual solution for multi-GPU training is either the professional A6000 or A100 lines, which support NVLink at full 600 GB/s bandwidth, or single-GPU training on a card with enough VRAM to fit the model alone.
This story captures the central challenge of hardware selection for local AI work: the numbers on the spec sheet (VRAM, TFLOPS, memory bandwidth) interact in non-obvious ways depending on the specific workload - inference vs fine-tuning, single request vs batched, short context vs 128K context. A card that is perfect for chatbot inference at 4-bit quantization is the wrong choice for fine-tuning a code model on long sequences.
The goal of this lesson is to give you the mental model and the specific numbers to make correct hardware decisions for the most common local AI workloads. We will cover the VRAM math precisely, compare every relevant GPU tier as of 2025, walk through Apple Silicon's unique unified memory architecture, and end with a concrete hardware selection matrix for six real use cases.
The math here is not complicated, but it must be done correctly before you spend thousands of dollars on hardware. There is no return policy once you have burned a GPU into a fine-tuning run.
Why This Exists - The Era of Locally Runnable Large Models
Before mid-2023, running a useful language model locally was not practical for most engineers. GPT-2 (1.5B parameters) was fine, but it was not useful for code generation or instruction following. GPT-3 (175B parameters) ran only in data centers. The gap between "fits on consumer hardware" and "actually useful" was enormous.
Three things closed that gap simultaneously. First, instruction-tuning techniques (RLHF, DPO) showed that a well-trained 7B model could outperform a poorly trained 30B model on practical tasks. Smaller, better-trained models became genuinely useful. Second, quantization techniques - particularly GGUF 4-bit and 8-bit quantization - cut VRAM requirements by 4-8x with minimal quality degradation. A 7B model that required 14 GB at float16 now ran in 4-5 GB at 4-bit. Third, the open-source model ecosystem exploded. LLaMA-2 in July 2023, Mistral 7B in September 2023, and CodeLlama in August 2023 gave engineers capable models they could actually download and run.
The result: by early 2024, a $1,500 RTX 4090 could run a 13B model at float16 or a 70B model at 4-bit quantization. The question shifted from "can I run this locally?" to "what hardware should I buy for my specific use case?"
Historical Context - Memory Walls and the Bandwidth Problem
The fundamental constraint on LLM inference was identified long before the current generation of models. The memory wall problem - the gap between compute throughput (TFLOPS) and memory bandwidth (GB/s) - has been a concern in high-performance computing since the 1990s.
For LLM inference specifically, the problem was articulated clearly in a 2022 paper by Noam Shazeer (one of the original "Attention Is All You Need" authors) on multi-query attention. The insight: during autoregressive generation, the model generates one token at a time. Each token generation requires reading the entire model weight matrix from GPU memory, doing a relatively small computation against a single token's activations, and writing one token output. The ratio of memory reads to compute operations is extremely high compared to training, where you process large batches. This means inference throughput is bounded by memory bandwidth, not compute throughput - a fact that took the industry several years to fully internalize.
This memory-bandwidth-bound nature of inference explains why raw TFLOPS numbers are misleading for LLM hardware comparisons. An RTX 4090's 82.6 TFLOPS of FP16 compute is largely irrelevant for single-request inference. What matters is its 1008 GB/s of memory bandwidth. In fact, the MI300X from AMD - designed specifically for inference - achieves 5.3 TB/s of memory bandwidth, which is why it outperforms the H100 (3.35 TB/s) on inference throughput per dollar despite lower TFLOPS numbers.
Apple's M-series chips applied a different solution: eliminate the distinction between CPU RAM and GPU RAM entirely. Unified memory means the GPU has access to the same high-bandwidth memory pool as the CPU. An M2 Ultra with 192 GB of unified memory can load a 70B model at float16 - something no consumer GPU can do - and run inference at a reasonable speed even though the chip's peak GPU compute is lower than a dedicated GPU.
Core Concepts
The VRAM Requirement Formula
The most important number for hardware selection is how much VRAM (or unified memory) your model requires. The base formula is:
Where is the number of parameters and is the number of bytes per parameter based on the precision:
| Precision | Bytes per param | Notes |
|---|---|---|
| FP32 (float32) | 4 bytes | Training only, rarely used for inference |
| BF16 / FP16 | 2 bytes | Standard inference, full quality |
| INT8 (8-bit) | 1 byte | bitsandbytes, minor quality loss |
| Q4_K_M (4-bit) | 0.5 bytes (approx) | GGUF/GPTQ, moderate quality loss |
| Q3_K (3-bit) | 0.375 bytes (approx) | Aggressive compression, noticeable degradation |
Add a 20% overhead factor for the KV cache, activation memory, and framework overhead:
Worked examples:
7B model at float16: Requires a 24 GB card (RTX 4090, RTX 3090, A6000).
7B model at Q4_K_M: Fits comfortably on a 6 GB GTX 1060 or the 8 GB integrated GPU of a modern laptop.
13B model at float16: Requires a 40+ GB card (A6000, A100 40GB) or two 24 GB cards.
70B model at float16: Requires either a high-end data center GPU (A100 80GB x2, H100 80GB x2) or an M2 Ultra (192 GB unified).
70B model at Q4_K_M: Fits on a single A6000 (48 GB) or two RTX 4090s (24 GB x2 = 48 GB effective, with bandwidth caveats).
KV Cache Growth with Context Length
The overhead factor of 1.2 is appropriate for short contexts (up to ~4K tokens). For long-context inference, the KV cache grows linearly with context length and can become the dominant VRAM consumer:
Where is the sequence length. For a LLaMA-3.1 8B model running at 128K context in float16, the KV cache alone is approximately 16 GB - as large as the model weights themselves. This means the overhead factor for long-context inference is not 1.2 but can be 2.0 or higher.
Practical implication: if your use case involves long documents, code repositories, or multi-turn conversations that accumulate thousands of tokens, your VRAM requirements are substantially higher than the simple formula suggests.
Memory Bandwidth vs VRAM Capacity
VRAM capacity determines whether a model fits in memory at all. Memory bandwidth determines how fast inference runs once the model is loaded.
The relationship between bandwidth and token generation speed for a single-request inference workload is approximately:
For a 7B model at float16 (14 GB) on an RTX 4090 (1008 GB/s):
In practice, you get 60-70% of theoretical due to framework overhead, reaching roughly 40-50 tokens per second for single-request inference.
For a 70B model at Q4_K_M (35 GB) on an A6000 (768 GB/s):
This bandwidth-bound analysis explains why quantized models on high-bandwidth cards often outperform full-precision models on lower-bandwidth cards, even when the lower-bandwidth card has "more raw compute."
GPU Tier Analysis
Consumer Tier - NVIDIA
RTX 4090 (24 GB GDDR6X)
- Memory bandwidth: 1008 GB/s
- VRAM: 24 GB
- Best for: 7B models at float16, 13B models at 8-bit, 70B models at Q4_K_M (offloading required)
- Inference speed (7B/fp16): ~45 tokens/sec single request
- Price (2025): 1,200-1,400 used
- Notes: The current best consumer GPU for local AI. Power hungry (450W TDP), requires a good PSU. NVLink not supported for multi-GPU.
RTX 4080 Super (16 GB GDDR6X)
- Memory bandwidth: 736 GB/s
- VRAM: 16 GB
- Best for: 7B models at float16, 13B models at Q4_K_M
- Inference speed (7B/fp16): ~35 tokens/sec single request
- Price (2025): $999-1,100 new
- Notes: Good price-performance ratio if 24 GB is not needed. Cannot run 13B at float16.
RTX 3090 (24 GB GDDR6X)
- Memory bandwidth: 936 GB/s
- VRAM: 24 GB
- Best for: Same workloads as RTX 4090 but 10-15% slower
- Inference speed (7B/fp16): ~38 tokens/sec single request
- Price (2025): $700-900 used
- Notes: Excellent value on the used market. Same VRAM as 4090 at 40-50% lower cost. Power draw similar to 4090. Good choice if budget is limited and the workload is inference-heavy rather than training.
RTX 4070 Ti Super (16 GB GDDR6X)
- Memory bandwidth: 672 GB/s
- VRAM: 16 GB
- Price (2025): $799 new
- Notes: The 16 GB ceiling limits you to 7B models at float16. Below the 4080 Super in most benchmarks. Only worth choosing if power efficiency matters (285W vs 320W for 4080S).
Prosumer Tier - NVIDIA
RTX 4000 Ada (20 GB GDDR6)
- Memory bandwidth: 432 GB/s
- VRAM: 20 GB
- Price (2025): $1,200-1,500
- Notes: Lower bandwidth than the gaming RTX line despite higher price. Designed for Workstation ISV certifications, not AI throughput. Not recommended for local inference.
RTX 6000 Ada / A6000 Ada (48 GB GDDR6)
- Memory bandwidth: 960 GB/s (Ada) / 768 GB/s (A6000)
- VRAM: 48 GB
- Best for: 70B models at Q4_K_M, 34B models at float16, fine-tuning 13B at full precision
- Inference speed (70B/Q4): ~15-18 tokens/sec
- Price (2025): $6,000-7,000 new
- Notes: The professional GPU that makes 70B inference practical without multi-GPU complexity. 4x the price of an RTX 4090 but 2x the VRAM with professional support guarantees.
A100 40GB / 80GB
- Memory bandwidth: 1935 GB/s (80GB HBM2e)
- VRAM: 40 GB or 80 GB HBM2e
- Best for: Multi-user inference serving, fine-tuning 70B models
- Inference speed (70B/fp16 on 80GB): ~25 tokens/sec (vs ~10 on A6000 due to HBM2e bandwidth)
- Price (2025): 4,000-6,000 used (40GB)
- Notes: HBM2e memory delivers 2-3x the bandwidth of GDDR6. Transformative for batched inference throughput. The right choice when you are serving multiple users simultaneously. Requires a server chassis or PCIe riser - does not fit in a standard ATX case.
H100 80GB
- Memory bandwidth: 3350 GB/s (SXM5)
- VRAM: 80 GB HBM3
- Best for: High-throughput multi-user serving, 70B+ model research
- Price (2025): $25,000-35,000
- Notes: Only justified for teams running commercial inference at scale. At this price point, cloud instances (Lambda Labs, CoreWeave) are often more cost-effective unless you have guaranteed utilization above 80%.
Apple Silicon
Apple's M-series chips use unified memory architecture (UMA): there is a single pool of high-bandwidth LPDDR5/LPDDR5X memory shared between the CPU and GPU. This has two important consequences for LLM inference:
- The "GPU VRAM" is as large as the system RAM. An M3 Max with 128 GB RAM has 128 GB available to the GPU.
- Memory bandwidth is shared between CPU and GPU. The M3 Max achieves 400 GB/s system bandwidth, but all processes share this.
M3 Max (128 GB)
- Unified memory: 128 GB
- GPU bandwidth: ~400 GB/s (shared)
- Best for: 70B models at Q4_K_M, 34B models at float16
- Inference speed (70B/Q4 with MLX): ~8-12 tokens/sec
- Price (2025): $4,000-5,000 (MacBook Pro / Mac Studio)
- Notes: The most practical path to running 70B models in a laptop form factor. Slower per-token than an A6000 due to lower absolute bandwidth, but the mobility and silence are significant operational advantages.
M2 Ultra (192 GB)
- Unified memory: 192 GB
- GPU bandwidth: ~800 GB/s
- Best for: 70B models at float16 (only non-server option), 13B and 34B at full precision
- Inference speed (70B/fp16 with MLX): ~15-18 tokens/sec
- Price (2025): $5,000-7,000 (Mac Studio Ultra, Mac Pro)
- Notes: The M2 Ultra is unique: it can load a 70B model at full float16 precision without quantization. The inference quality is noticeably better than 4-bit quantized versions. For researchers who need maximum inference quality without a server rack, this is the best option available.
M4 Pro (48 GB)
- Unified memory: 48 GB
- GPU bandwidth: ~273 GB/s
- Best for: 13B models at float16, 7B models with good throughput
- Inference speed (7B/fp16 with MLX): ~25-30 tokens/sec
- Price (2025): $2,000-2,500 (MacBook Pro 14")
- Notes: The best laptop option for engineers who need local inference without a dedicated workstation. The 48 GB ceiling fits 34B models at Q4_K_M.
CPU-Only Inference
CPU inference is feasible but slow. The key constraint is that CPUs have much lower memory bandwidth than GPUs:
- High-end desktop CPU (Core i9-13900K, Ryzen 9 7950X): ~100 GB/s DDR5 bandwidth
- Server CPU (dual-socket Xeon, dual EPYC): ~300-500 GB/s total
For a 7B model at Q4_K_M (4 GB):
In practice with llama.cpp's CPU backend, you get 8-15 tokens/sec on a modern high-end desktop CPU with 8 performance cores engaged. This is usable for casual chatting but too slow for a production coding assistant that must respond in under 2 seconds.
The practical threshold: CPU inference with 7B at Q4_K_M is viable for personal use where you tolerate 1-2 second response latency. For 13B and above, CPU inference drops below 5 tokens/sec and becomes frustrating to use.
RAM requirements for CPU inference mirror the VRAM requirements above, because the model weights live in system RAM:
- 7B at Q4_K_M: 5 GB model + 4 GB OS = 16 GB minimum, 32 GB recommended
- 13B at Q4_K_M: 9 GB model + 4 GB OS = 16 GB minimum, 32 GB recommended
- 70B at Q4_K_M: 42 GB model + 4 GB OS = 64 GB minimum, 128 GB recommended
Multi-GPU Considerations
NVLink vs PCIe
When a model does not fit in a single GPU's VRAM, you have two options: quantize it until it fits, or spread it across multiple GPUs. Multi-GPU inference requires transferring activations between GPUs during every layer's computation (tensor parallelism) or between transformer blocks (pipeline parallelism). The bandwidth of this interconnect directly limits throughput.
PCIe Gen 4 x16: ~32 GB/s bidirectional per slot. With two GPUs in different PCIe slots communicating over the CPU's PCIe switch, effective bandwidth is 16-32 GB/s. For a 70B model split across two GPUs, the inter-GPU transfer every forward pass creates a bottleneck that reduces effective throughput by 30-50% compared to a single-GPU setup with equivalent total VRAM.
NVLink (data center cards): 600 GB/s bidirectional (NVLink 4.0 on H100). This is 18x higher than PCIe Gen 4. Two A100s connected via NVLink can run 70B inference at nearly the same tokens/sec as a single 80GB A100.
Consumer GPUs and NVLink: The RTX 3090 technically has NVLink connectors, but NVIDIA disabled multi-GPU tensor parallelism through the driver for non-professional cards. You can use PCIe communication only. The RTX 4000+ series has removed NVLink entirely from consumer cards. Professional cards (A6000, A100, H100) retain full NVLink support.
Practical conclusion: for multi-GPU inference on consumer hardware, the PCIe bandwidth overhead usually makes it more cost-effective to buy a single larger card (A6000 at 48 GB) rather than two smaller cards (2x RTX 4090 at 24 GB each). The exception is fine-tuning with gradient checkpointing, where the inter-GPU communication pattern is less frequent and PCIe overhead is more tolerable.
Tensor Parallelism vs Pipeline Parallelism
With multiple GPUs available:
Tensor Parallelism (TP): The model's weight matrices are split column-wise or row-wise across GPUs. Every layer requires an all-reduce communication between all GPUs. High communication overhead, low latency per token. Best when NVLink is available.
Pipeline Parallelism (PP): Different transformer layers are assigned to different GPUs. Communication only happens between consecutive layers. Lower communication overhead, but introduces pipeline bubbles that reduce GPU utilization. Better suited to PCIe multi-GPU setups.
vLLM supports tensor parallelism via --tensor-parallel-size N. llama.cpp supports layer splitting across GPUs via -ngl (number of GPU layers) specified per GPU. For PCIe multi-GPU setups, use pipeline-style layer splitting rather than tensor parallelism.
Code Examples
Calculating VRAM Requirements in Python
"""
VRAM calculator for local LLM inference.
Estimates memory requirements for different model sizes and precisions.
"""
from dataclasses import dataclass
from typing import Optional
@dataclass
class ModelConfig:
name: str
params_billions: float
num_layers: int
num_heads: int
head_dim: int
context_length: int = 4096
@dataclass
class PrecisionConfig:
name: str
bytes_per_param: float
PRECISIONS = {
"fp32": PrecisionConfig("FP32", 4.0),
"fp16": PrecisionConfig("FP16 / BF16", 2.0),
"int8": PrecisionConfig("INT8", 1.0),
"q4_k_m": PrecisionConfig("Q4_K_M (GGUF)", 0.5),
"q3_k": PrecisionConfig("Q3_K (GGUF)", 0.375),
}
MODELS = {
"llama3.2-3b": ModelConfig("LLaMA 3.2 3B", 3.2, 28, 24, 64),
"llama3.1-8b": ModelConfig("LLaMA 3.1 8B", 8.0, 32, 32, 128),
"llama3.1-70b": ModelConfig("LLaMA 3.1 70B", 70.0, 80, 64, 128),
"mistral-7b": ModelConfig("Mistral 7B", 7.0, 32, 32, 128),
"codellama-34b": ModelConfig("CodeLlama 34B", 34.0, 48, 64, 128),
"qwen2.5-72b": ModelConfig("Qwen 2.5 72B", 72.0, 80, 64, 128),
}
def compute_vram_gb(
model: ModelConfig,
precision: PrecisionConfig,
context_tokens: Optional[int] = None,
overhead_factor: float = 1.2,
) -> dict:
"""
Compute VRAM requirements for a model at a given precision.
Args:
model: Model configuration
precision: Precision configuration
context_tokens: Override context length for KV cache calculation
overhead_factor: Multiplier for framework/activation overhead (default 1.2)
Returns:
dict with weight_gb, kv_cache_gb, total_gb
"""
ctx = context_tokens or model.context_length
# Model weights
weight_gb = (model.params_billions * 1e9 * precision.bytes_per_param) / (1024 ** 3)
# KV cache: 2 * layers * heads * head_dim * seq_len * bytes_per_element
# Use fp16 (2 bytes) for KV cache regardless of weight precision
kv_bytes = 2 * model.num_layers * model.num_heads * model.head_dim * ctx * 2
kv_gb = kv_bytes / (1024 ** 3)
# Total with overhead
total_gb = (weight_gb * overhead_factor) + kv_gb
return {
"weight_gb": round(weight_gb, 2),
"kv_cache_gb": round(kv_gb, 2),
"overhead_gb": round(weight_gb * (overhead_factor - 1.0), 2),
"total_gb": round(total_gb, 2),
}
def print_vram_table(model_key: str, context_lengths: list[int] = [4096, 32768, 131072]):
model = MODELS[model_key]
print(f"\n{model.name} - VRAM Requirements")
print(f"{'Precision':<20} {'Weights (GB)':<16} {'KV@4K (GB)':<14} {'KV@32K (GB)':<14} {'KV@128K (GB)':<14}")
print("-" * 80)
for prec_key, prec in PRECISIONS.items():
row = f"{prec.name:<20}"
for ctx in context_lengths:
result = compute_vram_gb(model, prec, context_tokens=ctx)
if ctx == context_lengths[0]:
row += f"{result['weight_gb']:<16}"
row += f"{result['kv_cache_gb']:<14}"
print(row)
# Example usage
if __name__ == "__main__":
for model_key in ["llama3.1-8b", "codellama-34b", "llama3.1-70b"]:
print_vram_table(model_key)
# Quick check: will 7B fp16 fit on an RTX 4090?
result = compute_vram_gb(MODELS["mistral-7b"], PRECISIONS["fp16"])
print(f"\nMistral 7B at FP16 needs {result['total_gb']} GB")
print(f"RTX 4090 has 24 GB - fits: {result['total_gb'] < 24}")
Hardware Benchmark Script
"""
Quick benchmark to measure actual tokens/sec on your hardware.
Run this after loading a model to get a real-world performance number.
"""
import time
import statistics
import requests
def benchmark_inference(
base_url: str = "http://localhost:11434",
model: str = "llama3.2:3b",
prompt: str = "Explain the transformer architecture in detail, covering attention mechanisms, positional encoding, and the feed-forward layers.",
n_runs: int = 5,
max_tokens: int = 200,
) -> dict:
"""
Benchmark inference speed against an Ollama-compatible API.
Returns tokens/sec statistics.
"""
results = []
for i in range(n_runs):
start = time.perf_counter()
response = requests.post(
f"{base_url}/api/generate",
json={
"model": model,
"prompt": prompt,
"options": {"num_predict": max_tokens},
"stream": False,
},
timeout=120,
)
response.raise_for_status()
data = response.json()
elapsed = time.perf_counter() - start
# Ollama returns eval_count (tokens generated) and eval_duration (nanoseconds)
eval_tokens = data.get("eval_count", 0)
eval_duration_ns = data.get("eval_duration", 1)
tokens_per_sec = eval_tokens / (eval_duration_ns / 1e9)
results.append(tokens_per_sec)
print(f"Run {i+1}/{n_runs}: {tokens_per_sec:.1f} tok/s ({eval_tokens} tokens in {elapsed:.1f}s)")
return {
"mean_tokens_per_sec": round(statistics.mean(results), 1),
"median_tokens_per_sec": round(statistics.median(results), 1),
"stdev": round(statistics.stdev(results) if len(results) > 1 else 0, 1),
"min": round(min(results), 1),
"max": round(max(results), 1),
}
if __name__ == "__main__":
stats = benchmark_inference(model="llama3.2:3b")
print(f"\nResults: {stats['mean_tokens_per_sec']} tok/s mean, "
f"{stats['median_tokens_per_sec']} tok/s median")
Hardware Selection Script
#!/bin/bash
# check-gpu-readiness.sh
# Diagnoses the local GPU setup for LLM inference readiness
echo "=== GPU Readiness Check for Local LLM Inference ==="
echo ""
# Check NVIDIA driver
if command -v nvidia-smi &>/dev/null; then
echo "NVIDIA driver detected"
nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version,compute_cap \
--format=csv,noheader,nounits | while IFS=',' read -r name total free driver cap; do
echo ""
echo " GPU: $name"
echo " VRAM Total: ${total} MiB ($(echo "scale=1; $total/1024" | bc) GB)"
echo " VRAM Free: ${free} MiB"
echo " Driver: $driver"
echo " Compute Cap: $cap"
# VRAM-based recommendation
total_gb=$(echo "scale=1; $total/1024" | bc)
if (( $(echo "$total_gb >= 48" | bc -l) )); then
echo " Recommended: 70B models at FP16, fine-tuning 13B+"
elif (( $(echo "$total_gb >= 24" | bc -l) )); then
echo " Recommended: 13B at FP16, 70B at Q4_K_M (with offloading)"
elif (( $(echo "$total_gb >= 16" | bc -l) )); then
echo " Recommended: 7B at FP16, 13B at Q4_K_M"
elif (( $(echo "$total_gb >= 8" | bc -l) )); then
echo " Recommended: 7B at Q4_K_M or Q8, small models at FP16"
else
echo " Recommended: Small models only (3B or under at FP16)"
fi
done
elif [[ "$(uname)" == "Darwin" ]]; then
echo "Apple Silicon detected"
system_profiler SPHardwareDataType | grep -E "Chip|Memory:"
# Check mlx availability
if python3 -c "import mlx" &>/dev/null; then
echo " MLX installed - optimized Apple Silicon inference available"
else
echo " MLX not installed - run: pip install mlx-lm"
fi
else
echo "No GPU detected - CPU inference only"
echo ""
# Check system RAM for CPU inference
if command -v free &>/dev/null; then
total_ram=$(free -g | awk '/^Mem:/{print $2}')
echo " System RAM: ${total_ram} GB"
if (( total_ram >= 64 )); then
echo " Recommended: 70B models at Q4_K_M with llama.cpp CPU backend"
elif (( total_ram >= 32 )); then
echo " Recommended: 13B models at Q4_K_M"
elif (( total_ram >= 16 )); then
echo " Recommended: 7B models at Q4_K_M"
else
echo " Recommended: 3B models at Q4_K_M only"
fi
fi
fi
echo ""
echo "=== Docker GPU Support ==="
if docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi &>/dev/null; then
echo "NVIDIA Container Toolkit configured correctly"
else
echo "GPU Docker passthrough not working"
echo " Install with: sudo apt install nvidia-container-toolkit && sudo systemctl restart docker"
fi
Hardware Selection Matrix
| Use Case | Minimum Hardware | Recommended | Best Option |
|---|---|---|---|
| Personal coding assistant (7B) | 8 GB VRAM (RTX 3070) | RTX 4080 16 GB | RTX 4090 24 GB |
| Personal chat / Q&A (7-13B) | 16 GB VRAM | RTX 4090 24 GB | M3 Max 128 GB |
| Local RAG stack (embeddings + 7B) | 12 GB VRAM | RTX 4090 24 GB | RTX 4090 + 64 GB RAM |
| 70B model inference | 48 GB VRAM (A6000) | M2 Ultra 192 GB | A100 80GB |
| Fine-tuning 7B (LoRA/QLoRA) | 16 GB VRAM | RTX 4090 24 GB | A100 40GB |
| Fine-tuning 13B (LoRA) | 24 GB VRAM | RTX 4090 24 GB | A100 40GB |
| Fine-tuning 70B (QLoRA) | 80 GB VRAM | 2x A100 40GB | H100 80GB |
| Air-gapped enterprise deployment | 24 GB VRAM | A6000 48 GB | A100 80GB |
| Low-budget / laptop | CPU or 8 GB | M4 Pro 48 GB | M3 Max 128 GB |
Architecture Diagrams
VRAM Requirements by Model Size and Precision
GPU Architecture Comparison
Multi-GPU Interconnect Impact
Production Engineering Notes
Thermal Management
High-end GPUs under sustained LLM inference load run hot. The RTX 4090 at 450W TDP in a closed case will thermal throttle after 20-30 minutes, dropping from 82 TFLOPS to 60-70 TFLOPS. For sustained inference server workloads, ensure:
- Case airflow: at minimum two 140mm intake fans and two 120mm exhaust fans
- GPU fan curve: set an aggressive custom curve in MSI Afterburner or nvidia-settings
- Ambient temperature: GPU max recommended ambient is 35°C; data closets often exceed this
- Power limit: setting the GPU to 80% TDP (360W for a 4090) reduces thermal output by 20% with less than 8% performance loss - worthwhile for 24/7 inference servers
Power Supply Sizing
An RTX 4090 draws 450W under load. The rest of a modern workstation system (CPU, RAM, storage, fans) draws 100-200W. Total system draw: 600-650W. Use a PSU rated at least 50% above maximum load for efficiency and headroom: an 850W or 1000W 80+ Gold PSU is appropriate for a single RTX 4090 system.
For dual A6000 workstations, each card draws 300W. Total system draw can reach 750-900W. A 1200W or 1600W PSU is appropriate.
Storage for Model Weights
Model weights are large read-once sequential files. You do not need an NVMe SSD for model storage - a fast SATA SSD is sufficient. The RTX 4090 takes 5-10 seconds to load a 14 GB (7B FP16) model from a SATA SSD. NVMe reduces this to 3-5 seconds. The difference is negligible if you keep models loaded.
What matters more: total capacity. A reasonable local AI workstation should have 2-4 TB of SSD storage for models. At 2025 prices, a 4 TB Samsung 870 Evo (SATA) is 60. At this price, there is no reason to run out of model storage.
ECC Memory Considerations
Professional GPUs (A100, A6000, H100) support ECC (Error Correcting Code) memory. Consumer GPUs (RTX series) do not. For LLM inference, a single bit flip in model weights produces a silent bad output rather than a crash. For production inference serving sensitive applications, this is an argument for professional GPU hardware with ECC support, even if the performance-per-dollar is lower.
For development and personal use, the risk of ECC-uncorrectable errors causing a meaningful output error is extremely low in practice. Consumer GPUs are fine.
Common Mistakes
:::danger Buying a Multi-GPU Consumer Setup Instead of a Single Professional Card The most common expensive mistake: purchasing two RTX 4090s to get 48 GB of effective VRAM for 70B models. Consumer GPUs cannot use NVLink for AI workloads. PCIe bandwidth between the two cards creates a 30-50% throughput penalty for tensor-parallel inference. The money is almost always better spent on a single A6000 (48 GB, NVLink-capable, professional support) or on an M2 Ultra (192 GB unified memory). Before buying multiple consumer GPUs for a multi-GPU setup, confirm the specific throughput numbers for your workload with PCIe interconnect. :::
:::danger Confusing VRAM with System RAM A common beginner error is buying 64 GB of system RAM and assuming that helps run large models. For GPU inference, only VRAM matters. System RAM does not extend the GPU's VRAM for model weights in normal inference. The exception is CPU offloading: llama.cpp and some vLLM configurations can offload specific layers to system RAM, but the performance penalty is severe (2-10x slower per layer offloaded). Do not plan a hardware purchase around CPU offloading as the primary inference strategy. :::
:::warning Ignoring Context Length in VRAM Calculations The simple VRAM formula (params x bytes x 1.2) assumes short contexts (4K tokens or less). If your application uses 32K+ context windows (long document analysis, repository-scale code generation, long conversations), the KV cache can consume as much VRAM as the model weights themselves. A 70B model at Q4_K_M fits in 42 GB. At 128K context, the KV cache adds another 15-20 GB. Your A6000 (48 GB) that was supposed to run 70B comfortably now runs out of memory. Always calculate KV cache size for your target context length before finalizing a hardware purchase. :::
:::warning Underestimating Power Infrastructure Requirements A workstation with an RTX 4090 running continuous inference will draw 600-700W from the wall. A standard 15A household circuit provides 1800W, but the circuit may already be loaded with other equipment. Running a high-TDP GPU at full load for 24 hours per day stresses both the PSU and the wall circuit. For a dedicated inference server in a home office: verify the circuit capacity, use a UPS (uninterruptible power supply) for clean power delivery to the GPU, and set a power limit in the NVIDIA control panel to reduce TDP by 10-20% for sustained operation. This extends hardware lifespan and prevents circuit breaker trips. :::
:::warning Apple Silicon Memory Bandwidth Is Shared The M2 Ultra's 800 GB/s unified memory bandwidth sounds impressive compared to an A6000 (768 GB/s) and technically it is. But on Apple Silicon, that bandwidth is shared between the GPU, CPU, Neural Engine, and media encoders. Under sustained LLM inference load, background system processes compete for bandwidth. On an NVIDIA card, VRAM bandwidth is exclusively available to the GPU. In practice, Apple Silicon inference throughput is 15-30% lower than the theoretical bandwidth number suggests when the system is under any other load. :::
Interview Q&A
Q: Walk me through how you would calculate the VRAM needed to run a 34B parameter model at float16 with a 32K token context window.
A: The calculation has two components. First, the model weights: 34 billion parameters at 2 bytes per parameter (float16) equals 68 GB, plus a 20% overhead factor for framework memory and activations, giving roughly 82 GB for the weights alone. Second, the KV cache: for a LLaMA-architecture 34B model, we have approximately 48 transformer layers, 64 attention heads, and a head dimension of 128. The KV cache size is 2 (for key and value) times 48 layers times 64 heads times 128 head dimensions times 32,768 sequence length times 2 bytes (float16), which equals approximately 16 GB. Total VRAM required: 82 + 16 = 98 GB. This means you need either two A100 40GB cards connected via NVLink, a single H100 80GB (which does not quite fit but with quantization adjustments would work), or an M2 Ultra with 192 GB unified memory. An A6000 at 48 GB cannot fit this configuration without aggressive quantization of both the weights and KV cache.
Q: Why does memory bandwidth matter more than TFLOPS for single-request LLM inference, and at what point does compute become the bottleneck instead?
A: During autoregressive token generation, the model generates one token at a time. For each token, the entire weight matrix must be read from GPU memory, multiplied against the current token's representation, and the output written back. The amount of computation per memory read is very low - for a 7B model, you read 14 GB of weights to perform roughly 7 billion multiply-add operations (about 14 TFLOPS at FP16). At the RTX 4090's 82 TFLOPS, you could do this compute in under 0.2 milliseconds if the data were already in register. But reading 14 GB from VRAM at 1008 GB/s takes about 14 milliseconds. The memory read is the bottleneck by a factor of 70x. Compute becomes the bottleneck when you are processing large batches: if you serve 64 requests simultaneously, the same weight read produces output for 64 tokens in parallel, making the computation 64x larger while the memory read stays the same size. At batch sizes of 32-128, modern inference servers shift from memory-bandwidth-bound to compute-bound. This is why vLLM and TGI emphasize continuous batching: filling the batch maximizes GPU utilization.
Q: A user asks whether to buy an RTX 4090 or an Apple M3 Max (128 GB) for local AI work. How do you advise them, and what questions do you ask first?
A: The questions I ask first are: What model sizes do they want to run? What is their primary use case (inference only, or also fine-tuning)? Do they need a laptop form factor? And what is their OS preference? The RTX 4090 is faster per token for models that fit within 24 GB - 7B at FP16, 13B at 8-bit, 70B at Q4_K_M with some offloading. At roughly $1,600, it is also significantly cheaper. It requires a full desktop system. Fine-tuning LoRA on 7B or 13B models works well on the 4090. The M3 Max (128 GB) runs 70B models at FP16 without quantization - something the 4090 cannot do at all. It is a laptop, draws far less power, and runs silently. Per-token speed is lower (8-15 tok/s for 70B vs the 4090's 10-15 tok/s for 70B at Q4), but the quality advantage of FP16 over Q4 is meaningful for complex reasoning tasks. My recommendation: if 70B at full precision matters or if laptop form factor is required, get the M3 Max. If the use case is primarily 7B-13B inference or LoRA fine-tuning and a desktop is acceptable, the RTX 4090 delivers more performance per dollar.
Q: Your team is designing an air-gapped inference server for a financial services client who needs to run 70B models with 64K context windows. Specify the hardware.
A: This is a workload requiring approximately 42 GB (70B at Q4_K_M) + 9 GB (KV cache at 64K context) = 51 GB VRAM minimum. An A6000 Ada at 48 GB does not quite fit. Options: a single H100 80GB SXM5 (overkill at roughly 8,000-10,000 used), or a single A100 80GB (fits comfortably, 20,000-30,000. The air-gap compliance costs (physical security, audit logging, isolated network) typically exceed the hardware cost.
Q: Explain the unified memory architecture of Apple Silicon and why it changes the VRAM equation compared to discrete GPU setups.
A: In a traditional discrete GPU setup, the CPU has its own DDR5 system memory (connected via the memory controller on the CPU die) and the GPU has its own GDDR6X VRAM (on the GPU card). These are completely separate memory pools. Data must be explicitly transferred over PCIe whenever the CPU needs to send inputs to the GPU or receive results back. PCIe Gen 4 x16 provides 32 GB/s for this transfer, which is a bottleneck for workloads that mix CPU and GPU computation. Apple Silicon uses a single LPDDR5X memory pool on the same package as the CPU and GPU dies. There is no separate VRAM - all memory is accessible to both the CPU and GPU at full bandwidth. The M2 Ultra's 800 GB/s bandwidth is the bandwidth of this shared pool. This eliminates the PCIe transfer bottleneck for hybrid CPU-GPU workloads. For LLM inference, it means the GPU can access model weights at 800 GB/s without any data movement overhead. The implication for the VRAM equation is that "GPU memory" and "RAM" are the same thing. When you buy an M3 Max with 128 GB of unified memory, all 128 GB is available to the GPU for model weights and KV cache. No discrete GPU card comes close to this capacity at any consumer price point. The tradeoff is that the GPU's dedicated compute throughput is lower than a discrete high-end card - the M3 Max GPU peaks at roughly 14 TFLOPS (FP16) versus the RTX 4090's 82 TFLOPS. For inference, which is memory-bandwidth-bound as we discussed, this compute gap matters less than the bandwidth and capacity numbers suggest.
Q: What is the impact of PCIe generation and slot width on LLM inference performance, and when does it actually matter?
A: PCIe bandwidth affects LLM inference performance in two scenarios. The first is CPU-to-GPU data transfer: input token embeddings, KV cache for CPU-offloaded layers, and output logits must transfer over PCIe. For a single request, this transfer is small (a few kilobytes of activations). PCIe generation makes essentially no difference here. The second scenario is multi-GPU inference: when two or more GPUs communicate over PCIe during tensor parallelism, the inter-GPU bandwidth is the bottleneck. PCIe Gen 4 x16 provides 32 GB/s per direction. If both GPUs share a PCIe switch (connected to the CPU's root complex through the same upstream port), effective bandwidth may be halved to 16 GB/s. PCIe Gen 5, available on AMD's X670 platform and Intel's Meteor Lake, doubles this to 64 GB/s, which meaningfully reduces the multi-GPU inference penalty. In practice: for single-GPU inference, PCIe generation is irrelevant - the VRAM bandwidth (1008 GB/s for an RTX 4090) dwarfs any PCIe transfer. For multi-GPU inference without NVLink, PCIe Gen 4 vs Gen 5 matters, but neither generation approaches NVLink bandwidth (600 GB/s), which is why NVLink-capable professional cards remain the correct choice for multi-GPU LLM workloads.
Advanced Topics
AMD GPUs for Local Inference
AMD GPUs are a viable but more complex alternative to NVIDIA for local LLM inference. The key toolchain difference: AMD uses ROCm (Radeon Open Compute) instead of CUDA. Support in the major inference frameworks is growing but uneven.
RX 7900 XTX (24 GB GDDR6)
- Memory bandwidth: 960 GB/s
- VRAM: 24 GB
- Price (2025): $900-1,000 new
- ROCm support: PyTorch 2.x, llama.cpp (with ROCm backend), limited vLLM support
- Notes: Strong specs for the price. The software ecosystem lags NVIDIA by 6-12 months on new features. Not recommended if you need bleeding-edge inference features (flash-attention 2, speculative decoding) as CUDA implementations arrive first.
MI300X (192 GB HBM3)
- Memory bandwidth: 5.3 TB/s
- VRAM: 192 GB
- Price (2025): $10,000-15,000
- Notes: AMD's server GPU designed specifically for inference. The 5.3 TB/s bandwidth (vs H100's 3.35 TB/s) makes it genuinely faster for large-model inference throughput. Major inference frameworks added MI300X support in 2024. For teams committed to cost-optimized inference at scale, worth evaluating.
To run llama.cpp with AMD ROCm:
# Install ROCm 6.x
sudo apt install rocm-libs rocm-dev
# Build llama.cpp with ROCm backend
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS="gfx1100" # gfx1100 = RDNA3 (RX 7900)
cmake --build . --config Release -j$(nproc)
# Run inference
./llama-cli -m /models/llama-3.1-8b-q4_k_m.gguf -p "Hello" -n 100
Workstation Build Examples
Build A - Solo Developer Coding Assistant ($2,000-2,500)
Components:
- GPU: RTX 4090 or RTX 3090 (24 GB VRAM) - $800-1,600
- CPU: AMD Ryzen 7 7700X or Intel i7-13700K - $280-350
- Motherboard: AMD X670 or Intel Z790 with PCIe 5.0 - $200-300
- RAM: 32 GB DDR5-6000 - $80-120
- Storage: 2 TB NVMe SSD (model weights) + 1 TB boot SSD - $150-200
- PSU: 1000W 80+ Gold - $120-150
- Case + cooling: Mid tower with good airflow, 360mm AIO - $200-300
Capable of: 7B at float16 (45 tok/s), 13B at 8-bit (28 tok/s), 70B at Q4_K_M with CPU offloading (~8 tok/s). LoRA fine-tuning on 7B.
Build B - Local AI Research Workstation ($5,000-6,000)
Components:
- GPU: NVIDIA A6000 48 GB - $4,500-5,500 (used/refurb)
- CPU: AMD Threadripper 7960X or Ryzen 9 7950X - $600-1,400
- Motherboard: TRX50 or X670E workstation - $400-600
- RAM: 128 GB DDR5 ECC - $400-600
- Storage: 4 TB NVMe (for multiple model families) + 2 TB NVMe boot - $400-600
- PSU: 1200W 80+ Platinum - $200-250
- Case: Full tower server chassis with hot-swap bays - $300-500
Capable of: 70B at float16 single-GPU (14-18 tok/s), 34B at float16, fine-tuning 13B full precision, 7B at float16 for multiple simultaneous users.
Monitoring Hardware Health Under LLM Load
LLM inference is one of the most sustained-compute workloads you can run on consumer hardware. Monitoring is essential to catch thermal throttling, memory errors, and power supply instability before they cause silent inference quality degradation.
# Real-time GPU monitoring during inference
watch -n 1 nvidia-smi --query-gpu=name,temperature.gpu,power.draw,memory.used,memory.total,utilization.gpu,clocks.current.sm \
--format=csv,noheader
# Log GPU metrics every 5 seconds to CSV for analysis
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw,memory.used,utilization.gpu,clocks.current.sm \
--format=csv --loop=5 > /tmp/gpu_metrics.csv &
# Check for throttling events (P-state changes indicate thermal throttling)
nvidia-smi --query-gpu=pstate,clocks_throttle_reasons.active \
--format=csv,noheader --loop=5
# System memory and CPU monitoring
vmstat 5 | tee /tmp/system_metrics.txt &
# For Apple Silicon: monitor all performance metrics
sudo powermetrics --samplers gpu_power,cpu_power,thermal -i 1000 | grep -E "GPU|CPU|DRAM"
Signs of thermal throttling:
- GPU clock frequency drops from 2500 MHz to 1800-2000 MHz during a long inference run
nvidia-smishowsclocks_throttle_reasons.sw_thermal_slowdownasActive- Tokens per second degrades progressively over a 20-30 minute session
If you observe throttling, check case airflow, apply fresh thermal paste to the GPU (if it is 2+ years old), or set a lower power limit with nvidia-smi -pl 380 (for RTX 4090, reducing from 450W to 380W cuts temperature by 5-8C with under 5% throughput loss).
Quantization Impact on Quality - When to Use Which Precision
The hardware selection decision directly interacts with which quantization levels you can run. Here is a practical quality guide:
Float16 / BF16: Full quality baseline. Use when you have the VRAM and quality matters. The difference from FP32 is negligible for inference (training is another story). Target for production applications where answer correctness is critical.
INT8 (8-bit): 1-2% degradation on most benchmarks. Invisible in practice for chat, code generation, and summarization tasks. The standard choice when you want near-full quality but need to fit a larger model. Use bitsandbytes with transformers or --quantization int8 in vLLM.
Q4_K_M (4-bit GGUF): 3-5% degradation on reasoning-heavy benchmarks (MMLU, GSM8K). Clearly noticeable on complex multi-step math and formal logic. Acceptable for most practical applications including code generation, summarization, and general Q&A. The standard choice for maximizing model size on limited VRAM.
Q3_K (3-bit GGUF): 8-12% degradation. Noticeable quality loss in complex reasoning, instruction following, and structured output generation. Use only when VRAM is severely constrained and the task is tolerant of errors (e.g., search re-ranking, classification, keyword extraction).
IQ2 / Q2 (2-bit): Significant quality loss. The model may produce incoherent responses on complex prompts. Generally only useful for very narrow classification tasks or as a speculative decoding draft model.
Practical rule: if your hardware requires more than 4-bit quantization to run a given model, consider running a smaller model at 4-bit or float16 instead. A well-trained 7B model at Q4_K_M almost always outperforms a 13B model at Q2 on practical tasks, because the quality degradation from extreme quantization exceeds the capacity advantage of the larger architecture.
