Tensor and Pipeline Parallelism
The Production Scenario
You have been asked to serve LLaMA-2 70B for your company's internal engineering assistant. You requisition the hardware and the allocation comes back: four A100 80GB GPUs. You do the math. LLaMA-2 70B in FP16 requires 140 GB just for the model weights - before you account for KV cache, activations, or any runtime overhead. Your four A100s give you 320 GB total, which is more than enough in aggregate, but each individual card holds only 80 GB. The model does not fit on a single GPU. A single GPU cannot even load it, let alone run inference.
This is not an edge case. It is the default situation for frontier-scale LLMs. GPT-4 is estimated at over 1 trillion parameters. Gemini Ultra, Claude 3 Opus, Llama-3 405B - all of them shatter the memory capacity of any single GPU available today. Even the modest frontier models like Llama-3 70B require 140 GB in FP16, demanding at least two A100 80GB cards. Quantization helps - INT4 brings LLaMA 70B down to roughly 35 GB - but quantization has quality trade-offs that matter for demanding tasks like code generation, reasoning, and complex instruction following.
The solution is to distribute the model across multiple GPUs. But "distribute" is not a single strategy - there are fundamentally different ways to split a model, each with different memory requirements, communication costs, and latency characteristics. Tensor parallelism splits individual weight matrices across GPUs. Pipeline parallelism splits layers of the model across GPUs. Data parallelism replicates the full model and splits the input data. These strategies can be composed - the approach that enabled Megatron-LM to train trillion-parameter models on clusters of thousands of GPUs is a combination of all three.
Understanding these parallelism strategies is not just for researchers building frontier models. It is essential knowledge for any engineer who needs to deploy models at scale. Which GPUs to buy? How many? How to connect them? Which serving framework to use? All of these decisions depend on understanding how model parallelism works under the hood.
Why This Exists: The Memory Wall
The Numbers
Modern LLMs are enormous. Memory requirement by precision:
| Model | Parameters | FP16 (bytes) | INT8 | INT4 |
|---|---|---|---|---|
| LLaMA-3 8B | 8B | 16 GB | 8 GB | 4 GB |
| LLaMA-3 70B | 70B | 140 GB | 70 GB | 35 GB |
| LLaMA-3 405B | 405B | 810 GB | 405 GB | 202 GB |
| GPT-4 (estimated) | ~1.8T | ~3,600 GB | ~1,800 GB | ~900 GB |
The largest single GPU as of 2025 - the NVIDIA H200 - has 141 GB HBM3e. LLaMA-3 405B does not fit on a single H200 even in INT4.
For inference, memory requirements are:
- Model weights: the dominant cost for large models ( bytes in FP16)
- KV cache: grows with batch size and sequence length - typically 20–40% of total GPU memory budget
- Activations: small for inference (unlike training), often negligible
Practical rule: A model needs approximately GB in FP16. A 70B model needs ~140 GB.
Why Not Just Quantize?
INT4 quantization reduces LLaMA-3 70B to ~35 GB, fitting on a single A100. But:
- Heavy quantization degrades accuracy on reasoning, math, and code generation tasks - the exact use cases where 70B beats 7B
- Some deployments require reproducible, full-precision outputs for auditing or compliance
- Quantization and multi-GPU parallelism are complementary, not alternatives - production systems often use both
Historical Context
Data parallelism has existed since the early deep learning era - PyTorch DDP, TensorFlow MirroredStrategy. It replicates the full model on each GPU, shards the data batch, and synchronizes gradients. This scales throughput but does nothing for models that exceed per-GPU memory.
Model parallelism became necessary around 2018–2019 as models grew beyond single-GPU capacity. Early implementations were manual: researchers hard-coded which layers ran on which GPU.
The landmark paper was Megatron-LM (Shoeybi et al., 2019) from NVIDIA, introducing systematic tensor parallelism for transformers. Megatron split attention heads and MLP weight matrices along specific dimensions with a clean two-operation (column-parallel then row-parallel) pattern requiring only one all-reduce per layer. This made large-scale training practical.
GPipe (Huang et al., 2019) from Google formalized pipeline parallelism with micro-batching to reduce bubble overhead from to where is the number of micro-batches.
3D parallelism - combining all three - emerged in 2021 as the standard recipe for 100B+ parameter training. Megatron-Turing NLG (530B, 2022) used TP=8, PP=35, DP=12 across 3,360 A100s.
Data Parallelism: Why It Is Not Enough
Before diving into tensor and pipeline parallelism, understand why data parallelism alone fails for large models.
Data parallelism replicates the full model on each GPU. Each GPU processes a different shard of the data batch. After the backward pass, gradients are all-reduced across GPUs so all copies remain identical.
Data Parallelism:
GPU 0: [Full Model Copy] ← data shard 0
GPU 1: [Full Model Copy] ← data shard 1
GPU 2: [Full Model Copy] ← data shard 2
GPU 3: [Full Model Copy] ← data shard 3
↕ All-reduce gradients
↕ (8× throughput, same per-GPU memory)
Problem: Each GPU holds the full model. If the model needs 140 GB and each GPU has 80 GB, data parallelism is impossible - no single GPU can even load the model.
Data parallelism scales throughput (more tokens per second) but does not reduce per-GPU memory requirements. You need tensor or pipeline parallelism for that.
Tensor Parallelism
Tensor parallelism (TP) splits individual weight matrices across GPUs. Each GPU holds a shard of each weight matrix and computes a partial result. Results are combined via all-reduce after each layer.
Column and Row Parallelism
For a linear layer , there are two ways to split the weight matrix :
Column parallelism: Split along columns.
Each GPU computes independently - input is replicated across all GPUs. Output is column-sharded. No all-reduce needed after this layer.
Row parallelism: Split along rows.
Each GPU receives a column shard of (from the previous column-parallel layer) and computes a partial sum. One all-reduce sums partial results to produce the full output on all GPUs.
Tensor Parallelism in Transformers
Megatron-LM applied this to every transformer block:
MLP block:
- : column-parallel - GPU holds columns of , applies GeLU locally to its shard
- : row-parallel - GPU holds corresponding rows, all-reduce after
Attention block: Split attention heads across GPUs.
- QKV projection: column-parallel - GPU computes its subset of attention heads
- Output projection: row-parallel - all-reduce after
Key insight: Pairing column-parallel followed by row-parallel requires only one all-reduce per transformer layer (for MLP) and one all-reduce per transformer layer (for attention). The activation function (GeLU, SiLU) applies locally within each GPU's column shard, requiring no communication.
Communication Cost: The NVLink Dependency
Each all-reduce in ring-all-reduce transfers bytes, where is the tensor size and is the TP degree.
For LLaMA-3 70B (hidden size 8192), generating a single token (1 token, FP16):
LLaMA-3 70B has 80 layers, two all-reduces per layer = 160 total all-reduces per forward pass:
| Interconnect | Bandwidth | Per all-reduce | Total per forward pass |
|---|---|---|---|
| NVLink A100 | 600 GB/s | 0.04 µs | 6.4 µs ≈ 0 ms |
| PCIe 4.0 x16 | 64 GB/s | 0.37 µs | 59 µs ≈ 0.06 ms |
For a single token, even PCIe is manageable. But the picture changes at larger batch sizes:
At batch=32, each activation tensor is 32× larger: 512 KB. Total TP communication over 160 all-reduces:
| Interconnect | Total per forward pass |
|---|---|
| NVLink A100 | ~0.2 ms - negligible |
| PCIe 4.0 x16 | ~1.9 ms - adds significant latency |
Conclusion: Tensor parallelism requires NVLink for practical inference latency. The standard rule: use TP only within a single node where GPUs are NVLink-connected.
Pipeline Parallelism
Pipeline parallelism (PP) splits the model's layers across GPUs. GPU 0 holds the first layers, GPU 1 holds the next layers, and so on. Data flows through the pipeline stage by stage.
Pipeline Parallelism - LLaMA-3 70B across 4 GPUs:
GPU 0: [Embeddings + Layers 0–19] → activations → GPU 1
GPU 1: [Layers 20–39] → activations → GPU 2
GPU 2: [Layers 40–59] → activations → GPU 3
GPU 3: [Layers 60–79 + LM Head] → output logits
Memory is split evenly: each GPU holds 140/4 = 35 GB of weights. Communication is just the activation tensor passed between stages - much lower bandwidth requirement than TP all-reduce.
The Pipeline Bubble
Naive pipeline parallelism has a severe efficiency problem. With a single batch, each stage must wait for the previous stage to finish before it can start:
Naive PP (4 stages, 1 batch):
Stage 0: [FWRD....][ idle ]
Stage 1: [ wait ][FWRD....][ idle ]
Stage 2: [ wait ][FWRD....][ idle ]
Stage 3: [ wait ][FWRD....]
Bubble fraction = (p-1)/p = 75% waste with p=4
75% of GPU-time is wasted. This makes naive PP unusable at scale.
Micro-batching (GPipe)
GPipe (2019) splits the global batch into micro-batches. While later stages process early micro-batches, earlier stages process new ones:
PP with m=4 micro-batches (F=forward, p=4 stages):
Stage 0: [F0][F1][F2][F3][ ]
Stage 1: [ ][F0][F1][F2][F3][ ]
Stage 2: [ ][ ][F0][F1][F2][F3]
Stage 3: [ ][ ][ ][F0][F1][F2][F3]
^bubble^ ^bubble^
Bubble fraction with stages and micro-batches:
For , : bubble fraction = . Much better than 75%.
For inference, micro-batching is less useful because low latency matters more than throughput - you want to process each request quickly, not accumulate a large micro-batch.
1F1B Schedule
The 1F1B (One Forward One Backward) schedule, used in Megatron-LM, interleaves forward and backward passes more tightly:
1F1B Schedule (p=4, m=8):
Stage 0: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6 B5 B4
Stage 1: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6 B5
Stage 2: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6
Stage 3: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7
1F1B achieves the same bubble fraction as GPipe but requires storing activations only for micro-batches simultaneously (not all as in GPipe), reducing peak memory.
Sequence Parallelism
For very long contexts (128K+ tokens), activation tensors become memory-intensive:
For , (LLaMA 70B):
With 80 layers, peak activations alone can exceed 160 GB - far more than model weights.
Sequence parallelism (Korthikanti et al., 2022) extends tensor parallelism to also shard the sequence dimension. In regions between tensor-parallel layers (where activations are replicated), sequence parallelism instead partitions the token sequence across GPUs. Each GPU processes its own slice of tokens, reducing per-GPU activation memory by a factor of (the TP degree).
The all-reduce at the end of TP layers is replaced by an all-gather (to reconstruct the full sequence before the next TP layer) - same communication volume, same bandwidth requirement.
3D Parallelism: The Full Stack
Training at scale uses all three dimensions simultaneously:
3D Parallelism Layout:
Data Parallel (DP) replicas
┌─────────────────────────────────────┐
│ Pipeline Stage 0 │ Pipeline Stage 1 │ ... │
│ [GPU0][GPU1].. │ [GPU8][GPU9].. │ │
│ ^tensor parallel^ │ ^tensor parallel^ │ │
└─────────────────────────────────────┘
└─────── one model copy ─────────────┘
└──── × DP replicas (different data) ─────────────────┘
For Megatron-Turing NLG (530B parameters):
- TP=8: within each 8-GPU DGX node, NVLink all-reduce
- PP=35: 35 pipeline stages across nodes, InfiniBand activation passing
- DP=12: 12 model replicas for throughput, InfiniBand gradient sync
- Total: A100 GPUs
Expert Parallelism for MoE Models
Mixture-of-Experts (MoE) models like Mixtral 8x7B use a different parallelism dimension. In an MoE layer, a router sends each token to one of expert networks (feed-forward layers). With expert parallelism, each GPU holds a subset of experts:
- GPU 0 holds experts 0–1
- GPU 1 holds experts 2–3
- ... and so on
Each token is routed to the GPU holding its expert. This requires all-to-all communication: each GPU may need to send tokens to and receive tokens from any other GPU.
Expert parallelism scales the model's capacity (more parameters = more knowledge) without scaling inference compute per token - each token only passes through one or two experts per MoE layer.
Inference vs Training: Different Optimal Strategies
The best parallelism strategy differs significantly between training and inference:
| Aspect | Training | Inference |
|---|---|---|
| Typical batch size | 256–4096 per GPU | 1–32 per GPU |
| Memory pressure | Very high (weights + gradients + optimizer states + activations) | Moderate (weights + KV cache) |
| Latency requirement | Not critical (offline) | Critical (user-facing) |
| Pipeline parallelism | Excellent - large batches amortize bubble | Poor - small batches cannot amortize bubble |
| Tensor parallelism | TP=8 standard | TP=2 to 8 depending on model size |
| Communication budget | Can overlap with compute | Must be minimized for low latency |
For inference, prefer tensor parallelism over pipeline parallelism when both options are available. TP adds a small, fixed latency per layer (all-reduce). PP adds sequential latency proportional to the number of pipeline stages and cannot be hidden at small batch sizes.
Code: Analyzing Parallelism Trade-offs
"""
Analytical model comparing tensor vs pipeline parallelism for inference.
Helps choose the right strategy for a given hardware configuration.
"""
from dataclasses import dataclass
from typing import Literal
import math
@dataclass
class HardwareConfig:
n_gpus: int
gpu_memory_gb: float
nvlink: bool # True if GPUs connected via NVLink
interconnect_gbps: float # Bandwidth in GB/s (NVLink or PCIe)
@dataclass
class ModelConfig:
n_params: float # Total parameters
n_layers: int
hidden_size: int
n_heads: int
dtype_bytes: int = 2 # FP16 = 2 bytes
def analyze_tensor_parallel(
model: ModelConfig,
hw: HardwareConfig,
batch_size: int = 1,
seq_len: int = 1, # For decode, seq_len=1 (generating one token)
) -> dict:
"""Estimate tensor parallel inference efficiency."""
tp_degree = hw.n_gpus
weights_total_gb = model.n_params * model.dtype_bytes / 1e9
weights_per_gpu_gb = weights_total_gb / tp_degree
# Activation size per token per layer (hidden_size activations, FP16)
activation_bytes = model.hidden_size * batch_size * seq_len * model.dtype_bytes
# All-reduce volume per layer: 2*(p-1)/p * activation_size
# Two all-reduces per layer (MLP + attention)
ar_volume_per_layer = 2 * (tp_degree - 1) / tp_degree * activation_bytes
total_ar_volume = ar_volume_per_layer * model.n_layers * 2
# Communication time
comm_time_s = total_ar_volume / (hw.interconnect_gbps * 1e9)
comm_time_ms = comm_time_s * 1000
# Memory check
fits_on_gpu = weights_per_gpu_gb < hw.gpu_memory_gb
return {
"strategy": f"Tensor Parallel (TP={tp_degree})",
"weights_per_gpu_gb": round(weights_per_gpu_gb, 1),
"fits_in_memory": fits_on_gpu,
"total_ar_volume_mb": round(total_ar_volume / 1e6, 2),
"comm_overhead_ms": round(comm_time_ms, 3),
"requires_nvlink": not hw.nvlink and comm_time_ms > 5,
"recommendation": "good" if (fits_on_gpu and comm_time_ms < 5) else "problematic",
}
def analyze_pipeline_parallel(
model: ModelConfig,
hw: HardwareConfig,
batch_size: int = 1,
seq_len: int = 1,
n_microbatches: int = 1,
) -> dict:
"""Estimate pipeline parallel inference efficiency."""
pp_degree = hw.n_gpus
layers_per_stage = model.n_layers // pp_degree
weights_per_gpu_gb = (model.n_params / pp_degree) * model.dtype_bytes / 1e9
# Activation tensor passed between stages
# Shape: [batch_size, seq_len, hidden_size]
activation_bytes = model.hidden_size * batch_size * seq_len * model.dtype_bytes
activation_mb = activation_bytes / 1e6
# P2P communication time between stages (one tensor per stage transition)
comm_per_stage_ms = (activation_bytes / (hw.interconnect_gbps * 1e9)) * 1000
# Bubble fraction
bubble_fraction = (pp_degree - 1) / (n_microbatches + pp_degree - 1)
# Total pipeline latency: stages × forward_time_per_stage + comm
# Approximate: proportional to number of stages in sequence
pipeline_latency_overhead_ms = comm_per_stage_ms * pp_degree
fits_on_gpu = weights_per_gpu_gb < hw.gpu_memory_gb
return {
"strategy": f"Pipeline Parallel (PP={pp_degree})",
"weights_per_gpu_gb": round(weights_per_gpu_gb, 1),
"fits_in_memory": fits_on_gpu,
"activation_per_stage_mb": round(activation_mb, 2),
"comm_per_stage_ms": round(comm_per_stage_ms, 3),
"pipeline_latency_overhead_ms": round(pipeline_latency_overhead_ms, 3),
"bubble_fraction": round(bubble_fraction, 3),
"recommendation": "good" if fits_on_gpu and bubble_fraction < 0.2 else "suboptimal",
}
# Example: LLaMA-3 70B on 4× A100 80GB
llama_70b = ModelConfig(
n_params=70e9,
n_layers=80,
hidden_size=8192,
n_heads=64,
dtype_bytes=2, # FP16
)
# NVLink-connected (single node)
nvlink_hw = HardwareConfig(
n_gpus=4,
gpu_memory_gb=80,
nvlink=True,
interconnect_gbps=600, # NVLink A100
)
# PCIe-connected (separate servers)
pcie_hw = HardwareConfig(
n_gpus=4,
gpu_memory_gb=80,
nvlink=False,
interconnect_gbps=64, # PCIe 4.0 x16
)
print("=== LLaMA-3 70B on 4× A100 80GB ===")
print()
for hw_name, hw in [("NVLink (single node)", nvlink_hw), ("PCIe (multi-server)", pcie_hw)]:
print(f"Hardware: {hw_name}")
tp = analyze_tensor_parallel(llama_70b, hw, batch_size=8)
pp = analyze_pipeline_parallel(llama_70b, hw, batch_size=8, n_microbatches=4)
print(f" Tensor Parallel: {tp['weights_per_gpu_gb']} GB/GPU, "
f"comm={tp['comm_overhead_ms']} ms, {tp['recommendation']}")
print(f" Pipeline Parallel: {pp['weights_per_gpu_gb']} GB/GPU, "
f"comm_latency={pp['pipeline_latency_overhead_ms']} ms, "
f"bubble={pp['bubble_fraction']:.1%}, {pp['recommendation']}")
print()
Choosing a Parallelism Strategy
Production Engineering Notes
Check GPU Topology Before Choosing TP
# Show interconnect topology between all GPU pairs
nvidia-smi topo -m
# Output matrix shows connection type:
# NV1-NV4: NVLink (generation 1-4) - use for TP
# PIX: PCIe crossing - do NOT use for TP
# SYS: separate NUMA domains - absolutely no TP
vLLM Multi-GPU Configuration
# Tensor parallel - single node, 4 NVLink-connected GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--port 8000
# TP + PP - multi-node (8 GPUs per node, 2 nodes)
# Node 0 (head), Node 1 (worker) connected via InfiniBand
NCCL_IB_DISABLE=0 \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--dtype bfloat16 \
--port 8000
NVLink vs InfiniBand Summary
| NVLink (A100) | InfiniBand HDR | PCIe 4.0 x16 | |
|---|---|---|---|
| Bandwidth | 600 GB/s | 200 GB/s | 64 GB/s |
| Use for | TP all-reduce | PP activation passing, DP gradients | Last resort |
| Topology | Within node | Node-to-node | Within node (no NVLink) |
| TP feasibility | Excellent | Marginal | Poor for inference |
:::danger Never Use Tensor Parallelism Over PCIe for Inference
If your GPUs connect via PCIe (not NVLink), tensor parallelism will be dramatically slower than running a quantized model on fewer GPUs. PCIe all-reduce adds 10–50 ms of overhead per forward pass on typical LLM configurations - making inference 3–10× slower than single-GPU.
Check before deploying:
nvidia-smi topo -m # Look for NVLink (NV1-NV4) between your target GPUs
If you see PIX or SYS instead of NVLink, either quantize to fit on fewer GPUs, or use pipeline parallelism instead. :::
:::warning Pipeline Parallelism Is Rarely Optimal for Online Inference
Pipeline parallelism was designed for training, where large batch sizes amortize the pipeline bubble. For online inference at batch sizes of 1–32, the bubble overhead is not amortized - a 4-stage pipeline with batch size 1 wastes 75% of GPU time in the bubble.
Only use pipeline parallelism for inference when:
- The model is too large for TP within a single node (nodes lack NVLink to each other)
- Throughput (offline batch workloads) matters more than latency
For interactive serving, prefer tensor parallelism. For offline batch summarization across multi-node clusters, PP becomes more justifiable. :::
Interview Questions
Q1: LLaMA-2 70B requires 140 GB in FP16. You have 4× A100 80GB GPUs connected via NVLink. What parallelism strategy do you use and why?
Use tensor parallelism with TP=4. Each GPU holds GB of model weights, leaving 45 GB per GPU for KV cache and activations - comfortable for realistic batch sizes and sequence lengths. The NVLink interconnect (600 GB/s) makes all-reduce communication negligible: for a single-token decode, total all-reduce traffic across 80 layers is about 19.2 MB, taking under 0.1 ms on NVLink.
Pipeline parallelism is the alternative: 35 GB per GPU (same memory distribution as TP=4). However, pipeline parallel inference adds sequential latency - tokens must travel through each pipeline stage, and the pipeline bubble is not amortized at inference batch sizes. For online serving, TP is strictly preferable.
Q2: Explain the difference between column-parallel and row-parallel linear layers and why pairing them requires only one all-reduce per MLP block.
Column-parallel splits the weight matrix along output columns. Each GPU computes independently - the input is replicated across all GPUs. No communication is needed after this layer because each GPU produces a valid column shard of the output.
Row-parallel splits along input rows. Each GPU receives a column-sharded input (from the preceding column-parallel layer) and computes a partial sum . An all-reduce sums these partial results across GPUs to produce the full output.
Pairing them in sequence: the first (column-parallel) layer produces column-sharded output with no communication. The activation function (GeLU, SiLU) applies to each GPU's shard locally - no communication. The second (row-parallel) layer consumes the sharded input and produces a full output after one all-reduce. Total: one all-reduce per MLP block, regardless of the number of GPUs.
Q3: What is the pipeline bubble and how does micro-batching reduce it?
The pipeline bubble is idle time in pipeline parallelism. With pipeline stages and a single batch, each stage must wait for the preceding stage to finish before processing can begin. Stage 1 waits for stage 0, stage 2 waits for stages 0–1, and so on. During this wait, GPUs are completely idle. Bubble fraction: . With : 75% of compute time is wasted.
Micro-batching splits the global batch into smaller micro-batches. While stage 2 processes micro-batch 0, stage 1 can start processing micro-batch 1, and stage 0 can start micro-batch 2. The stages are no longer fully sequential - they overlap. Bubble fraction reduces to . With , : bubble fraction = .
Q4: Why is tensor parallelism preferred over pipeline parallelism for LLM inference?
Three reasons:
-
Latency: TP all-reduce adds microseconds on NVLink - negligible. PP adds sequential stage latency: each token must flow through every stage sequentially, adding milliseconds per stage that accumulate across the entire sequence.
-
Bubble overhead: Micro-batching amortizes the bubble at large batch sizes (training uses 256–4096 per GPU). Inference batch sizes are 1–32. With batch=1, a 4-stage pipeline wastes 75% of compute to bubble. This cannot be hidden.
-
Continuous batching compatibility: Continuous batching systems (vLLM, TGI) dynamically change batch composition at every step. Pipeline parallelism with dynamic batches is complex - the micro-batch scheduling must be redesigned for variable-length sequences entering and leaving the pipeline. TP is transparent to the batching layer.
Q5: What is 3D parallelism and when is it required?
3D parallelism combines tensor, pipeline, and data parallelism simultaneously. Total GPU count = TP × PP × DP.
It is required when a single node's NVLink-connected GPUs (TP=8 on DGX A100) are insufficient to hold the model, and when you want to scale throughput further with data parallelism.
Example: LLaMA-3 405B in FP16 = 810 GB. One 8-GPU DGX A100 node provides 640 GB - not enough. Solution: TP=8 within each node for fast intra-node parallelism, PP=2 across 2 nodes to hold the full 810 GB (405 GB per node), DP=N to scale throughput. Communication: NVLink for TP (fast), InfiniBand for PP activation passing and DP gradient sync.
Q6: How does expert parallelism for MoE models differ from tensor parallelism, and what communication primitive does it require?
Tensor parallelism routes every token through every GPU - each GPU holds a shard of every weight matrix and all GPUs collaborate to produce the output for every token via all-reduce.
Expert parallelism routes each token to the GPU(s) holding its selected expert(s) - the router makes a per-token decision about which GPU does the work. Only the assigned GPU(s) compute for each token.
TP requires all-reduce (every GPU sends to every other GPU, symmetric). EP requires all-to-all (each GPU may send to any other GPU, asymmetric - depends on which experts are selected for each token). All-to-all is more complex to implement but can be efficient with NVLink or InfiniBand. The key advantage of EP: total model capacity scales with the number of experts (more parameters = more knowledge) while inference compute per token stays constant.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Model Parallelism: Tensor & Pipeline demo on the EngineersOfAI Playground - no code required.
:::
