Skip to main content

Tensor and Pipeline Parallelism

The Production Scenario

You have been asked to serve LLaMA-2 70B for your company's internal engineering assistant. You requisition the hardware and the allocation comes back: four A100 80GB GPUs. You do the math. LLaMA-2 70B in FP16 requires 140 GB just for the model weights - before you account for KV cache, activations, or any runtime overhead. Your four A100s give you 320 GB total, which is more than enough in aggregate, but each individual card holds only 80 GB. The model does not fit on a single GPU. A single GPU cannot even load it, let alone run inference.

This is not an edge case. It is the default situation for frontier-scale LLMs. GPT-4 is estimated at over 1 trillion parameters. Gemini Ultra, Claude 3 Opus, Llama-3 405B - all of them shatter the memory capacity of any single GPU available today. Even the modest frontier models like Llama-3 70B require 140 GB in FP16, demanding at least two A100 80GB cards. Quantization helps - INT4 brings LLaMA 70B down to roughly 35 GB - but quantization has quality trade-offs that matter for demanding tasks like code generation, reasoning, and complex instruction following.

The solution is to distribute the model across multiple GPUs. But "distribute" is not a single strategy - there are fundamentally different ways to split a model, each with different memory requirements, communication costs, and latency characteristics. Tensor parallelism splits individual weight matrices across GPUs. Pipeline parallelism splits layers of the model across GPUs. Data parallelism replicates the full model and splits the input data. These strategies can be composed - the approach that enabled Megatron-LM to train trillion-parameter models on clusters of thousands of GPUs is a combination of all three.

Understanding these parallelism strategies is not just for researchers building frontier models. It is essential knowledge for any engineer who needs to deploy models at scale. Which GPUs to buy? How many? How to connect them? Which serving framework to use? All of these decisions depend on understanding how model parallelism works under the hood.


Why This Exists: The Memory Wall

The Numbers

Modern LLMs are enormous. Memory requirement by precision:

ModelParametersFP16 (bytes)INT8INT4
LLaMA-3 8B8B16 GB8 GB4 GB
LLaMA-3 70B70B140 GB70 GB35 GB
LLaMA-3 405B405B810 GB405 GB202 GB
GPT-4 (estimated)~1.8T~3,600 GB~1,800 GB~900 GB

The largest single GPU as of 2025 - the NVIDIA H200 - has 141 GB HBM3e. LLaMA-3 405B does not fit on a single H200 even in INT4.

For inference, memory requirements are:

  • Model weights: the dominant cost for large models (2×parameters2 \times \text{parameters} bytes in FP16)
  • KV cache: grows with batch size and sequence length - typically 20–40% of total GPU memory budget
  • Activations: small for inference (unlike training), often negligible

Practical rule: A model needs approximately parameters109×2\frac{\text{parameters}}{10^9} \times 2 GB in FP16. A 70B model needs ~140 GB.

Why Not Just Quantize?

INT4 quantization reduces LLaMA-3 70B to ~35 GB, fitting on a single A100. But:

  • Heavy quantization degrades accuracy on reasoning, math, and code generation tasks - the exact use cases where 70B beats 7B
  • Some deployments require reproducible, full-precision outputs for auditing or compliance
  • Quantization and multi-GPU parallelism are complementary, not alternatives - production systems often use both

Historical Context

Data parallelism has existed since the early deep learning era - PyTorch DDP, TensorFlow MirroredStrategy. It replicates the full model on each GPU, shards the data batch, and synchronizes gradients. This scales throughput but does nothing for models that exceed per-GPU memory.

Model parallelism became necessary around 2018–2019 as models grew beyond single-GPU capacity. Early implementations were manual: researchers hard-coded which layers ran on which GPU.

The landmark paper was Megatron-LM (Shoeybi et al., 2019) from NVIDIA, introducing systematic tensor parallelism for transformers. Megatron split attention heads and MLP weight matrices along specific dimensions with a clean two-operation (column-parallel then row-parallel) pattern requiring only one all-reduce per layer. This made large-scale training practical.

GPipe (Huang et al., 2019) from Google formalized pipeline parallelism with micro-batching to reduce bubble overhead from O(1)O(1) to O(1/m)O(1/m) where mm is the number of micro-batches.

3D parallelism - combining all three - emerged in 2021 as the standard recipe for 100B+ parameter training. Megatron-Turing NLG (530B, 2022) used TP=8, PP=35, DP=12 across 3,360 A100s.


Data Parallelism: Why It Is Not Enough

Before diving into tensor and pipeline parallelism, understand why data parallelism alone fails for large models.

Data parallelism replicates the full model on each GPU. Each GPU processes a different shard of the data batch. After the backward pass, gradients are all-reduced across GPUs so all copies remain identical.

Data Parallelism:

GPU 0: [Full Model Copy] ← data shard 0
GPU 1: [Full Model Copy] ← data shard 1
GPU 2: [Full Model Copy] ← data shard 2
GPU 3: [Full Model Copy] ← data shard 3
↕ All-reduce gradients
↕ (8× throughput, same per-GPU memory)

Problem: Each GPU holds the full model. If the model needs 140 GB and each GPU has 80 GB, data parallelism is impossible - no single GPU can even load the model.

Data parallelism scales throughput (more tokens per second) but does not reduce per-GPU memory requirements. You need tensor or pipeline parallelism for that.


Tensor Parallelism

Tensor parallelism (TP) splits individual weight matrices across GPUs. Each GPU holds a shard of each weight matrix and computes a partial result. Results are combined via all-reduce after each layer.

Column and Row Parallelism

For a linear layer Y=XWY = XW, there are two ways to split the weight matrix WRdin×doutW \in \mathbb{R}^{d_{in} \times d_{out}}:

Column parallelism: Split WW along columns.

Y=X[W1W2...Wp]=[XW1XW2...XWp]Y = X[W_1 | W_2 | ... | W_p] = [XW_1 | XW_2 | ... | XW_p]

Each GPU ii computes Yi=XWiY_i = XW_i independently - input XX is replicated across all GPUs. Output is column-sharded. No all-reduce needed after this layer.

Row parallelism: Split WW along rows.

Y=[X1X2...Xp][W1W2Wp]=i=1pXiWiY = [X_1 | X_2 | ... | X_p] \begin{bmatrix} W_1 \\ W_2 \\ \vdots \\ W_p \end{bmatrix} = \sum_{i=1}^{p} X_i W_i

Each GPU ii receives a column shard of XX (from the previous column-parallel layer) and computes a partial sum. One all-reduce sums partial results to produce the full output on all GPUs.

Tensor Parallelism in Transformers

Megatron-LM applied this to every transformer block:

MLP block: output=GeLU(XA)B\text{output} = \text{GeLU}(XA) B

  • AA: column-parallel - GPU ii holds columns of AA, applies GeLU locally to its shard
  • BB: row-parallel - GPU ii holds corresponding rows, all-reduce after

Attention block: Split attention heads across GPUs.

  • QKV projection: column-parallel - GPU ii computes its subset of attention heads
  • Output projection: row-parallel - all-reduce after

Key insight: Pairing column-parallel followed by row-parallel requires only one all-reduce per transformer layer (for MLP) and one all-reduce per transformer layer (for attention). The activation function (GeLU, SiLU) applies locally within each GPU's column shard, requiring no communication.

Each all-reduce in ring-all-reduce transfers 2(p1)p×S\frac{2(p-1)}{p} \times S bytes, where SS is the tensor size and pp is the TP degree.

For LLaMA-3 70B (hidden size 8192), generating a single token (1 token, FP16):

S=8192×1×2 bytes=16 KBS = 8192 \times 1 \times 2 \text{ bytes} = 16 \text{ KB}

all-reduce transfer=2×34×16 KB=24 KB\text{all-reduce transfer} = \frac{2 \times 3}{4} \times 16 \text{ KB} = 24 \text{ KB}

LLaMA-3 70B has 80 layers, two all-reduces per layer = 160 total all-reduces per forward pass:

InterconnectBandwidthPer all-reduceTotal per forward pass
NVLink A100600 GB/s0.04 µs6.4 µs ≈ 0 ms
PCIe 4.0 x1664 GB/s0.37 µs59 µs ≈ 0.06 ms

For a single token, even PCIe is manageable. But the picture changes at larger batch sizes:

At batch=32, each activation tensor is 32× larger: 512 KB. Total TP communication over 160 all-reduces:

InterconnectTotal per forward pass
NVLink A100~0.2 ms - negligible
PCIe 4.0 x16~1.9 ms - adds significant latency

Conclusion: Tensor parallelism requires NVLink for practical inference latency. The standard rule: use TP only within a single node where GPUs are NVLink-connected.


Pipeline Parallelism

Pipeline parallelism (PP) splits the model's layers across GPUs. GPU 0 holds the first L/pL/p layers, GPU 1 holds the next L/pL/p layers, and so on. Data flows through the pipeline stage by stage.

Pipeline Parallelism - LLaMA-3 70B across 4 GPUs:

GPU 0: [Embeddings + Layers 0–19] → activations → GPU 1
GPU 1: [Layers 20–39] → activations → GPU 2
GPU 2: [Layers 40–59] → activations → GPU 3
GPU 3: [Layers 60–79 + LM Head] → output logits

Memory is split evenly: each GPU holds 140/4 = 35 GB of weights. Communication is just the activation tensor passed between stages - much lower bandwidth requirement than TP all-reduce.

The Pipeline Bubble

Naive pipeline parallelism has a severe efficiency problem. With a single batch, each stage must wait for the previous stage to finish before it can start:

Naive PP (4 stages, 1 batch):

Stage 0: [FWRD....][ idle ]
Stage 1: [ wait ][FWRD....][ idle ]
Stage 2: [ wait ][FWRD....][ idle ]
Stage 3: [ wait ][FWRD....]

Bubble fraction = (p-1)/p = 75% waste with p=4

75% of GPU-time is wasted. This makes naive PP unusable at scale.

Micro-batching (GPipe)

GPipe (2019) splits the global batch into mm micro-batches. While later stages process early micro-batches, earlier stages process new ones:

PP with m=4 micro-batches (F=forward, p=4 stages):

Stage 0: [F0][F1][F2][F3][ ]
Stage 1: [ ][F0][F1][F2][F3][ ]
Stage 2: [ ][ ][F0][F1][F2][F3]
Stage 3: [ ][ ][ ][F0][F1][F2][F3]
^bubble^ ^bubble^

Bubble fraction with pp stages and mm micro-batches:

bubble fraction=p1m+p1\text{bubble fraction} = \frac{p-1}{m + p - 1}

For p=4p=4, m=16m=16: bubble fraction = 3/1916%3/19 \approx 16\%. Much better than 75%.

For inference, micro-batching is less useful because low latency matters more than throughput - you want to process each request quickly, not accumulate a large micro-batch.

1F1B Schedule

The 1F1B (One Forward One Backward) schedule, used in Megatron-LM, interleaves forward and backward passes more tightly:

1F1B Schedule (p=4, m=8):

Stage 0: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6 B5 B4
Stage 1: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6 B5
Stage 2: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7 B6
Stage 3: F0 F1 F2 F3 B3 B2 B1 B0 F4 F5 F6 F7 B7

1F1B achieves the same bubble fraction as GPipe but requires storing activations only for pp micro-batches simultaneously (not all mm as in GPipe), reducing peak memory.


Sequence Parallelism

For very long contexts (128K+ tokens), activation tensors become memory-intensive:

activation per layer=Lseq×dhidden×2 bytes\text{activation per layer} = L_{\text{seq}} \times d_{\text{hidden}} \times 2 \text{ bytes}

For Lseq=128000L_{\text{seq}} = 128000, dhidden=8192d_{\text{hidden}} = 8192 (LLaMA 70B):

128000×8192×22 GB per layer128000 \times 8192 \times 2 \approx 2 \text{ GB per layer}

With 80 layers, peak activations alone can exceed 160 GB - far more than model weights.

Sequence parallelism (Korthikanti et al., 2022) extends tensor parallelism to also shard the sequence dimension. In regions between tensor-parallel layers (where activations are replicated), sequence parallelism instead partitions the token sequence across GPUs. Each GPU processes its own slice of tokens, reducing per-GPU activation memory by a factor of pp (the TP degree).

The all-reduce at the end of TP layers is replaced by an all-gather (to reconstruct the full sequence before the next TP layer) - same communication volume, same bandwidth requirement.


3D Parallelism: The Full Stack

Training at scale uses all three dimensions simultaneously:

3D Parallelism Layout:

Data Parallel (DP) replicas
┌─────────────────────────────────────┐
│ Pipeline Stage 0 │ Pipeline Stage 1 │ ... │
│ [GPU0][GPU1].. │ [GPU8][GPU9].. │ │
│ ^tensor parallel^ │ ^tensor parallel^ │ │
└─────────────────────────────────────┘
└─────── one model copy ─────────────┘
└──── × DP replicas (different data) ─────────────────┘

For Megatron-Turing NLG (530B parameters):

  • TP=8: within each 8-GPU DGX node, NVLink all-reduce
  • PP=35: 35 pipeline stages across nodes, InfiniBand activation passing
  • DP=12: 12 model replicas for throughput, InfiniBand gradient sync
  • Total: 8×35×12=3,3608 \times 35 \times 12 = 3,360 A100 GPUs

Expert Parallelism for MoE Models

Mixture-of-Experts (MoE) models like Mixtral 8x7B use a different parallelism dimension. In an MoE layer, a router sends each token to one of EE expert networks (feed-forward layers). With expert parallelism, each GPU holds a subset of experts:

  • GPU 0 holds experts 0–1
  • GPU 1 holds experts 2–3
  • ... and so on

Each token is routed to the GPU holding its expert. This requires all-to-all communication: each GPU may need to send tokens to and receive tokens from any other GPU.

Expert parallelism scales the model's capacity (more parameters = more knowledge) without scaling inference compute per token - each token only passes through one or two experts per MoE layer.


Inference vs Training: Different Optimal Strategies

The best parallelism strategy differs significantly between training and inference:

AspectTrainingInference
Typical batch size256–4096 per GPU1–32 per GPU
Memory pressureVery high (weights + gradients + optimizer states + activations)Moderate (weights + KV cache)
Latency requirementNot critical (offline)Critical (user-facing)
Pipeline parallelismExcellent - large batches amortize bubblePoor - small batches cannot amortize bubble
Tensor parallelismTP=8 standardTP=2 to 8 depending on model size
Communication budgetCan overlap with computeMust be minimized for low latency

For inference, prefer tensor parallelism over pipeline parallelism when both options are available. TP adds a small, fixed latency per layer (all-reduce). PP adds sequential latency proportional to the number of pipeline stages and cannot be hidden at small batch sizes.


Code: Analyzing Parallelism Trade-offs

"""
Analytical model comparing tensor vs pipeline parallelism for inference.
Helps choose the right strategy for a given hardware configuration.
"""

from dataclasses import dataclass
from typing import Literal
import math


@dataclass
class HardwareConfig:
n_gpus: int
gpu_memory_gb: float
nvlink: bool # True if GPUs connected via NVLink
interconnect_gbps: float # Bandwidth in GB/s (NVLink or PCIe)


@dataclass
class ModelConfig:
n_params: float # Total parameters
n_layers: int
hidden_size: int
n_heads: int
dtype_bytes: int = 2 # FP16 = 2 bytes


def analyze_tensor_parallel(
model: ModelConfig,
hw: HardwareConfig,
batch_size: int = 1,
seq_len: int = 1, # For decode, seq_len=1 (generating one token)
) -> dict:
"""Estimate tensor parallel inference efficiency."""

tp_degree = hw.n_gpus
weights_total_gb = model.n_params * model.dtype_bytes / 1e9
weights_per_gpu_gb = weights_total_gb / tp_degree

# Activation size per token per layer (hidden_size activations, FP16)
activation_bytes = model.hidden_size * batch_size * seq_len * model.dtype_bytes

# All-reduce volume per layer: 2*(p-1)/p * activation_size
# Two all-reduces per layer (MLP + attention)
ar_volume_per_layer = 2 * (tp_degree - 1) / tp_degree * activation_bytes
total_ar_volume = ar_volume_per_layer * model.n_layers * 2

# Communication time
comm_time_s = total_ar_volume / (hw.interconnect_gbps * 1e9)
comm_time_ms = comm_time_s * 1000

# Memory check
fits_on_gpu = weights_per_gpu_gb < hw.gpu_memory_gb

return {
"strategy": f"Tensor Parallel (TP={tp_degree})",
"weights_per_gpu_gb": round(weights_per_gpu_gb, 1),
"fits_in_memory": fits_on_gpu,
"total_ar_volume_mb": round(total_ar_volume / 1e6, 2),
"comm_overhead_ms": round(comm_time_ms, 3),
"requires_nvlink": not hw.nvlink and comm_time_ms > 5,
"recommendation": "good" if (fits_on_gpu and comm_time_ms < 5) else "problematic",
}


def analyze_pipeline_parallel(
model: ModelConfig,
hw: HardwareConfig,
batch_size: int = 1,
seq_len: int = 1,
n_microbatches: int = 1,
) -> dict:
"""Estimate pipeline parallel inference efficiency."""

pp_degree = hw.n_gpus
layers_per_stage = model.n_layers // pp_degree
weights_per_gpu_gb = (model.n_params / pp_degree) * model.dtype_bytes / 1e9

# Activation tensor passed between stages
# Shape: [batch_size, seq_len, hidden_size]
activation_bytes = model.hidden_size * batch_size * seq_len * model.dtype_bytes
activation_mb = activation_bytes / 1e6

# P2P communication time between stages (one tensor per stage transition)
comm_per_stage_ms = (activation_bytes / (hw.interconnect_gbps * 1e9)) * 1000

# Bubble fraction
bubble_fraction = (pp_degree - 1) / (n_microbatches + pp_degree - 1)

# Total pipeline latency: stages × forward_time_per_stage + comm
# Approximate: proportional to number of stages in sequence
pipeline_latency_overhead_ms = comm_per_stage_ms * pp_degree

fits_on_gpu = weights_per_gpu_gb < hw.gpu_memory_gb

return {
"strategy": f"Pipeline Parallel (PP={pp_degree})",
"weights_per_gpu_gb": round(weights_per_gpu_gb, 1),
"fits_in_memory": fits_on_gpu,
"activation_per_stage_mb": round(activation_mb, 2),
"comm_per_stage_ms": round(comm_per_stage_ms, 3),
"pipeline_latency_overhead_ms": round(pipeline_latency_overhead_ms, 3),
"bubble_fraction": round(bubble_fraction, 3),
"recommendation": "good" if fits_on_gpu and bubble_fraction < 0.2 else "suboptimal",
}


# Example: LLaMA-3 70B on 4× A100 80GB
llama_70b = ModelConfig(
n_params=70e9,
n_layers=80,
hidden_size=8192,
n_heads=64,
dtype_bytes=2, # FP16
)

# NVLink-connected (single node)
nvlink_hw = HardwareConfig(
n_gpus=4,
gpu_memory_gb=80,
nvlink=True,
interconnect_gbps=600, # NVLink A100
)

# PCIe-connected (separate servers)
pcie_hw = HardwareConfig(
n_gpus=4,
gpu_memory_gb=80,
nvlink=False,
interconnect_gbps=64, # PCIe 4.0 x16
)

print("=== LLaMA-3 70B on 4× A100 80GB ===")
print()

for hw_name, hw in [("NVLink (single node)", nvlink_hw), ("PCIe (multi-server)", pcie_hw)]:
print(f"Hardware: {hw_name}")
tp = analyze_tensor_parallel(llama_70b, hw, batch_size=8)
pp = analyze_pipeline_parallel(llama_70b, hw, batch_size=8, n_microbatches=4)

print(f" Tensor Parallel: {tp['weights_per_gpu_gb']} GB/GPU, "
f"comm={tp['comm_overhead_ms']} ms, {tp['recommendation']}")
print(f" Pipeline Parallel: {pp['weights_per_gpu_gb']} GB/GPU, "
f"comm_latency={pp['pipeline_latency_overhead_ms']} ms, "
f"bubble={pp['bubble_fraction']:.1%}, {pp['recommendation']}")
print()

Choosing a Parallelism Strategy


Production Engineering Notes

Check GPU Topology Before Choosing TP

# Show interconnect topology between all GPU pairs
nvidia-smi topo -m

# Output matrix shows connection type:
# NV1-NV4: NVLink (generation 1-4) - use for TP
# PIX: PCIe crossing - do NOT use for TP
# SYS: separate NUMA domains - absolutely no TP

vLLM Multi-GPU Configuration

# Tensor parallel - single node, 4 NVLink-connected GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 4 \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--port 8000

# TP + PP - multi-node (8 GPUs per node, 2 nodes)
# Node 0 (head), Node 1 (worker) connected via InfiniBand
NCCL_IB_DISABLE=0 \
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--dtype bfloat16 \
--port 8000
NVLink (A100)InfiniBand HDRPCIe 4.0 x16
Bandwidth600 GB/s200 GB/s64 GB/s
Use forTP all-reducePP activation passing, DP gradientsLast resort
TopologyWithin nodeNode-to-nodeWithin node (no NVLink)
TP feasibilityExcellentMarginalPoor for inference

:::danger Never Use Tensor Parallelism Over PCIe for Inference

If your GPUs connect via PCIe (not NVLink), tensor parallelism will be dramatically slower than running a quantized model on fewer GPUs. PCIe all-reduce adds 10–50 ms of overhead per forward pass on typical LLM configurations - making inference 3–10× slower than single-GPU.

Check before deploying:

nvidia-smi topo -m # Look for NVLink (NV1-NV4) between your target GPUs

If you see PIX or SYS instead of NVLink, either quantize to fit on fewer GPUs, or use pipeline parallelism instead. :::

:::warning Pipeline Parallelism Is Rarely Optimal for Online Inference

Pipeline parallelism was designed for training, where large batch sizes amortize the pipeline bubble. For online inference at batch sizes of 1–32, the bubble overhead is not amortized - a 4-stage pipeline with batch size 1 wastes 75% of GPU time in the bubble.

Only use pipeline parallelism for inference when:

  1. The model is too large for TP within a single node (nodes lack NVLink to each other)
  2. Throughput (offline batch workloads) matters more than latency

For interactive serving, prefer tensor parallelism. For offline batch summarization across multi-node clusters, PP becomes more justifiable. :::


Interview Questions

Q1: LLaMA-2 70B requires 140 GB in FP16. You have 4× A100 80GB GPUs connected via NVLink. What parallelism strategy do you use and why?

Use tensor parallelism with TP=4. Each GPU holds 140/4=35140/4 = 35 GB of model weights, leaving 45 GB per GPU for KV cache and activations - comfortable for realistic batch sizes and sequence lengths. The NVLink interconnect (600 GB/s) makes all-reduce communication negligible: for a single-token decode, total all-reduce traffic across 80 layers is about 19.2 MB, taking under 0.1 ms on NVLink.

Pipeline parallelism is the alternative: 35 GB per GPU (same memory distribution as TP=4). However, pipeline parallel inference adds sequential latency - tokens must travel through each pipeline stage, and the pipeline bubble is not amortized at inference batch sizes. For online serving, TP is strictly preferable.

Q2: Explain the difference between column-parallel and row-parallel linear layers and why pairing them requires only one all-reduce per MLP block.

Column-parallel splits the weight matrix WW along output columns. Each GPU ii computes Yi=XWiY_i = XW_i independently - the input XX is replicated across all GPUs. No communication is needed after this layer because each GPU produces a valid column shard of the output.

Row-parallel splits WW along input rows. Each GPU ii receives a column-sharded input XiX_i (from the preceding column-parallel layer) and computes a partial sum XiWiX_i W_i. An all-reduce sums these partial results across GPUs to produce the full output.

Pairing them in sequence: the first (column-parallel) layer produces column-sharded output with no communication. The activation function (GeLU, SiLU) applies to each GPU's shard locally - no communication. The second (row-parallel) layer consumes the sharded input and produces a full output after one all-reduce. Total: one all-reduce per MLP block, regardless of the number of GPUs.

Q3: What is the pipeline bubble and how does micro-batching reduce it?

The pipeline bubble is idle time in pipeline parallelism. With pp pipeline stages and a single batch, each stage must wait for the preceding stage to finish before processing can begin. Stage 1 waits for stage 0, stage 2 waits for stages 0–1, and so on. During this wait, GPUs are completely idle. Bubble fraction: (p1)/p(p-1)/p. With p=4p=4: 75% of compute time is wasted.

Micro-batching splits the global batch into mm smaller micro-batches. While stage 2 processes micro-batch 0, stage 1 can start processing micro-batch 1, and stage 0 can start micro-batch 2. The stages are no longer fully sequential - they overlap. Bubble fraction reduces to (p1)/(m+p1)(p-1)/(m+p-1). With p=4p=4, m=16m=16: bubble fraction = 3/1916%3/19 \approx 16\%.

Q4: Why is tensor parallelism preferred over pipeline parallelism for LLM inference?

Three reasons:

  1. Latency: TP all-reduce adds microseconds on NVLink - negligible. PP adds sequential stage latency: each token must flow through every stage sequentially, adding milliseconds per stage that accumulate across the entire sequence.

  2. Bubble overhead: Micro-batching amortizes the bubble at large batch sizes (training uses 256–4096 per GPU). Inference batch sizes are 1–32. With batch=1, a 4-stage pipeline wastes 75% of compute to bubble. This cannot be hidden.

  3. Continuous batching compatibility: Continuous batching systems (vLLM, TGI) dynamically change batch composition at every step. Pipeline parallelism with dynamic batches is complex - the micro-batch scheduling must be redesigned for variable-length sequences entering and leaving the pipeline. TP is transparent to the batching layer.

Q5: What is 3D parallelism and when is it required?

3D parallelism combines tensor, pipeline, and data parallelism simultaneously. Total GPU count = TP × PP × DP.

It is required when a single node's NVLink-connected GPUs (TP=8 on DGX A100) are insufficient to hold the model, and when you want to scale throughput further with data parallelism.

Example: LLaMA-3 405B in FP16 = 810 GB. One 8-GPU DGX A100 node provides 640 GB - not enough. Solution: TP=8 within each node for fast intra-node parallelism, PP=2 across 2 nodes to hold the full 810 GB (405 GB per node), DP=N to scale throughput. Communication: NVLink for TP (fast), InfiniBand for PP activation passing and DP gradient sync.

Q6: How does expert parallelism for MoE models differ from tensor parallelism, and what communication primitive does it require?

Tensor parallelism routes every token through every GPU - each GPU holds a shard of every weight matrix and all GPUs collaborate to produce the output for every token via all-reduce.

Expert parallelism routes each token to the GPU(s) holding its selected expert(s) - the router makes a per-token decision about which GPU does the work. Only the assigned GPU(s) compute for each token.

TP requires all-reduce (every GPU sends to every other GPU, symmetric). EP requires all-to-all (each GPU may send to any other GPU, asymmetric - depends on which experts are selected for each token). All-to-all is more complex to implement but can be efficient with NVLink or InfiniBand. The key advantage of EP: total model capacity scales with the number of experts (more parameters = more knowledge) while inference compute per token stays constant.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Model Parallelism: Tensor & Pipeline demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.