Skip to main content

CUDA Streams and Async Execution

Reading time: ~38 min · Interview relevance: High · Target roles: CUDA Developer, ML Systems Engineer, Performance Engineer


The GPU That Was 40% Idle

The training server had a 40-Gbps NVLink fabric and four A100s, each with 2 TB/s of HBM bandwidth. By every paper specification, this machine was fast. But when the engineer checked Nsight Systems on the training job, the GPU utilization trace looked like a comb: compute, then idle, then compute, then idle, repeating every batch.

The timeline told the story clearly:

Timeline (one batch, time flows right):
──────────────────────────────────────────────────────────
H2D Transfer: [===========]
Compute: [===================]
D2H Transfer: [=======]
──────────────────────────────────────────────────────────
GPU idle: ↑↑↑↑↑↑↑↑↑↑↑ ↑↑↑↑↑↑↑

Every batch followed the same three-act structure: copy data to the GPU, run the forward and backward pass, copy gradients back to the CPU. Each act waited for the previous one to finish before starting. The GPU was idle during both transfers. On this workload, data transfers consumed about 40% of wall time - meaning the GPU was sitting idle 40% of the time doing absolutely nothing.

The fix was CUDA streams. By using two streams and two device buffers, the engineer restructured the pipeline so that while the GPU was processing batch N, the host was already transferring batch N+1 to the device. The total wall time dropped by 35%. The same hardware, the same model, the same data - just a different execution schedule.

This lesson teaches you how CUDA streams work, why they enable concurrency, how to build the double-buffering pipeline that solved the problem above, and how PyTorch's stream and non-blocking transfer APIs give you these capabilities without writing raw CUDA.


Why This Exists - The GPU Cannot Guess What to Overlap

A GPU is a deeply pipelined machine with multiple independent execution units: shader cores, memory controllers, PCIe DMA engines, NVLink engines. These units can physically operate simultaneously. A DMA engine copying memory from host to device does not use the shader cores at all. In principle, a 100 GB/s PCIe transfer should be able to run in parallel with a kernel using shader cores and HBM.

The problem is that the GPU needs to know which operations are independent before it can overlap them. By default, the GPU is conservative: every operation waits for all previous operations to complete before starting. This ensures correctness but sacrifices performance when operations are actually independent.

CUDA streams are the API by which you communicate independence to the GPU. A stream is an ordered queue of operations. Operations within the same stream execute in the order they were submitted and cannot overlap each other. Operations in different streams can overlap if they are independent - if they do not share resources that would cause a hazard.

When you put the H2D transfer for batch N+1 in a different stream from the kernel processing batch N, you are telling the GPU: "these two things do not depend on each other, and you are allowed to execute them at the same time." The GPU then schedules them concurrently on different hardware units, hiding the transfer latency behind compute.

This is not magic. The hardware overlap is only possible because:

  1. The H2D transfer uses the DMA engine, not shader cores
  2. The kernel uses shader cores and HBM, not the PCIe DMA engine
  3. They access different memory regions (different device buffers)
  4. You have communicated this independence through separate streams

If any of these conditions is violated - if the kernel reads a buffer that the transfer is writing - the result would be a data race and corrupted output. Streams give you performance, but you are responsible for ensuring the operations in different streams are genuinely independent.


Historical Context - From Synchronous Beginnings to Full Concurrency

Early CUDA (version 1.0, 2007) had no concurrency model. Every operation - kernel launches, memory copies - was synchronous and serialized on a single implicit stream. The GPU executed exactly one thing at a time.

CUDA 2.0 (2008) introduced streams as a concept, but the hardware of that era (Tesla architecture) had a single copy engine and could not truly overlap a memcpy with kernel execution. Streams existed but their benefits were limited.

The Fermi architecture (2010) was the first to physically support concurrent kernel execution on the same GPU. For the first time, multiple kernels from different streams could run simultaneously on different SMs. This opened the door to the kind of overlap that makes streams genuinely valuable.

Kepler (2012) introduced hyper-Q: instead of a single hardware queue that serialized stream submissions, Kepler had 32 hardware queues, allowing truly independent streams to be dispatched concurrently without false dependencies in the submission path. This was the point where streams went from theoretically useful to practically essential.

Pascal and later architectures improved the DMA engines (adding multiple copy engines), increased the number of concurrent kernel slots, and made the overlap between compute and memory transfers more reliable. Today, on Ampere and Hopper, you can run kernel computation, H2D transfer, and D2H transfer simultaneously on three different streams.


Core Concepts

The Default Stream

When you write CUDA code without explicitly creating a stream, all operations go on stream 0, also called the default stream or the null stream. The default stream has special serialization semantics: every operation on the default stream waits for all previous operations on ALL streams to complete before executing, and all operations on all streams wait for the default stream to drain before they execute.

This means that if you mix default-stream operations with non-default-stream operations, the default stream acts as a global barrier. Many subtle bugs in stream-based code come from accidentally using the default stream at a synchronization point and destroying the overlap you worked to create.

// These three operations run sequentially - no overlap possible
cudaMemcpyAsync(d_a, h_a, bytes, cudaMemcpyHostToDevice, 0); // stream 0
my_kernel<<<grid, block, 0, 0>>>(d_a, d_b); // stream 0
cudaMemcpyAsync(h_c, d_c, bytes, cudaMemcpyDeviceToHost, 0); // stream 0

Creating and Using Non-Default Streams

#include <cuda_runtime.h>

// Create streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Launch kernel on stream1
my_kernel<<<grid, block, 0, stream1>>>(d_input1, d_output1);

// Launch a different kernel on stream2 (can run concurrently with stream1)
another_kernel<<<grid, block, 0, stream2>>>(d_input2, d_output2);

// Wait for both streams to complete
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

// Or wait for everything to complete
cudaDeviceSynchronize();

// Cleanup
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

Within stream1, all operations execute in submission order. Within stream2, same. Between stream1 and stream2, execution is concurrent where hardware supports it.

What Can and Cannot Overlap

Understanding what the hardware can actually overlap is critical:

Can overlap:

  • Kernel execution on stream A and H2D memcpy on stream B (uses different hardware units)
  • Kernel execution on stream A and D2H memcpy on stream B (same - different hardware units)
  • H2D transfer on stream A and D2H transfer on stream B (if two copy engines exist - Kepler+)
  • Two kernels on stream A and stream B (if both fit in available SM resources)

Cannot overlap:

  • Two operations on the same stream (always sequential by definition)
  • A kernel on stream A and a kernel on stream B that both write to the same memory region (data hazard - wrong results, not a hardware limitation)
  • Any operation that follows a cudaDeviceSynchronize() or default-stream operation until those complete

Pinned (Page-Locked) Memory - Required for Async Transfers

The GPU's DMA engine can only transfer data directly from pinned (page-locked) host memory. Standard malloc memory is pageable - the OS can swap it to disk at any time, and the DMA engine cannot handle that. If you call cudaMemcpyAsync with pageable memory, CUDA silently falls back to a synchronous copy through a staging buffer, eliminating all overlap benefit.

// Pageable memory - async copy silently becomes synchronous
float* h_data_pageable = (float*)malloc(bytes);
cudaMemcpyAsync(d_data, h_data_pageable, bytes,
cudaMemcpyHostToDevice, stream); // actually synchronous!

// Pinned memory - async copy is truly asynchronous
float* h_data_pinned;
cudaMallocHost(&h_data_pinned, bytes); // allocate pinned
cudaMemcpyAsync(d_data, h_data_pinned, bytes,
cudaMemcpyHostToDevice, stream); // truly async
cudaFreeHost(h_data_pinned); // free pinned memory

The tradeoff: pinned memory is not pageable, so the OS cannot reclaim it under memory pressure. Using too much pinned memory degrades overall system performance. As a rule of thumb, pin only the memory that participates in async transfers - typically your input/output staging buffers, not your entire dataset.


CUDA Events - Timing and Inter-Stream Synchronization

What CUDA Events Are

A CUDA event is a timestamp recorded on a stream. When you record an event on a stream, the GPU records the current time on that stream when it reaches that point in the command queue. Events serve two purposes:

  1. Timing: measure the elapsed GPU time between two events
  2. Synchronization: make one stream wait for a specific point in another stream

Event-Based Timing

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Record start on the stream
cudaEventRecord(start, stream1);

// Launch kernel
my_kernel<<<grid, block, 0, stream1>>>(d_input, d_output);

// Record stop on the same stream (after kernel in queue order)
cudaEventRecord(stop, stream1);

// Wait for the stop event to complete
cudaEventSynchronize(stop);

// Compute elapsed time in milliseconds
float elapsed_ms;
cudaEventElapsedTime(&elapsed_ms, start, stop);
printf("Kernel time: %.3f ms\n", elapsed_ms);

// Cleanup
cudaEventDestroy(start);
cudaEventDestroy(stop);

cudaEventRecord is non-blocking on the CPU. It inserts a timestamp into the stream's command queue. cudaEventSynchronize blocks the CPU until the GPU reaches and executes that event. cudaEventElapsedTime returns the GPU-measured time between two completed events.

This is more accurate than CPU-side timing because it measures GPU execution time directly, eliminating CPU overhead and scheduling jitter from the measurement.

Inter-Stream Synchronization with Events

Sometimes you need stream B to wait until stream A has reached a specific point, without waiting for stream A to fully complete. Events make this precise:

cudaStream_t streamA, streamB;
cudaStreamCreate(&streamA);
cudaStreamCreate(&streamB);

cudaEvent_t checkpoint;
cudaEventCreate(&checkpoint);

// Stream A does some work
stage1_kernel<<<grid, block, 0, streamA>>>(d_a);

// Record a checkpoint in stream A
cudaEventRecord(checkpoint, streamA);

// Stream B waits for the checkpoint before starting
// (does not block the CPU - inserts a wait into stream B's queue)
cudaStreamWaitEvent(streamB, checkpoint, 0);

// These run after stage1_kernel completes in stream A
stage2_kernel<<<grid, block, 0, streamA>>>(d_a, d_b);
stage2_dependent_kernel<<<grid, block, 0, streamB>>>(d_a, d_c);

cudaStreamWaitEvent is key: it is a GPU-side wait. It does not block the CPU. It inserts a dependency into stream B's hardware queue so that stream B's subsequent operations do not start until the GPU has executed the event in stream A. This is how you express "stream B depends on a specific result from stream A" without fully serializing both streams.


Double Buffering - The Core Overlap Pattern

Double buffering is the most important practical application of CUDA streams. The idea is straightforward:

  • Maintain two sets of device buffers: buffer A and buffer B
  • While the GPU processes data from buffer A (stream 1), transfer the next batch into buffer B (stream 2)
  • When processing of buffer A finishes, swap roles: stream 1 now processes buffer B while stream 2 loads the next batch into buffer A
  • The GPU never idles waiting for data, and data transfers are hidden behind computation
Without double buffering:
──────────────────────────────────────────────────────────────────
[H2D batch 0][compute batch 0][H2D batch 1][compute batch 1]...
Total time = sum of all individual phases

With double buffering:
──────────────────────────────────────────────────────────────────
[H2D batch 0][compute batch 0]
[H2D batch 1][compute batch 1]
[H2D batch 2][compute batch 2]...
Total time = max(transfer time, compute time) per batch

If compute time > transfer time (typical for large models), transfers are completely hidden and throughput approaches the compute-only speed.

Complete Double Buffering Implementation

#include <cuda_runtime.h>
#include <stdio.h>

#define N_BATCHES 8
#define BATCH_SIZE (1 << 20) // 1M elements per batch
#define BYTES (BATCH_SIZE * sizeof(float))

__global__ void process_kernel(float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
// Simulate some compute work
float val = input[idx];
for (int i = 0; i < 16; ++i) {
val = val * 1.0001f + 0.0001f;
}
output[idx] = val;
}
}

int main() {
// Allocate pinned host memory for all batches
float* h_input[N_BATCHES];
float* h_output[N_BATCHES];
for (int i = 0; i < N_BATCHES; ++i) {
cudaMallocHost(&h_input[i], BYTES);
cudaMallocHost(&h_output[i], BYTES);
// Initialize input data
for (int j = 0; j < BATCH_SIZE; ++j) {
h_input[i][j] = (float)(i * BATCH_SIZE + j);
}
}

// Allocate two device buffers (double buffer)
float* d_input[2];
float* d_output[2];
cudaMalloc(&d_input[0], BYTES);
cudaMalloc(&d_input[1], BYTES);
cudaMalloc(&d_output[0], BYTES);
cudaMalloc(&d_output[1], BYTES);

// Create two streams
cudaStream_t stream[2];
cudaStreamCreate(&stream[0]);
cudaStreamCreate(&stream[1]);

// Create events for timing
cudaEvent_t total_start, total_stop;
cudaEventCreate(&total_start);
cudaEventCreate(&total_stop);

int threads = 256;
int blocks = (BATCH_SIZE + threads - 1) / threads;

cudaEventRecord(total_start, stream[0]);

// Pipeline: process batch i while transferring batch i+1
for (int i = 0; i < N_BATCHES; ++i) {
int buf = i % 2; // alternates 0, 1, 0, 1, ...
int s = buf; // stream matches buffer index

// H2D: transfer batch i into device buffer buf
cudaMemcpyAsync(d_input[buf], h_input[i], BYTES,
cudaMemcpyHostToDevice, stream[s]);

// Compute: process batch i on the same stream (after H2D completes)
process_kernel<<<blocks, threads, 0, stream[s]>>>(
d_input[buf], d_output[buf], BATCH_SIZE);

// D2H: transfer result of batch i back to host
cudaMemcpyAsync(h_output[i], d_output[buf], BYTES,
cudaMemcpyDeviceToHost, stream[s]);
}

// Wait for all streams to finish
cudaStreamSynchronize(stream[0]);
cudaStreamSynchronize(stream[1]);

cudaEventRecord(total_stop, stream[0]);
cudaEventSynchronize(total_stop);

float elapsed;
cudaEventElapsedTime(&elapsed, total_start, total_stop);
printf("Double-buffered pipeline: %.2f ms for %d batches\n",
elapsed, N_BATCHES);
printf("Throughput: %.2f GB/s\n",
(float)(N_BATCHES * BYTES * 2) / (elapsed * 1e-3) / 1e9);

// Cleanup
for (int s = 0; s < 2; ++s) {
cudaStreamDestroy(stream[s]);
cudaMalloc(&d_input[s], 0); // placeholder - in practice use cudaFree
}
for (int i = 0; i < N_BATCHES; ++i) {
cudaFreeHost(h_input[i]);
cudaFreeHost(h_output[i]);
}

return 0;
}

The key insight in this code: stream[0] handles even batches (0, 2, 4, 6) and stream[1] handles odd batches (1, 3, 5, 7). Within each stream, H2D -> kernel -> D2H is sequential (correct). Between streams, batch 0 on stream[0] and batch 1 on stream[1] overlap (fast).


Mermaid: Double Buffering Timeline


PyTorch Streams

Creating and Using Streams in PyTorch

PyTorch wraps CUDA streams with a Python-friendly API:

import torch

# Create a non-default CUDA stream
stream = torch.cuda.Stream()

# Use the stream as a context manager
with torch.cuda.stream(stream):
# All CUDA operations inside this block go on 'stream'
output = model(input_tensor)
loss = criterion(output, labels)

# Outside the context, operations return to the default stream

# Synchronize the specific stream
stream.synchronize()

# Or synchronize all streams
torch.cuda.synchronize()

# Check the current stream
current = torch.cuda.current_stream()
default = torch.cuda.default_stream()

Non-Blocking Tensor Transfers

The PyTorch equivalent of cudaMemcpyAsync is the non_blocking=True argument on tensor .to() and .cuda() calls:

import torch

# Allocate pinned memory on host (required for true async)
# PyTorch DataLoader does this with pin_memory=True
x_pinned = torch.randn(1024, 512).pin_memory()

# Non-blocking transfer to GPU (returns immediately on CPU)
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
x_gpu = x_pinned.to("cuda", non_blocking=True)

# x_gpu may not be ready yet on the CPU side
# but the stream will ensure it is ready before any operation
# launched on 'stream' after this point uses x_gpu

# This kernel launch on the same stream is guaranteed to see
# the completed transfer because streams are ordered
with torch.cuda.stream(stream):
output = model(x_gpu) # safe - stream ordering ensures x_gpu is ready

:::danger non_blocking=True Without Pinned Memory If you call tensor.to(device, non_blocking=True) with pageable (non-pinned) host memory, PyTorch silently falls back to a synchronous copy. The non_blocking=True flag is silently ignored. You will get correct results but no overlap benefit. Always pair non_blocking=True with pinned memory (via pin_memory=True in DataLoader, or explicitly via tensor.pin_memory()). :::

DataLoader with Pin Memory and Non-Blocking Transfers

The most common and highest-value place to apply streams in PyTorch training is the data loading pipeline. The DataLoader's pin_memory=True option pre-allocates all batches in pinned host memory, enabling non-blocking transfers to the GPU.

import torch
from torch.utils.data import DataLoader

# pin_memory=True: batches are allocated in pinned host memory
train_loader = DataLoader(
dataset,
batch_size=64,
num_workers=4,
pin_memory=True, # allocate batches in pinned memory
persistent_workers=True, # keep worker processes alive between epochs
)

model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())

# Create a stream for compute and use non_blocking transfers
compute_stream = torch.cuda.Stream()

for batch_idx, (inputs, labels) in enumerate(train_loader):
# Non-blocking transfer: starts H2D copy, CPU returns immediately
inputs = inputs.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)

# The default stream will automatically wait for the transfers
# because PyTorch inserts the correct stream dependencies

with torch.cuda.stream(compute_stream):
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()

optimizer.step()
optimizer.zero_grad()

With num_workers=4, the DataLoader prefetches and prepares batches on CPU worker processes. With pin_memory=True, those batches land in pinned memory. With non_blocking=True, the H2D transfer starts while the previous batch is still being computed. The result is that data transfer latency is almost entirely hidden behind compute - the GPU is always fed.

Stream Dependencies in PyTorch

PyTorch handles stream dependencies automatically when you use the standard tensor operations inside a torch.cuda.stream() context. However, when you need explicit cross-stream synchronization, you use torch.cuda.Event:

import torch

stream_a = torch.cuda.Stream()
stream_b = torch.cuda.Stream()

# Create a CUDA event
event = torch.cuda.Event()

# Stream A does some work
with torch.cuda.stream(stream_a):
result_a = some_expensive_op(input_a)
# Record event when stream A reaches this point
event.record(stream_a)

# Stream B waits for stream A's event before proceeding
# This is a GPU-side wait - does not block the CPU
stream_b.wait_event(event)

with torch.cuda.stream(stream_b):
# This will not execute until stream A has reached the event
result_b = another_op(result_a)

Multiple GPU Streams for Inference Serving

In production inference serving, you often have multiple client requests arriving simultaneously. If each request launches its GPU kernel on the default stream, they serialize even though they are completely independent. Using a stream per request (or a pool of streams) allows concurrent execution:

import torch
from concurrent.futures import ThreadPoolExecutor
from typing import List

class StreamedInferenceServer:
def __init__(self, model, n_streams=4):
self.model = model.cuda().eval()
# Pool of streams - one per concurrent request slot
self.streams = [torch.cuda.Stream() for _ in range(n_streams)]
self.n_streams = n_streams

def infer(self, request_batch: torch.Tensor, stream_idx: int) -> torch.Tensor:
stream = self.streams[stream_idx % self.n_streams]

with torch.cuda.stream(stream):
# Non-blocking transfer if input is in pinned memory
x = request_batch.cuda(non_blocking=True)
with torch.no_grad():
result = self.model(x)
# Keep result on GPU until caller needs it
return result

def infer_batch_of_requests(self, requests: List[torch.Tensor]) -> List[torch.Tensor]:
results = []
events = []

# Launch all requests concurrently on different streams
for i, req in enumerate(requests):
stream_idx = i % self.n_streams
result = self.infer(req, stream_idx)
# Record event on the stream to know when this result is ready
event = torch.cuda.Event()
self.streams[stream_idx].record_event(event)
results.append((result, event))

# Wait for each result and collect
outputs = []
for result, event in results:
event.synchronize()
outputs.append(result.cpu())

return outputs

In practice, inference serving systems like Triton Inference Server and vLLM use stream pools extensively. Each request gets a stream, kernels from different requests run concurrently on the GPU, and the server achieves much higher QPS (queries per second) than a serialized approach. The practical speedup depends on the model size relative to GPU compute capacity - small models on large GPUs see the greatest benefit from concurrency.


Stream Priorities

CUDA supports stream priorities for workloads where some operations are more latency-sensitive than others:

// Query the supported priority range
int min_priority, max_priority;
cudaDeviceGetStreamPriorityRange(&min_priority, &max_priority);
// Typically: min_priority = 0 (default), max_priority = -1 (highest)

// Create a high-priority stream for latency-sensitive work
cudaStream_t high_priority_stream;
cudaStreamCreateWithPriority(&high_priority_stream,
cudaStreamNonBlocking,
max_priority);

// Create a low-priority stream for background work
cudaStream_t low_priority_stream;
cudaStreamCreateWithPriority(&low_priority_stream,
cudaStreamNonBlocking,
min_priority);

In PyTorch:

# High priority stream - preempts low priority when both are ready
high_stream = torch.cuda.Stream(priority=-1)
# Low priority stream - runs in background
low_stream = torch.cuda.Stream(priority=0)

with torch.cuda.stream(high_stream):
# Latency-sensitive forward pass for serving
output = serving_model(request_input)

with torch.cuda.stream(low_stream):
# Background work: parameter update, logging, etc.
background_task()

Stream priorities are hints to the scheduler, not hard guarantees. On Ampere and later, the GPU respects priorities when multiple warps from different streams are competing for SM resources.


Practical Benchmark: Measuring Overlap Benefit

Here is a self-contained benchmark that measures the actual overlap benefit on your hardware:

import torch
import time

def benchmark_sequential(n_batches=20, batch_size=1024*1024, n_warmup=5):
"""All operations on default stream - fully sequential."""
h_data = torch.randn(batch_size).pin_memory()
d_data = torch.empty(batch_size, device="cuda")
d_out = torch.empty(batch_size, device="cuda")

# Warmup
for _ in range(n_warmup):
d_data.copy_(h_data)
d_out.copy_(d_data) # simulate compute
torch.cuda.synchronize()

start = torch.cuda.Event(enable_timing=True)
stop = torch.cuda.Event(enable_timing=True)

start.record()
for _ in range(n_batches):
d_data.copy_(h_data) # H2D
d_out = torch.relu(d_data * 2.0 + 1.0) # compute
h_out = d_out.cpu() # D2H (blocking)
stop.record()
torch.cuda.synchronize()

return start.elapsed_time(stop) # ms


def benchmark_streamed(n_batches=20, batch_size=1024*1024, n_warmup=5):
"""Use two streams to overlap H2D transfer with compute."""
n_streams = 2
streams = [torch.cuda.Stream() for _ in range(n_streams)]

h_data = [torch.randn(batch_size).pin_memory() for _ in range(n_batches)]
d_data = [torch.empty(batch_size, device="cuda") for _ in range(n_streams)]
d_out = [torch.empty(batch_size, device="cuda") for _ in range(n_streams)]

# Warmup
for _ in range(n_warmup):
with torch.cuda.stream(streams[0]):
d_data[0].copy_(h_data[0], non_blocking=True)
_ = torch.relu(d_data[0] * 2.0 + 1.0)
streams[0].synchronize()

start = torch.cuda.Event(enable_timing=True)
stop = torch.cuda.Event(enable_timing=True)

start.record()
for i in range(n_batches):
s = i % n_streams
with torch.cuda.stream(streams[s]):
d_data[s].copy_(h_data[i], non_blocking=True)
d_out[s] = torch.relu(d_data[s] * 2.0 + 1.0)

for s in streams:
s.synchronize()
stop.record()
torch.cuda.synchronize()

return start.elapsed_time(stop) # ms


if __name__ == "__main__":
seq_ms = benchmark_sequential()
str_ms = benchmark_streamed()
print(f"Sequential: {seq_ms:.1f} ms")
print(f"Streamed: {str_ms:.1f} ms")
print(f"Speedup: {seq_ms / str_ms:.2f}x")

On a system where the batch fits entirely in GPU L2 (compute-light workload), the speedup is modest because compute finishes before the next transfer even starts. On systems where batch processing time exceeds the transfer time - typical for real training workloads - the speedup approaches the theoretical limit of max(compute_time, transfer_time) / (compute_time + transfer_time).


Production Engineering Notes

Always profile before adding streams. Streams add code complexity. If your workload is already compute-bound and transfers are negligible, streams will not help. Nsight Systems will tell you within minutes whether transfer overlap would help.

Pinned memory has a cost. cudaMallocHost is slow and consumes non-pageable physical memory. Never pin large portions of your dataset. Pin only the staging buffers used in the transfer pipeline - typically a few hundred MB.

Stream context managers are your friend in PyTorch. The with torch.cuda.stream(s): context manager correctly handles all the stream-switching bookkeeping. Do not manually call torch.cuda.set_stream() unless you have a specific reason.

torch.cuda.synchronize() kills overlap. Every call to torch.cuda.synchronize() is a global barrier. It destroys any in-flight overlap on any stream. Check your code for implicit synchronizations: Python-side tensor inspections (print(tensor), tensor.item(), tensor.numpy()) all trigger synchronization. Never call these inside your training loop.

non_blocking=True does not mean the transfer is instant. It means the CPU does not wait for the transfer to complete. The tensor is not usable on the GPU until the stream containing the transfer has processed it. PyTorch's stream ordering ensures that operations on the same stream see the completed transfer, but if you manually move tensors to a different stream without using events, you will read uninitialized data.

Worker count in DataLoader. With streams, the DataLoader worker count becomes more important. If num_workers=0 (single-process), the main thread prepares each batch synchronously, limiting how far ahead you can prefetch. num_workers=4 or higher ensures the prefetch queue stays full.


Common Mistakes

:::danger Calling tensor.item() or print(tensor) Inside Training Loop Both of these operations pull a value from the GPU to the CPU, which requires synchronizing the GPU. This destroys any in-flight stream overlap and forces a global barrier. Profile your training loop with Nsight Systems and look for unexpected cudaDeviceSynchronize events - they are almost always caused by .item() calls, typically inside logging code. :::

:::danger Using Default Stream After Non-Default Stream Operations The default stream (stream 0) has special serialization semantics: it waits for ALL non-default streams before executing, and ALL non-default streams wait for it to drain. If you intersperse default-stream operations with non-default streams, you inadvertently serialize everything. Either use all non-default streams, or use cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) to create non-blocking streams that do not synchronize with the default stream. :::

:::warning Forgetting to Synchronize Before Accessing Results After launching operations on a non-default stream, the CPU returns immediately. If you then access the result tensor on the CPU without synchronizing, you will read garbage - the GPU has not finished writing the output yet. Always call stream.synchronize() or record an event and call event.synchronize() before transferring results back to the CPU. :::

:::warning Assuming D2H and H2D Can Always Overlap On hardware with a single DMA copy engine (rare on modern cards, but possible on some embedded GPU variants), H2D and D2H cannot overlap even in separate streams - both use the same hardware unit. On A100 and H100, there are multiple copy engines and bidirectional overlap works. Benchmark on your target hardware rather than assuming. :::

:::warning Creating Too Many Streams Each CUDA stream has hardware overhead. Creating thousands of streams (e.g., one per model layer) does not create more parallelism - the hardware has a finite number of concurrent execution slots. A typical production pattern uses 2-4 streams. More than 8 streams rarely helps and can introduce overhead from context switching in the hardware scheduler. :::


Interview Questions and Answers

Q1: What is a CUDA stream and how do multiple streams enable concurrent execution?

A CUDA stream is an ordered queue of GPU operations. Operations submitted to the same stream execute in submission order, one at a time. Operations submitted to different streams can execute concurrently if the hardware supports it and they access different resources.

The GPU has multiple independent execution units: shader cores for compute, DMA engines for memory transfers, NVLink engines for multi-GPU communication. These can operate simultaneously. When you put a kernel launch and a memory transfer in different streams, you are telling the hardware that these operations are independent. The GPU can then schedule the kernel on the shader cores and the transfer on the DMA engine simultaneously, hiding the transfer latency behind compute.

Without streams (everything on the default stream), every operation waits for the previous one to complete, regardless of whether they actually depend on each other. Streams are the mechanism by which you express independence to the hardware.

Q2: Sketch a double-buffering pipeline that overlaps data loading with kernel computation.

You need two device buffers and two streams. Buffer A and stream 1 handle odd batches; buffer B and stream 2 handle even batches.

For batch i, on stream (i mod 2):

  1. Async H2D copy of batch i into device buffer (i mod 2)
  2. Kernel launch reading from that buffer
  3. Async D2H copy of result back to host

Because stream 0 and stream 1 operate independently, while stream 0 is running its kernel on batch 0 (in buffer A), stream 1 can already be copying batch 1 into buffer B. When batch 0's kernel finishes and stream 0 starts copying its result back, stream 1 starts its kernel on buffer B. The transfers and computation interleave continuously.

The requirement is pinned host memory - async transfers silently become synchronous with pageable memory. The speedup is (compute_time + transfer_time) / max(compute_time, transfer_time). If compute takes 8ms and transfer takes 3ms, ideal speedup is 11ms / 8ms = 1.37x.

Q3: Why does cudaMemcpyAsync require pinned host memory for true asynchrony?

The GPU's DMA (Direct Memory Access) engine can only read from memory at a fixed physical address. Pageable memory managed by the OS can be moved or swapped to disk at any time, meaning its physical address is not stable. The DMA engine cannot handle this - it needs to read from a stable physical location for the full duration of the transfer.

For pageable memory, CUDA internally allocates a small pinned staging buffer, synchronously copies your data from pageable memory into the staging buffer (on the CPU), then starts the async DMA transfer from the staging buffer to the GPU. The synchronous CPU copy defeats the purpose of async - by the time cudaMemcpyAsync returns, the data is already in the staging buffer, but you paid a CPU memcpy to get it there.

With pinned memory (cudaMallocHost), the physical address is locked and the DMA engine can transfer directly without any staging. cudaMemcpyAsync returns immediately on the CPU and the transfer genuinely runs in the background while the CPU continues to other work.

Q4: What is the difference between cudaStreamSynchronize and cudaDeviceSynchronize?

cudaStreamSynchronize(stream) blocks the CPU until all operations previously submitted to that specific stream have completed. Other streams continue running during this wait.

cudaDeviceSynchronize() blocks the CPU until ALL operations on ALL streams have completed. It is a global barrier.

In production code with multiple streams, cudaDeviceSynchronize is almost never the right choice - it destroys all concurrency. Use cudaStreamSynchronize to wait only for the specific stream whose results you need, or use cudaEventSynchronize to wait for a specific point within a stream.

In PyTorch, stream.synchronize() maps to cudaStreamSynchronize and torch.cuda.synchronize() maps to cudaDeviceSynchronize. The global synchronize is sometimes necessary at the end of a training step before logging metrics, but never inside the hot path.

Q5: How does non_blocking=True in PyTorch interact with streams, and what can go wrong?

tensor.to(device, non_blocking=True) submits the H2D transfer to the current stream and returns immediately on the CPU. The tensor object exists on the GPU side, but the data may not have arrived yet. Any operation launched on the same stream after the non_blocking transfer is safe - the stream ordering guarantees it will see the completed data. Operations on a different stream may race with the transfer.

What goes wrong: if you call .item(), .numpy(), or print() on a tensor immediately after a non-blocking transfer, these operations synchronize the GPU to get the value, see partially-transferred data or trigger implicit GPU synchronization that serializes everything. Another common bug is moving tensors between streams without using events - if you do a non-blocking transfer on stream A, then use the tensor in an operation on stream B without stream_b.wait_event(event_from_stream_a), stream B may start the operation before stream A's transfer completes.

The safe pattern: keep operations on a tensor on the same stream as the transfer, or explicitly synchronize with events when crossing streams.

Q6: When is it NOT worth using CUDA streams?

Streams help when the bottleneck is data transfer latency that can be hidden behind compute. They do not help in several scenarios:

  1. Compute-bound kernels: if your kernel runs for 100ms and your transfer takes 0.5ms, hiding the transfer saves a negligible 0.5%. Not worth the added complexity.

  2. Very small batches: if the kernel itself finishes faster than the transfer, the "overlap window" is tiny. Streams add synchronization overhead that may exceed the benefit.

  3. Multiple kernels that share memory: two kernels that both read and write the same device buffer cannot safely run concurrently. Putting them on different streams would produce data races.

  4. When profiling shows no transfer time: if Nsight Systems shows no PCIe transfers (e.g., data is already preloaded to GPU memory), there is nothing to overlap.

The decision should always be driven by profiling. If Nsight Systems shows significant gaps in GPU compute caused by H2D/D2H transfers, streams are the fix. If the GPU is busy continuously, adding streams complicates the code without any benefit.


Summary

CUDA streams are the API for expressing concurrency to the GPU. The default stream serializes everything. Non-default streams communicate independence - operations in different streams can run simultaneously on different hardware units, with the DMA engines handling transfers while shader cores execute kernels.

The double-buffering pattern is the most practical application: two device buffers ping-pong between two streams so that transfers for batch N+1 overlap with computation for batch N, eliminating GPU idle time and increasing throughput. In PyTorch, pin_memory=True in the DataLoader plus non_blocking=True on transfers implements this pattern with minimal code.

CUDA events provide precise GPU-side timing and cross-stream synchronization - they let you say "stream B should wait for a specific point in stream A" without a global barrier. Stream priorities let latency-sensitive work preempt background work when both are ready.

The rule of thumb: profile first with Nsight Systems to confirm that transfer time is significant. If it is, add streams. If it is not, do not add complexity without benefit. On real training workloads where the DataLoader is the bottleneck, streams consistently reduce end-to-end training time by 15-35%.

© 2026 EngineersOfAI. All rights reserved.