Skip to main content

TensorRT and Inference Optimization

The Production Crisis at 3 AM

The model launched two weeks ago. Accuracy metrics were excellent in testing - 94.2% mAP on the validation set, clean ablations, solid engineering review. The team shipped with confidence. Then usage ramped.

At peak traffic, the serving cluster is running 47 A100 GPUs. Your PyTorch model, wrapped in TorchServe, is handling about 340 requests per second. The product manager sends a Slack message: the competitor just cut their latency by 60% and halved their API pricing. Your infrastructure bill is $180,000 per month and climbing. Someone is going to have to explain to the CTO why you need twice as many GPUs as they do to serve the same traffic.

You pull up the profiler. Each forward pass through your ResNet-based detection model takes 8.4 ms on an A100. The A100 has 312 TFLOPS of dense FP16 compute. Your model has roughly 25 billion floating-point operations per inference. That should complete in 0.08 ms at peak hardware efficiency. You are running at roughly 1% of theoretical hardware utilization. The remaining 99% is overhead - framework dispatch, unnecessary memory copies, redundant kernel launches for operations that could be fused, and a graph structure that PyTorch's eager execution has no way to optimize across operation boundaries.

This is the problem TensorRT was built to solve. Not the 1% case - the 99% overhead case. When NVIDIA engineers analyzed production inference workloads in the early 2010s, they found that the computational graph produced by deep learning frameworks was wildly inefficient when mapped to actual GPU hardware. A convolution followed by batch normalization followed by a ReLU activation is three separate kernel launches, three round trips through memory, three synchronization points. On the GPU, those three operations could run as a single fused kernel - one pass through the data, one memory access pattern, one kernel launch overhead. TensorRT's entire existence is about collapsing that gap.

After running TensorRT compilation on your model, the same workload completes in 1.1 ms. You go from 340 requests per second to 2,600 requests per second on the same hardware. The infrastructure bill drops from 180,000to180,000 to 23,000 per month. The latency story now beats the competitor. This is not a small optimization - it is an order-of-magnitude improvement from taking the framework overhead seriously.

Why This Exists - The Framework Overhead Problem

PyTorch was designed for research. The goal was maximum flexibility: define any computation you want, debug it with standard Python tools, change the architecture in the middle of a training run without recompilation. Eager execution - where each operation runs immediately when called - was the right tradeoff for that use case. But eager execution has a fundamental problem for deployment: you cannot optimize across operation boundaries.

Consider what happens when PyTorch runs conv -> batchnorm -> relu:

  1. Launch convolution kernel, write output to GPU DRAM (8+ GB/s bandwidth, ~1 ms round trip for large tensors)
  2. Launch batch normalization kernel, read from DRAM, write back to DRAM
  3. Launch ReLU kernel, read from DRAM, write back to DRAM

Three kernel launches. Three full memory round trips. For a computation that is genuinely simple: multiply inputs by weights, normalize, clip negatives. A hand-written fused kernel does one memory read (the input), computes all three operations in registers while the data is still in SRAM/shared memory, and does one memory write (the output). The fused version is 2-6x faster on typical tensor sizes, not because the arithmetic is different, but because the memory access pattern is completely different.

The framework overhead problem compounds as models get deeper. A 50-layer ResNet has roughly 150 individual operations in the computation graph. Each one is a separate kernel launch. Each kernel launch has overhead: checking tensor metadata, dispatching to the correct implementation, scheduling on the CUDA stream. At small batch sizes - which is almost always the case in real-time inference - this overhead dominates.

TensorRT solves this by treating the computation graph as an optimization target rather than an execution sequence. You describe what you want to compute (the graph), and TensorRT figures out the fastest way to execute it on the specific piece of hardware you are deploying to.

Historical Context - From cuDNN to TensorRT

NVIDIA released cuDNN in 2014 as a library of hand-optimized CUDA kernels for deep learning primitives - convolutions, pooling, normalization, recurrent operations. This was a major improvement over naive CUDA implementations, but it only solved the per-kernel efficiency problem. The cross-kernel optimization problem remained.

The TensorRT project started internally at NVIDIA around 2015. The driving observation was simple: production inference workloads are not research workloads. In production, the model architecture does not change. The input tensor shape is often fixed (or at least constrained). The hardware is known. Given all those constraints, you can afford to do expensive offline compilation work - benchmarking hundreds of kernel implementations, measuring actual execution times on the target hardware, fusing operations together - and amortize that compilation cost across millions of inference calls.

TensorRT 1.0 shipped publicly in 2016. It handled convolution-heavy computer vision models and provided the first systematic approach to layer fusion and INT8 quantization in a production-ready library. The original target was autonomous driving inference - Tesla, Waymo, and others needed to run complex perception networks at 30+ fps on embedded GPUs with tight power budgets.

The "aha moment" that shaped TensorRT's design came from analyzing what operations actually appear together in production models. Conv followed by batch norm followed by activation is not a coincidence - it is an architectural pattern used in nearly every ResNet, EfficientNet, MobileNet, and their derivatives. If you have a library that specifically knows about CBR (Conv-BN-ReLU) fusion, you immediately accelerate the majority of deployed vision models. The lesson generalized: most deep learning models use a small vocabulary of operation patterns, and optimizing those patterns provides enormous leverage.

TensorRT 8.x (2021-2022) added dynamic shape support, improved transformer support, and INT8/FP16 calibration improvements that made it practical for NLP workloads. TensorRT-LLM, released in late 2023, extended the framework specifically for autoregressive LLM inference with native support for paged KV caches, in-flight batching, and multi-GPU tensor parallelism.

Core Concepts

The TensorRT Compilation Pipeline

TensorRT takes a model graph as input and produces an optimized execution engine. The pipeline has four distinct phases:

Phase 1: Graph Import TensorRT primarily ingests models through the ONNX format. PyTorch, TensorFlow, JAX, and most other frameworks can export to ONNX, making it a universal intermediate representation. The import step parses the ONNX graph and builds TensorRT's internal graph representation (a INetworkDefinition).

Phase 2: Graph Optimization This is where TensorRT earns its performance. The optimizer applies a series of graph transformations:

  • Dead code elimination: remove operations whose outputs are never used
  • Constant folding: pre-compute operations whose inputs are compile-time constants
  • Layer fusion: identify fusable operation sequences and replace them with single optimized kernels
  • Layout optimization: choose optimal tensor memory layouts (NCHW vs NHWC vs NCHW32 etc.) for each operation
  • Precision selection: decide which operations run in FP32 vs FP16 vs INT8 based on the precision configuration

Phase 3: Kernel Auto-Tuning For each fused subgraph, TensorRT maintains a catalog of candidate kernel implementations. For a matrix multiplication, there might be 50+ different CUDA kernel variants - different tile sizes, different memory access patterns, different degrees of loop unrolling. TensorRT runs a benchmarking loop: it literally executes each candidate kernel with representative input tensors and measures the actual wall-clock time on the target GPU. The fastest implementation wins and gets compiled into the engine.

This is the step that makes TensorRT non-portable: an engine compiled for an A100 will not be optimal on a T4, and may not even be loadable on an H100 with different CUDA compute capability. The auto-tuning step can also take a long time - from minutes to hours for large transformer models - because it is exhaustively benchmarking potentially thousands of kernel variants.

Phase 4: Engine Serialization The optimized engine is serialized to a binary .engine file. At runtime, this file is loaded and executed directly. No Python interpreter, no PyTorch overhead, no graph traversal - just the pre-compiled kernel sequence.

Layer Fusion - The Core Optimization

Layer fusion is what makes TensorRT dramatically faster than eager execution. The principle is straightforward: when the output of operation A is only used by operation B, and the combined computation can fit in GPU registers/shared memory without going to DRAM, run them as a single kernel.

The math of why this matters: an A100 has 40 MB of L2 cache and 80 GB of HBM2e DRAM. The bandwidth to L2 is ~12 TB/s. The bandwidth to DRAM is 2 TB/s. Loading a 10 MB activation tensor, processing it, writing back to DRAM, then loading it again for the next operation wastes 2 TB/s bandwidth on a round trip that should never happen.

The most common fusion patterns:

CBR Fusion (Conv + BN + ReLU)

Batch normalization applies a learned scale and shift per channel: x^=xμσ2+ϵγ+β\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta

Since μ\mu, σ2\sigma^2, γ\gamma, β\beta are all constants at inference time (frozen from training), the entire batch norm collapses to a per-channel linear transform: x^=xα+β\hat{x} = x \cdot \alpha + \beta' where α=γ/σ2+ϵ\alpha = \gamma / \sqrt{\sigma^2 + \epsilon}.

This linear transform can be absorbed directly into the convolution weights: multiply the convolution kernel's output scale by α\alpha and add β\beta' to the bias. The result is a single convolution operation with modified weights - no batch norm kernel needed at all. TensorRT performs this weight absorption at compile time, eliminating BN entirely from the runtime graph.

Then ReLU - which is just max(0, x) - gets added as a flag on the convolution kernel. Modern cuDNN convolution kernels accept an activation parameter. The entire CBR block becomes one kernel launch.

Bias + GELU / Bias + SiLU Fusion

Transformer models use patterns like linear -> bias add -> GELU frequently (in FFN blocks). These fuse into a single kernel that computes the GELU activation directly from the matmul output without a separate memory round trip.

Attention Fusion (FlashAttention integration)

TensorRT 8.6+ integrates FlashAttention kernels that fuse the QK matmul, softmax, and AV matmul into a single kernel with O(1) memory complexity in the sequence dimension (rather than O(n^2) from materializing the full attention matrix).

Precision Calibration - INT8 and FP8

Running inference in INT8 rather than FP16 or FP32 provides two benefits: tensors are 2-4x smaller (better memory bandwidth utilization) and integer arithmetic on modern GPUs is 2x faster than FP16 (on A100: 312 TFLOPS FP16 vs 624 INT8 TOPS).

The problem is that neural network weights and activations are floating-point. Converting to INT8 requires deciding how to map the continuous floating-point range to 256 discrete integer values. The mapping is: xint8=round(xfloat32/scale)x_{int8} = \text{round}(x_{float32} / scale) where scalescale is a calibration constant.

TensorRT performs INT8 calibration in three steps:

  1. Collect activation statistics: Run the model on a representative calibration dataset (typically 100-1000 samples from the production distribution). For each activation tensor, record the range of values observed.

  2. Compute scales: For each tensor, find the scale that minimizes quantization error. TensorRT supports several calibration algorithms:

    • MinMax calibration: scale=max(xmin,xmax)/127scale = \max(|x_{min}|, |x_{max}|) / 127
    • Entropy calibration (KL divergence): Find the scale that minimizes information loss between the FP32 and INT8 distributions
    • Percentile calibration: Clip at the 99.9th or 99.99th percentile to handle outliers
  3. Apply per-tensor or per-channel scales: Weights use per-channel quantization (better accuracy). Activations use per-tensor quantization (simpler hardware).

The accuracy loss from INT8 calibration on well-calibrated models is typically under 0.5% relative to FP16. On models with large activation outliers (which is common in large language models - this is precisely why LLM.int8() uses mixed-precision decomposition), accuracy can drop more significantly.

FP8 (introduced with NVIDIA H100) provides a middle ground: float format with 8 bits, giving ~1% accuracy loss compared to FP16 at hardware speeds matching INT8. TensorRT-LLM makes heavy use of FP8 for H100 deployments.

Dynamic Shapes

One historical limitation of TensorRT was that engines were compiled for a fixed input shape. A model compiled for batch=1, seq_len=512 would not work for batch=4, seq_len=128. This made it impractical for production serving where request shapes vary.

TensorRT 6+ introduced Optimization Profiles for dynamic shapes. You specify:

  • Minimum shape: the smallest inputs the engine should handle
  • Optimal shape: the shape to optimize for (used for auto-tuning benchmarks)
  • Maximum shape: the largest inputs the engine should handle

TensorRT compiles separate kernel selections for the range of shapes. At runtime, the engine checks which profile to use and selects the appropriate pre-tuned kernel variant.

The tradeoff: dynamic shape engines have larger compilation artifacts and slightly lower peak performance (because auto-tuning cannot perfectly optimize for every possible shape). For LLM serving, where sequence length varies dramatically, using dynamic shapes with well-chosen profiles is essential.

Code Examples

Basic PyTorch to TensorRT Conversion

import torch
import tensorrt as trt
import numpy as np
from pathlib import Path

# Step 1: Export PyTorch model to ONNX
model = MyProductionModel().cuda().half().eval()

dummy_input = torch.randn(1, 3, 224, 224, device="cuda", dtype=torch.float16)

# Export with dynamic batch size
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=17,
input_names=["input"],
output_names=["output"],
dynamic_axes={
"input": {0: "batch_size"},
"output": {0: "batch_size"},
},
do_constant_folding=True, # Let ONNX fold constants before TRT sees it
)
print("ONNX export complete")

# Step 2: Build TensorRT engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def build_engine(onnx_path: str, engine_path: str, use_fp16: bool = True):
with trt.Builder(TRT_LOGGER) as builder, \
builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
) as network, \
trt.OnnxParser(network, TRT_LOGGER) as parser:

config = builder.create_builder_config()

# Memory pool: how much workspace TensorRT can use during auto-tuning
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 * (1 << 30)) # 4 GB

# Enable FP16 if supported
if use_fp16 and builder.platform_has_fast_fp16:
config.set_flag(trt.BuilderFlag.FP16)
print("FP16 enabled")

# Parse ONNX
with open(onnx_path, "rb") as f:
if not parser.parse(f.read()):
for error in range(parser.num_errors):
print(f"ONNX parse error: {parser.get_error(error)}")
raise RuntimeError("Failed to parse ONNX")

# Dynamic shape profile
profile = builder.create_optimization_profile()
profile.set_shape(
"input",
min=(1, 3, 224, 224),
opt=(8, 3, 224, 224), # optimize for batch=8
max=(32, 3, 224, 224),
)
config.add_optimization_profile(profile)

# Build - this is the slow step (auto-tuning runs here)
print("Building TensorRT engine (this may take several minutes)...")
serialized_engine = builder.build_serialized_network(network, config)

if serialized_engine is None:
raise RuntimeError("Engine build failed")

# Save to disk
with open(engine_path, "wb") as f:
f.write(serialized_engine)

print(f"Engine saved to {engine_path}")
return serialized_engine

build_engine("model.onnx", "model.engine", use_fp16=True)

Running Inference with a TensorRT Engine

import tensorrt as trt
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit # noqa: initializes CUDA context

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

class TRTInferenceEngine:
"""Production-ready TensorRT inference wrapper."""

def __init__(self, engine_path: str):
with open(engine_path, "rb") as f:
runtime = trt.Runtime(TRT_LOGGER)
self.engine = runtime.deserialize_cuda_engine(f.read())

self.context = self.engine.create_execution_context()
self._allocate_buffers()

def _allocate_buffers(self):
"""Pre-allocate pinned host memory and device memory."""
self.inputs = []
self.outputs = []
self.bindings = []
self.stream = cuda.Stream()

for binding in self.engine:
binding_idx = self.engine.get_binding_index(binding)
shape = self.engine.get_binding_shape(binding_idx)
dtype = trt.nptype(self.engine.get_binding_dtype(binding_idx))
size = trt.volume(shape) * np.dtype(dtype).itemsize

# Pinned host memory for fast DMA transfers
host_mem = cuda.pagelocked_empty(trt.volume(shape), dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
self.bindings.append(int(device_mem))

if self.engine.binding_is_input(binding_idx):
self.inputs.append({"host": host_mem, "device": device_mem})
else:
self.outputs.append({"host": host_mem, "device": device_mem})

def infer(self, input_array: np.ndarray) -> np.ndarray:
"""Run inference. input_array should be np.float16."""
# Copy input to pinned memory
np.copyto(self.inputs[0]["host"], input_array.ravel())

# Transfer input to GPU
cuda.memcpy_htod_async(
self.inputs[0]["device"],
self.inputs[0]["host"],
self.stream
)

# Execute
self.context.execute_async_v2(
bindings=self.bindings,
stream_handle=self.stream.handle
)

# Transfer output back
cuda.memcpy_dtoh_async(
self.outputs[0]["host"],
self.outputs[0]["device"],
self.stream
)

self.stream.synchronize()
return self.outputs[0]["host"].copy()


# Usage
engine = TRTInferenceEngine("model.engine")
input_data = np.random.randn(8, 3, 224, 224).astype(np.float16)
output = engine.infer(input_data)
print(f"Output shape: {output.shape}, dtype: {output.dtype}")

INT8 Calibration

import tensorrt as trt
import numpy as np
from torch.utils.data import DataLoader

class ImageCalibrator(trt.IInt8EntropyCalibrator2):
"""
TensorRT INT8 calibrator using entropy minimization.
Feed ~500-1000 representative samples from production distribution.
"""

def __init__(self, calibration_loader: DataLoader, cache_file: str = "calibration.cache"):
super().__init__()
self.loader = iter(calibration_loader)
self.cache_file = cache_file
self.device_input = None
self.batch_size = calibration_loader.batch_size

# Pre-allocate device buffer
import pycuda.driver as cuda
self.device_input = cuda.mem_alloc(
self.batch_size * 3 * 224 * 224 * np.dtype(np.float32).itemsize
)

def get_batch_size(self) -> int:
return self.batch_size

def get_batch(self, names):
"""Called by TensorRT for each calibration batch."""
try:
batch, _ = next(self.loader)
batch_np = batch.numpy().astype(np.float32)
import pycuda.driver as cuda
cuda.memcpy_htod(self.device_input, np.ascontiguousarray(batch_np))
return [int(self.device_input)]
except StopIteration:
return None

def read_calibration_cache(self):
"""Return cached scales if available."""
if Path(self.cache_file).exists():
with open(self.cache_file, "rb") as f:
return f.read()
return None

def write_calibration_cache(self, cache: bytes):
"""Save computed scales for reuse."""
with open(self.cache_file, "wb") as f:
f.write(cache)
print(f"Calibration cache written to {self.cache_file}")


def build_int8_engine(onnx_path: str, calibrator, engine_path: str):
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Builder(TRT_LOGGER) as builder, \
builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
trt.OnnxParser(network, TRT_LOGGER) as parser:

config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 * (1 << 30))
config.set_flag(trt.BuilderFlag.INT8)
config.int8_calibrator = calibrator

with open(onnx_path, "rb") as f:
parser.parse(f.read())

serialized = builder.build_serialized_network(network, config)
with open(engine_path, "wb") as f:
f.write(serialized)
print(f"INT8 engine saved to {engine_path}")

torch2trt - Simpler Conversion Path

# torch2trt wraps the TensorRT API in a cleaner interface
# Install: pip install git+https://github.com/NVIDIA-AI-IOT/torch2trt
from torch2trt import torch2trt

model = MyModel().cuda().half().eval()
dummy = torch.ones(1, 3, 224, 224, device="cuda", dtype=torch.float16)

# Compile to TensorRT - equivalent to the manual pipeline above
model_trt = torch2trt(
model,
[dummy],
fp16_mode=True,
max_batch_size=32,
max_workspace_size=1 << 32, # 4 GB
)

# Verify accuracy
with torch.no_grad():
output_pytorch = model(dummy)
output_trt = model_trt(dummy)

max_error = (output_pytorch - output_trt).abs().max().item()
print(f"Max error vs PyTorch: {max_error:.6f}") # Should be < 0.01 for FP16

# Save and reload
torch.save(model_trt.state_dict(), "model_trt.pth")

Benchmarking the Speedup

import time
import torch
import numpy as np

def benchmark(model_fn, input_tensor, n_warmup=50, n_runs=500):
"""Measure inference latency with proper GPU timing."""
# Warmup
for _ in range(n_warmup):
_ = model_fn(input_tensor)
torch.cuda.synchronize()

# Time with CUDA events for accuracy
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)

latencies = []
for _ in range(n_runs):
start_event.record()
_ = model_fn(input_tensor)
end_event.record()
torch.cuda.synchronize()
latencies.append(start_event.elapsed_time(end_event)) # ms

latencies = np.array(latencies)
print(f" Mean latency: {latencies.mean():.2f} ms")
print(f" P50 latency: {np.percentile(latencies, 50):.2f} ms")
print(f" P99 latency: {np.percentile(latencies, 99):.2f} ms")
print(f" Throughput: {1000 / latencies.mean():.1f} inferences/sec")

input_fp16 = torch.randn(8, 3, 224, 224, device="cuda", dtype=torch.float16)

print("PyTorch FP16 eager:")
benchmark(lambda x: model(x), input_fp16)

print("\nTensorRT FP16:")
benchmark(lambda x: model_trt(x), input_fp16)

TensorRT-LLM for Large Language Models

# TensorRT-LLM uses a higher-level API designed for autoregressive LLMs
# This example shows the workflow for converting LLaMA-2 to TRT-LLM format

# Step 1: Convert weights to TRT-LLM format (run from trt-llm repo)
# python examples/llama/convert_checkpoint.py \
# --model_dir /path/to/llama-2-7b-hf \
# --output_dir ./trt_llama_checkpoints \
# --dtype float16

# Step 2: Build the TRT-LLM engine
# python examples/llama/build.py \
# --checkpoint_dir ./trt_llama_checkpoints \
# --output_dir ./trt_llama_engine \
# --gemm_plugin float16 \
# --max_batch_size 8 \
# --max_input_len 2048 \
# --max_output_len 1024 \
# --use_gpt_attention_plugin float16 \
# --paged_kv_cache \
# --remove_input_padding

# Step 3: Run inference with in-flight batching
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner, SamplingConfig

runner = ModelRunner.from_dir(
engine_dir="./trt_llama_engine",
rank=0,
)

sampling_config = SamplingConfig(
end_id=2, # EOS token
pad_id=2,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
)

input_ids = tokenizer.encode("Explain transformer attention:", return_tensors="pt")
outputs = runner.generate(
batch_input_ids=[input_ids[0]],
sampling_config=sampling_config,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Architecture Diagrams

Layer Fusion - Before and After

TensorRT-LLM Architecture

Triton Inference Server Integration

NVIDIA's Triton Inference Server provides a production-grade serving layer on top of TensorRT engines. It adds:

  • Dynamic batching: group incoming requests into batches automatically
  • Model ensembles: chain multiple models (e.g., preprocessor + backbone + postprocessor)
  • gRPC and HTTP/REST endpoints
  • Prometheus metrics: throughput, latency percentiles, queue depth
  • Multiple backend support: TensorRT, ONNX Runtime, PyTorch, TensorFlow
# Directory structure for a TensorRT model in Triton
models/
resnet50_trt/
config.pbtxt # Model configuration
1/
model.plan # Serialized TensorRT engine

# config.pbtxt
cat > models/resnet50_trt/config.pbtxt << 'EOF'
name: "resnet50_trt"
backend: "tensorrt"
max_batch_size: 32

input [
{
name: "input"
data_type: TYPE_FP16
dims: [3, 224, 224]
}
]

output [
{
name: "output"
data_type: TYPE_FP16
dims: [1000]
}
]

dynamic_batching {
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 1000
}

instance_group [
{
kind: KIND_GPU
count: 1
gpus: [0]
}
]
EOF

# Start Triton server
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver --model-repository=/models

# Check model status
curl -s http://localhost:8000/v2/models/resnet50_trt/ready
# Client-side inference via Triton HTTP API
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input
input_data = np.random.randn(8, 3, 224, 224).astype(np.float16)
inputs = [httpclient.InferInput("input", input_data.shape, "FP16")]
inputs[0].set_data_from_numpy(input_data)

outputs = [httpclient.InferRequestedOutput("output")]

# Synchronous inference
response = client.infer("resnet50_trt", inputs, outputs=outputs)
result = response.as_numpy("output")
print(f"Output shape: {result.shape}")

Production Engineering Notes

Engine Portability and Versioning

TensorRT engines are not portable across different CUDA or cuDNN versions. An engine compiled with TensorRT 8.6 on CUDA 11.8 will fail to load with TensorRT 9.0 or CUDA 12.1. This has important operational implications:

  • Always store the ONNX file alongside the engine file - it is the portable artifact
  • Tag engine files with TRT version, CUDA version, and GPU architecture: model_trt86_cuda118_sm80.engine
  • Build engines as part of the container build process, not separately
  • Pin base images to specific NVIDIA container versions: nvcr.io/nvidia/tensorrt:23.10-py3

Build Time Management

TensorRT engine compilation can take 10-90 minutes for large models due to kernel auto-tuning. Strategies to manage this:

  1. Use the calibration cache to skip INT8 calibration after the first run
  2. Use --timing-cache to reuse timing benchmarks across builds with similar shapes
  3. Build engines on representative hardware (same GPU SKU as production), not on development machines
  4. Cache engines in CI/CD artifacts keyed by model hash + TRT version + GPU arch
  5. For dynamic shapes, minimize the shape range - wider ranges force more auto-tuning
# Using timing cache to speed up subsequent builds
trtexec --onnx=model.onnx \
--saveEngine=model.engine \
--fp16 \
--timingCacheFile=timing.cache \
--minShapes=input:1x3x224x224 \
--optShapes=input:8x3x224x224 \
--maxShapes=input:32x3x224x224 \
--workspace=4096 # MB

Accuracy Validation Protocol

Always validate TRT engine accuracy against the PyTorch baseline before deploying. Use a statistical test, not just a single sample:

def validate_trt_accuracy(pytorch_model, trt_engine, val_loader, tolerance=0.005):
"""
Validate TRT engine accuracy against PyTorch.
tolerance: maximum allowed relative accuracy drop.
"""
pytorch_correct = 0
trt_correct = 0
total = 0

for images, labels in val_loader:
images_gpu = images.cuda().half()
labels_np = labels.numpy()

# PyTorch inference
with torch.no_grad():
pt_out = pytorch_model(images_gpu).argmax(dim=1).cpu().numpy()

# TRT inference
trt_out = trt_engine.infer(images_gpu.cpu().numpy())
trt_preds = trt_out.reshape(len(labels), -1).argmax(axis=1)

pytorch_correct += (pt_out == labels_np).sum()
trt_correct += (trt_preds == labels_np).sum()
total += len(labels)

pt_acc = pytorch_correct / total
trt_acc = trt_correct / total
relative_drop = (pt_acc - trt_acc) / pt_acc

print(f"PyTorch accuracy: {pt_acc:.4f}")
print(f"TRT accuracy: {trt_acc:.4f}")
print(f"Relative drop: {relative_drop:.4f} (tolerance: {tolerance})")

if relative_drop > tolerance:
raise ValueError(f"TRT accuracy drop {relative_drop:.4f} exceeds tolerance {tolerance}")
return True

Unsupported Operations

TensorRT does not support every ONNX operation natively. When the parser encounters an unsupported op, it falls back to a plugin (if available) or fails the build. Common unsupported operations:

  • Custom Python autograd functions (obviously)
  • Dynamic control flow (if/while based on tensor values)
  • Some recent ONNX ops not yet implemented in TRT

When you hit an unsupported op:

  1. Check if a TRT plugin exists (NVIDIA maintains a plugin library)
  2. Implement a custom TRT plugin in CUDA
  3. Restructure the model to avoid the op (often possible)
  4. Fall back to ONNX Runtime for that subgraph and use TRT for the rest
# Use polygraphy to inspect which ops are supported before building
pip install polygraphy
polygraphy inspect model --onnx=model.onnx --show layers --trt

Common Mistakes

:::danger Benchmarking Without GPU Warmup

TensorRT engines require a warmup period before reaching peak performance. The first few inferences load kernel code into the GPU instruction cache and trigger JIT compilation of any remaining runtime-compiled components. If you benchmark without warmup:

# WRONG - first inference is 3-10x slower due to cold cache
start = time.time()
output = engine.infer(input_data)
print(f"Latency: {(time.time() - start)*1000:.1f} ms") # Will show 50ms instead of 1ms

# CORRECT - always warmup
for _ in range(50):
engine.infer(input_data)
torch.cuda.synchronize()
# Now benchmark

This is the single most common mistake in TensorRT benchmarking and leads to completely wrong conclusions about performance. :::

:::danger Building Engines on Development Hardware

TensorRT auto-tuning benchmarks kernel implementations on the actual GPU running the build. If you build on a workstation with a 3090 Ti and deploy to production A100s, the engine is optimized for the wrong hardware. Worse, the engine may refuse to load if the GPU compute capability is incompatible.

Always build TRT engines on the same GPU SKU (or same compute capability) as production. In CI/CD, use GPU-enabled build agents with the same instance type as your serving fleet. :::

:::warning INT8 Calibration Dataset Quality

INT8 calibration quality depends entirely on the calibration dataset being representative of production inputs. If your production data has different statistical properties than the calibration set - different image distributions, different text domains, different input scales - the learned quantization scales will be wrong and accuracy will suffer.

Use at least 500 samples from the actual production distribution. Never use the training set for calibration - use a held-out production sample. Monitor INT8 accuracy continuously in production, not just at deployment time. :::

:::warning Dynamic Shape Engine Build Times

Engines with wide dynamic shape ranges take much longer to build because TensorRT auto-tunes for multiple shape configurations. A model with min_seq_len=1, max_seq_len=4096 may take 4x longer to build than one with min=128, max=512. Narrow your shape ranges to realistic production bounds, and use multiple optimization profiles if needed (e.g., one profile for short sequences and one for long ones). :::

Interview Questions and Answers

Q1: Explain layer fusion in TensorRT. Why does it improve performance, and what is the most common fusion pattern you would find in a production vision model?

Layer fusion improves performance by reducing the number of memory round trips between the GPU's compute units and DRAM. DRAM bandwidth is 1-2 TB/s on high-end GPUs, which sounds fast, but is a bottleneck when you are repeatedly writing intermediate activation tensors to DRAM and reading them back for the next operation. Fused kernels keep intermediate results in registers and shared memory, which operate at 10-100x higher bandwidth.

The most common fusion in production vision models is Conv-BN-ReLU (CBR). Batch normalization during inference reduces to a per-channel linear transform (since running mean/variance are frozen). TensorRT absorbs this into the convolution weights at compile time, eliminating the BN kernel entirely. The ReLU is applied as a flag on the convolution kernel - modern cuDNN convolutions accept an activation parameter. The result is a single kernel with one memory read (input) and one memory write (output) instead of three kernel launches and three DRAM round trips.

Q2: What is kernel auto-tuning and why does it make TensorRT engines non-portable?

Kernel auto-tuning is TensorRT's process of benchmarking multiple candidate CUDA kernel implementations for each operation and selecting the fastest one for the target hardware. For a matrix multiplication, there might be 50+ kernel variants with different tile sizes (e.g., 128x128 vs 64x256), different memory prefetch strategies, different pipeline depths. TensorRT literally executes each candidate on the target GPU and measures wall-clock time. The winner is compiled into the engine.

This makes engines non-portable because the "winning" kernel for each operation depends on the specific GPU microarchitecture, CUDA version, and hardware characteristics. An A100 has different warp schedulers, different cache sizes, and different Tensor Core configurations than a T4 or H100. The optimal tile size for a matrix multiplication on an A100 is different from the optimal tile size on a T4. Additionally, CUDA kernel code is compiled to PTX and then to SASS (machine code) for a specific compute capability - running an A100 engine (sm_80) on an H100 (sm_90) may fail or produce incorrect results.

Q3: You are converting a transformer model to TensorRT and getting accuracy errors in INT8 mode. Walk through how you would diagnose this.

Start by confirming the FP16 engine is accurate first - if FP16 has errors, the problem is in the graph conversion, not quantization. Assuming FP16 is fine, the INT8 issue is almost certainly the quantization scales.

First, check whether you have a representative calibration dataset. Transformers are particularly sensitive to INT8 because attention scores and certain FFN activations have large outlier values - the softmax input can have values like +20 or -20 even though most values are small. A calibration dataset from the wrong domain will badly miscalibrate these.

Second, try switching from MinMax calibration to Entropy (KL divergence) calibration. MinMax includes outliers in the scale calculation, which forces the scale to be large, reducing the resolution for the bulk of values. Entropy calibration clips outliers intelligently.

Third, use the --layerOutputTypes flag in trtexec or the builder API to keep specific layers in FP32. Typically, the softmax operation and the final projection layer benefit from staying in FP16/FP32 even in an INT8 engine.

Fourth, use Polygraphy's debug tools to compare per-layer activations between the FP16 and INT8 engines to identify which layer is diverging.

Q4: What is TensorRT-LLM and what are the three key optimizations it adds over standard TensorRT for LLM inference?

TensorRT-LLM is NVIDIA's LLM-specific wrapper around TensorRT that adds optimizations designed for autoregressive generation workloads. Standard TensorRT is designed for fixed-graph, fixed-input-size inference, which does not map well to LLM generation where each decode step generates one token and the effective batch changes dynamically.

The three key optimizations: First, in-flight batching (continuous batching) - instead of waiting for all requests in a batch to complete before starting new ones, TRT-LLM inserts new requests as soon as slots free up mid-generation. This eliminates the GPU idle time caused by requests finishing at different times. Second, paged KV cache - instead of allocating a contiguous memory block for each sequence's key/value cache (which wastes memory when sequences are shorter than the maximum), paged KV cache uses fixed-size memory blocks and a page table, similar to OS virtual memory. This dramatically reduces memory fragmentation and allows serving 2-5x more concurrent sequences. Third, tensor parallelism integration - TRT-LLM handles the multi-GPU weight sharding and all-reduce communication natively, allowing transformer attention heads and FFN weight matrices to be split across multiple GPUs without manual partitioning code.

Q5: A TensorRT engine takes 45 minutes to build in your CI pipeline. What strategies would you use to reduce this build time?

Several approaches, in order of impact:

Use a timing cache. TensorRT saves the results of kernel auto-tuning benchmarks to a .cache file. On subsequent builds with similar layer configurations, it reuses cached timing data instead of re-benchmarking. This cuts build time dramatically after the first run - often 5-10x.

Narrow the dynamic shape range. If you specified min_seq=1, max_seq=4096, TensorRT auto-tunes for many shape configurations. Analyzing production traffic and narrowing to, say, min_seq=64, max_seq=2048 reduces the auto-tuning space substantially.

Use multiple optimization profiles strategically. Split the problem: one profile for short sequences (64-256) and one for long sequences (256-2048). Each profile has a narrower range, so each tunes faster than a single wide profile.

Parallelize. TensorRT 8.6+ allows the builder to run some auto-tuning in parallel if multiple CUDA devices are available. The --numStreams flag in trtexec controls this.

Cache engines as CI artifacts keyed by (model weights hash + TRT version + GPU arch). If none of these change between builds, skip the build step entirely and deploy the cached engine.

Q6: Describe the precision hierarchy in TensorRT (FP32, TF32, FP16, BF16, INT8, FP8) and when you would choose each.

FP32: full 32-bit float, slowest, most accurate. Used for calibration reference and for layers that are numerically sensitive (like log-softmax or final linear projections in INT8 engines).

TF32: NVIDIA's default format on A100+, uses FP32 storage but only 10-bit mantissa precision for Tensor Core operations. Automatic on A100 for matrix multiplications. Provides most of FP32 range with near-FP16 speed. Usually the right default for training.

FP16: 16-bit float, ~2x faster than FP32 on A100 Tensor Cores. Primary format for inference. Good for most vision and NLP models. Watch for overflow (max ~65504) in softmax logits.

BF16: brain float 16, same exponent range as FP32 but only 7-bit mantissa. Less precise than FP16 but less prone to overflow. Better for training, similar inference performance to FP16.

INT8: 8-bit integer, ~2x faster than FP16 on Tensor Cores. Best for throughput-constrained production vision workloads. Requires calibration. 0.1-0.5% accuracy drop typical.

FP8: 8-bit float (H100+), faster than FP16, easier calibration than INT8 (wider dynamic range), ~0.1-0.3% accuracy drop. Best choice for LLM inference on H100.

© 2026 EngineersOfAI. All rights reserved.