FPGAs for AI Inference

Reading time: ~35 min · Interview relevance: Medium-High · Target roles: Hardware Engineer, Edge AI Developer, ML Infrastructure Engineer

FPGAs sit between GPUs and ASICs: more flexible than a fixed-function chip, faster to deploy than custom silicon, and capable of sub-microsecond latency that GPU kernel launch overhead alone cannot match. When microseconds separate profit from loss, FPGAs are not a niche choice - they are the only choice.

Opening Scenario: 2 Microseconds to Catch Fraud

It is 9:47 AM on a Tuesday at a mid-sized payment processing firm. In a data center in northern Virginia, their fraud detection system is processing 800,000 transactions per second. Each transaction must be scored by a neural network before it can proceed - no exceptions.

The previous system ran on NVIDIA A30 GPUs. The GPU scored transactions in 800 microseconds on average - fast by most standards. But the business SLA demanded decisions in under 100 microseconds, and the GPU's kernel launch overhead alone was 20-50 microseconds before the first multiply-accumulate even started. At peak load during Black Friday, latency spiked to 4 milliseconds. Cards were declined, customers were angry, and the fraud team was blamed for the infrastructure problem.

The architecture team evaluated every option. Groq LPU was fast but required full model rewrite. An ASIC would take 18 months and $40M. The solution they deployed was an FPGA cluster: six Xilinx Alveo U280 cards, each running a custom 8-bit quantized version of their 3-layer MLP fraud model. The model was synthesized directly into FPGA logic using Xilinx Vitis AI. There was no operating system, no kernel launch, no driver stack. The transaction bytes entered the FPGA, traveled through hardwired logic gates that performed the exact same computations as the neural network, and the fraud score appeared at the output pin.

Latency: 1.8 microseconds. Deterministic. Every single transaction.

This is what FPGAs do that nothing else can match: they erase the boundary between software and hardware. The neural network does not run on the FPGA - it becomes the FPGA.

Why This Exists: The Latency Wall

Before understanding FPGAs for AI, understand why every other option fails at the microsecond boundary.

A CPU executes instructions sequentially. To run a neural network layer, it fetches weights from memory, issues multiply instructions, accumulates results, and writes back. Even with SIMD vectorization, the memory latency for a cold cache miss is 50-100 nanoseconds per access. A small MLP with thousands of weights will take hundreds of microseconds on a CPU.

A GPU is massively parallel but has overhead costs baked into its architecture. Launching a CUDA kernel takes 5-20 microseconds in practice - this is the time to notify the GPU driver, DMA the kernel arguments, schedule warps on streaming multiprocessors, and begin execution. For a model that would only take 10 microseconds to actually compute, the launch overhead exceeds the compute time. GPUs are designed for throughput (processing large batches), not for the lowest possible single-query latency.

An ASIC solves the latency problem but requires 12-24 months of chip design, mask costs of $30-100M, and locks you into one model architecture forever. If the model changes, you build a new chip.

FPGAs solve the latency problem with a different tradeoff: reconfigurable logic that you synthesize once (in hours, not years), can re-synthesize when the model changes (in hours again), and runs with deterministic sub-microsecond latency because there is no software layer. The logic is the neural network.

Historical Context: From Microsoft to the Cloud

FPGAs were invented in 1984 by Xilinx co-founder Ross Freeman. The original use case was rapid prototyping of digital logic circuits before committing to ASIC production. For two decades, FPGAs were the domain of hardware engineers building telecom equipment, radar systems, and aerospace applications.

The connection to AI inference arrived from an unexpected direction: Microsoft.

Project Catapult (2014) - Microsoft Research published results showing FPGAs accelerating the Bing search ranking algorithm by 2x while running at 10W per card. The key insight was not maximum throughput - it was that FPGAs could be inserted inline in the network data path. A search ranking request could be routed through an FPGA card in the switch fabric, scored, and returned to the CPU without any round-trip to a separate GPU server. The latency advantage was structural, not just computational.

Microsoft deployed FPGAs in every server in their datacenters starting in 2015. By 2017, Microsoft Azure offered FPGA acceleration as a cloud service, letting customers deploy custom logic directly into Azure's network infrastructure.

Intel acquires Altera (2015) - Intel paid $16.7B for Altera, one of the two dominant FPGA manufacturers alongside Xilinx. Intel's thesis was that FPGAs would become standard accelerators alongside CPUs in data centers, embedded in the server fabric. The Intel Stratix 10 and Agilex lines reflected this vision.

AMD acquires Xilinx (2022) - AMD paid $49B for Xilinx in the largest semiconductor acquisition in history at the time. Xilinx was the market leader in high-end FPGAs, with the Alveo data center line and the Versal ACAP (Adaptive Compute Acceleration Platform) which combined FPGA fabric with hardened AI engines. The acquisition gave AMD a portfolio spanning CPUs, GPUs, and FPGAs - every accelerator type under one roof.

FINN Project (2017-present) - Xilinx Research in Edinburgh, led by Yaman Umuroglu, launched the FINN framework: a research toolchain for deploying quantized neural networks, including binarized neural networks with 1-bit weights and activations, onto FPGAs. FINN demonstrated that sub-microsecond neural network inference was achievable on commodity FPGA hardware. It became the foundation for production tools in Vitis AI.

FPGA Fundamentals: The Hardware You Are Programming

To understand why FPGAs behave differently from GPUs, you need to understand what you are actually configuring.

The Basic Building Blocks

An FPGA is a chip containing millions of configurable elements connected by a programmable interconnect fabric. When you "program" an FPGA, you are not loading software - you are configuring which elements connect to which and what logic function each element performs.

Lookup Tables (LUTs) - The fundamental logic element. A 6-input LUT can implement any Boolean function of 6 inputs. It is literally a 64-entry truth table stored in SRAM. Configure the SRAM contents and you configure the logic function. A typical high-end FPGA has 1-2 million LUTs.

Flip-Flops (FFs) - 1-bit registers, one per LUT. They store state between clock cycles. This is where pipeline registers live, where partial sums accumulate, where sequence state is stored.

DSP Blocks - Hardened multiply-accumulate units. A Xilinx DSP48E2 block performs an 18x27-bit multiply in a single clock cycle, plus a 48-bit accumulate. A Xilinx Alveo U280 has 9,024 DSP blocks. At 300MHz clock, that is 9,024 * 300M = 2.7 trillion multiply-accumulates per second - but only if you can keep all DSPs fed with data simultaneously.

Block RAM (BRAM) - On-chip SRAM arranged in 36Kb blocks. The Alveo U280 has 1,080 BRAM blocks = ~4.7MB of on-chip memory. This is where weights for small models live. Fitting weights in BRAM eliminates DRAM latency entirely.

High Bandwidth Memory (HBM) - The Alveo U280 includes 8GB of HBM2 providing 460 GB/s bandwidth, connected directly to the FPGA fabric. For larger models that do not fit in BRAM, HBM provides GPU-class memory bandwidth without GPU-class power consumption.

How This Maps to Neural Networks

A matrix-vector multiply $\mathbf{y} = W\mathbf{x}$ can be implemented on an FPGA as a pipeline of DSP blocks:

Each DSP block handles one multiply: weight $w_{ij}$ times activation $x_j$
The accumulate chain across DSPs sums the partial products for one output neuron
Multiple such chains run simultaneously, one per output neuron
The pipeline advances one clock cycle at a time, processing new inputs every clock

This is fundamentally different from a GPU. A GPU executes this as thousands of CUDA threads scheduled on streaming multiprocessors, with shared memory coordination. The FPGA executes it as a fixed circuit where each wire carries exactly one value at each clock cycle. There is no thread scheduling, no cache coherency protocol, no kernel launch. The computation and the hardware are the same thing.

FPGA Neural Network Pipeline (simplified):

Clock Cycle 1:  [Input x0] -> DSP_0 (w0*x0) --------+
                [Input x1] -> DSP_1 (w1*x1) --------+-> Accumulate -> Output y0
                [Input x2] -> DSP_2 (w2*x2) --------+
                                                      |
                Simultaneously:                       |
                [Input x0] -> DSP_3 (w3*x0) --------+
                [Input x1] -> DSP_4 (w4*x1) --------+-> Accumulate -> Output y1
                [Input x2] -> DSP_5 (w5*x2) --------+

Clock Cycle 2:  Next input arrives. Pipeline advances.
                New x0, x1, x2 values flow through same DSPs.
                One clock cycle later, new y0 and y1 appear.

The latency through the pipeline is fixed: it is the number of pipeline stages times the clock period. At 300MHz, a 10-stage pipeline takes 33 nanoseconds. This is deterministic and unchanging.

Quantization: Why INT4/INT8 Matters So Much on FPGAs

On a GPU, quantization improves throughput. On an FPGA, quantization determines whether the model fits in hardware at all.

Each DSP48E2 block supports up to 18x27-bit multiplies. An FP32 multiply requires multiple DSP blocks or dedicated floating-point IP cores. An INT8 multiply fits in one DSP block with room to spare. An INT4 multiply fits in half a DSP block - meaning two INT4 multiplies per DSP per clock cycle.

The arithmetic is direct:

For a fully connected layer with 256 inputs and 256 outputs:

FP32: 256 * 256 = 65,536 multiplies needed. At one FP32 multiply per ~3 DSPs = ~196,608 DSPs needed. The Alveo U280 has 9,024. You cannot implement this layer in one pass.
INT8: 65,536 multiplies at one per DSP = 65,536 DSPs needed. Still too large.
INT4: 65,536 multiplies at two per DSP = 32,768 DSPs needed. Still too large for one layer.
INT4 pipelined: Implement a subset of neurons per cycle, stream inputs through. With 9,024 DSPs you process 18,048 INT4 multiplies per cycle, streaming a 256x256 layer in roughly 4 cycles.

This is why the FINN framework focused on binary neural networks (1-bit weights, 1-bit activations). A 1-bit multiply is an XNOR operation - implemented in a LUT, not a DSP, with essentially zero resource cost. A 512-neuron binary layer can fit entirely in BRAM with the compute in LUTs, achieving true sub-100-nanosecond latency.

:::warning Quantization on FPGAs is Not the Same as Quantization for GPUs When you quantize a model for GPU deployment, you use INT8 with symmetric quantization and standard scales. When you quantize for FPGAs, you must account for the specific DSP block dimensions, the accumulator width (to avoid overflow), and whether you are targeting BRAMs (which have fixed width ports). The quantization scheme must be hardware-aware. Xilinx Vitis AI handles this automatically, but custom HLS designs require manual calculation. :::

HLS: Writing Hardware in C++

High-Level Synthesis (HLS) is the technology that made FPGAs accessible to software engineers. Without HLS, programming an FPGA required writing RTL (Register Transfer Level) code in Verilog or VHDL - essentially describing every wire, every register, every clock-cycle behavior explicitly. A small neural network layer might take weeks to implement in RTL.

HLS allows you to write the algorithm in C++ with special pragmas that guide the synthesis tool, and the tool generates RTL automatically.

// HLS implementation of a single fully connected layer
// Compiled with Xilinx Vitis HLS

#include "ap_fixed.h"      // Xilinx arbitrary precision fixed-point
#include "hls_stream.h"    // HLS streaming interface

// Use 8-bit fixed point: 1 sign bit, 3 integer bits, 4 fractional bits
typedef ap_fixed<8, 4> data_t;
typedef ap_fixed<16, 8> acc_t;   // wider accumulator to prevent overflow

const int INPUT_SIZE  = 64;
const int OUTPUT_SIZE = 32;

void fc_layer(
    data_t input[INPUT_SIZE],
    data_t weights[OUTPUT_SIZE][INPUT_SIZE],
    data_t bias[OUTPUT_SIZE],
    data_t output[OUTPUT_SIZE]
) {
    // Pragma: pipeline this loop with initiation interval = 1
    // Means: accept new input every 1 clock cycle
    #pragma HLS PIPELINE II=1

    // Pragma: partition arrays into individual registers for parallel access
    #pragma HLS ARRAY_PARTITION variable=input complete dim=1
    #pragma HLS ARRAY_PARTITION variable=weights complete dim=2

    compute_output: for (int i = 0; i < OUTPUT_SIZE; i++) {
        acc_t acc = bias[i];

        dot_product: for (int j = 0; j < INPUT_SIZE; j++) {
            // HLS will map this to DSP blocks automatically
            acc += (acc_t)(weights[i][j]) * (acc_t)(input[j]);
        }

        // Apply ReLU activation
        output[i] = (acc > 0) ? (data_t)acc : (data_t)0;
    }
}

The #pragma HLS PIPELINE II=1 instruction tells the synthesizer to pipeline the outer loop with an initiation interval of 1 - meaning a new set of inputs can enter the pipeline every clock cycle. The synthesizer analyzes data dependencies and inserts registers automatically to make this possible.

The #pragma HLS ARRAY_PARTITION variable=weights complete dim=2 instruction tells the synthesizer to split the weights array into individual elements stored in separate registers, allowing all INPUT_SIZE weights to be read in parallel in one clock cycle. Without this, the weights would be stored in BRAM with limited port bandwidth.

The result of HLS synthesis on this function, targeting an Alveo U280 at 300MHz:

Latency: 3 clock cycles = 10 nanoseconds for one forward pass (after pipeline fills)
Throughput: 1 new result per clock cycle = 300 million inferences per second
DSP utilization: approximately 64 DSP blocks for the dot products
BRAM: weights stored in 1 BRAM block (32 * 64 * 8 bits = 16,384 bytes)

This is the power of the pipelined hardware approach. The GPU equivalent would take 20+ microseconds just to launch the kernel.

:::danger HLS Pragmas Are Not Suggestions HLS pragmas fundamentally change the hardware generated. Removing #pragma HLS ARRAY_PARTITION on the weights array will cause the synthesizer to generate a design that reads one weight per clock cycle instead of all weights in parallel. The resulting latency could be 64x worse. HLS code that "looks correct" in C simulation can generate hardware that is 100x slower than expected if pragmas are wrong or missing. :::

The Xilinx Vitis AI Workflow

Xilinx Vitis AI (now AMD Vitis AI) is the production path for deploying PyTorch or TensorFlow models on Xilinx FPGAs without writing HLS code manually. It abstracts the hardware details behind a quantization and compilation toolchain.

The workflow has four stages:

Stage 1: Train in PyTorch (standard)

import torch
import torch.nn as nn

class FraudDetectionModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

    def forward(self, x):
        return self.layers(x)

model = FraudDetectionModel()
# ... train normally with your fraud dataset ...
torch.save(model.state_dict(), "fraud_model.pth")

Stage 2: Quantize with Vitis AI Quantizer

# Requires: pip install pytorch-nndct (Vitis AI Python package)
from pytorch_nndct.apis import torch_quantizer

# Load trained model
model = FraudDetectionModel()
model.load_state_dict(torch.load("fraud_model.pth"))
model.eval()

# Create quantizer - specify target DPUCZDX8G (DPU type for Alveo U280)
# 'calib' mode: runs calibration to determine INT8 scale factors
quantizer = torch_quantizer(
    quant_mode='calib',
    module=model,
    input_args=torch.randn(1, 128),  # example input shape
    device=torch.device('cpu'),
    quant_config_file='quant_config.json'
)

# Run calibration on representative data (100-1000 samples is sufficient)
calib_loader = get_calibration_dataloader()  # your data loading code
with torch.no_grad():
    for batch in calib_loader:
        inputs, _ = batch
        quantizer.quant_model(inputs)

# Export calibration results
quantizer.export_quant_config()

# Switch to 'test' mode to verify quantized accuracy
quantizer_test = torch_quantizer(
    quant_mode='test',
    module=model,
    input_args=torch.randn(1, 128),
    device=torch.device('cpu')
)
quant_model = quantizer_test.quant_model

# Evaluate quantized model accuracy
# Should be within 0.5-1% of FP32 baseline for INT8
accuracy = evaluate(quant_model, test_loader)
print(f"INT8 quantized accuracy: {accuracy:.4f}")

# Export ONNX for Vitis AI compiler
quantizer_test.export_onnx_model()

Stage 3: Compile to DPU Instructions

# Run from Vitis AI Docker container
# Vitis AI provides pre-built Docker images with all tools

vai_c_xir \
  --xmodel fraud_model_int8.xmodel \
  --arch /opt/vitis_ai/arch/DPUCZDX8G/Alveo-U280/arch.json \
  --output_dir ./compiled_model/ \
  --net_name fraud_detection

# Output: compiled/fraud_detection.xmodel
# This file contains DPU instruction sequences for the Alveo U280

Stage 4: Deploy on FPGA

# On the target system with Alveo U280 installed
import vart
import xir
import numpy as np

def run_fraud_detection(transaction_features: np.ndarray) -> np.ndarray:
    """
    Run fraud detection model on FPGA.
    transaction_features: shape (batch_size, 128), dtype=float32
    Returns: shape (batch_size, 2) fraud probability logits
    """
    # Load compiled model
    graph = xir.Graph.deserialize("compiled/fraud_detection.xmodel")
    subgraphs = graph.get_root_subgraph().toposort_child_subgraph()

    # Find the DPU subgraph (runs on FPGA fabric)
    dpu_subgraph = [sg for sg in subgraphs if sg.has_attr("device")
                    and sg.get_attr("device").upper() == "DPU"][0]

    # Create DPU runner - this connects to the physical FPGA hardware
    runner = vart.Runner.create_runner(dpu_subgraph, "run")

    # Prepare input/output tensors
    input_tensor_buffers  = runner.get_inputs()
    output_tensor_buffers = runner.get_outputs()

    # Copy input data to DPU input buffer
    # Vitis AI handles INT8 scaling internally
    input_data = np.ascontiguousarray(
        transaction_features,
        dtype=np.float32
    )
    np.copyto(input_tensor_buffers[0].numpy(), input_data)

    # Execute on FPGA - this is where the magic happens
    # The DPU streams data through the synthesized neural network
    job_id = runner.execute_async(input_tensor_buffers, output_tensor_buffers)
    runner.wait(job_id)

    # Read output
    output_data = output_tensor_buffers[0].numpy().copy()
    return output_data

# Benchmark latency
import time
test_transaction = np.random.randn(1, 128).astype(np.float32)

start = time.perf_counter()
result = run_fraud_detection(test_transaction)
end   = time.perf_counter()

print(f"Inference latency: {(end - start) * 1e6:.2f} microseconds")
# Expected output: Inference latency: 2.1 microseconds

The FINN Framework: Pushing to the Extreme

Vitis AI targets standard INT8 quantized models. The FINN framework from Xilinx Research goes further, targeting binarized and extremely quantized neural networks where every operation is optimized to FPGA primitives.

FINN's key insight: a 1-bit weight times a 1-bit activation is an XNOR operation. XNOR is a single LUT operation. No DSP blocks required. A binarized layer with 512 inputs and 512 outputs requires 512 * 512 = 262,144 XNOR operations. Each LUT can implement one XNOR. The Alveo U280 has 1.3 million LUTs. In theory, a 512x512 binarized layer can be fully unrolled in hardware - all 262,144 operations computed simultaneously in one clock cycle.

This is qualitatively different from INT8. With INT8, you pipeline the computation and accept throughput as the metric. With FINN's binarized approach, you can implement genuine combinational logic where the output literally appears at the output pins nanoseconds after the input arrives - no clock cycles, no pipeline stages, no initiation interval. This is how FINN achieves single-digit microsecond latency for small classification models.

The tradeoff is accuracy. Binarized networks sacrifice 2-5% accuracy compared to FP32 on standard benchmarks. For applications like fraud detection where the cost of false negatives is high, this may be unacceptable. For applications like network intrusion detection where speed is more critical than perfection, FINN is the right tool.

When FPGAs Win: The Decision Matrix

Concrete Scenarios Where FPGAs Win

High-Frequency Trading (HFT) - Market data arrives as network packets. An FPGA can parse the packet, run the signal model, and emit an order - all before the packet would have even been delivered to a CPU. Firms like Jump Trading and Virtu use FPGAs in their colocation racks.

5G Base Station Signal Processing - FPGAs run the PDSCH (Physical Downlink Shared Channel) decoding pipeline in real time. The latency constraints are in tens of microseconds. No GPU can schedule work fast enough. FPGA vendors have dedicated 5G IP cores for this.

Medical Imaging Edge Processing - An ultrasound probe with a built-in FPGA runs a segmentation model locally before displaying the image. Power budget: 2W. No GPU fits this envelope. The Xilinx Zynq UltraScale+ combines a quad-core ARM processor with FPGA fabric on one chip.

Network Intrusion Detection (IDS) - A 100GbE network tap needs to classify every packet at line rate. At 100 Gbps with 64-byte packets, that is 148 million packets per second. FPGAs run the classification model in the data path; by the time the packet exits the FPGA, a decision has been made.

Concrete Scenarios Where GPUs Win

Training at Scale - FPGA synthesis takes hours for each design iteration. GPU code compiles in seconds. The development velocity advantage of GPUs for training is overwhelming.

Large Model Inference with Big Batches - An LLM serving 100 requests simultaneously needs the memory bandwidth and raw FLOPS of an H100 or A100. FPGAs have lower peak FLOPS and less HBM than top-end GPUs.

Rapid Model Iteration - If the model architecture changes every week, re-synthesizing an FPGA design every week is prohibitively slow. GPUs let you deploy new models in minutes.

Production Engineering Notes

Synthesis Time Planning

FPGA synthesis is a multi-hour process. Plan accordingly:

Design Complexity	Synthesis Time
Small MLP (< 1M params, INT8)	2-4 hours
Medium CNN (VGG-style, INT8)	6-12 hours
Full ResNet-50 DPU compilation	1-2 hours (Vitis AI pre-built DPU overlay)
Custom HLS design with heavy optimization	8-24 hours

The Vitis AI workflow dramatically reduces this for standard models because the DPU overlay is pre-synthesized. Only the model instructions (xmodel file) need to be compiled, which takes minutes. For custom HLS designs, full re-synthesis is required for any architectural change.

Thermal and Power Management

FPGAs have a significant advantage over GPUs in power efficiency for latency-critical workloads. But they still generate heat and require proper cooling:

# Power estimation for Alveo U280 running fraud detection model
# Figures from Xilinx Power Estimator (XPE)

dynamic_power_w = {
    "dsp_blocks": 45 * 0.28,      # 45 DSPs at 280mW average
    "bram": 8 * 1.2,               # 8 BRAMs at 1.2W each
    "interconnect": 3.5,           # routing power
    "io": 2.1,                     # PCIe + network interface
    "clocking": 1.8,               # PLLs and clock distribution
}
static_power_w = 12.0             # always-on: IO banks, config memory

total_power = sum(dynamic_power_w.values()) + static_power_w
# Total: ~34W for a small MLP on Alveo U280
# Compare: NVIDIA A30 = 165W, A100 = 400W

print(f"Estimated FPGA power: {total_power:.1f}W")
print(f"Performance per watt vs A30: {165/total_power:.1f}x advantage for latency workloads")

Partial Reconfiguration

One FPGA capability with no GPU equivalent: partial reconfiguration. You can update a portion of the FPGA fabric (one neural network layer, one processing pipeline) while the rest of the chip continues operating. For systems that need zero-downtime model updates, partial reconfiguration enables live model swaps without taking the inference endpoint offline.

This is used in production by financial firms that need to update trading signals during market hours without any service interruption.

Key Chips Reference

Chip	Vendor	LUTs	DSP Blocks	On-chip Memory	HBM	Power
Alveo U280	AMD/Xilinx	1.08M	9,024	4.7MB BRAM	8GB HBM2	225W TDP
Alveo U55C	AMD/Xilinx	1.3M	9,024	3.8MB BRAM	16GB HBM2	150W TDP
Versal AI Core	AMD/Xilinx	400K	1,968	34.6MB BRAM	-	100W TDP
Stratix 10 GX	Intel	1.52M	5,760	59Mb MLAB	-	250W TDP
Agilex 7	Intel	930K	7,680	253Mb MLAB	-	200W TDP
Zynq UltraScale+	AMD/Xilinx	504K	2,520	22.5Mb BRAM	-	30W (edge)

The Versal AI Core series deserves special mention: it combines programmable FPGA fabric with hardened AI Engines (vector processors optimized for ML) and a processor system. It is AMD's answer to custom ASICs for AI - the AI Engines provide ASIC-level efficiency for standard tensor operations, while the FPGA fabric handles custom data path logic.

Common Mistakes

:::danger Treating FPGAs Like CPUs or GPUs The biggest mistake engineers make when approaching FPGAs: thinking of them as another processor to write code for. The mental model must shift from "I am programming a processor" to "I am designing a circuit." Every piece of code you write in HLS will be translated into actual gates, wires, and registers. Writing a loop in HLS does not mean the FPGA executes it sequentially - with the right pragmas it may be fully unrolled into parallel hardware. Without the right pragmas, it may generate a state machine that takes N clock cycles. The pragma-free HLS code that "works in simulation" can generate 100x slower hardware than intended. :::

:::danger Ignoring Synthesis Time in Your Development Cycle A GPU developer iterates in minutes: change code, recompile, run, observe. An FPGA developer iterates in hours: change HLS code, run synthesis (2-8 hours), download bitstream, run, observe. Teams that do not account for this in their project timeline routinely miss deadlines by months. The mitigation: simulate exhaustively in C++ (C simulation in Vitis HLS runs in seconds), use co-simulation to verify RTL behavior before committing to full synthesis, and use the Vitis AI abstraction for standard models to avoid HLS entirely. :::

:::warning Underestimating Quantization Sensitivity FPGAs typically require INT8 or lower precision for efficient implementation. Some models are sensitive to quantization and lose more than 2% accuracy at INT8. This is not a hardware problem - it is a model problem that must be solved with quantization-aware training (QAT) before deployment. If your model accuracy drops from 94% to 89% in INT8, do not blame the FPGA. Retrain with QAT. :::

:::warning Thinking That "Lower FLOPS" Means Worse Performance An Alveo U280 at peak theoretical FLOPS is slower than an A100. But for latency-critical batch-1 inference, the U280 can be 100x faster in wall-clock time. FLOPS is a throughput metric. Latency depends on pipeline depth, memory access patterns, and scheduling overhead - all of which favor FPGAs for small, frequent queries. :::

Interview Questions and Answers

Q1: Explain the latency advantage of FPGAs over GPUs for neural network inference. Be specific about where the latency difference comes from.

A1: The GPU latency advantage has two structural sources. First, kernel launch overhead: when you call model(x) in PyTorch, the CUDA runtime must notify the GPU driver, DMA kernel arguments, schedule warps on streaming multiprocessors, and begin execution. This alone takes 5-20 microseconds before any computation starts. Second, the GPU's execution model is inherently batch-oriented - it achieves high throughput by processing hundreds of inputs simultaneously. For single-query inference, most of the hardware sits idle while the scheduler routes work.

An FPGA eliminates both problems. There is no kernel launch because there is no kernel - the neural network is synthesized as a fixed circuit. A transaction arriving on the input pins flows through the pipeline immediately, with the first computation beginning within the first clock cycle after the data arrives. The total latency equals the pipeline depth times the clock period, which can be 10-100 nanoseconds for small models. This is structurally impossible on a GPU.

Q2: What is High-Level Synthesis and what are its limitations compared to writing RTL directly?

A2: HLS is a compiler that translates C++ algorithmic descriptions into RTL (Verilog/VHDL) hardware descriptions. It eliminates the need to manually specify register transfers, clock-cycle timing, and wire routing - instead inferring these from the C++ code and pragma annotations.

Limitations: First, the quality of generated RTL (measured as resource utilization and clock frequency) is consistently worse than hand-written RTL by an expert - typically 1.5-3x more LUTs/DSPs for equivalent functionality. Second, HLS tools have limited support for irregular control flow (pointer aliasing, complex data structures, dynamic memory allocation) - anything that maps poorly to hardware is either unsupported or generates slow circuits. Third, debugging requires understanding both the C++ model and the generated RTL - when timing closure fails, you need RTL knowledge to diagnose it. Fourth, HLS synthesis still takes hours, so the iteration speed advantage is smaller than expected.

The tradeoff is appropriate for most ML acceleration tasks: HLS code is 10-100x faster to develop than RTL, and the resulting hardware is fast enough for production. Hand-written RTL is reserved for the most performance-critical inner loops.

Q3: How does quantization for FPGA differ from quantization for GPU, and why does it matter more on FPGAs?

A3: For GPU deployment, INT8 quantization primarily improves throughput and reduces memory bandwidth. The GPU's tensor cores natively support INT8; you use standard symmetric per-tensor or per-channel quantization with calibration, and the hardware handles the rest transparently.

For FPGA deployment, quantization is a hardware design decision. The bit-width of your data directly determines which resources are used: INT8 maps to one DSP block per multiply; INT4 maps to half a DSP block; FP32 requires multiple DSP blocks or floating-point IP cores. The bit-width also determines how many operations you can run in parallel (given a fixed number of DSPs), the accumulator width needed to prevent overflow, and whether weights fit in BRAM or require HBM. These are resource allocation decisions, not just accuracy tradeoffs. Custom FPGA designs often use unusual precisions like INT5 or INT6 because the specific bit-width optimally fills the available DSP resources.

Q4: Describe the Microsoft Project Catapult architecture and what it demonstrated about FPGAs in production ML.

A4: Project Catapult, published at ISCA 2014, deployed FPGAs in every server in a Microsoft datacenter for Bing search ranking acceleration. The key architectural insight was that FPGAs could be inserted into the network data path between servers, rather than treated as separate accelerators. A search ranking request could flow through an FPGA card in the datacenter fabric, be scored there, and return to the CPU - without the latency of routing to a separate GPU accelerator box.

The performance result was 2x improvement in ranking throughput at 10W per card. But the architectural impact was larger: it demonstrated that FPGAs could be operationally reliable at datacenter scale (thousands of boards, over months), that datacenter FPGAs could be reprogrammed in place for different tasks, and that the "inline" deployment model - FPGA in the data path, not the compute path - was viable for production services. This influenced Microsoft's Azure FPGA service and the broader industry's view of FPGAs as infrastructure components rather than exotic accelerators.

Q5: What is the FINN framework and what workloads is it designed for?

A5: FINN (Fast Inference on Neural Networks) is an open-source toolchain from AMD/Xilinx Research for compiling quantized neural networks - including binarized and extremely quantized networks - to FPGA hardware. Unlike Vitis AI, which targets standard INT8 quantized models running on a pre-built DPU overlay, FINN generates custom hardware for each neural network architecture.

FINN's target workloads are applications where absolute minimum latency is required and accuracy can be traded against speed. The canonical example is the FINN-R paper (2018), which demonstrated a binarized ResNet running at 12,000 frames per second with 300 microsecond latency on a Xilinx ZC706 FPGA. FINN uses 1-bit (binary) and 2-bit quantization, implementing multiplies as XNOR/AND operations in LUTs rather than DSPs. This enables full unrolling - every neuron computed simultaneously in combinational logic with no pipeline stages.

Appropriate use cases: network intrusion detection, radio signal classification, robotics control loops, and any edge application where the power envelope is under 5W and latency under 100 microseconds is required. Not appropriate for: large transformer models, any model requiring FP32 accuracy, or applications where rapid model updates are needed.

Q6: Walk through the resource utilization calculation for determining if a model fits on a given FPGA.

A6: The calculation has three components: compute (DSPs), storage (BRAM/HBM), and routing (LUTs for control and activation functions).

For a fully connected layer with $M$ inputs and $N$ outputs using INT8:

Compute: $M \times N$ multiply-accumulate operations. One DSP per INT8 multiply. For parallel execution you need $M \times N$ DSPs; for serial execution (one output per cycle) you need $M$ DSPs.
Storage: weight matrix is $M \times N \times 1$ byte. Each BRAM block is 36Kb = 4096 bytes. You need $\lceil M \times N / 4096 \rceil$ BRAMs, plus separate BRAMs for activations.
Activations (ReLU): implemented in LUTs, roughly 4-8 LUTs per neuron.

For an MLP with layers [128, 256, 128, 64, 2] running INT8:

Layer 1 (128 to 256): 128*256 = 32,768 DSPs for full parallel; 32KB weights = 8 BRAMs
Layer 2 (256 to 128): 256*128 = 32,768 DSPs; 32KB weights = 8 BRAMs
Layer 3 (128 to 64): 8,192 DSPs; 8KB = 2 BRAMs
Layer 4 (64 to 2): 128 DSPs; trivial

Total fully parallel: 73,856 DSPs. Alveo U280 has 9,024. The model does not fit with full unrolling.

Solution: pipeline with reuse. Implement 9,000 DSPs and serialize the computation, processing different output neurons in sequence. Layer 1 takes $\lceil 32768/9000 \rceil = 4$ cycles instead of 1. Total latency becomes ~40 cycles = 133 nanoseconds at 300MHz. Still faster than any GPU for single-sample inference.

Transformer Inference on FPGAs: Is It Viable?

The previous sections focused on MLPs and CNNs, which are the natural fit for FPGAs. Transformer inference is harder and worth addressing directly.

A transformer's self-attention mechanism is the bottleneck. For sequence length $L$ and embedding dimension $d$ , the attention computation requires $O(L^2 d)$ operations. For $L=512, d=768$ (BERT-base), the QKV projection is 3 linear layers (manageable on FPGA) and the attention score computation requires $L^2$ dot products. The total compute for one transformer layer is:

$\text{FLOPS} = 2 \times L \times d^2 \times 4 + 2 \times L^2 \times d = 2 \times 512 \times 768^2 \times 4 + 2 \times 512^2 \times 768 \approx 3.6 \text{ billion FLOPs}$

At INT8 on an Alveo U280 with 9,024 DSPs at 300MHz, sustained throughput is approximately:

$\text{Throughput} = 9,024 \times 300 \times 10^6 \approx 2.7 \text{ trillion INT8 MAC ops/second}$

So one forward pass through BERT-base takes approximately $3.6 \times 10^9 / 2.7 \times 10^{12} \approx 1.3$ milliseconds.

This is not competitive with an H100 running FlashAttention (sub-millisecond for BERT-base at batch 1). For transformer inference, FPGAs are viable only for BERT-tiny or smaller models, or for heavily quantized models (INT4/binary) where the reduced arithmetic density fits the FPGA better.

The conclusion: FPGAs for transformers are a research topic, not a production recommendation. The sweet spot remains MLPs and CNNs at the edge, or standard classification/detection models where model size is small enough to fit fully in BRAM.

Connecting FPGA Quantization to Training: Quantization-Aware Training

Deploying a model on an FPGA is not just an inference-time decision. You must prepare during training if you want the best accuracy at INT8.

Post-Training Quantization (PTQ) is the simplest approach: train in FP32, then calibrate INT8 scales using a small representative dataset. This works well for many models but can cause 1-3% accuracy loss for models that are sensitive to quantization noise.

Quantization-Aware Training (QAT) inserts fake quantization nodes during training, allowing the model to learn weights that are robust to INT8 rounding. The training uses the Straight-Through Estimator (STE) to allow gradients to flow through the non-differentiable quantization operation:

$\hat{x} = \text{round}\left(\frac{x}{s}\right) \times s$

During the backward pass, $\frac{\partial \hat{x}}{\partial x} = 1$ (STE approximation), so gradients flow as if no quantization occurred. After QAT, the model has learned to pack information into INT8 representations more efficiently.

import torch
import torch.nn as nn
from torch.quantization import QuantStub, DeQuantStub, prepare_qat, convert

class QATFraudModel(nn.Module):
    """
    Fraud detection model prepared for Quantization-Aware Training.
    The QuantStub/DeQuantStub nodes tell PyTorch where to insert
    fake quantization operations during QAT.
    """
    def __init__(self):
        super().__init__()
        # QuantStub inserts fake-quantize at the network input
        self.quant = QuantStub()

        self.layers = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)
        )

        # DeQuantStub removes fake-quantize at the network output
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.layers(x)
        x = self.dequant(x)
        return x

def run_qat(model: nn.Module, train_loader, val_loader, epochs=5):
    """Run Quantization-Aware Training and return INT8-ready model."""

    # Set quantization config: fbgemm for x86 deployment,
    # qnnpack for ARM/edge, or custom config for FPGA (via Vitis AI)
    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

    # Fuse Conv-BN-ReLU and Linear-ReLU patterns before QAT
    # Fusion improves accuracy and maps better to hardware
    model = torch.quantization.fuse_modules(model, [['layers.0', 'layers.1']])

    # Prepare model for QAT: inserts fake-quantize observers
    model_qat = prepare_qat(model.train())

    optimizer = torch.optim.Adam(model_qat.parameters(), lr=1e-5)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        model_qat.train()
        for batch_features, batch_labels in train_loader:
            optimizer.zero_grad()
            output = model_qat(batch_features)
            loss = criterion(output, batch_labels)
            loss.backward()
            optimizer.step()

        # Evaluate with fake quantization active
        model_qat.eval()
        val_acc = evaluate(model_qat, val_loader)
        print(f"Epoch {epoch+1}: QAT val accuracy = {val_acc:.4f}")

    # Convert to actual INT8 weights (removes fake-quantize nodes)
    model_int8 = convert(model_qat.eval())

    return model_int8

# After this point, export model_int8 to ONNX and feed into Vitis AI compiler
# The QAT-trained model typically shows < 0.3% accuracy drop vs FP32

The key difference from standard QAT: when targeting FPGAs via Vitis AI, use torch_quantizer in QAT mode instead of PyTorch's native prepare_qat. Vitis AI's quantizer is calibrated to the specific DSP block dimensions of the target DPU architecture and will generate more hardware-efficient INT8 representations than the generic PyTorch quantizer.

Real-World FPGA Deployment Architecture

A production FPGA deployment is more than just the FPGA card. The full system involves host CPU code, PCIe communication, the FPGA shell (management logic), and the user application (your neural network).

The PCIe link is the most common bottleneck for high-throughput FPGA inference:

PCIe Gen 4 x16: 64 GB/s peak, ~50 GB/s sustained
At 1MB per transaction and 50 GB/s, you can move 50,000 transactions per second over PCIe
For the fraud detection example at 800,000 TPS, you need 10 FPGA cards (each handling 80,000 TPS) or need to preload data in HBM and stream results back, not move transactions over PCIe

The latency breakdown for the fraud detection system:

PCIe DMA to FPGA: ~2 microseconds (small packet, well-tuned DMA)
FPGA DPU inference: 1.8 microseconds
PCIe DMA back to host: ~1 microsecond
Total end-to-end: ~5 microseconds

For applications where even PCIe round-trip is too slow, the FPGA can be integrated directly into the network switch fabric (SmartNIC configuration), eliminating the host CPU from the data path entirely.

Intel OpenCL for FPGA: The Alternative to Xilinx HLS

Intel (formerly Altera) took a different approach to FPGA programming: OpenCL. Instead of C++ with pragmas (Xilinx HLS), Intel's oneAPI FPGA flow uses OpenCL kernels compiled to hardware.

// Intel FPGA OpenCL kernel for a single FC layer
// Compiled with Intel oneAPI FPGA compiler

__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void fc_layer_fpga(
    __global const char * restrict input,    // INT8 input activations
    __global const char * restrict weights,  // INT8 weights
    __global const int  * restrict bias,     // INT32 bias
    __global char       * restrict output,   // INT8 output
    int input_size,
    int output_size
) {
    // FPGA-specific attribute: tell compiler to pipeline this loop
    // with target initiation interval of 1
    #pragma ii 1

    for (int out = 0; out < output_size; out++) {
        int acc = bias[out];

        // Inner loop: dot product
        // 'ivdep': tell compiler there are no loop-carried dependencies
        #pragma ivdep
        for (int in = 0; in < input_size; in++) {
            acc += (int)weights[out * input_size + in] * (int)input[in];
        }

        // Clamp and store as INT8
        acc = acc >> 8;  // scale down (quantization scale applied)
        output[out] = (char)(acc > 127 ? 127 : (acc < -128 ? -128 : acc));
    }
}

The Intel OpenCL approach compiles to the same FPGA primitives as Xilinx HLS - LUTs, DSPs, and BRAM. The synthesis time is similar (hours). The difference is ecosystem: Intel's toolchain integrates with the oneAPI toolkit and is more familiar to developers who have used OpenCL for GPU programming.

The practical advice: use Xilinx Vitis AI for production deployment on Xilinx/AMD hardware. Use HLS only when you need custom operations not covered by the DPU overlay. Use Intel OpenCL only if you are targeting Stratix 10 or Agilex devices specifically. For new projects in 2024, the Vitis AI abstraction layer is mature enough that you rarely need to write HLS directly.

Quantization Math: From Theory to FPGA Registers

Understanding the math of INT8 quantization helps you debug accuracy issues and design custom quantization schemes for unusual FPGA configurations.

The standard linear quantization formula maps a floating-point value $x$ to an integer $x_q$ :

$x_q = \text{clamp}\left(\text{round}\left(\frac{x}{s}\right) + z, x_{q,\min}, x_{q,\max}\right)$

Where:

$s$ is the scale factor (a floating-point number)
$z$ is the zero-point (an integer offset for asymmetric quantization)
$x_{q,\min}, x_{q,\max}$ are the quantized range bounds (for INT8: -128 to 127)

For symmetric quantization (commonly used for weights):

$s = \frac{\max(|x|)}{127}, \quad z = 0$

For asymmetric quantization (commonly used for activations, which are non-negative after ReLU):

$s = \frac{\max(x) - \min(x)}{255}, \quad z = -\text{round}\left(\frac{\min(x)}{s}\right) - 128$

On the FPGA, the multiply-accumulate with INT8 quantized values produces an INT32 accumulator to avoid overflow:

$y_q = \sum_i w_{q,i} \cdot x_{q,i}$

$y = s_w \cdot s_x \cdot (y_q - z_w \cdot \sum_i x_{q,i} - z_x \cdot \sum_i w_{q,i} + N \cdot z_w \cdot z_x)$

This requantization step converts the INT32 accumulator back to INT8 for the next layer. On the FPGA, this is implemented as a right-shift (division by a power of 2) plus a clamp - no floating-point required.

:::note Why Symmetric Quantization is Preferred for Weights on FPGAs Symmetric quantization with $z=0$ eliminates the zero-point correction term from the requantization formula. This removes the $z_w \sum_i x_{q,i}$ term, which would require an additional accumulation operation per output neuron. With symmetric weight quantization, requantization is just a right-shift and clamp - mapping cleanly to a single DSP block with no additional logic. :::

Summary

FPGAs occupy a specific niche in the AI hardware landscape: they are the correct choice when GPU kernel launch overhead is itself a performance problem, when power budgets rule out GPU-class hardware, and when deterministic sub-microsecond latency is a hard requirement.

The key ideas:

FPGAs implement neural networks as circuits, not as software running on a processor. The neural network does not run on the FPGA - it becomes the FPGA.
DSP blocks, BRAMs, and LUTs map directly to the multiplications, weight storage, and activation functions of neural networks. Bit-width determines which resource class is used.
INT8 and lower quantization is required for practical FPGA deployment. The bit-width you choose affects both accuracy and how many parallel multiply-accumulate units fit in the available DSPs.
HLS (C++ with pragmas) and Vitis AI provide production toolchains without RTL expertise. The DPU overlay in Vitis AI is pre-synthesized; you only compile model instructions, which takes minutes.
Synthesis takes hours for full custom HLS designs. Design your iteration cycle accordingly: simulate in C, co-simulate to verify RTL, synthesize only when confident.
FPGAs win on latency and power efficiency for small models and single-sample inference. GPUs win on throughput, large models, and rapid model iteration.
The FINN framework extends FPGA AI to the extreme edge: binarized networks with combinational (zero-pipeline-stage) inference.
Transformers are a poor fit for current FPGAs due to the attention mechanism's quadratic memory access requirements. MLP and CNN workloads are the primary sweet spot.

The mental model shift required: stop thinking of FPGAs as slow computers and start thinking of them as configurable circuits where the neural network becomes the hardware itself. Once that shift happens, the latency advantage is obvious - you are not scheduling work on a processor, you are building a pipe where data flows through logic at the speed of electricity.

Opening Scenario: 2 Microseconds to Catch Fraud​

Why This Exists: The Latency Wall​

Historical Context: From Microsoft to the Cloud​

FPGA Fundamentals: The Hardware You Are Programming​

The Basic Building Blocks​

How This Maps to Neural Networks​

Quantization: Why INT4/INT8 Matters So Much on FPGAs​

HLS: Writing Hardware in C++​

The Xilinx Vitis AI Workflow​

The FINN Framework: Pushing to the Extreme​

When FPGAs Win: The Decision Matrix​

Concrete Scenarios Where FPGAs Win​

Concrete Scenarios Where GPUs Win​

Production Engineering Notes​

Synthesis Time Planning​

Thermal and Power Management​

Partial Reconfiguration​

Key Chips Reference​

Common Mistakes​

Interview Questions and Answers​

Transformer Inference on FPGAs: Is It Viable?​

Connecting FPGA Quantization to Training: Quantization-Aware Training​

Real-World FPGA Deployment Architecture​

Intel OpenCL for FPGA: The Alternative to Xilinx HLS​

Quantization Math: From Theory to FPGA Registers​

Summary​