Hardware Acceleration Beyond GPU

Reading time: ~40 min · Interview relevance: High · Target roles: ML Engineer, Hardware Engineer, AI Systems Architect

NVIDIA's H100 costs $30,000 per card. For a startup serving 10 million inference requests per day with a model that fits in 1MB and runs in 0.5ms, the H100 is 10,000x overpowered. An FPGA implementation of the same model costs$ 1,500, fits in a standard PCIe slot, runs at 0.05ms latency, and uses 25 watts instead of 700. Knowing which hardware to use for which workload is a systems design skill that directly determines whether a product is economically viable.

When GPUs Are the Wrong Tool

The history of NVIDIA's dominance in ML is straightforward: CUDA was available, GPUs were fast at matrix multiplication, and the ML community standardized on this combination. But NVIDIA's success has a shadow: it has conditioned a generation of ML engineers to reach for a GPU by default, even when a GPU is the wrong answer.

Consider a production fraud detection system. The model is a gradient-boosted tree ensemble - fast and interpretable, deployed by most major banks. Input is a transaction with 200 features. Inference time on a well-optimized CPU: 50 microseconds. On a GPU: you pay the PCIe transfer overhead (5-10 microseconds) plus kernel launch latency (5-10 microseconds), and the actual compute is so simple that the GPU is done in 2 microseconds. Total GPU latency: 15-25 microseconds. Worse than CPU, at 200x the power consumption, 20x the cost.

Now consider a different scenario: a satellite imagery processing pipeline that runs a 50-layer convolutional network on 10,000 image tiles per second, each 256x256 pixels, 24 hours a day. The total compute is enormous but highly regular - the same convolution kernel applied repeatedly to slightly different data. An FPGA implementation can pipeline this perfectly: each layer's output feeds directly into the next layer's input registers, with no memory transfers between layers, at a throughput that a GPU achieves only at power and cost levels that make the satellite operator's economics unworkable.

These two examples illustrate the core insight of this lesson: hardware acceleration is not "more powerful is better." It is about matching the computational pattern, data access pattern, latency requirement, power budget, and production volume to the right hardware implementation. A well-chosen FPGA beats a GPU not because it is more capable in general, but because it perfectly fits the specific workload.

This lesson covers the full hardware acceleration landscape: FPGAs, the ASIC design process, Google's TPU and its systolic array architecture, domain-specific accelerators, neuromorphic computing, photonic chips, and processing-in-memory. We will cover when to use each, how to program them, and how to do the economic analysis that determines which choice makes sense for your scale and workload.

Why This Exists - The Limits of General-Purpose Hardware

The von Neumann architecture that underlies both CPUs and GPUs has a fundamental bottleneck: the memory wall. Every computation requires fetching operands from memory, processing them in a small register file, and writing results back. The processor and memory are separate components connected by a bus with limited bandwidth. For compute-intensive workloads like matrix multiplication, the ratio of memory bandwidth to compute throughput determines whether you are memory-bound or compute-bound.

GPUs solve this partially with DRAM (HBM) that provides 2-3 TB/s bandwidth and thousands of parallel execution units. But they are still general-purpose: the same hardware that computes matrix multiplications also handles branching, variable-length operations, dynamic memory allocation, and arbitrary data structures. Most of the transistors in an H100 serve generality that a specific ML inference workload never uses.

Domain-specific architectures eliminate this generality. A systolic array for matrix multiplication has no branch predictor, no out-of-order execution unit, no dynamic memory allocator - just an array of multiplier-accumulator cells connected in a grid pattern that perfectly maps to matrix multiplication. When you eliminate unused hardware, you can build more relevant hardware in the same die area, or achieve the same throughput with far fewer transistors (and thus far lower power and cost).

Historical Context - From DSPs to Neural Network Accelerators

Specialized computing hardware predates ML. Digital Signal Processors (DSPs) from the 1980s were the first domain-specific processors: fixed-point arithmetic, hardware multiply-accumulate, circular buffers for FIR filtering. DSPs powered modems, phone systems, and audio codecs for decades.

The first neural network accelerators appeared in the early 2010s, before the deep learning revolution. Intel's Neuromorphic Research Chip (2011) explored spike-based computation. Google's TPU project began in 2013 (publicly revealed in 2016), motivated by the realization that if every Gmail user used voice search for 3 minutes per day, Google would need twice the current number of data centers running CPU inference. The TPU's systolic array design delivered 15-30x better performance-per-watt compared to GPUs for inference-only workloads at the time.

Microsoft's Project Brainwave (2018) deployed FPGAs at scale in Azure for real-time AI inference. Amazon began deploying custom ML chips (Inferentia, 2019) in AWS. Apple integrated the Neural Engine into A-series chips starting with A11 in 2017, making the first neural network accelerator to ship in billions of consumer devices.

By 2024, virtually every major technology company has custom ML silicon, and the startup ecosystem of AI accelerator companies (Cerebras, SambaNova, Groq, Graphcore, Tenstorrent) has attracted billions in investment. The era of GPU monoculture is ending - replaced by a diverse hardware landscape where the right choice depends on workload characteristics.

FPGAs for ML Inference

An FPGA (Field-Programmable Gate Array) is an integrated circuit containing a large array of configurable logic blocks (CLBs) connected by a programmable routing fabric. The logic blocks contain lookup tables (LUTs), flip-flops, and carry chains. DSP blocks provide high-throughput multiplier-accumulator operations. Block RAM (BRAM) provides on-chip storage.

The key distinction from a CPU or GPU: you are not programming software that runs on hardware. You are configuring the hardware itself. An FPGA inference engine is literally a circuit that implements your neural network as physical logic.

FPGA Resource Types

Resource	Description	Use for ML
LUT (Look-Up Table)	6-input boolean function generator	Control logic, activation functions
DSP (Digital Signal Processor)	18x25-bit or 27x18-bit MAC unit	Matrix multiplication, convolutions
BRAM (Block RAM)	36Kbit dual-port synchronous SRAM	Weight storage, activation buffers
URAM (Ultra RAM, Ultrascale+)	288Kbit single-port RAM	Large weight storage
HBM (High Bandwidth Memory, some FPGAs)	460 GB/s stacked DRAM	Large model weights

A Xilinx Ultrascale+ VU9P (flagship datacenter FPGA) has 6,840 DSP blocks. At 500 MHz clock, each DSP performs one 18-bit multiply-accumulate per cycle: 6,840 * 500e6 = 3.42 TOPS (INT18). In INT8 mode with cascaded DSPs: up to 13.7 TOPS. Compare this to an A100 at 312 TOPS - the FPGA is significantly less powerful. But the FPGA wins on power (100W vs 400W), latency (deterministic sub-millisecond vs variable GPU scheduling), and cost for medium-volume deployment.

FPGA Architecture for Neural Network Inference

The critical property of FPGA inference: data flows through the pipeline without touching external memory between layers. On a GPU, every layer's output is written to DRAM and the next layer reads from DRAM. On an FPGA, the output register of layer $N$ feeds directly into the input register of layer $N+1$ at single-cycle latency. For a 10-layer network, this eliminates 9 round-trips to DRAM - the dominant latency component in GPU inference for small models.

Vitis AI - Xilinx's ML Deployment Framework

"""
Xilinx Vitis AI Python API for deploying quantized models to Xilinx FPGAs.
Vitis AI handles the compilation from PyTorch/TensorFlow to FPGA bitstream.

Prerequisites:
  - Xilinx Alveo U50/U250 or Versal AI Edge card
  - Vitis AI runtime: pip install vitis-ai-library
  - Model quantized with vai_q_pytorch (Vitis AI quantizer)
"""
import numpy as np
from typing import List, Optional
import time


def quantize_model_for_fpga(
    model,
    calibration_data,
    output_dir: str = "./quantized_model",
    quant_mode: str = "calib",
):
    """
    Quantize a PyTorch model for FPGA deployment using Vitis AI quantizer.

    Vitis AI uses calibration-based post-training quantization:
    1. Run forward passes with calibration data to measure activation ranges
    2. Convert weights and activations to INT8 (or INT16 for critical layers)
    3. Export quantized model in xmodel format for FPGA compilation

    The quantization error is typically < 1% accuracy drop for classification
    and < 2% for regression tasks.
    """
    try:
        from pytorch_nndct.apis import torch_quantizer
        import torch
    except ImportError:
        print("Install Vitis AI toolkit: follow Xilinx installation guide")
        return None

    # Create quantizer in calibration mode
    quantizer = torch_quantizer(
        quant_mode,
        module=model,
        input_args=(next(iter(calibration_data))[0],),
        output_dir=output_dir,
        bitwidth=8,  # INT8 quantization
    )

    quant_model = quantizer.quant_model

    # Run calibration forward passes
    quant_model.eval()
    import torch
    with torch.no_grad():
        for batch_idx, (data, _) in enumerate(calibration_data):
            quant_model(data)
            if batch_idx >= 100:  # 100 batches sufficient for calibration
                break

    # Export quantized model
    quantizer.export_quant_config()
    print(f"Quantized model exported to {output_dir}")

    return quant_model


class VitisAIInferenceEngine:
    """
    FPGA inference engine using Vitis AI runtime.
    Provides the same Python interface as a PyTorch model for easy integration.
    """

    def __init__(
        self,
        xmodel_path: str,
        device_id: int = 0,
        batch_size: int = 1,
    ):
        """
        Initialize the FPGA inference engine.

        Args:
            xmodel_path: Path to compiled .xmodel file (output of Vitis AI compiler)
            device_id: FPGA device index (0 for first Alveo card)
            batch_size: Fixed batch size (FPGA pipelines are sized at compile time)
        """
        self.xmodel_path = xmodel_path
        self.batch_size   = batch_size

        try:
            import xir
            import vart
        except ImportError:
            print("Vitis AI runtime not installed. Install per Xilinx documentation.")
            self._runner = None
            return

        # Load the compiled model graph
        graph = xir.Graph.deserialize(xmodel_path)
        subgraphs = self._get_child_subgraph(graph)

        # Create runner (manages hardware resource allocation on FPGA)
        self._runner = vart.Runner.create_runner(
            subgraphs[0], "run"
        )

        # Get input/output tensor descriptors
        self._input_tensors  = self._runner.get_input_tensors()
        self._output_tensors = self._runner.get_output_tensors()

        print(f"FPGA inference engine initialized:")
        print(f"  Model: {xmodel_path}")
        print(f"  Input shape:  {self._input_tensors[0].dims}")
        print(f"  Output shape: {self._output_tensors[0].dims}")

    def infer(self, input_data: np.ndarray) -> np.ndarray:
        """
        Run inference on the FPGA.

        Args:
            input_data: numpy array with shape matching model input
        Returns:
            numpy array with model output
        """
        if self._runner is None:
            raise RuntimeError("FPGA runner not initialized")

        # Allocate output buffer
        output_shape = tuple(self._output_tensors[0].dims)
        output_data  = np.zeros(output_shape, dtype=np.float32)

        # Submit inference job (asynchronous)
        job_id = self._runner.execute_async(
            [input_data.astype(np.float32)],
            [output_data]
        )

        # Wait for completion
        self._runner.wait(job_id)

        return output_data

    def benchmark(self, n_iterations: int = 1000) -> dict:
        """Run latency and throughput benchmark on the FPGA."""
        # Generate random input
        input_shape = tuple(self._input_tensors[0].dims)
        dummy_input = np.random.randn(*input_shape).astype(np.float32)

        # Warmup
        for _ in range(10):
            self.infer(dummy_input)

        # Benchmark
        latencies = []
        for _ in range(n_iterations):
            t0 = time.perf_counter()
            self.infer(dummy_input)
            latencies.append((time.perf_counter() - t0) * 1000)

        latencies.sort()
        return {
            "p50_ms": latencies[len(latencies) // 2],
            "p99_ms": latencies[int(len(latencies) * 0.99)],
            "throughput_per_sec": 1000 / latencies[len(latencies) // 2],
        }

    def _get_child_subgraph(self, graph):
        """Extract subgraph suitable for DPU execution."""
        children = graph.get_root_subgraph().toposort_child_subgraph()
        return [c for c in children if c.has_attr("device") and
                c.get_attr("device").upper() == "DPU"]


def estimate_fpga_resources(
    model_params_millions: float,
    input_activations_mb: float,
    precision_bits: int = 8,
    clock_mhz: int = 300,
) -> dict:
    """
    Estimate FPGA resource requirements for deploying a model.
    Useful for selecting the right FPGA card before committing to synthesis.

    This is a rough model - actual synthesis results vary by architecture.
    """
    # Weight storage (BRAM/URAM)
    weight_bytes = model_params_millions * 1e6 * (precision_bits / 8)
    weight_brams = weight_bytes / (36 * 1024 / 8)  # 36Kb per BRAM tile

    # Activation buffer (intermediate results between layers)
    activation_brams = (input_activations_mb * 1024 * 1024) / (36 * 1024 / 8)

    # DSP blocks (each handles one MAC operation per cycle)
    # Assume 60% of weight bytes are in the first 3 layers (dominant compute)
    # and we want to sustain one sample per N cycles
    target_samples_per_sec = 10_000
    cycles_per_sample = clock_mhz * 1e6 / target_samples_per_sec
    ops_per_sample = model_params_millions * 2e6  # 2 FLOPs per parameter (MAC)
    dsps_needed = ops_per_sample / cycles_per_sample

    # LUT usage: roughly 2x DSP count for control logic and activation functions
    luts_needed = dsps_needed * 2

    # Select appropriate FPGA card
    cards = {
        "Alveo U50":    {"dsps": 5952,   "brams": 1344,  "urams": 320,  "cost_usd": 8000},
        "Alveo U250":   {"dsps": 12288,  "brams": 2688,  "urams": 960,  "cost_usd": 20000},
        "Alveo U55C":   {"dsps": 9024,   "brams": 984,   "urams": 640,  "cost_usd": 15000},
        "Versal AI Core VC1902": {"dsps": 1968, "brams": 967, "urams": 463, "cost_usd": 7000,
                                  "ai_engines": 400},  # dedicated vector engines
    }

    recommended = None
    for name, specs in cards.items():
        if (specs["dsps"] >= dsps_needed * 1.2 and
                specs["brams"] >= (weight_brams + activation_brams) * 1.2):
            recommended = name
            break

    print(f"\nFPGA Resource Estimation:")
    print(f"  Model size:         {model_params_millions:.1f}M parameters")
    print(f"  Weight storage:     {weight_bytes/1e6:.0f} MB ({weight_brams:.0f} BRAMs)")
    print(f"  Activation buffer:  {input_activations_mb:.1f} MB ({activation_brams:.0f} BRAMs)")
    print(f"  DSPs needed:        {dsps_needed:.0f}")
    print(f"  LUTs estimated:     {luts_needed:.0f}")
    print(f"  Recommended card:   {recommended or 'Consider multi-FPGA or GPU instead'}")

    return {
        "weight_brams": weight_brams,
        "activation_brams": activation_brams,
        "dsps_needed": dsps_needed,
        "recommended_card": recommended,
    }

ASIC Design Flow - From RTL to Silicon

An ASIC (Application-Specific Integrated Circuit) is a chip designed for one specific purpose. Unlike FPGAs, ASICs have no reconfigurability - the logic is etched permanently in silicon. This eliminates the overhead of the FPGA routing fabric, enabling 5-10x better performance per watt and 2-4x better performance per area at the cost of massive Non-Recurring Engineering (NRE) investment.

The ASIC design flow:

NRE costs (2024 estimates, TSMC 5nm process):

Full custom ASIC tape-out: $30-100 million
Engineering team (RTL, verification, physical design): $5-20 million per year
Post-silicon bring-up and debug: $2-5 million
Total time from spec to first silicon: 18-36 months

This is why ASIC design is only economical at high volume. The NRE amortizes across every chip produced. At 10,000 chips, each chip carries $5,000-$ 10,000 of NRE overhead. At 10,000,000 chips (smartphone scale), the NRE overhead is $5-$ 10 per chip.

Google TPU - Systolic Arrays in Production

Google's Tensor Processing Unit is the most deployed ML ASIC in the world. Its core innovation is the systolic array, a 2D grid of simple multiply-accumulate cells where data flows rhythmically from cell to cell, like a heartbeat ("systole").

How a Systolic Array Computes Matrix Multiplication

For multiplying matrix A (MxK) by matrix B (KxN):

$C_{ij} = \sum_{k=0}^{K-1} A_{ik} \cdot B_{kj}$

In a systolic array, B's columns are pre-loaded into the array's MAC units. A's rows are fed in from the left, advancing one cell per cycle. Each cell multiplies the A element it receives by the B element stored locally and passes the accumulated result down. After K cycles, the full dot product for one C element has accumulated.

The critical properties:

Each weight value is loaded into the cell once and reused for all input rows - this is the maximum possible weight reuse, eliminating DRAM bandwidth for weights during computation.
Computation is perfectly pipelined - all K*K MAC units are active simultaneously.
No on-chip memory traffic for weights during computation - weights stay in the MAC cells.

TPU v4 (2023) has a 128x128 systolic array running at 1.1 GHz, delivering 275 TOPS (INT8) per chip. For transformer training (large GEMMs, near-perfect systolic array utilization), TPU v4 pods achieve efficiency that NVIDIA GPUs cannot match on a per-dollar basis for Google's specific workloads.

The limitation: systolic arrays excel at regular dense matrix multiplication but are inefficient for sparse operations, dynamic shapes, and models with complex control flow. This is why TPUs are powerful for training standard transformer architectures but challenging for research models with unusual architectures.

Intel OpenVINO with FPGA Backend

Intel's OpenVINO (Open Visual Inference and Neural network Optimization) toolkit supports FPGA deployment through the Intel FPGA plugin:

"""
Intel OpenVINO with FPGA acceleration (Intel Arria 10 / Stratix 10).
OpenVINO provides a unified API across CPU, GPU, VPU, and FPGA backends.
"""
from openvino.runtime import Core, CompiledModel, InferRequest
import numpy as np
import time
from typing import Optional, Dict


def compile_model_for_fpga(
    model_path: str,
    device: str = "HETERO:FPGA,CPU",  # FPGA with CPU fallback for unsupported ops
    config: Optional[Dict] = None,
) -> CompiledModel:
    """
    Compile an OpenVINO model for FPGA execution.

    The HETERO device automatically partitions the network:
    - Supported layers run on FPGA (most conv, fully connected, pooling)
    - Unsupported layers fall back to CPU (some normalization, custom ops)

    Args:
        model_path: Path to OpenVINO IR (.xml) or ONNX (.onnx) model
        device: Target device. "FPGA" for FPGA-only, "HETERO:FPGA,CPU" for hybrid
        config: Optional configuration dict for performance tuning

    Returns:
        CompiledModel ready for inference
    """
    core = Core()

    if config is None:
        config = {
            # Enable dynamic batch size (FPGA supports up to configured max)
            "PERFORMANCE_HINT": "THROUGHPUT",
        }

    # Read and compile the model
    print(f"Loading model: {model_path}")
    model = core.read_model(model_path)

    print(f"Compiling for device: {device}")
    compiled = core.compile_model(model, device, config)

    # Report which layers went to FPGA vs CPU
    if "HETERO" in device:
        try:
            exec_graph = compiled.get_runtime_model()
            fpga_layers = [n for n in exec_graph.get_ops()
                          if n.get_rt_info().get("affinity", "") == "FPGA"]
            cpu_layers  = [n for n in exec_graph.get_ops()
                          if n.get_rt_info().get("affinity", "") == "CPU"]
            print(f"Layer distribution: {len(fpga_layers)} on FPGA, {len(cpu_layers)} on CPU")
        except Exception:
            pass

    return compiled


def run_fpga_inference_openvino(
    compiled_model: CompiledModel,
    inputs: np.ndarray,
    async_infer: bool = True,
) -> np.ndarray:
    """
    Run inference using OpenVINO with FPGA backend.

    Async inference is strongly recommended for FPGA: it allows the CPU
    to prepare the next batch while the FPGA processes the current one.
    """
    infer_request = compiled_model.create_infer_request()

    if async_infer:
        # Asynchronous: submit and wait (enables pipelining multiple requests)
        infer_request.start_async(inputs={0: inputs})
        infer_request.wait()
    else:
        # Synchronous: blocks until completion
        infer_request.infer(inputs={0: inputs})

    # Retrieve output
    output_tensor = infer_request.get_output_tensor(0)
    return output_tensor.data.copy()


def benchmark_fpga_vs_cpu_openvino(
    model_path: str,
    test_input: np.ndarray,
    n_iterations: int = 500,
) -> Dict:
    """
    Compare FPGA vs CPU inference latency and throughput using OpenVINO.
    """
    core = Core()
    model = core.read_model(model_path)

    results = {}

    for device_name in ["CPU", "HETERO:FPGA,CPU"]:
        try:
            compiled = core.compile_model(model, device_name)
            infer_request = compiled.create_infer_request()

            # Warmup
            for _ in range(20):
                infer_request.infer(inputs={0: test_input})

            # Benchmark
            latencies = []
            for _ in range(n_iterations):
                t0 = time.perf_counter()
                infer_request.infer(inputs={0: test_input})
                latencies.append((time.perf_counter() - t0) * 1000)

            latencies.sort()
            results[device_name] = {
                "p50_ms": latencies[len(latencies) // 2],
                "p99_ms": latencies[int(len(latencies) * 0.99)],
                "throughput": 1000.0 / latencies[len(latencies) // 2],
            }
            print(f"\n{device_name}:")
            print(f"  P50 latency: {results[device_name]['p50_ms']:.2f} ms")
            print(f"  P99 latency: {results[device_name]['p99_ms']:.2f} ms")
            print(f"  Throughput:  {results[device_name]['throughput']:.0f} inferences/sec")

        except RuntimeError as e:
            print(f"{device_name} not available: {e}")

    return results

Processing-In-Memory (PIM)

The memory wall - the fact that moving data between compute and memory consumes far more energy than the compute itself - motivates Processing-In-Memory (PIM): placing compute logic inside or very near the memory arrays, so data never has to travel far.

For ML inference, the critical operation is matrix-vector multiplication (MVM) during the forward pass. In standard architectures:

Load weight matrix row from DRAM (expensive: ~4nJ per 64-byte cache line)
Multiply with input vector in ALU (cheap: ~0.1nJ per FMA)
Store result back to DRAM (expensive again)

Energy breakdown for a single layer inference: 95% of energy is data movement, 5% is computation. PIM flips this by doing the multiplication inside the DRAM arrays.

Samsung's HBM-PIM (2021) integrates SIMD ALUs inside each HBM2 die. The ALUs operate on data without reading it out of the memory array, achieving 2x bandwidth and 2.5x energy efficiency for AI workloads compared to standard HBM2.

SK Hynix's AiM (AI-in-Memory, 2022) integrates multiply-accumulate units alongside each DRAM bank. Peak throughput: 1 TOPS at 500 MB/s bandwidth vs 50 GB/s for conventional DRAM - enabling compute to happen where data lives.

For ML inference specifically, analog in-memory computing (analog PIM) takes this further: weights are stored as conductances in memristor or PCM (phase-change memory) devices. Kirchhoff's current law naturally computes the matrix-vector product in analog: input voltages on the rows, weight conductances on crossbar junctions, output currents on columns are the vector product. A single pass through the memristor crossbar computes one entire layer in one step, at energy cost proportional to the number of non-zero weights rather than the number of operations.

Neuromorphic Computing - Intel Loihi

Neuromorphic computing takes inspiration from biological neural networks - not the simplified mathematical model used in ML, but the actual spiking behavior of biological neurons.

Intel's Loihi 2 (2021) is the leading research platform for neuromorphic computing. Key properties:

1 million "neurons" per chip, each a leaky-integrate-and-fire (LIF) unit
Communication via sparse spike events - a neuron only fires when its membrane potential exceeds a threshold
Learning rules implemented in hardware: spike-timing-dependent plasticity (STDP)
Power consumption: 1-10mW for typical spiking workloads (vs 400W for an A100)

The energy efficiency advantage comes from sparsity. In a biological neural network, most neurons are silent most of the time. A spiking network that is 90% sparse at any given time only computes 10% of the connections. The energy cost is proportional to the number of spikes, not the total number of connections.

Current limitations: programming neuromorphic hardware requires fundamentally different algorithms (spiking neural networks, SNN), not the standard backprop-trained models. State-of-the-art SNNs lag standard ANNs on most benchmarks by 5-15% accuracy. The hardware is not commercially available at scale. Neuromorphic computing remains primarily a research domain as of 2024, but is actively pursued for edge inference applications where power is the binding constraint.

Photonic Computing

Photonic computing uses light instead of electrons for computation. The motivation: photons travel at the speed of light, do not interact with each other (no electron-electron scattering), and can carry multiple data streams simultaneously on different wavelengths (wavelength-division multiplexing).

For ML inference, the core operation is matrix-vector multiplication. In a photonic chip:

Input values are encoded as light intensity or phase
Weights are implemented as Mach-Zehnder interferometers (beam splitters with tunable splitting ratio)
The matrix multiply happens as light propagates through the interferometer array
Output is measured by photodetectors

The key claims: photonic MVM operates at the speed of light, consumes energy only for encoding/decoding, not for computation (photons do not dissipate energy as they propagate through transparent waveguides). For a specific matrix size, photonic compute can be 10-100x more energy-efficient than electronic compute.

Lightmatter (startup, founded 2017) has demonstrated photonic AI accelerators in silicon photonics. Their "Passage" chip uses wavelength-division multiplexing to implement multiple matrix multiplies simultaneously on different wavelength channels. As of 2024, photonic computing for ML is pre-commercial and faces significant engineering challenges in precision (photonic ADCs/DACs limit to 4-8 bits effective precision vs 8-16 bits needed for inference), temperature sensitivity, and manufacturing yield.

Economic Analysis - FPGA vs GPU vs ASIC

"""
Cost-per-inference analysis across hardware platforms.
Helps decide which hardware makes economic sense for your scale and workload.
"""
from dataclasses import dataclass
from typing import Dict, List


@dataclass
class HardwarePlatform:
    name: str
    device_cost_usd: float        # purchase price
    nre_cost_usd: float           # non-recurring engineering cost (0 for COTS)
    dev_months: int               # months to production deployment
    power_watts: float            # idle power (worst case)
    peak_tops_int8: float         # INT8 throughput (TOPS)
    utilization_pct: float        # expected utilization for target workload
    effective_throughput_qps: int # actual inference queries per second
    support_life_years: int       # expected useful life
    maintenance_cost_per_year: float


PLATFORMS = {
    "gpu_a100": HardwarePlatform(
        name="NVIDIA A100 80GB SXM",
        device_cost_usd=30_000,
        nre_cost_usd=0,
        dev_months=1,              # PyTorch model runs with minimal changes
        power_watts=400,
        peak_tops_int8=624,
        utilization_pct=60,        # typical for batch=1 inference
        effective_throughput_qps=5_000,
        support_life_years=5,
        maintenance_cost_per_year=3_000,
    ),
    "gpu_t4": HardwarePlatform(
        name="NVIDIA T4 16GB",
        device_cost_usd=2_500,
        nre_cost_usd=0,
        dev_months=1,
        power_watts=70,
        peak_tops_int8=130,
        utilization_pct=55,
        effective_throughput_qps=800,
        support_life_years=5,
        maintenance_cost_per_year=500,
    ),
    "fpga_alveo_u50": HardwarePlatform(
        name="Xilinx Alveo U50 (FPGA)",
        device_cost_usd=8_000,
        nre_cost_usd=200_000,    # Vitis AI implementation: ~2-3 months of 2 engineers
        dev_months=3,
        power_watts=75,
        peak_tops_int8=13,
        utilization_pct=90,      # pipeline can be near fully utilized
        effective_throughput_qps=2_000,
        support_life_years=7,
        maintenance_cost_per_year=1_000,
    ),
    "asic_custom": HardwarePlatform(
        name="Custom ASIC (TSMC 7nm, 500K units)",
        device_cost_usd=150,     # marginal cost at 500K volume
        nre_cost_usd=25_000_000, # design, verification, tape-out
        dev_months=24,
        power_watts=5,
        peak_tops_int8=50,
        utilization_pct=95,
        effective_throughput_qps=50_000,
        support_life_years=8,
        maintenance_cost_per_year=50_000,  # amortized sustaining engineering
    ),
    "aws_inferentia": HardwarePlatform(
        name="AWS Inferentia2 (via inf2 instance)",
        device_cost_usd=0,       # pay-per-use, amortize annual cost
        nre_cost_usd=20_000,     # AWS Neuron SDK port: ~1 month engineering
        dev_months=1,
        power_watts=75,          # per chip (inf2.8xlarge has 1 chip)
        peak_tops_int8=190,
        utilization_pct=65,
        effective_throughput_qps=3_000,
        support_life_years=5,
        maintenance_cost_per_year=5_000,
    ),
}


def cost_per_inference_analysis(
    target_qps: int = 10_000,
    hours_per_day: float = 24,
    days_per_year: float = 365,
    electricity_cost_per_kwh: float = 0.10,
    production_volume: int = 1,  # number of hardware units to buy
) -> None:
    """
    Full cost-per-inference analysis for each hardware platform.

    Args:
        target_qps: Required inference queries per second
        hours_per_day: Operating hours per day
        electricity_cost_per_kwh: Data center electricity cost
        production_volume: For ASIC analysis - amortizes NRE differently
    """
    hours_per_year = hours_per_day * days_per_year
    total_inferences_per_year = target_qps * 3600 * hours_per_year

    print(f"\nCost-Per-Inference Analysis")
    print(f"Target: {target_qps:,} QPS, {total_inferences_per_year/1e9:.1f}B inferences/year")
    print("=" * 80)

    for key, hw in PLATFORMS.items():
        # Number of units needed to meet target QPS
        n_units = -(-target_qps // hw.effective_throughput_qps)  # ceiling division

        # Capital cost
        hardware_cost = n_units * hw.device_cost_usd
        nre_cost      = hw.nre_cost_usd / production_volume  # NRE amortized over units

        # Annual operating cost
        power_kw           = n_units * hw.power_watts / 1000
        annual_power_cost  = power_kw * hours_per_year * electricity_cost_per_kwh
        annual_maint_cost  = n_units * hw.maintenance_cost_per_year

        # Total cost of ownership over hardware lifetime
        total_capex = hardware_cost + nre_cost
        annual_opex = annual_power_cost + annual_maint_cost
        tco_total   = total_capex + annual_opex * hw.support_life_years

        # Per-inference cost
        total_inferences_lifetime = total_inferences_per_year * hw.support_life_years
        cost_per_million = tco_total / (total_inferences_lifetime / 1_000_000)

        print(f"\n{hw.name}:")
        print(f"  Units needed:         {n_units:>8,}")
        print(f"  Development time:     {hw.dev_months:>8} months")
        print(f"  Hardware capex:       ${hardware_cost:>12,.0f}")
        print(f"  NRE cost:             ${nre_cost:>12,.0f}")
        print(f"  Annual power cost:    ${annual_power_cost:>12,.0f}  ({power_kw:.1f} kW)")
        print(f"  Annual maintenance:   ${annual_maint_cost:>12,.0f}")
        print(f"  {hw.support_life_years}-year TCO:          ${tco_total:>12,.0f}")
        print(f"  Cost per 1M inferences: ${cost_per_million:>8.4f}")


def when_does_asic_make_sense(
    fpga_cost_per_million: float = 2.0,
    asic_nre_usd: float = 25_000_000,
    asic_marginal_cost_usd: float = 50,
    asic_lifetime_years: int = 5,
    annual_volume_units: int = 10_000,
) -> dict:
    """
    Calculates the breakeven volume where an ASIC becomes cheaper than FPGA.

    The fundamental equation:
    ASIC_total = NRE + (unit_cost * volume)
    FPGA_total = fpga_unit_cost * volume

    Break-even when ASIC_total = FPGA_total:
    volume_breakeven = NRE / (fpga_unit_cost - asic_unit_cost)
    """
    # For FPGA: cost per inference over hardware lifetime
    fpga_total_cost_10k_units = (8_000 + 200_000 / 10_000) * annual_volume_units

    # For ASIC: NRE + marginal per unit
    asic_total_cost_10k_units = asic_nre_usd + asic_marginal_cost_usd * annual_volume_units

    # FPGA cost per unit (hardware + amortized NRE at scale)
    fpga_effective_per_unit = 8_000 + 200_000 / annual_volume_units
    asic_effective_per_unit = asic_nre_usd / annual_volume_units + asic_marginal_cost_usd

    breakeven_units = asic_nre_usd / (fpga_effective_per_unit - asic_marginal_cost_usd)

    print(f"\nFPGA vs ASIC Breakeven Analysis:")
    print(f"  FPGA effective cost per unit (at {annual_volume_units} units): ${fpga_effective_per_unit:,.0f}")
    print(f"  ASIC effective cost per unit (at {annual_volume_units} units): ${asic_effective_per_unit:,.0f}")
    print(f"  Breakeven volume: {breakeven_units:,.0f} units")
    print(f"  At {annual_volume_units} units/year, breakeven time: {breakeven_units/annual_volume_units:.1f} years")

    return {
        "fpga_effective_per_unit": fpga_effective_per_unit,
        "asic_effective_per_unit": asic_effective_per_unit,
        "breakeven_units": breakeven_units,
    }

Decision Framework - Choosing the Right Hardware

Production Engineering Notes

When FPGA Makes Sense for ML Inference

Three conditions together justify FPGA deployment: (1) the model architecture is stable and changes infrequently (reconfiguring an FPGA bitstream takes 2-10 minutes), (2) latency requirements are strict (sub-millisecond) or power is the binding constraint, (3) batch size 1 or near-1 is required (GPUs are inefficient at batch size 1).

The classic FPGA ML use case is edge inference: a computer vision model deployed on a drone, satellite, medical device, or industrial sensor. The model must run at 30+ fps, on battery power, at -40 to +85 degree Celsius ambient temperature, on a $200 bill-of-materials budget. No GPU in the world meets these constraints. An FPGA implementation absolutely does.

The NRE Cost Trap

The most common mistake when evaluating custom accelerators: forgetting to amortize the full NRE cost. An FPGA implementation requires:

2-4 weeks learning Vitis AI / HLS tools
4-8 weeks quantizing and adapting the model
4-8 weeks implementing and debugging the FPGA pipeline
2-4 weeks verification and testing

That is 3-6 months of an experienced engineer at $150,000-$ 250,000 fully-loaded annual cost: $37,500 -$ 125,000 in engineering time. For a deployment of 10 FPGAs, this is $3,750-$ 12,500 per device in NRE - comparable to or greater than the device cost itself.

The FPGA path only makes financial sense when the engineering investment is amortized over many devices (100+) or when the power/latency advantages justify the premium at lower volumes.

:::warning FPGA Inference Pipelines Break on Model Updates An FPGA bitstream implements a specific model architecture with specific quantization parameters. When you update the model (new architecture, retrained weights, changed output classes), you must re-synthesize the bitstream, which takes hours of compilation time. For services where models update frequently (weekly retraining), GPU deployment with its instantaneous model hot-swap capability is far more operationally practical. :::

:::danger Quantization Accuracy Loss Can Be Larger on FPGA than GPU FPGA inference typically uses INT8 or INT16 fixed-point arithmetic. The quantization calibration process that minimizes accuracy loss requires representative calibration data and careful per-layer bit-width selection. Rushing this process - using the wrong calibration dataset, or applying uniform INT8 without checking outlier activations - can cause accuracy drops of 5-15% that are invisible in your development environment but catastrophic in production. Always validate the quantized FPGA model against a held-out test set that includes edge cases before deployment. :::

Interview Questions and Answers

Q1: Explain the systolic array architecture in Google's TPU and why it is more efficient than a GPU for transformer training.

A systolic array is a 2D grid of simple multiply-accumulate (MAC) cells. In the TPU's matrix multiply unit (MXU), the array is 128x128 cells. Matrix B's values are pre-loaded into the cells. Matrix A's rows are fed in from the left edge, advancing one cell per clock cycle. Each cell multiplies its incoming A value by its stored B weight and accumulates into a running partial sum, then passes the A value to the right. After K cycles (where K is the inner dimension of the matrix product), the output row accumulates in the bottom cells.

The critical efficiency advantage: weight values are loaded into cells once and stay there while an entire batch of inputs flows through. For a transformer attention layer computing QK^T with batch size 64, sequence length 512, and head dimension 64, the same key matrix K is multiplied against all 64 queries in the batch. In the systolic array, K is loaded once; all 64 queries flow through while K remains in the cells. Weight-to-computation ratio is minimized.

On a GPU, every matrix multiply reads weights from HBM (2TB/s bandwidth) for every batch element. For small batch sizes or frequent weight access, GPU DRAM bandwidth is the bottleneck. The TPU's systolic array eliminates this bottleneck entirely for dense matrix multiplication.

The limitation: systolic arrays have poor utilization for irregular shapes (attention with variable sequence lengths), sparse operations (pruned models), and models with complex control flow. GPU's general SIMT (single instruction, multiple thread) model handles these cases gracefully; the TPU cannot.

Q2: A startup wants to deploy a 5MB computer vision model for inference on an industrial sensor at 30fps, 25W power budget, $500 device cost. Walk through the hardware decision.

The constraints eliminate most options immediately. GPU: minimum $300 for a Jetson Nano, 5W idle but 10W under inference load - fits the power budget but requires$ 300 out of the $500 for the chip alone, leaving little for other components. Performance is more than adequate.

FPGA: Xilinx Artix-7 or Lattice ECP5 FPGAs cost $20-80 in volume, use 2-5W for typical inference pipelines, and a 5MB model will fit comfortably in on-chip BRAM (4Mb BRAM in Artix-7 XC7A200T). At 30fps with a 5MB model, the compute requirement is modest - easily achievable on a mid-range FPGA.

The NRE analysis: implementing the model in Vitis AI or Xilinx HLS takes 2-3 engineer-months. At a $200K/year engineering rate, that is$ 33,000- $50,000 in NRE. If you produce 10,000 devices, the NRE is$ 3-5 per device - acceptable. If you produce 100, it is $330-$ 500 per device - prohibitive.

For 100 devices: use Jetson Nano. For 10,000+ devices: design an FPGA implementation. The $470 bill-of-materials savings at 10,000 units ($ 4.7M total) more than justifies the $50,000 engineering investment. Payback time: 1-2 months of production.

Q3: What is the memory wall, and how do Processing-In-Memory architectures address it for neural network inference?

The memory wall refers to the growing disparity between processor compute speed and memory bandwidth. A modern GPU can perform 300+ TOPS (INT8) but has only 2TB/s of HBM bandwidth. For a matrix-vector multiplication where the weight matrix is 1GB, loading the weights from HBM takes 0.5ms regardless of how fast the compute units are. The processor sits idle waiting for data.

The energy manifestation is even more stark: moving 64 bytes from DRAM to a compute unit consumes approximately 4-8 nanojoules. The multiply-accumulate operation on those 8 float32 values consumes approximately 0.05 nanojoules. Data movement consumes 80-160x more energy than computation. For ML inference where every forward pass reads all model weights, the hardware mostly spends energy moving data, not computing.

PIM addresses this by placing compute logic adjacent to or inside memory arrays. Samsung's HBM-PIM integrates SIMD ALUs inside each HBM2 DRAM die. Instead of reading weights out to the GPU compute die, the multiply-accumulate happens inside the DRAM die where the data is stored. The effective bandwidth is 2x (data does not leave the die), and the energy per MAC drops by 4-10x because data travels micrometers instead of centimeters.

Analog PIM (memristor crossbars) goes further: the matrix-vector product occurs in a single step using Kirchhoff's current law, with energy proportional to the number of non-zero weights rather than the total number of operations.

Q4: When should a company choose FPGA over GPU for ML inference, and what are the three most common mistakes in the evaluation?

Choose FPGA when all three of the following are true: (1) the model is stable and changes less frequently than once per quarter, (2) either batch size 1 latency is critical (under 2ms) or power is strictly constrained (under 50W), and (3) production volume is sufficient to amortize the NRE investment (typically 100+ units for a commercial FPGA solution).

Common mistakes:

First, comparing peak theoretical TOPS. An Alveo U50 has 13 TOPS INT8. An A100 has 624 TOPS INT8. The FPGA looks 48x weaker. But for batch size 1 inference with a 5-layer CNN, the FPGA pipeline has no scheduling overhead, no PCIe transfer latency, and 100% utilization on every cycle. The FPGA latency is 0.05ms; the A100 latency is 1-2ms because GPU overhead dominates for tiny models.

Second, forgetting the NRE cost. The FPGA card is $8,000 and "seems cheaper" than a$ 30,000 A100. But implementing the model in Vitis AI costs $50,000-$ 150,000 in engineering time. At 10 units, the FPGA costs $5,000-$ 15,000 more per unit in NRE alone.

Third, not accounting for model update velocity. A team that retrains and deploys a new model every week on GPU does so with a docker push and a rolling restart. The same update on FPGA requires re-synthesizing the bitstream (4-8 hours of Vivado compilation), re-characterizing quantization, re-running validation, and reloading the bitstream onto every deployed device. For fast-moving ML products, this operational overhead is often the deciding factor against FPGA deployment.

Q5: Describe the ASIC design flow from RTL to working silicon, and explain why the NRE cost for a 5nm ASIC is $30-100 million.

The ASIC design flow has six major phases. RTL design (6-18 months): engineers write the hardware description in Verilog or SystemVerilog, describing logic gates and their interconnections. A typical ML accelerator has 50,000-500,000 lines of RTL. Verification (runs in parallel, often longer than design): proves the RTL is correct using simulation with hundreds of thousands of test vectors, formal property checking, and emulation. Verification typically consumes 60-70% of the total design effort.

Logic synthesis converts RTL to a gate-level netlist using a process design kit (PDK) from the foundry. Physical design (place and route) maps the logical netlist to actual transistor positions on silicon and routes metal wires between them. Signoff verifies that timing, power, and physical design rules are met - this requires running complex EDA tools that cost $5-10M/year in licenses.

Tape-out is submitting the final GDSII layout file to the foundry (TSMC, Samsung). For TSMC 5nm, the mask set costs $10-20 million - physical photomasks used to pattern each metal layer. Fabrication takes 3-6 months.

The $30-100M NRE breaks down roughly as:$ 10-20M in EDA tool licenses and compute for simulation/verification, $10-20M in engineering headcount (20-50 engineers for 18-24 months),$ 5-15M in physical verification and timing closure (multiple rounds of iteration), and $10-20M for the mask set. Each design bug found post-tape-out that requires a respin costs another$ 10-20M and 3-6 months of delay.

Q6: What is the current state and practical limitations of photonic and neuromorphic computing for production ML inference?

Photonic computing for ML inference is at TRL (Technology Readiness Level) 3-4 out of 9 - proof of concept demonstrated in lab, not yet production-ready. The fundamental advantages are real: matrix multiplication in a photonic chip consumes energy only for encoding inputs and reading outputs, not for the computation itself. Lightmatter has demonstrated a photonic chip performing BERT inference at lower energy than electronic alternatives.

The practical limitations blocking production deployment: (1) Precision - photonic ADC/DAC converters currently achieve 4-8 bits effective precision. ML inference requires 8-16 bits for acceptable accuracy. Increasing precision requires higher-power optical components that erode the energy efficiency advantage. (2) Temperature sensitivity - Mach-Zehnder interferometer splitting ratios drift with temperature at ~1pm/degree Celsius. Maintaining stable weights requires active thermal control, adding power and complexity. (3) Reprogrammability - changing weights requires heating thermo-optic phase shifters, which is slow (milliseconds) compared to electronic bit writes (nanoseconds). (4) Manufacturing yield - silicon photonics requires much tighter fabrication tolerances than electronic chips, resulting in higher defect rates and costs.

Neuromorphic computing via Intel Loihi 2 is at TRL 4-5: demonstrated on specific tasks (keyword detection, sparse coding, optimization problems) with compelling energy efficiency (1-10mW vs 400W for GPU). The main limitation is programmability: standard neural networks cannot run on Loihi directly. Spiking neural network (SNN) training algorithms are less mature, SNN accuracy on standard benchmarks lags ANN accuracy by 5-15%, and there is no equivalent of PyTorch for SNNs. Neuromorphic is compelling for specific applications (always-on edge sensing at microwatt power) but not a general-purpose replacement for GPU-based inference as of 2024.

When GPUs Are the Wrong Tool​

Why This Exists - The Limits of General-Purpose Hardware​

Historical Context - From DSPs to Neural Network Accelerators​

FPGAs for ML Inference​

FPGA Resource Types​

FPGA Architecture for Neural Network Inference​

Vitis AI - Xilinx's ML Deployment Framework​

ASIC Design Flow - From RTL to Silicon​

Google TPU - Systolic Arrays in Production​

How a Systolic Array Computes Matrix Multiplication​

Intel OpenVINO with FPGA Backend​

Processing-In-Memory (PIM)​

Neuromorphic Computing - Intel Loihi​

Photonic Computing​

Economic Analysis - FPGA vs GPU vs ASIC​

Decision Framework - Choosing the Right Hardware​

Production Engineering Notes​

When FPGA Makes Sense for ML Inference​

The NRE Cost Trap​

Interview Questions and Answers​