Skip to main content

Groq LPU Architecture

Reading time: ~35 min · Interview relevance: High · Target roles: ML Systems Engineer, AI Infrastructure, LLM Platform Engineer

Groq's LPU hits 500+ tokens/second for LLaMA-3 70B at batch size 1. A100 GPUs deliver around 80 tokens/second for the same model. The difference is not clock speed or FLOPS - it is that Groq eliminated the memory roundtrip that GPUs are forced to take on every single generated token.

The Voice Assistant That Changed the Team's Thinking

The product team at a mid-size AI startup had spent six months building a real-time voice assistant. The pipeline was straightforward: speech-to-text, then an LLM for response generation, then text-to-speech. The speech and TTS components were fast - under 100ms combined. The LLM was the problem.

They were running LLaMA-3 8B on an A100. At idle, with one request at a time, the model produced tokens at about 120 tokens/second. That felt acceptable. But their 90th percentile latency told a different story. When two or three users spoke simultaneously - a completely normal scenario during business hours - latency ballooned. The GPU was trying to batch multiple decoding sequences together, and performance per sequence dropped sharply as batch size grew.

They tried Groq's API on a Tuesday afternoon. Single-request throughput was 500 tokens/second for the 8B model. First-token latency was under 150ms. The voice assistant felt instantaneous. But when they simulated 50 concurrent users all calling the API at once, they noticed something odd: the total throughput didn't scale the way they expected. Groq was handling one sequence at a time beautifully but wasn't designed for massive concurrent batch processing the way a GPU cluster is.

The team had hit one of the fundamental trade-offs in custom silicon: every architecture optimizes for a specific workload shape. Groq's architecture is purpose-built for one thing - low-latency sequential token generation. Understanding why it achieves that, and exactly where it trades away other capabilities, is what this lesson is about.

The experience also highlighted something deeper. The bottleneck in LLM inference had nothing to do with compute. The A100 has 312 TFLOPS of tensor core throughput. Generating a single token from a 7B model requires roughly 14 billion multiply-accumulate operations - about 14 GFLOPs. At 312 TFLOPS, the theoretical compute time is 0.045 milliseconds. So why does it take 8ms in practice? The answer is memory bandwidth. And Groq built an entire chip architecture around eliminating that specific bottleneck.

Why This Exists - The Memory Wall in LLM Inference

To understand Groq you need to understand one equation. During autoregressive generation - producing tokens one at a time - the model performs a forward pass through every layer for every single token. That forward pass must read every weight in the model from memory.

A 7B parameter model in FP16 occupies 14GB of memory. Each token generation requires reading those 14GB. The A100 SXM has 80GB HBM2e with 2TB/s bandwidth:

Token time=14 GB2000 GB/s=7 ms143 tokens/sec\text{Token time} = \frac{14 \text{ GB}}{2000 \text{ GB/s}} = 7 \text{ ms} \quad \Rightarrow \quad 143 \text{ tokens/sec}

This is the memory bandwidth ceiling. You cannot exceed it on an A100 at batch size 1, regardless of how much compute you add. The chip is not compute-bound - it is memory-bandwidth-bound. The tensor cores sit mostly idle while the memory system slowly feeds them weights.

The only way to break this ceiling is to either (a) increase memory bandwidth dramatically, or (b) eliminate the memory read entirely by keeping the weights somewhere the chip can access without going off-chip. Option (a) is what NVIDIA pursues with HBM3 and HBM3e - and they have pushed bandwidth to 3.35 TB/s on H100. Option (b) is what Groq did.

SRAM is fundamentally faster than DRAM. On-chip SRAM in modern silicon can deliver 10-50x lower latency than off-chip HBM. The catch is density - SRAM requires more silicon area per bit than DRAM. You cannot put 80GB of SRAM on a GPU die without making the chip prohibitively large and expensive. But for models that fit within a few hundred megabytes, SRAM becomes viable.

Groq's insight was that many real-world inference workloads use quantized 7B-13B models. An 8B model in INT8 is 8GB. Spread across multiple chips, it fits in on-chip SRAM. And once it fits, the memory bandwidth calculation changes entirely.

GPU Inference (batch=1):
Token 1: Load 14GB weights from HBM → compute → output token
Token 2: Load 14GB weights from HBM → compute → output token
Token N: Load 14GB weights from HBM → compute → output token
Cost: N × (14GB / HBM_bandwidth)

Groq Inference:
Startup: Load weights into SRAM once (or compile them in statically)
Token 1: Read weights from on-chip SRAM → compute → output token
Token 2: Read weights from on-chip SRAM → compute → output token
Token N: Read weights from on-chip SRAM → compute → output token
Cost: N × (14GB / SRAM_bandwidth) where SRAM_bandwidth >> HBM_bandwidth

This is the core of the Groq value proposition. The weights never leave the chip. Every token generation is a cheap in-chip read, not a full roundtrip through HBM.

Historical Context - From TPU Origins to LPU

Groq was founded in 2016 by Jonathan Ross and a team of engineers who had worked on Google's original TPU. Ross is credited as the primary inventor of the TPU. After leaving Google, the founding insight was that the TPU architecture, while revolutionary for training, had a different kind of constraint - it still relied on dynamic scheduling and caching mechanisms that introduced unpredictability.

The goal was something more radical: a chip where every memory access and every compute operation is scheduled at compile time. No caches. No branch prediction. No out-of-order execution. No dynamic dispatch. Everything known ahead of time.

The timeline:

  • 2016 - Groq founded, Series A from Social Capital and other investors
  • 2019 - First public demo of GroqChip at MLPerf, showing exceptional inference performance
  • 2020 - GroqChip 1 released, 1MB SRAM per chip, used in first GroqCard PCIe accelerators
  • 2021 - GroqRack unveiled, multi-chip configurations for larger models
  • 2023 - Second-generation GroqChip, improved SRAM capacity and interconnect
  • 2024 - Groq coined the term "LPU" (Language Processing Unit) as a product category, GroqCloud API launched publicly with OpenAI-compatible REST interface

The LPU name is marketing, not a new architecture. The underlying chip has been the TSP (Tensor Streaming Processor) from the beginning. But "LPU" accurately captures the product positioning: this chip is optimized specifically for language model inference, not general matrix operations.

GroqCloud's API uses the same format as OpenAI's API. You can drop it in as a replacement with a single endpoint and key change. This made adoption frictionless for teams already using LangChain, LlamaIndex, or direct OpenAI client code.

Core Architecture - The Tensor Streaming Processor

The GroqChip's central compute element is called the Tensor Streaming Processor (TSP). Understanding the TSP requires first understanding what it deliberately does not have - because the absences are as important as the features.

What the TSP Eliminates

No caches. Traditional CPUs and GPUs use multi-level cache hierarchies (L1, L2, L3) to hide memory latency. The cache is a bet: "the data we need next is probably nearby in memory, so we'll keep recent data close." This bet is usually right, but "usually" is not "always," and cache misses cause stalls. Groq has no caches because it does not need to guess - the compiler already knows exactly what data is needed at every cycle.

No branch prediction. CPUs speculatively execute code paths before knowing which branch will be taken. This speculative execution was the source of Spectre and Meltdown vulnerabilities. Groq has no speculative execution because the execution path is fixed at compile time.

No out-of-order execution. Modern CPUs reorder instructions dynamically to keep execution units busy while waiting for slow memory. Groq has no out-of-order execution because the compiler already produced an optimal in-order schedule with no pipeline stalls.

No dynamic dispatch. The runtime does not decide what runs when. The compiler wrote a cycle-accurate schedule before the chip ever received the model.

All of these omissions serve the same purpose: removing the sources of non-determinism from the chip. When nothing is dynamic, every operation takes a known, fixed number of cycles. The chip becomes a predictable machine.

SRAM-First Memory Architecture

Each GroqChip contains 230MB of on-chip SRAM. The SRAM bandwidth is approximately 80 TB/s of aggregate internal bandwidth across all compute units. For comparison:

Memory TypeCapacity (per chip)BandwidthLatency
Groq SRAM (per chip)230 MB~80 TB/s internal~nanoseconds
H100 HBM380 GB3.35 TB/s~microseconds
A100 HBM2e80 GB2.0 TB/s~microseconds
DDR5 (CPU)per DIMM~100 GB/s~tens of nanoseconds

The SRAM bandwidth numbers look smaller than HBM on paper, but the comparison is misleading. HBM bandwidth is the external memory bus. The SRAM bandwidth is internal - from the SRAM cells directly to the compute units, with no external bus crossing. The latency difference is the key: reading from on-chip SRAM is 10-100x faster than reading from HBM.

For a 7B model in INT8 (8GB), you need multiple chips. A GroqCard contains 2 chips = 460MB SRAM. A GroqRack contains 9 GroqCards = 18 chips = 4.14 GB SRAM. The model is partitioned across chips, with each chip holding a slice of each layer's weights.

Compiler-Scheduled Execution

The Groq compiler is where the real intelligence lives. When you compile a model for Groq:

  1. The compiler analyzes the full model graph
  2. It assigns every weight tensor to specific SRAM addresses on specific chips
  3. It generates a cycle-accurate schedule: at cycle T, read these weights from SRAM address X, pipe them to compute unit Y, store result at SRAM address Z
  4. It schedules chip-to-chip communication over the high-speed interconnect with exact timing
  5. The resulting "program" is a static instruction stream with zero branching

This compilation step takes time - minutes to hours for large models. But you do it once. After compilation, every inference run executes the exact same instruction stream at the exact same cycle offsets. The chip has no decisions to make at runtime.

The consequence of this is deterministic latency. Not approximate, not "usually consistent" - mathematically guaranteed. If the compiler says a 200-token generation takes 400ms, it will take 400ms on the 1st run and the 10,000th run. This is unlike GPUs, where OS scheduling, CUDA stream management, memory allocation, and kernel launch overhead all introduce variability.

GPU execution timeline (batch=1, 200 tokens):
|←12ms→| token 1 [kernel launch + HBM read + compute]
|←9ms→| token 2 [variable - depends on memory state]
|←14ms→| token 3 [cache effect from other processes]
...
P95 latency != P50 latency

Groq execution timeline (batch=1, 200 tokens):
|←2ms→| token 1 [SRAM read + compute, static schedule]
|←2ms→| token 2 [identical - same instruction stream]
|←2ms→| token 3 [guaranteed - no variability sources]
...
P99 latency == P50 latency

GroqRack - Scaling to Larger Models

A single GroqCard (2 chips, 460MB SRAM) handles models up to ~460MB. A quantized Llama-3 8B is 8GB - too large for one card. GroqRack addresses this.

A GroqRack contains:

  • 9 GroqCards
  • 18 GroqChip TSPs total
  • 4.14 GB aggregate on-chip SRAM
  • 4.1 Tb/s chip-to-chip bandwidth (via high-speed SerDes interconnect)

The chips are connected in a topology designed by the Groq compiler. Each chip holds a partition of the model. When a token is generated, the activation flows through the chips in sequence, passing through each layer's partition. The chip-to-chip communication is also statically scheduled - the compiler knows exactly when each chip will need to send activations to the next.

For a 70B model in INT8 (70GB), you need multiple GroqRacks. GroqCloud provisions these transparently via the API.

The Math - Why SRAM Wins for Autoregressive Decoding

Let's work through the numbers precisely. This is the calculation that explains why Groq's architecture is fast.

Roofline Analysis for Token Generation

The roofline model separates workloads into two categories:

  • Compute-bound: the bottleneck is arithmetic throughput (FLOPS)
  • Memory-bound: the bottleneck is memory bandwidth (bytes/second)

For autoregressive generation at batch size 1, the arithmetic intensity is:

Arithmetic Intensity=FLOPs per tokenBytes loaded per token\text{Arithmetic Intensity} = \frac{\text{FLOPs per token}}{\text{Bytes loaded per token}}

For a transformer layer with hidden dimension dd:

  • FLOPs per token: 2×d2\approx 2 \times d^2 (attention + FFN, simplified)
  • Bytes loaded (FP16 weights): d2×2\approx d^2 \times 2 bytes

Arithmetic Intensity=2d22d2=1 FLOP/byte\text{Arithmetic Intensity} = \frac{2d^2}{2d^2} = 1 \text{ FLOP/byte}

This is extremely low. The H100 has peak arithmetic intensity of about 148 FLOP/byte (312 TFLOPS / 2.1 TB/s HBM). Since 1 << 148, the workload is entirely memory-bandwidth-bound. You are using less than 1% of the H100's compute capability - the rest waits for memory.

On Groq with SRAM:

Token time=Model size in bytesEffective SRAM bandwidth\text{Token time} = \frac{\text{Model size in bytes}}{\text{Effective SRAM bandwidth}}

But here's the key insight: with SRAM, the "model size in bytes" denominator is served at much lower latency than HBM. The access pattern is also perfectly sequential and fully predicted by the compiler, eliminating any effective latency penalty.

Batch Size Sensitivity

The picture changes at large batch sizes. When you process B sequences simultaneously:

Arithmetic Intensity at batch B=2×B×d22d2=B FLOPs/byte\text{Arithmetic Intensity at batch B} = \frac{2 \times B \times d^2}{2d^2} = B \text{ FLOPs/byte}

At batch size 148, the H100 becomes compute-bound and reaches full FLOPS utilization. A GPU cluster with 8 H100s, well-batched at B=256+, can process thousands of tokens per second in aggregate - far more than Groq's per-rack throughput.

This is the fundamental trade-off:

Workload shape: Groq wins? GPU cluster wins?
---------------------------------------------------------
Single user, low latency YES No (HBM bound)
Voice assistant (B=1-4) YES Marginal
API with B=32+ No YES (compute-bound)
Model training NO YES (always)
Embedding generation Depends Usually YES
Batch inference jobs NO YES

GroqCloud API - Practical Usage

GroqCloud provides an OpenAI-compatible REST API. The groq Python library is a thin wrapper with the same interface as openai.

Installation and Basic Usage

pip install groq
from groq import Groq

client = Groq(api_key="your_groq_api_key")

completion = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
],
temperature=0.7,
max_tokens=512,
)

print(completion.choices[0].message.content)
print(f"Tokens generated: {completion.usage.completion_tokens}")
print(f"Generation time: {completion.usage.completion_time:.3f}s")
print(f"Tokens/sec: {completion.usage.completion_tokens / completion.usage.completion_time:.1f}")

Available Models (as of 2024)

# Available models on GroqCloud
GROQ_MODELS = {
"llama-3.1-8b-instant": {
"params": "8B",
"context": 128_000,
"typical_tps": 750, # tokens per second at batch=1
},
"llama-3.1-70b-versatile": {
"params": "70B",
"context": 128_000,
"typical_tps": 280,
},
"llama-3.3-70b-versatile": {
"params": "70B",
"context": 128_000,
"typical_tps": 275,
},
"mixtral-8x7b-32768": {
"params": "46.7B active (MoE)",
"context": 32_768,
"typical_tps": 500,
},
"gemma2-9b-it": {
"params": "9B",
"context": 8_192,
"typical_tps": 500,
},
}

Benchmarking Latency and Throughput

import time
from groq import Groq
import statistics

client = Groq(api_key="your_groq_api_key")

def benchmark_groq(model: str, prompt: str, n_runs: int = 10):
"""
Measure first-token latency and tokens-per-second throughput.
Groq's deterministic execution means variance should be very low.
"""
latencies = []
throughputs = []

for i in range(n_runs):
start = time.perf_counter()

response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
stream=False,
)

elapsed = time.perf_counter() - start
tokens = response.usage.completion_tokens

latencies.append(elapsed * 1000) # ms
throughputs.append(tokens / elapsed)

return {
"model": model,
"p50_latency_ms": statistics.median(latencies),
"p95_latency_ms": sorted(latencies)[int(0.95 * n_runs)],
"mean_tps": statistics.mean(throughputs),
"std_tps": statistics.stdev(throughputs),
"cv_percent": (statistics.stdev(throughputs) / statistics.mean(throughputs)) * 100,
}

# Run benchmark
result = benchmark_groq(
model="llama-3.1-8b-instant",
prompt="Describe the architecture of a transformer model.",
n_runs=20,
)

print(f"P50 latency: {result['p50_latency_ms']:.1f} ms")
print(f"P95 latency: {result['p95_latency_ms']:.1f} ms")
print(f"Mean TPS: {result['mean_tps']:.1f}")
print(f"Std TPS: {result['std_tps']:.1f}")
print(f"CV (variance): {result['cv_percent']:.1f}%")
# Groq CV is typically <2% - GPU APIs are often 10-30%

Streaming for Voice Applications

import time
from groq import Groq

client = Groq(api_key="your_groq_api_key")

def stream_with_ttft(model: str, prompt: str):
"""
Measure Time to First Token (TTFT) - critical for interactive UX.
Groq's SRAM architecture keeps TTFT extremely low.
"""
start = time.perf_counter()
first_token_time = None
full_response = ""

stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
ttft_ms = (first_token_time - start) * 1000
print(f"TTFT: {ttft_ms:.1f} ms")
full_response += chunk.choices[0].delta.content

total_time = time.perf_counter() - start
print(f"Total time: {total_time*1000:.0f} ms")
print(f"Response length: {len(full_response.split())} words")
return full_response

# For a voice assistant, TTFT < 200ms feels instantaneous to users
response = stream_with_ttft(
model="llama-3.1-8b-instant",
prompt="What's the weather in San Francisco today?"
)

Drop-in Replacement for OpenAI

import os
# Replace OpenAI client with Groq - same interface
from groq import Groq as OpenAI # alias trick for zero-code-change migration

client = OpenAI(
api_key=os.environ.get("GROQ_API_KEY"),
base_url="https://api.groq.com/openai/v1", # optional - groq library sets this
)

# Or use the openai library directly with Groq's base URL:
from openai import OpenAI

client = OpenAI(
api_key=os.environ.get("GROQ_API_KEY"),
base_url="https://api.groq.com/openai/v1",
)

# Now all existing OpenAI code works with Groq
response = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Hello!"}],
)

Architecture Decision - When to Use Groq

Understanding the architecture helps you make the right tool choice:

Production Engineering Notes

Rate Limiting and Concurrency

GroqCloud has strict rate limits because Groq hardware handles one sequence at a time per chip cluster. Unlike GPUs that batch naturally, Groq queues requests. Check your rate limits:

from groq import Groq, RateLimitError
import time

client = Groq(api_key="your_key")

def call_with_backoff(messages, model="llama-3.1-8b-instant", max_retries=5):
"""
Groq rate limits are tokens-per-minute based.
Use exponential backoff when you hit them.
"""
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model=model,
messages=messages,
max_tokens=512,
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # exponential backoff: 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait}s (attempt {attempt+1}/{max_retries})")
time.sleep(wait)

Concurrency Pattern for Batch Jobs

import asyncio
from groq import AsyncGroq

client = AsyncGroq(api_key="your_key")

async def process_single(prompt: str, semaphore: asyncio.Semaphore):
"""Process one prompt, respecting concurrency limit."""
async with semaphore:
response = await client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
)
return response.choices[0].message.content

async def batch_process(prompts: list[str], max_concurrent: int = 10):
"""
Groq recommends max 10-20 concurrent requests.
More than this hits rate limits without improving throughput
because the hardware processes them sequentially anyway.
"""
semaphore = asyncio.Semaphore(max_concurrent)
tasks = [process_single(p, semaphore) for p in prompts]
return await asyncio.gather(*tasks)

# Usage
prompts = ["Explain neural networks.", "What is backpropagation?"] * 50
results = asyncio.run(batch_process(prompts))

Monitoring Latency Consistency

from groq import Groq
import time
import json
from collections import deque

client = Groq(api_key="your_key")

class GroqLatencyMonitor:
"""
Track Groq latency over time.
Groq's deterministic execution means variance should stay <5%.
A sudden increase in variance signals system-level issues (network, rate limiting).
"""

def __init__(self, window_size: int = 100):
self.latencies = deque(maxlen=window_size)
self.tps_samples = deque(maxlen=window_size)

def record(self, latency_ms: float, tps: float):
self.latencies.append(latency_ms)
self.tps_samples.append(tps)

def stats(self):
lats = sorted(self.latencies)
n = len(lats)
if n < 2:
return {}
return {
"p50_ms": lats[n // 2],
"p95_ms": lats[int(0.95 * n)],
"p99_ms": lats[int(0.99 * n)],
"mean_tps": sum(self.tps_samples) / len(self.tps_samples),
}

monitor = GroqLatencyMonitor()

def monitored_call(prompt: str):
start = time.perf_counter()
resp = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": prompt}],
max_tokens=100,
)
elapsed_ms = (time.perf_counter() - start) * 1000
tps = resp.usage.completion_tokens / (elapsed_ms / 1000)
monitor.record(elapsed_ms, tps)
return resp

Trade-offs Summary

Common Mistakes

:::danger Expecting Groq to Scale Like a GPU at High Batch Sizes

Groq's architecture is optimized for low-latency sequential decoding. When you send 100 concurrent requests to GroqCloud, the hardware processes them sequentially or with minimal parallelism - not in a large batch like a GPU would. The result is that aggregate throughput per dollar may actually be worse than a well-batched GPU for high-volume jobs.

Use Groq when: latency per request matters. Use GPU APIs when: you have hundreds of requests per second and care about aggregate throughput and cost. :::

:::danger Assuming SRAM Means Unlimited Speed

Groq's 230MB per chip is a hard limit. A 70B INT8 model is 70GB - it requires 152+ chips across many GroqRacks. The chip-to-chip communication overhead across a large rack configuration reduces the effective speedup compared to the single-chip case. For very large models, the advantage over well-configured H100 clusters narrows significantly.

Always check the specific model you need against the GroqCloud documentation on which hardware tier it runs on. :::

:::warning Groq Does Not Support Training

The TSP architecture is compiled at model load time. There is no mechanism for gradient flow, dynamic graph construction, or weight updates. If your workload involves any fine-tuning, LoRA, or online learning, Groq is not the right tool. Use GPU infrastructure for training and optionally switch to Groq for inference serving. :::

:::warning Compile Time is Not Inference Time

For cloud API usage this is invisible. But if you are using Groq hardware on-premise, model compilation can take hours for large models. Plan for cold-start time. Keep compiled model artifacts cached. Do not restart hardware unnecessarily. :::

Interview Questions and Answers

Q1: Why does Groq's LPU achieve much higher inference throughput than a GPU for autoregressive generation at batch size 1?

A: The bottleneck in autoregressive generation at batch size 1 is not compute - it is memory bandwidth. Each generated token requires loading the full model weights to perform the forward pass. A 7B FP16 model is 14GB. On an A100 with 2 TB/s HBM bandwidth, each token takes a minimum of 7ms just for memory reads, regardless of how fast the matrix multiplications run. The arithmetic intensity at batch=1 is approximately 1 FLOP/byte, far below the GPU's compute-bound threshold.

Groq eliminates this bottleneck by storing model weights in on-chip SRAM. Once loaded (or compiled in), the weights are never read from slow off-chip memory. SRAM access is 10-100x lower latency than HBM. Combined with a deterministic compiler-scheduled execution model that has no runtime overhead, this allows Groq to generate tokens at 500-750 tokens/second for 8B models versus 120-150 on an A100.

Q2: What is deterministic execution and why does Groq's architecture depend on it?

A: Deterministic execution means that every memory access and compute operation is scheduled at compile time with exact cycle timing. There are no caches, no branch predictors, no out-of-order execution, and no dynamic dispatch in the TSP. The compiler analyzes the full model graph and produces a static instruction stream where every event is assigned a precise cycle offset.

This is important for two reasons. First, it enables the SRAM architecture to work - because all data access patterns are known at compile time, there is no need for caches (which exist to handle unpredictable access patterns). Second, it produces truly constant latency - not just "usually fast" but mathematically guaranteed. This is valuable for SLA-bound production systems and voice applications where P99 latency matters as much as median latency.

Q3: How does the memory bandwidth ceiling calculation work, and at what batch size does a GPU escape it?

A: The roofline model gives us the arithmetic intensity of a workload: FLOPS / bytes-loaded. For a transformer at batch size 1, arithmetic intensity is approximately 1 FLOP/byte. The H100's ridge point (where compute becomes the bottleneck instead of memory) is 312 TFLOPS / 2.1 TB/s = ~148 FLOP/byte. Since 1 << 148, batch=1 inference is purely memory-bound and uses less than 1% of available compute.

As batch size B increases, arithmetic intensity scales as B FLOP/byte (the weights are read once but used for B sequences). At B=148, the H100 crosses its ridge point and becomes compute-bound. In practice, batch sizes of 32-64 put a GPU into the useful operating range where it is using a meaningful fraction of its FLOPS. This is why GPU inference farms need careful request batching to be cost-efficient.

Q4: What are the practical limitations of Groq's SRAM-based architecture?

A: Three main limitations. First, capacity: 230MB per chip limits model size. Large models require multi-chip GroqRack configurations, which reintroduce some latency from chip-to-chip communication. A 70B model needs 18+ chips. Second, no training support: the static compiler cannot propagate gradients or update weights, so Groq is inference-only. Third, concurrency: unlike GPUs which can batch hundreds of sequences together to improve throughput, Groq processes sequences with limited batching. High-concurrency workloads see throughput saturation earlier than a GPU cluster.

Q5: If you're building a production LLM API serving thousands of requests per minute, when would you choose Groq over a GPU cluster?

A: Groq wins when per-request latency is the primary metric and requests are largely independent. Specifically: voice assistants where TTFT under 200ms is required, interactive coding assistants, real-time customer support where response time affects user satisfaction, and any SLA contract where P99 latency must stay below a threshold. Groq's deterministic execution makes hitting tight latency SLAs much easier to guarantee.

A GPU cluster wins when you need: high aggregate throughput at lower cost per token, training or fine-tuning, models larger than what Groq racks support economically, or custom model architectures requiring CUDA-level control. For a truly high-volume API with QPS in the thousands and modest latency requirements, well-batched H100s will be more cost-effective.

Q6: How does the Groq compiler's static scheduling relate to the absence of a cache hierarchy?

A: They are two sides of the same design decision. Cache hierarchies exist to mitigate the performance cost of unpredictable memory access patterns - you keep recently-used data nearby in case it is needed again soon. This works because access patterns in general-purpose code are often locally regular but globally unpredictable.

In a neural network forward pass, the access pattern is entirely determined by the model graph. Every matrix multiplication knows exactly which weight tiles it needs, in what order, at what time. The Groq compiler exploits this by computing the exact memory access schedule at compile time, then directly staging data into the compute units' local registers with zero-latency precision. With a perfect schedule, a cache provides zero benefit - it is overhead with no payoff. Removing the cache hierarchy simplifies the chip, reduces area, reduces power, and eliminates the latency unpredictability that cache misses would cause.

Comparing Groq to Other Inference Hardware

Understanding Groq's position requires placing it in the context of the full inference hardware landscape. Each accelerator optimizes for a different point on the latency-throughput-cost surface.

Groq vs NVIDIA H100 for Inference

The H100 is the dominant inference accelerator in 2024. It wins on raw aggregate throughput at large batch sizes. The Tensor Cores can chew through large batches with near-100% utilization. For a production LLM API serving thousands of concurrent users, vLLM or TensorRT-LLM running on H100s is the standard.

The H100 loses to Groq specifically when requests are individual and latency-bound. At batch size 1, the H100 is delivering about 3.35 TB/s / 14 GB/token = 239 theoretical tokens/sec for a 7B model - and real-world numbers are lower due to kernel launch overhead, OS scheduling jitter, and HBM read latency. Groq at 750 tokens/sec represents a genuine 3x advantage.

The variance gap is less discussed but practically important. H100 P99 latency on a production API can be 2-4x the P50 latency due to CUDA stream contention, memory fragmentation from other running jobs, and HBM memory bus contention. Groq P99 is within 5% of P50. For latency SLAs that include P99 commitments, Groq is dramatically easier to reason about.

Groq vs Groq - The Multi-Chip Scaling Reality

Adding GroqChips to handle larger models introduces chip-to-chip communication latency. The 4.1 Tb/s chip-to-chip bandwidth in a GroqRack is fast, but it is not zero latency. As model partitioning requires activation tensors to flow between chips at every layer, the communication overhead grows proportionally to the number of inter-chip hops.

For a 70B model partitioned across 9 cards (18 chips), each transformer layer requires a partial forward pass on one set of chips, communication across the interconnect, and continuation on the next set. The theoretical speedup from SRAM is real but reduced compared to a single-chip configuration. Groq's benchmarked 280 tokens/sec for 70B is still competitive but the gap over H100 is smaller than for 8B models.

Quantization and Model Precision

Groq's SRAM advantage is most pronounced for quantized models. INT8 quantization halves model size, doubling how many parameters fit per chip:

# Memory footprint by quantization level
MODEL_SIZE_7B = {
"fp32": "28 GB", # 7B * 4 bytes
"fp16": "14 GB", # 7B * 2 bytes
"int8": "7 GB", # 7B * 1 byte
"int4": "3.5 GB", # 7B * 0.5 bytes
}

# GroqChips needed (230MB per chip)
CHIPS_NEEDED_7B = {
"fp16": "61 chips (31 cards)", # 14GB / 0.23GB per chip
"int8": "31 chips (16 cards)", # 7GB / 0.23GB per chip
"int4": "16 chips (8 cards)", # 3.5GB / 0.23GB per chip
}

# INT8 halves hardware cost with minimal quality loss
# Most Groq production deployments use INT8 or INT4 quantization

INT4 quantization for a 7B model brings it down to 3.5GB, fitting on 16 chips - a two-card GroqRack with room to spare. The quality trade-off for INT4 is model-dependent, but for many production use cases it is acceptable. Always benchmark quality metrics (MMLU, HellaSwag, etc.) before committing to a quantization level.

Groq in System Architecture

Where Groq Fits in a Production ML Stack

A pragmatic production system routes requests by latency requirements. Interactive, user-facing requests with strict SLAs go to Groq. Batch summarization jobs, embedding generation, and high-concurrency throughput-optimized requests go to GPU clusters. The routing logic can be as simple as a flag in your API request or as sophisticated as a latency-aware load balancer that measures queue depth on each backend.

Hybrid Cost Optimization

import os
from groq import Groq as GroqClient
from openai import OpenAI

groq_client = GroqClient(api_key=os.environ["GROQ_API_KEY"])
gpu_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"]) # or self-hosted vLLM

def route_request(
messages: list,
max_tokens: int,
latency_sensitive: bool,
model_size: str = "8b",
):
"""
Route inference requests based on latency requirements.
Groq for latency-sensitive paths, GPU cluster for batch/cost paths.
"""
if latency_sensitive:
# Groq: low latency, deterministic, higher cost per token
model = "llama-3.1-8b-instant" if model_size == "8b" else "llama-3.1-70b-versatile"
return groq_client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
)
else:
# GPU cluster: lower cost at volume, higher latency acceptable
return gpu_client.chat.completions.create(
model="gpt-4o-mini", # or your vLLM endpoint
messages=messages,
max_tokens=max_tokens,
)

# Voice assistant: latency-sensitive
voice_response = route_request(
messages=[{"role": "user", "content": "Book a table for 2 at 7pm"}],
max_tokens=100,
latency_sensitive=True,
)

# Nightly report generation: cost-sensitive
report_response = route_request(
messages=[{"role": "user", "content": "Summarize last week's sales data: [...]"}],
max_tokens=1000,
latency_sensitive=False,
)

Summary

Groq's LPU architecture makes one fundamental bet: LLM inference at small batch sizes is memory-bandwidth-bound, and the right answer to that is to eliminate off-chip memory reads entirely, not to make off-chip memory faster.

The TSP achieves this through SRAM-first design (230MB per chip), compiler-scheduled execution (no dynamic dispatch, no caches, no branch prediction), and deterministic cycle-accurate timing. The result is 500-750 tokens/second at batch size 1 for 7B-8B models - 4-6x faster than equivalent GPU hardware - with latency variance near zero.

The trade-offs are real: SRAM capacity limits model size, training is not supported, and aggregate throughput at high concurrency does not scale as well as GPU batching. Groq wins for latency-sensitive, low-batch inference workloads. It loses for training, large models, and high-throughput batch jobs.

GroqCloud's OpenAI-compatible API makes adoption trivial for any team already using OpenAI or similar LLM providers. The groq Python library and the base_url trick with the OpenAI client mean zero code changes for most integrations.

The architecture represents a broader principle in hardware design: once you know your target workload precisely, removing generality is not a weakness - it is how you achieve 10x performance improvements. Groq chose to be the world's best inference chip for small-to-medium models at low latency. That focus is exactly why it works.

© 2026 EngineersOfAI. All rights reserved.