Apple Silicon for AI

Reading time: ~35 min · Interview relevance: Medium · Target roles: iOS/macOS Engineer, On-Device AI, ML Infrastructure

35 Tokens Per Second, 8 Watts, In Your Backpack

It is 2 PM on a transatlantic flight. You have no internet connection, a fully charged M3 Max MacBook Pro, and a 7 billion parameter language model running locally at 35 tokens per second. The fan is silent. The battery indicator says you have six hours left. The model is answering questions about your code, summarizing documents, drafting emails - all with complete privacy, no API calls, no latency spikes.

That is not a thought experiment. It is the practical reality of Apple Silicon in 2024. The same chip that runs your code editor and plays 4K video is running a quantized LLaMA 3 model faster than a cloud API on a bad network day, at 8 watts of sustained power draw.

Now zoom out. A desktop GPU that runs LLaMA 3 7B at comparable speeds - an RTX 4070 - costs $600, draws 200 watts, and requires a PC case and power supply. The RTX 4090 runs it at 100+ tokens per second but draws 450 watts and costs$ 2,000 in the US. An Apple M3 Max MacBook Pro starts at $3,000 but also works as your laptop, runs all day on battery, fits in a bag, and runs the same model at competitive speed per watt.

This is not about Apple Silicon competing with H100 for training data centers. It does not. Apple Silicon wins in a completely different category: edge inference where memory bandwidth, power, and portability matter more than peak FLOPS. Understanding why requires understanding the architectural choice Apple made in 2020 that changed what "a computer's memory" means.

The Problem: The PCIe Bus Was Killing Inference Efficiency

Before Apple Silicon, every consumer computing device that wanted GPU acceleration faced the same bottleneck: the PCIe bus.

In a traditional CPU + discrete GPU setup, CPU memory (DRAM) and GPU memory (VRAM) are physically separate pools. The CPU's DDR5 RAM might be 64GB at 90 GB/s. The GPU's GDDR6X VRAM might be 24GB at 1 TB/s. They sit on different chips connected by PCIe 4.0 at 64 GB/s.

The consequence for inference: when you load a model, you first load it into CPU RAM, then copy it across PCIe to VRAM. When you update a KV cache, you write to VRAM. When you read logits for beam search, you copy back across PCIe. Each PCIe transfer is 64 GB/s - roughly 6x slower than GPU VRAM bandwidth, 0.5x CPU DRAM bandwidth.

For a 7B BF16 model (14GB), the initial model load from storage to GPU involves:

Storage to CPU RAM: limited by storage I/O (NVMe at ~7 GB/s)
CPU RAM to GPU VRAM: PCIe at 64 GB/s (about 0.2 seconds)

Worse, if the model does not fit in VRAM, you have to split layers across CPU RAM and GPU VRAM, swapping layers across PCIe at 64 GB/s during every forward pass. This is catastrophically slow - a 70B model in 4-bit on a GPU with 24GB VRAM would need constant CPU-GPU swapping.

Apple's answer to this was not a faster PCIe bus. It was eliminating the PCIe bus entirely.

Historical Context: From Neural Engine to M-Series

Apple's path to Apple Silicon for AI started in 2017 with a smartphone chip.

A11 Bionic (2017) - First Apple chip to include the Apple Neural Engine (ANE). Two dedicated neural network inference cores delivering 600 GOPS (giga-operations per second). Used for Face ID facial recognition and camera depth processing. Not programmable - Apple-only models ran on it via Core ML.

A12 Bionic (2018) - 8-core ANE at 5 TOPS. The API to use it from Core ML became public.

M1 (2020) - Apple's first Mac chip on their own silicon. 16-core ANE at 11 TOPS. 8GB or 16GB unified memory at 68.25 GB/s (base) or 200 GB/s (Max variant). CPU and GPU share this memory pool - no discrete GPU. This was the architectural shift that mattered.

M1 Ultra (2022) - Two M1 Max dies connected by UltraFusion interconnect (2.5 TB/s die-to-die). Up to 128GB unified memory at 800 GB/s. First consumer chip where a 70B model in 4-bit (35GB) could fit entirely in unified memory.

M2 series (2022-2023) - Incremental improvements. M2 Max: 38 TOPS ANE, up to 96GB at 400 GB/s.

M3 series (2023) - Hardware ray tracing added to GPU. M3 Max: 40 TOPS ANE, up to 128GB at 400 GB/s. M3 Ultra: up to 192GB at 800 GB/s.

M4 (2024) - 38 TOPS ANE (same as M2 Max numerically but different architecture). Hardware-accelerated matrix multiply for machine learning improved. iPad Pro and MacBook Pro M4 variants. M4 Max: up to 128GB at 546 GB/s - the highest bandwidth in the M4 line.

The critical trend: each generation adds more addressable unified memory and higher bandwidth. The memory is the bottleneck for LLM inference, and Apple is systematically increasing it.

Unified Memory Architecture: Why It Changes Everything

What Unified Memory Actually Means

"Unified memory" is sometimes used loosely to mean shared memory accessible by CPU and GPU. What Apple Silicon does is architecturally stronger than that: the CPU, GPU, and ANE all access the exact same physical DRAM chips through the same memory controller.

There is no "CPU copy" and "GPU copy" of the same data. There is one physical array of bytes. The CPU, GPU, and ANE each have cache hierarchies that sit in front of that DRAM, but at the DRAM level they share the same address space with no separation.

Traditional Discrete GPU Setup
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CPU ─── CPU L1/L2/L3 Cache ─── DDR5 DRAM (64GB, 90 GB/s)
                                     |
                               PCIe 4.0 x16
                               (64 GB/s max)
                                     |
GPU ─── GPU L2 Cache ───────── GDDR6X VRAM (24GB, 1008 GB/s)

Model inference:
  Storage -> CPU RAM: ~7 GB/s
  CPU RAM -> GPU VRAM: ~64 GB/s
  GPU compute: 1008 GB/s bandwidth
  GPU result -> CPU RAM: ~64 GB/s (for logit sampling)

Apple Silicon Unified Memory
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
        ┌──────────────────────────────┐
        │     Unified DRAM Pool        │
        │   (128GB max, 400+ GB/s)     │
        └──────────────────────────────┘
               ▲         ▲         ▲
               │         │         │
            CPU        GPU        ANE
         L1/L2/L3   L1/L2      dedicated
          caches     cache     SRAM buffers

Model inference:
  Storage -> Unified RAM: ~7 GB/s (same as before)
  CPU to GPU "copy": pointer arithmetic, 0 bytes moved
  Compute: full 400 GB/s shared across all processors
  Logit sampling: CPU reads from same physical bytes, no copy

The "copy from CPU to GPU" in PyTorch or Core ML on Apple Silicon is a metadata operation - it updates a pointer or page table entry. The underlying DRAM bytes do not move. This eliminates the PCIe bottleneck entirely for inference.

The Memory Bandwidth Calculation for Inference

LLM inference speed (for single-user, small batch) is almost entirely memory-bandwidth-bound. The reason: during each token generation step, you read every parameter of the model once (to compute the forward pass) and generate a handful of FLOPs per byte read.

The theoretical peak tokens per second is:

$\text{Tokens/sec} \approx \frac{\text{Memory Bandwidth (bytes/sec)}}{\text{Model Size (bytes)}}$

For an M3 Max (400 GB/s) running LLaMA 3 7B in 4-bit quantization (3.5 GB):

$\text{Tokens/sec} \approx \frac{400 \times 10^9}{3.5 \times 10^9} \approx 114 \text{ tokens/sec theoretical}$

Real-world measurements with llama.cpp or MLX run at 40-70% of this theoretical limit, due to quantization dequantization overhead, KV cache reads, and software inefficiency. Typical results: 35-60 tokens/second for 7B models on M3 Max.

For an M3 Ultra (800 GB/s) running LLaMA 3 70B in 4-bit (38 GB):

$\text{Tokens/sec} \approx \frac{800 \times 10^9}{38 \times 10^9} \approx 21 \text{ tokens/sec theoretical}$

In practice, 10-15 tokens/sec. Usable for personal productivity. Not suitable for production serving.

This arithmetic makes the memory bandwidth per dollar a key purchase criterion for Apple Silicon ML use cases - not GPU core count, not TOPS, but GB/s.

Apple Neural Engine: The Dedicated Matrix Multiply Hardware

What the ANE Is

Every M-series chip contains a component called the Apple Neural Engine (ANE). It is fixed-function hardware optimized specifically for dense matrix multiply operations - the same operations that dominate transformer inference.

Unlike the programmable GPU compute units or the general-purpose CPU cores, the ANE cannot run arbitrary code. It accepts a restricted set of operations: matrix multiply, convolution, activation functions, normalization. Models must be compiled to the ANE's instruction set using Core ML's compilation pipeline, which uses neural network compilation under the hood.

The ANE's key properties:

Dedicated SRAM buffers: the ANE has fast on-chip SRAM for weight staging. This reduces DRAM accesses for repeated weight reads.
Very low power: the ANE delivers high TOPS per watt compared to the GPU
Fixed-function, not programmable: you cannot write custom ANE kernels as a third party
Core ML only (officially): the ANE is accessed through Core ML. No direct Metal API access.

ANE Compute Capacity

Chip	ANE Cores	ANE TOPS	GPU Cores	Unified Memory	Bandwidth
M1	16	11	8	8-16GB	68 GB/s
M1 Max	16	11	32	32-64GB	400 GB/s
M1 Ultra	32	22	64	64-128GB	800 GB/s
M2	16	15.8	10	8-24GB	100 GB/s
M2 Max	16	15.8	38	32-96GB	400 GB/s
M3	16	18	10	8-24GB	100 GB/s
M3 Max	16	18	40	36-128GB	400 GB/s
M3 Ultra	32	36	80	64-192GB	800 GB/s
M4	16	38	10	16-32GB	120 GB/s
M4 Max	16	38	40	36-128GB	546 GB/s

When the ANE Is Used vs GPU

The routing decision for where an operation runs:

ANE: Core ML models that have been compiled with computeUnits: .all or .cpuAndNeuralEngine. Standard CNN and transformer inference through Core ML.
GPU: Metal Performance Shaders (MPS) backend for PyTorch. MLX framework. Custom Metal compute kernels. Large batch inference.
CPU: Operations not supported by ANE or GPU. Fallback for unsupported layers. CPU-only Core ML.

The ANE is most efficient for Core ML inference of mobile-scale models (up to a few hundred million parameters). For LLM inference (7B+ parameters), the GPU via Metal or MLX is typically faster because the ANE's dedicated SRAM is too small to stage the full weight matrix without excessive DRAM traffic, and the GPU's larger L2 cache and higher compute parallelism compensate.

:::note ANE Is Not Directly Programmable by Third Parties Apple has not released an ANE programming interface. The only way to target the ANE is through Core ML model compilation. mlx, llama.cpp, and PyTorch MPS all run on the GPU, not the ANE. When benchmarks say "Apple M3 runs 38 TOPS," that TOPS figure is for ANE workloads accessed via Core ML. :::

The GPU Compute Path: Metal Performance Shaders

For LLM inference, the actual compute path on Apple Silicon is the GPU through Metal.

Metal API Architecture

Metal is Apple's low-level GPU compute API - analogous to CUDA for NVIDIA, ROCm for AMD. Metal supports:

Compute shaders: arbitrary GPGPU compute written in the Metal Shading Language (MSL)
Metal Performance Shaders (MPS): Apple's library of optimized GPU kernels for ML operations - matrix multiply, convolution, LSTM, attention

PyTorch's MPS backend uses Metal under the hood. When you do:

import torch

device = torch.device("mps")
x = torch.randn(1024, 1024, device=device)
y = torch.randn(1024, 1024, device=device)
z = x @ y  # Runs via Metal MPS

PyTorch dispatches to Apple's MPSMatrixMultiplication kernel, which Apple has hand-tuned for their GPU architecture.

PyTorch MPS Backend

PyTorch added MPS backend support in version 1.12 (2022). It covers most common ML operations:

import torch

# Check MPS availability
print(torch.backends.mps.is_available())      # True on Apple Silicon
print(torch.backends.mps.is_built())           # True if PyTorch built with MPS

device = torch.device("mps")

# Standard model inference
model = MyModel()
model.to(device)
model.eval()

with torch.no_grad():
    x = torch.randn(1, 3, 224, 224, device=device)
    output = model(x)

# Training (works for fine-tuning smaller models)
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

for batch in dataloader:
    inputs = batch["input_ids"].to(device)
    labels = batch["labels"].to(device)

    outputs = model(inputs, labels=labels)
    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

:::warning MPS Backend Limitations in PyTorch Not all PyTorch operations are supported on MPS. Operations that fall back to CPU cause silent slowdowns - the tensor silently moves to CPU, runs, and moves back. For operations like some sparse attention variants or custom CUDA extensions ported to PyTorch, there may be no MPS implementation. Check PyTorch's MPS operation coverage before planning a training run. :::

MLX: Apple's Native ML Framework

In December 2023, Apple released MLX - an array computation framework for Apple Silicon designed from the ground up for the unified memory architecture.

Design Philosophy

MLX is built on two core ideas:

Lazy evaluation: like JAX and similar frameworks, operations are not executed immediately. A computation graph is built and compiled before execution. This lets the compiler fuse operations and optimize memory layout.
Unified memory as a first-class concept: MLX arrays live in unified memory by default. There is no "device" concept the way PyTorch has cuda vs cpu. Everything is on the same chip, accessible by all compute units. "Transferring" to GPU is a scheduling decision, not a data movement.

MLX API Basics

import mlx.core as mx
import mlx.nn as nn

# MLX arrays - in unified memory, no .to(device) needed
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])

# Operations - lazy by default
c = a + b   # Not yet computed
d = mx.sum(c)  # Still not computed

# Force evaluation
mx.eval(d)  # Now computes a + b + sum
print(d)   # array(21, dtype=float32)

# Matrix operations
A = mx.random.normal([1024, 1024])
B = mx.random.normal([1024, 1024])
C = A @ B   # Matrix multiply - dispatched to GPU
mx.eval(C)   # Force execution

MLX Neural Network Module

MLX has a neural network layer library (mlx.nn) similar to PyTorch's torch.nn:

import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, n_heads: int, d_ff: int):
        super().__init__()
        self.attention = nn.MultiHeadAttention(d_model, n_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff1 = nn.Linear(d_model, d_ff)
        self.ff2 = nn.Linear(d_ff, d_model)

    def __call__(self, x):
        # Self-attention with residual
        attn_out = self.attention(x, x, x)
        x = self.norm1(x + attn_out)

        # Feed-forward with residual
        ff = nn.gelu(self.ff1(x))
        ff = self.ff2(ff)
        x = self.norm2(x + ff)

        return x

# Create model
model = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)

# Forward pass
x = mx.random.normal([8, 64, 512])  # [batch, seq_len, d_model]
output = model(x)
mx.eval(output)
print(output.shape)  # (8, 64, 512)

MLX for LLM Inference: mlx-lm

Apple's mlx-lm package provides a high-level interface for running LLMs with MLX:

pip install mlx-lm

from mlx_lm import load, generate

# Load model from Hugging Face Hub (MLX quantized format)
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

# Generate text
prompt = "Explain the attention mechanism in transformers"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
    verbose=True  # Prints tokens/sec during generation
)
print(response)

# Load with explicit quantization config
from mlx_lm import load, generate
from mlx_lm.utils import generate_step

model, tokenizer = load(
    "meta-llama/Llama-3-8B",
    tokenizer_config={"trust_remote_code": True}
)

# For streaming generation
for token, _ in zip(generate_step(
    mx.array(tokenizer.encode(prompt)),
    model,
    temp=0.7,
    top_p=0.9
), range(512)):
    print(tokenizer.decode([token.item()]), end="", flush=True)

MLX Quantization

MLX supports 4-bit and 8-bit post-training quantization, essential for fitting large models in Apple Silicon's memory:

import mlx.core as mx
import mlx.nn as nn
from mlx.utils import tree_flatten, tree_unflatten

def quantize_model(model, bits=4, group_size=64):
    """
    Quantize model weights to reduce memory footprint.
    4-bit quantization: 7B model from 14GB -> 3.5GB
    """
    # Quantize all Linear layers
    nn.quantize(model, bits=bits, group_size=group_size)
    return model

# Load full precision, then quantize
model, tokenizer = load("meta-llama/Llama-3-8B")
quantized_model = quantize_model(model, bits=4)

# Memory usage after 4-bit quantization
# 7B params * 0.5 bytes (4-bit) = ~3.5GB
# Plus KV cache: 2 * n_layers * seq_len * d_model * 2 bytes
# For 8192 context, LLaMA 3 8B: ~2GB KV cache
# Total: ~5.5GB - fits in base M1 with 8GB unified memory

llama.cpp on Apple Silicon

llama.cpp is a C++ inference engine that compiles to Metal on Apple Silicon. It is often faster than MLX for single-model inference because of its highly optimized Metal kernels and lower framework overhead.

Installation and Basic Usage

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Metal backend is automatically enabled on macOS
cmake -B build -DGGML_METAL=ON
cmake --build build -j 8

# Download a GGUF model
# GGUF is llama.cpp's quantized model format
# Models available at huggingface.co/bartowski/ etc.
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf

# Run inference
./build/bin/llama-cli \
    -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
    -n 512 \
    --n-gpu-layers 32 \
    -p "Explain gradient descent"

# Key flags:
# --n-gpu-layers: how many layers to offload to GPU Metal
#   32 = all layers for 8B model (fits in M-series GPU)
#   0 = CPU only (much slower)
# -n: tokens to generate
# -c: context window size

Benchmarking with llama.cpp

# llama-bench: systematic throughput measurement
./build/bin/llama-bench \
    -m Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 32 \
    -p 512 \    # Prompt processing tokens (prefill)
    -n 128      # Generation tokens (decode)

# Sample output:
# model        |    size |     params | backend       |ngl| test   |    t/s
# Llama 3 8B Q4| 4.66 GiB|  8.03 B   | Metal,BLAS    | 32| pp512  | 2847.15
# Llama 3 8B Q4| 4.66 GiB|  8.03 B   | Metal,BLAS    | 32| tg128  |  38.47
#
# pp512 = prompt processing (prefill): 2847 tokens/sec
# tg128 = token generation (decode): 38 tokens/sec

The prefill (processing the prompt) is much faster than generation because prefill is a compute-bound batch operation. Generation (one token at a time) is memory-bandwidth-bound.

Performance Comparison: Apple Silicon vs Discrete GPU

Configuration	Model	Speed (tok/s gen)	Power Draw	Memory Available	Price (approx)
M3 (8GB)	7B Q4	15-25	15W	6GB usable	$1,299 MacBook Air
M3 Max (36GB)	7B Q4	35-55	25W	34GB usable	$3,499 MacBook Pro
M3 Max (128GB)	70B Q4	5-10	40W	126GB usable	$6,999 Mac Studio
M3 Ultra (192GB)	70B Q4	10-15	60W	190GB usable	$9,999 Mac Pro
RTX 4070 (12GB)	7B Q4	50-80	200W	11GB VRAM	$600 GPU only
RTX 4090 (24GB)	7B Q4	100-140	450W	23GB VRAM	$2,000 GPU only
RTX 4090 (24GB)	70B Q4	CPU offload needed	450W+	23GB VRAM	$2,000 + system
H100 SXM5 (80GB)	70B Q4	100-150	700W	79GB VRAM	$30,000+

The RTX 4090 is faster for 7B models. But:

70B models on RTX 4090: insufficient VRAM (24GB vs 38GB needed for Q4). Requires CPU offloading at 3-5 tokens/sec.
An M3 Ultra with 192GB runs 70B at 10-15 tok/sec on battery.
Watts per token/sec: RTX 4090 = ~4 W/(tok/s). M3 Max = ~0.5 W/(tok/s). Apple Silicon wins by 8x on power efficiency.

Architecture Diagram: Apple Silicon for AI

Memory Sizing Guide for Local LLM Deployment

The most common question: "which M-series chip do I need to run model X?"

Model memory requirements (rough estimates):

Model	Full BF16	Q8_0	Q4_K_M	Q3_K_S
3B	6 GB	3 GB	1.9 GB	1.5 GB
7B / 8B	14 GB	7 GB	4.5 GB	3.5 GB
13B	26 GB	13 GB	8 GB	6.5 GB
30B	60 GB	30 GB	19 GB	14 GB
70B	140 GB	70 GB	43 GB	32 GB
405B	810 GB	405 GB	250 GB	185 GB

Add KV cache on top: for a 4K context window, LLaMA 3 8B needs approximately 0.5GB additional. For 32K context: ~4GB. Always leave at least 2-4GB headroom for the OS and other processes.

Practical recommendations:

8GB M-series: 7B models in Q4_K_M or smaller. Tight but workable.
16GB M-series: 7B in Q8, 13B in Q4. Good for development.
36GB M3/M4 Max: 13B in Q8, 30B in Q4. Comfortable with context window headroom.
64GB M-Ultra: 70B in Q4_K_M. The sweet spot for running frontier-size local models.
128GB M3 Max or M4 Max: 70B in Q8 or 405B in very aggressive quantization.
192GB M3/M4 Ultra: 70B in BF16, 405B in Q4_K_M.

When Apple Silicon Wins and When It Loses

Apple Silicon Wins For:

Local inference with privacy: model weights never leave your machine. Zero API calls. HIPAA, GDPR, IP protection use cases.
Laptop inference on battery: no discrete GPU laptop runs LLMs at 8W. Period.
Models that exceed discrete GPU VRAM: 70B models require 40+ GB. No consumer discrete GPU has this. Apple Ultra chips do.
Developer iteration speed: no cloud API latency, no cost per token, no rate limits. Prototype LLM chains locally.
Consumer app inference: on-device inference in iOS/macOS apps using Core ML. Zero inference cost, works offline.
Power-constrained edge deployments: an M2 Mac mini runs at 15W idle, 45W load. A GPU server runs at 1-10 kW.

Apple Silicon Loses For:

Training: CUDA's ecosystem for training is unmatched. PyTorch MPS training works but has op gaps, lower throughput, and no equivalent to FlashAttention 2 or Triton kernels. Fine-tuning a 7B model on M-series is feasible but slow. Pre-training is impractical.
High-throughput serving: a single H100 can serve hundreds of concurrent users for a 7B model via continuous batching in vLLM. An M4 Max can serve a handful. For production inference serving with SLA requirements, Apple Silicon is not competitive.
Multi-GPU scaling: Apple Silicon has no equivalent to NVLink or even PCIe multi-GPU for collective operations. You cannot scale a training job across two Mac Pros in any practical way.
Cutting-edge research: most ML research infrastructure assumes CUDA. Triton, FlashAttention 3, PagedAttention, speculative decoding in vLLM - all CUDA. Staying on the research frontier requires NVIDIA hardware.
Fine-tuning large models: even with MLX, fine-tuning 70B models on Apple Silicon takes days for what an H100 cluster does in hours. The math does not work for production fine-tuning.

Complete MLX Benchmark Script

"""
Benchmark LLM inference speed on Apple Silicon using MLX.
Measures: time to first token (TTFT), tokens/sec generation,
and memory usage.
"""

import time
import resource
import mlx.core as mx
from mlx_lm import load, generate
from mlx_lm.utils import generate_step

def get_memory_usage_mb() -> float:
    """Return current process memory in MB (RSS)."""
    usage = resource.getrusage(resource.RUSAGE_SELF)
    return usage.ru_maxrss / 1024 / 1024  # Convert bytes to MB

def benchmark_model(
    model_path: str,
    prompt: str,
    n_generate: int = 256,
    n_warmup: int = 32,
) -> dict:
    """
    Run a comprehensive benchmark for an MLX LLM.

    Returns dict with:
    - prefill_tokens_per_sec: prompt processing speed
    - generate_tokens_per_sec: generation speed
    - time_to_first_token_ms: latency
    - peak_memory_mb: peak unified memory used
    """
    print(f"Loading {model_path}...")
    mem_before_load = get_memory_usage_mb()

    model, tokenizer = load(model_path)
    mx.eval(model.parameters())  # Force parameter materialization

    mem_after_load = get_memory_usage_mb()
    model_memory_mb = mem_after_load - mem_before_load
    print(f"Model loaded. Memory delta: {model_memory_mb:.0f} MB")

    # Tokenize prompt
    input_ids = mx.array(tokenizer.encode(prompt))
    prompt_tokens = len(input_ids)
    print(f"Prompt: {prompt_tokens} tokens")

    # Warmup run (compiles Metal shaders, warms caches)
    print("Warming up...")
    warmup_ids = mx.array(tokenizer.encode("Hello"))
    tokens_generated = 0
    for token, _ in zip(generate_step(warmup_ids, model), range(n_warmup)):
        mx.eval(token)
        tokens_generated += 1
    print(f"Warmup complete ({tokens_generated} tokens)")

    # Benchmark: prefill (time to first token)
    print("Benchmarking prefill...")
    t0 = time.perf_counter()

    gen = generate_step(input_ids, model, temp=0.0)  # Greedy for determinism
    first_token = next(gen)
    mx.eval(first_token)  # Force Metal execution

    ttft_ms = (time.perf_counter() - t0) * 1000
    prefill_tps = prompt_tokens / (ttft_ms / 1000)
    print(f"Time to first token: {ttft_ms:.1f} ms ({prefill_tps:.0f} prefill tok/s)")

    # Benchmark: generation speed
    print(f"Benchmarking generation ({n_generate} tokens)...")
    tokens = [first_token]
    t_gen_start = time.perf_counter()

    for token, _ in zip(gen, range(n_generate - 1)):
        mx.eval(token)
        tokens.append(token)

    gen_time = time.perf_counter() - t_gen_start
    generate_tps = (n_generate - 1) / gen_time

    # Decode output
    output_ids = [t.item() for t in tokens]
    output_text = tokenizer.decode(output_ids)
    print(f"Generated: {output_text[:100]}...")

    peak_memory_mb = get_memory_usage_mb()

    results = {
        "model": model_path,
        "prompt_tokens": prompt_tokens,
        "generated_tokens": n_generate,
        "prefill_tokens_per_sec": prefill_tps,
        "generate_tokens_per_sec": generate_tps,
        "time_to_first_token_ms": ttft_ms,
        "peak_memory_mb": peak_memory_mb,
        "model_memory_mb": model_memory_mb,
    }

    print("\n--- Results ---")
    for k, v in results.items():
        if isinstance(v, float):
            print(f"  {k}: {v:.1f}")
        else:
            print(f"  {k}: {v}")

    return results

if __name__ == "__main__":
    benchmark_model(
        model_path="mlx-community/Meta-Llama-3-8B-Instruct-4bit",
        prompt="Explain the transformer attention mechanism in detail, covering scaled dot-product attention, multi-head attention, and why positional encoding is necessary.",
        n_generate=256,
        n_warmup=32,
    )

Core ML: Deploying Models in iOS and macOS Apps

For iOS and macOS app developers, the path to using the ANE is through Core ML. Core ML accepts models in .mlpackage or .mlmodel format and compiles them to use ANE, GPU, or CPU depending on the operation type and chip.

# Convert a PyTorch model to Core ML
import torch
import coremltools as ct

# Load PyTorch model
model = MyClassifier()
model.load_state_dict(torch.load("model.pth"))
model.eval()

# Trace the model
example_input = torch.randn(1, 3, 224, 224)
traced = torch.jit.trace(model, example_input)

# Convert to Core ML
mlmodel = ct.convert(
    traced,
    inputs=[ct.ImageType(
        name="input",
        shape=example_input.shape,
        scale=1/255.0,
        bias=[0, 0, 0]
    )],
    compute_precision=ct.precision.FLOAT16,  # Use FP16 for ANE efficiency
    compute_units=ct.ComputeUnit.ALL  # Allow ANE + GPU + CPU
)

# Save
mlmodel.save("MyClassifier.mlpackage")

In Swift (iOS/macOS app):

import CoreML
import Vision

// Load Core ML model
guard let model = try? MyClassifier(configuration: MLModelConfiguration()) else {
    fatalError("Failed to load model")
}

// Run inference - Core ML routes to ANE/GPU/CPU automatically
let input = MyClassifierInput(input: imageBuffer)
guard let output = try? model.prediction(input: input) else {
    fatalError("Inference failed")
}
print(output.classLabel)  // Classification result

Thermal Throttling: The Laptop Reality

Apple Silicon laptops have a fixed thermal envelope. Sustained high-compute workloads trigger thermal throttling - the chip reduces clock speeds to stay within thermal limits.

What this means for LLM inference:

MacBook Air (no fan): no active cooling. Sustained inference at full speed for 2-3 minutes, then significant throttling (30-50% speed reduction). Not suitable for long running inference jobs.
MacBook Pro (fan): fan activates during sustained inference. Can sustain near-peak performance for much longer. Throttling appears after 30-60 minutes of maximum load.
Mac Mini and Mac Studio: desktop thermal design with active cooling. Sustained workloads run at near-peak speed indefinitely.

For production-like local inference (many requests over hours), prefer Mac Mini, Mac Studio, or Mac Pro over MacBook. The MacBook Pro is excellent for developer iteration but not for running a local inference server all day.

# Detect if Metal device is available and log chip info
import subprocess
import platform

def get_apple_silicon_info():
    """Get Apple Silicon chip details."""
    if platform.system() != "Darwin":
        return None

    result = subprocess.run(
        ["system_profiler", "SPHardwareDataType"],
        capture_output=True, text=True
    )

    info = {}
    for line in result.stdout.split("\n"):
        if "Chip" in line:
            info["chip"] = line.split(":")[1].strip()
        elif "Memory" in line and "GB" in line:
            info["memory_gb"] = line.split(":")[1].strip()
        elif "Number of" in line and "Cores" in line:
            info["cores"] = line.strip()

    return info

chip_info = get_apple_silicon_info()
if chip_info:
    print(f"Chip: {chip_info.get('chip', 'Unknown')}")
    print(f"Memory: {chip_info.get('memory_gb', 'Unknown')}")

Common Mistakes and Pitfalls

:::danger Treating ANE TOPS as Equivalent to GPU GFLOPS When people say "M4 has 38 TOPS," they are referring to ANE operations per second through Core ML. You cannot use this number to compare against a GPU's FLOPS for arbitrary PyTorch operations. The ANE is not accessible to PyTorch, MLX, or llama.cpp. For LLM benchmarking purposes, the relevant number is GPU core count and memory bandwidth - not ANE TOPS. :::

:::danger Using PyTorch MPS for Production Inference PyTorch's MPS backend has incomplete op coverage. Some operations silently fall back to CPU, causing non-obvious slowdowns. For production local LLM inference, use MLX or llama.cpp instead. Both are designed specifically for Apple Silicon and achieve better throughput than PyTorch MPS for generation workloads. :::

:::warning Memory Estimation Without KV Cache When calculating whether a model fits in unified memory, people forget to add the KV cache. A 70B model in Q4_K_M is 43GB. But a 32K context window KV cache for LLaMA 3 70B adds another 8-10GB. Your 64GB Mac Studio may not comfortably run 70B with long contexts - you need the 128GB configuration for production use with long contexts. :::

:::warning Training on Apple Silicon for Research PyTorch MPS training works for small models and fine-tuning experiments. But it lacks FlashAttention 2, Triton support, and many custom ops. If you run a training experiment on MPS and plan to scale to NVIDIA GPUs, test compatibility before assuming the code is portable. The MPS path sometimes requires different implementations than the CUDA path. :::

Interview Questions and Answers

Q1: How does Apple Silicon's unified memory architecture benefit ML inference compared to a discrete GPU setup? What is the performance and practical impact?

In a discrete GPU setup, CPU RAM and GPU VRAM are physically separate with a PCIe bus (64 GB/s) connecting them. Loading a model requires a data copy from CPU RAM across PCIe to GPU VRAM. If the model exceeds VRAM, layers must be swapped across PCIe during inference - catastrophically slow.

Apple Silicon's unified memory means the CPU, GPU, and ANE all access the same physical DRAM. "Moving a tensor to GPU" is a metadata operation (pointer change or page table update), not a data movement. No PCIe bottleneck exists.

The practical impact:

70B models in 4-bit (38GB) cannot fit in any consumer discrete GPU's VRAM (maximum 24GB on RTX 4090). They can run on an M-Ultra with 64GB+ unified memory at usable speeds.
Model loading is faster (no PCIe copy phase).
Memory bandwidth is fully shared. The M3 Max's 400 GB/s is available to all compute units simultaneously, not split across separate pools.
Power efficiency is far better because there is no separate GPU chip with its own power rail.

Q2: What is the Apple Neural Engine, how is it different from the GPU for ML inference, and what routing logic determines which unit runs a given operation?

The Apple Neural Engine (ANE) is dedicated fixed-function hardware for neural network inference. It has a specialized architecture for dense matrix multiply and convolution, with dedicated SRAM buffers for weight staging. It is not programmable by third parties - Apple uses it through Core ML's compilation pipeline.

The GPU is a general-purpose programmable parallel processor using the Metal API. It can run arbitrary compute shaders (GPGPU), Metal Performance Shaders, and has a much larger and more flexible compute model.

Routing logic (through Core ML):

Operations compile to ANE when: the layer type is supported (Linear, Conv2D, LayerNorm, standard activation functions), the model was compiled with computeUnits: .all or .cpuAndNeuralEngine, and the tensor shapes fit ANE constraints.
Operations fall to GPU when: the operation is Metal-supported but not ANE-supported (some complex attention variants), or when batch size or context length exceeds ANE buffer capacity.
Operations fall to CPU when: no Metal or ANE implementation exists.

For third-party inference (PyTorch MPS, MLX, llama.cpp): all computation runs on the GPU through Metal. The ANE is not accessible to these frameworks. The 38 TOPS ANE number is irrelevant for LLM benchmarks outside Core ML.

Q3: Walk through the memory bandwidth arithmetic for LLM inference on an M3 Max. Why is bandwidth the bottleneck, not FLOPS?

LLM autoregressive generation processes one token at a time (batch size 1 in the simplest case). Each forward pass requires reading all model weights from DRAM once. For a 7B parameter model in BF16 (2 bytes/param): 7B * 2 bytes = 14GB per token.

FLOPS count: each parameter participates in roughly 2 multiply-add operations per forward pass = 14 billion FLOPs per token.

M3 Max has approximately 14 TFLOPS GPU compute and 400 GB/s bandwidth.

Time to generate one token:

Compute bound: 14e9 FLOPS / 14e12 FLOPS/s = 1 ms = 1000 tokens/sec
Memory bound: 14e9 bytes / 400e9 bytes/s = 35 ms = 28 tokens/sec

The memory-bound estimate (28 tokens/sec) is close to observed 30-40 tokens/sec. The compute-bound estimate (1000 tokens/sec) is far higher. Therefore memory bandwidth is the bottleneck, not FLOPS. This is why buying more GPU cores does not help - you are already compute-idle while waiting for weight data from DRAM.

This is also why 4-bit quantization (from 14GB to 3.5GB model size) roughly 4x's inference speed - it reduces the memory bottleneck proportionally.

Q4: What is the difference between using MLX and llama.cpp for local LLM inference on Apple Silicon? When would you choose each?

Both frameworks use Metal to run on the Apple Silicon GPU. Key differences:

llama.cpp:

C++ implementation, lower overhead per token
GGUF quantization format - very wide model availability on Hugging Face
More mature Metal kernels for common quantization types (Q4_K_M, Q5_K_M, Q8_0)
Server mode (llama-server) for HTTP API serving
Better performance benchmarks for pure generation throughput
Choose llama.cpp for: running GGUF models, serving via HTTP API, maximum generation throughput

MLX:

Python-native framework, Pythonic API
Better for training and fine-tuning experiments (supports gradients)
Better integration with Hugging Face Transformers format
Can run custom Python-level model modifications without C++ knowledge
mlx-lm provides high-level LLM interface
Choose MLX for: Python workflows, fine-tuning experiments, custom model modifications, Hugging Face model format

For a developer using local LLMs for productivity (chat, coding help), llama.cpp via LM Studio or Ollama is easier and faster. For an ML engineer building custom inference pipelines or running fine-tuning experiments, MLX is the better tool.

Q5: How does thermal throttling affect Apple Silicon inference performance, and how would you design a local inference server on Apple hardware to handle this?

Apple Silicon laptops throttle when the chip's thermal output exceeds what passive cooling (MacBook Air) or active cooling (MacBook Pro fan) can dissipate. During sustained inference:

MacBook Air: throttles within 2-3 minutes of heavy load. Generation speed drops 30-50%.
MacBook Pro: fan activates, sustains near-peak for 15-30 minutes, then mild throttling.
Mac Mini / Mac Studio: designed for sustained workloads. Near-peak performance indefinitely.

Design considerations for a production local inference server:

Use Mac Mini, Mac Studio, or Mac Pro - not MacBook
Monitor chip temperature via sudo powermetrics --samplers smc and set alerting
Rate-limit concurrent inference requests to keep sustained power below throttling threshold
Use GGUF Q4_K_M instead of Q8_0 - generates fewer FLOPs per token, lower sustained power draw
Implement request queuing rather than parallel inference - Apple Silicon chips do not scale efficiently with multiple concurrent generation requests (unlike vLLM's continuous batching on GPU)
Monitor thermal state programmatically and return 503 if temperature exceeds threshold

Q6: When would you recommend an M4 Max MacBook Pro over a cloud GPU for LLM inference? What are the breakeven economics?

The M4 Max at $3,499-$ 3,999 makes economic sense when:

Privacy requirements are non-negotiable: healthcare, legal, defense, IP-sensitive industries where sending data to OpenAI/Anthropic APIs is prohibited. The hardware cost is a one-time expense, not an ongoing API bill.
Long-running personal use: at $0.002/1k tokens (GPT-4o pricing), 2 million tokens of personal use per month costs$ 4,000/year in API costs. The MacBook pays for itself in under 1 year.
Developer iteration with no latency tolerance: locally running LLMs have 0ms network latency. At 40 tokens/sec with no network round-trip, the perceived responsiveness often beats a cloud API on a slow connection.
70B model access: running LLaMA 3 70B locally at 10-15 tok/sec is viable on M Ultra hardware. Via API, 70B-class models cost 5-10x more per token than 7B models. For many tasks, 70B quality at local speed beats 7B quality at API speed.

When cloud GPU wins:

Training: H100 clusters are orders of magnitude faster for training
High-concurrency serving: one H100 serving 100 concurrent users beats 100 M4 Max machines
Need for the absolute latest model: Anthropic/OpenAI's frontier models are not available for local deployment

Summary

Apple Silicon's unified memory architecture fundamentally changes the economics of local LLM inference. By eliminating the CPU-GPU PCIe bottleneck and making all compute units - CPU, GPU, and ANE - share the same physical DRAM, Apple Silicon allows consumer hardware to run models that would require expensive high-VRAM discrete GPUs in a traditional setup.

The key insight is that LLM autoregressive generation is memory-bandwidth-bound, not compute-bound. A 7B model at batch size 1 reads 14GB of weights per token. Memory bandwidth determines tokens per second, not GPU FLOPS. Apple Silicon's unified memory architecture means all 400-800 GB/s of bandwidth is available to the GPU without any PCIe bottleneck - and the model can be up to 128GB or 192GB, matching the chip's DRAM capacity.

The ANE provides dedicated matrix multiply hardware for Core ML models in iOS/macOS apps, but is not accessible to third-party frameworks. For LLM inference, the relevant tools are MLX (Python, gradients, fine-tuning) and llama.cpp (C++, maximum throughput, HTTP serving). PyTorch MPS works for standard models but has gaps for production inference.

Apple Silicon wins for local inference under power and portability constraints, for models exceeding discrete GPU VRAM, and for privacy-sensitive deployments. It loses for training, high-throughput multi-user serving, and research workloads requiring cutting-edge CUDA ecosystem tools.

35 Tokens Per Second, 8 Watts, In Your Backpack​

The Problem: The PCIe Bus Was Killing Inference Efficiency​

Historical Context: From Neural Engine to M-Series​

Unified Memory Architecture: Why It Changes Everything​

What Unified Memory Actually Means​

The Memory Bandwidth Calculation for Inference​

Apple Neural Engine: The Dedicated Matrix Multiply Hardware​

What the ANE Is​

ANE Compute Capacity​

When the ANE Is Used vs GPU​

The GPU Compute Path: Metal Performance Shaders​

Metal API Architecture​

PyTorch MPS Backend​

MLX: Apple's Native ML Framework​

Design Philosophy​

MLX API Basics​

MLX Neural Network Module​

MLX for LLM Inference: mlx-lm​

MLX Quantization​

llama.cpp on Apple Silicon​

Installation and Basic Usage​

Benchmarking with llama.cpp​

Performance Comparison: Apple Silicon vs Discrete GPU​

Architecture Diagram: Apple Silicon for AI​

Memory Sizing Guide for Local LLM Deployment​

When Apple Silicon Wins and When It Loses​

Apple Silicon Wins For:​

Apple Silicon Loses For:​

Complete MLX Benchmark Script​

Core ML: Deploying Models in iOS and macOS Apps​

Thermal Throttling: The Laptop Reality​

Common Mistakes and Pitfalls​

Interview Questions and Answers​

Summary​