MLX for Apple Silicon
The Production Scenario
It is 11:47 PM on a Tuesday. A senior ML engineer at a fintech startup is on-call. An anomaly detection model has started producing garbage outputs in production - not catastrophically wrong, but wrong enough that risk management has flagged it. The on-call runbook says to run manual analysis using the internal LLM assistant to cross-check the model's reasoning.
The problem: the internal LLM assistant requires VPN access to the company's cloud inference cluster, and the VPN certificate has expired. The engineer can not renew it until business hours. The model analysis has to happen now.
The engineer has a 16-inch MacBook Pro M3 Max sitting in front of them. They installed MLX three weeks ago following a company lunch-and-learn. They pull up a terminal, type four commands, and within two minutes they have LLaMA 3.1 8B running locally at 38 tokens per second with full context - no VPN, no cloud, no waiting. The anomaly gets analyzed. The on-call incident closes at 12:23 AM.
This is the practical case for MLX. Not a benchmark. Not a demo. A real engineer in a real situation where cloud access failed and local inference saved the incident response window. The story repeats across organizations every week: data that can not leave the building, compliance requirements that prohibit cloud processing, air-gapped environments, and the simple need to work when the internet is unreliable.
Apple Silicon changed what "local inference" means. Before M-series chips, local inference meant either: run a tiny quantized model on CPU at 3-4 tokens per second (painful), or have a dedicated GPU in a desktop (expensive, not portable). M-series chips collapsed that distinction. The unified memory architecture means an M2 Ultra with 192 GB RAM can run 70B parameter models. An M3 MacBook Pro with 36 GB RAM handles 13B models comfortably. The hardware was ready. MLX is the software framework built specifically to use it.
Why This Exists
The Problem Before MLX
When Apple Silicon launched in 2020, the ML ecosystem had a serious problem: it did not know what to do with unified memory.
PyTorch's model for GPU acceleration assumes a strict CPU/GPU boundary. You allocate tensors on CPU, then explicitly move them to GPU with .to('cuda') or .to('mps'). Operations happen in device memory. Results get copied back. This model made perfect sense for discrete GPUs where CPU memory and GPU memory are physically separate - copying is expensive and explicit, so you want to be deliberate about it.
Apple Silicon has no such boundary. The CPU, GPU, and Neural Engine all share the same physical memory pool. A tensor created on CPU is immediately accessible to the GPU without any copy. But PyTorch's MPS (Metal Performance Shaders) backend still models the operation as if there were a copy - it allocates memory through Metal's abstractions and the result is a framework that works but feels like a square peg in a round hole.
The consequence was performance that underdelivered on the hardware's potential. LLaMA.cpp was better because it operated at a lower level, using Metal shaders directly and bypassing PyTorch's abstraction overhead. But llama.cpp is a C++ project designed for inference, not for research workflows or fine-tuning. Python researchers working on Apple Silicon had no good option.
Frameworks designed for research (PyTorch, JAX) had backends that treated Apple Silicon like a second-class GPU. Frameworks designed for inference (llama.cpp) had no Python API suitable for experimentation and fine-tuning.
What MLX Solves
Apple released MLX in December 2023, built internally by Apple's ML research team. The core design premise is simple: build a framework that treats unified memory as the default, not as an afterthought.
MLX exposes a NumPy-like Python API. Arrays exist in a single flat memory space. There is no .to('device') call because there is no device boundary. When you run a matrix multiply, MLX schedules it on whichever compute unit (CPU, GPU, Neural Engine) is appropriate - transparently. The programmer thinks about operations, not memory management.
Beyond memory, MLX added two features that make it genuinely better than PyTorch for Apple Silicon inference and fine-tuning:
-
Lazy evaluation - operations are not executed immediately when you call them. MLX builds a computation graph and executes it in a batch. This enables aggressive kernel fusion - multiple operations get merged into single Metal shader passes, eliminating intermediate allocations.
-
Function transformation API -
mx.grad()for gradients,mx.vmap()for vectorized maps,mx.compile()for ahead-of-time compilation. The functional API makes fine-tuning workflows that in PyTorch require multiple abstractions (DataLoader, Trainer, optimizer state management) expressible in clean, readable Python.
The result is a framework where inference is typically 20-40% faster than PyTorch MPS, and fine-tuning workflows are significantly simpler to write and debug.
Historical Context
Who Built It and Why
MLX was developed by Apple's Machine Learning Research team, primarily by Awni Hannun and collaborators. Hannun is known in the ML community for his work on end-to-end speech recognition and, more recently, as one of the authors of Whisper.cpp and other efficient inference tools.
The "aha moment" for MLX's design came from a basic observation about how memory bandwidth affects transformer inference. In transformer models doing autoregressive generation, each forward pass reads all model weights from memory and produces a small amount of output (often just one token). The operation is almost entirely memory-bandwidth-bound - the limiting factor is how fast you can move weights from memory to the compute units, not how fast the compute units are.
Apple Silicon's unified memory architecture has remarkable memory bandwidth. The M3 Pro has 150 GB/s. The M3 Max reaches 400 GB/s. These numbers compete with or exceed the memory bandwidth of many dedicated ML GPUs. The A100 PCIe has 2,000 GB/s HBM bandwidth but that is between the GPU memory controller and the compute units - the bottleneck for transformer inference is feeding the compute units, and Apple Silicon's unified architecture eliminates the CPU-GPU transfer as a bottleneck entirely.
Hannun's insight: if unified memory is designed right, the MacBook is not a "consumer device running AI on the side" - it is a competitive inference platform. But you need a framework that actually uses the architecture rather than papering over it with GPU abstractions designed for discrete cards.
MLX was open-sourced on GitHub in December 2023 and within six months had become the primary framework for running and fine-tuning LLMs on Apple Silicon, with the mlx-community on HuggingFace maintaining thousands of converted model weights.
The Unified Memory Architecture - Why It Matters for LLMs
How Traditional GPU Inference Works
To understand what MLX does differently, you need to understand the traditional GPU inference pipeline:
Traditional GPU Inference (NVIDIA):
CPU RAM PCIe Bus GPU VRAM
[weights] --> [copy ~32 GB/s] --> [weights]
[KV cache] --> [copy] --> [KV cache]
[input] --> [copy] --> [input]
[compute]
[copy back] <-- [output logits]
Every inference call involves PCIe transfers. PCIe Gen 4 x16 delivers roughly 32 GB/s bidirectional. For a 7B parameter model in 4-bit quantization, the weights alone are ~4 GB. Loading them takes roughly 125 milliseconds even at full PCIe bandwidth. In practice you keep the model loaded in VRAM and only transfer inputs and outputs - but the VRAM is a fixed resource you are fighting over.
The deeper problem: if your model exceeds GPU VRAM, you either quantize more aggressively, use offloading (which hits that PCIe bottleneck repeatedly during inference), or split across multiple GPUs (expensive and complex).
Apple Silicon Unified Memory
Apple Silicon Unified Memory:
Single Physical Memory Pool
+---------------------------------+
| Weights | KV Cache | Input |
+---------------------------------+
| | |
[CPU cores] [GPU cores] [Neural Engine]
| | |
All units access same physical addresses
No copy required - zero-copy access
There is no copy. The weights sit in memory once. The GPU cores read them directly from the same physical addresses the CPU wrote them to. The Neural Engine does the same. When the output logits are computed, the CPU can read them without any transfer.
For transformer inference this translates to:
- No memory double-booking - you do not need RAM for the system AND VRAM for the model. 64 GB of unified memory is 64 GB for everything.
- Larger context windows - the KV cache grows with sequence length. With 64 GB unified memory you can maintain much longer conversations without cache eviction.
- Faster weight loading - weights load from unified memory at the full memory bus bandwidth, not PCIe speeds.
Memory Bandwidth Math
For a 7B parameter model in 4-bit quantization (4 bits per parameter):
At each forward pass, all weights must be read once (memory-bandwidth-bound for batch size 1):
For an M3 Pro (150 GB/s):
For an M3 Max (400 GB/s):
Real-world numbers are lower due to compute overhead and memory access patterns, but the theoretical maximums explain why M3 Max reaches 80+ tok/s on 7B models while M3 Pro reaches 40-50 tok/s. The bandwidth is the ceiling.
MLX Architecture - How It Works
Lazy Evaluation in Practice
When you write:
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b # Not executed yet
d = c * 2.0 # Not executed yet
e = mx.sum(d) # Not executed yet
mx.eval(e) # Execute entire graph at once
The three operations (+, *, sum) are fused into a single Metal kernel. The intermediate arrays c and d are never materialized in memory - the GPU processes the entire computation in a single pass. This is kernel fusion, and it dramatically reduces memory bandwidth consumption and kernel launch overhead.
For transformer inference, this matters enormously. Attention computation involves many elementwise operations, softmax, and matrix multiplies. Without fusion, each operation launches a separate kernel and materializes intermediate results. With MLX's lazy evaluation and fusion, many of these collapse into fewer, larger operations.
The MLX vs PyTorch MPS Comparison
The practical difference: on an M3 Pro running LLaMA 7B, PyTorch MPS achieves roughly 15-20 tok/s. MLX achieves 40-50 tok/s on the same hardware for the same model - a 2-3x speedup from better use of the same silicon.
Setting Up MLX
Installation
MLX requires Python 3.9+ and macOS 13.5+ (Ventura or later). It runs on any Apple Silicon Mac - M1, M2, M3, or M4 series.
# Install MLX
pip install mlx
# Install mlx-lm for LLM inference
pip install mlx-lm
# Verify installation
python -c "import mlx.core as mx; print(mx.default_device())"
# Expected output: Device(gpu, 0)
First Inference - LLaMA 3.1 8B
The mlx-lm package provides a high-level interface for LLM inference. The mlx-community organization on HuggingFace hosts pre-converted models ready to use:
from mlx_lm import load, generate
# Load the model (downloads on first run, cached afterwards)
# mlx-community hosts MLX-format models - no conversion needed
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
# Simple generation
response = generate(
model,
tokenizer,
prompt="Explain transformer attention in one paragraph",
max_tokens=200,
verbose=True # Shows tokens/sec
)
print(response)
Expected output on M3 Pro (18 GB):
Transformer attention allows each token to attend to all other tokens...
[Generated 200 tokens at 38.4 tok/s]
Command Line Interface
mlx-lm also ships a CLI for quick testing:
# Generate text from command line
python -m mlx_lm.generate \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--prompt "What is the capital of France?" \
--max-tokens 100
# Interactive chat
python -m mlx_lm.generate \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--prompt "" \
--chat-template # Uses the model's built-in chat template
Loading Models with Custom Parameters
from mlx_lm import load, generate
from mlx_lm.utils import generate_step
import mlx.core as mx
# Load with explicit configuration
model, tokenizer = load(
"mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
tokenizer_config={
"trust_remote_code": False,
"use_fast": True
}
)
# Format as chat messages using the model's chat template
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is gradient descent?"}
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Generate with full parameter control
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=512,
temp=0.7, # Temperature
top_p=0.9, # Top-p sampling
repetition_penalty=1.1,
verbose=True
)
Streaming Generation
For production applications that need to display tokens as they arrive:
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
messages = [
{"role": "user", "content": "Write a Python function to sort a list"}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Stream tokens as they are generated
print("Response: ", end="", flush=True)
for token_text in stream_generate(
model,
tokenizer,
prompt=prompt,
max_tokens=400
):
print(token_text, end="", flush=True)
print() # Final newline
MLX Quantization Formats
Understanding the Options
MLX supports several quantization schemes. The mlx-community HuggingFace organization hosts models in standardized formats:
| Format | Bits per weight | Size (7B) | Speed (M3 Pro) | Quality loss |
|---|---|---|---|---|
| fp16 | 16 bits | 14 GB | 18 tok/s | None |
| 8-bit | 8 bits | 7 GB | 28 tok/s | Negligible |
| 4-bit | 4 bits | 3.5 GB | 45 tok/s | Minor |
| 3-bit | 3 bits | 2.6 GB | 52 tok/s | Moderate |
The naming convention on HuggingFace follows: mlx-community/Model-Name-Xbit where X is the quantization level.
For most use cases, 4-bit is the sweet spot: it runs fast, fits large models into available memory, and preserves enough precision for instruction following and reasoning tasks.
Converting Your Own Models
If you have a HuggingFace model that is not yet in the mlx-community, you can convert it:
# Convert HuggingFace model to MLX format (4-bit quantization)
python -m mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./llama-3.1-8b-mlx-4bit \
--quantize \
--q-bits 4
# Convert without quantization (fp16, for maximum quality)
python -m mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./llama-3.1-8b-mlx-fp16
The conversion process:
- Loads the original HuggingFace weights (PyTorch
.safetensorsformat) - Transposes weight matrices to match MLX's preferred memory layout
- Applies quantization using MLX's group quantization scheme
- Saves as MLX-format
.safetensorswith aconfig.jsonandtokenizer.json
# Programmatic conversion with Python API
from mlx_lm import convert
convert(
hf_path="mistralai/Mistral-7B-Instruct-v0.3",
mlx_path="./mistral-7b-mlx",
quantize=True,
q_bits=4,
q_group_size=64 # Group size for quantization - smaller = more accurate
)
Uploading to HuggingFace
Contributing converted models back to the community:
# Install HuggingFace Hub CLI
pip install huggingface_hub
# Login
huggingface-cli login
# Upload converted model
python -m mlx_lm.convert \
--hf-path meta-llama/Llama-3.1-8B-Instruct \
--mlx-path ./llama-3.1-8b-mlx-4bit \
--quantize \
--q-bits 4 \
--upload-repo your-username/Llama-3.1-8B-Instruct-4bit-MLX
Fine-Tuning with MLX LoRA
One of MLX's killer features - not just inference but actual on-device fine-tuning. LoRA (Low-Rank Adaptation) makes this practical.
How LoRA Works in MLX
LoRA adds trainable low-rank matrices to frozen pretrained weights. Instead of fine-tuning all parameters in a weight matrix, you train two smaller matrices: and where .
The number of trainable parameters drops from to . For a 4096x4096 weight matrix with rank 8:
That is a 256x reduction in trainable parameters. An 8B model that requires 32 GB VRAM for full fine-tuning can be fine-tuned with LoRA in under 8 GB of unified memory.
Setting Up Training Data
MLX LoRA expects JSONL format with one example per line:
{"prompt": "Classify this email as spam or not spam: 'Congratulations you won!'", "completion": "Spam"}
{"prompt": "Classify this email as spam or not spam: 'Meeting rescheduled to 3pm'", "completion": "Not spam"}
{"prompt": "Classify this email as spam or not spam: 'Your account has been limited'", "completion": "Spam"}
Or for instruction-following format:
{"messages": [{"role": "user", "content": "What is the boiling point of water?"}, {"role": "assistant", "content": "100 degrees Celsius at standard atmospheric pressure."}]}
Running LoRA Fine-Tuning
# Fine-tune with mlx-lm LoRA trainer
python -m mlx_lm.lora \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--train \
--data ./my_data \
--iters 1000 \
--batch-size 4 \
--lora-layers 8 \
--learning-rate 1e-4 \
--save-every 100 \
--adapter-path ./my_adapters
Directory structure expected:
my_data/
train.jsonl # Training examples
valid.jsonl # Validation examples (optional)
test.jsonl # Test examples (optional)
Python API for Fine-Tuning
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
from mlx_lm import load
from mlx_lm.tuner import LoRALinear
from mlx_lm.tuner.datasets import load_dataset
from mlx_lm.tuner.trainer import train, TrainingArgs
# Load base model
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
# Apply LoRA adapters to specific layers
# lora_layers=8 means the last 8 transformer layers get LoRA adapters
model.freeze() # Freeze all parameters
lora_layers = 8
for layer in model.model.layers[-lora_layers:]:
layer.self_attn.q_proj = LoRALinear.from_linear(
layer.self_attn.q_proj, rank=8
)
layer.self_attn.v_proj = LoRALinear.from_linear(
layer.self_attn.v_proj, rank=8
)
# Verify trainable parameter count
trainable = sum(
v.size for _, v in model.trainable_parameters()
)
total = sum(v.size for _, v in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)")
# Training arguments
args = TrainingArgs(
batch_size=4,
iters=500,
val_batches=25,
steps_per_report=10,
steps_per_eval=100,
save_every=100,
adapter_path="./adapters",
max_seq_length=512,
learning_rate=1e-4
)
# Train
optimizer = optim.Adam(learning_rate=args.learning_rate)
train(model, tokenizer, optimizer, train_set, valid_set, args)
Running Inference with Fine-Tuned Adapters
from mlx_lm import load, generate
# Load base model with custom adapters
model, tokenizer = load(
"mlx-community/Llama-3.2-3B-Instruct-4bit",
adapter_path="./my_adapters" # Path to saved LoRA adapters
)
# Inference works exactly like normal
response = generate(
model,
tokenizer,
prompt="Classify this email: 'Your credit card payment is due'",
max_tokens=50
)
Fusing Adapters for Production
After training, fuse adapters into the base weights for slightly faster inference:
python -m mlx_lm.fuse \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--adapter-path ./my_adapters \
--save-path ./fused_model
Benchmarking MLX Performance
Real-World Numbers on M-Series Chips
The following benchmarks were run using mlx-lm with default settings, measuring tokens per second during generation (not prefill):
LLaMA 3.1 8B - 4-bit quantization
| Chip | Unified RAM | Tok/s |
|---|---|---|
| M1 | 16 GB | 18 tok/s |
| M2 | 16 GB | 22 tok/s |
| M2 Pro | 32 GB | 28 tok/s |
| M3 | 16 GB | 30 tok/s |
| M3 Pro | 36 GB | 38 tok/s |
| M3 Max | 48 GB | 65 tok/s |
| M2 Ultra | 192 GB | 55 tok/s |
LLaMA 3.1 70B - 4-bit quantization (requires 40 GB+ RAM)
| Chip | Unified RAM | Tok/s |
|---|---|---|
| M2 Ultra | 192 GB | 9 tok/s |
| M3 Max | 128 GB | 12 tok/s |
| M4 Max | 128 GB | 18 tok/s |
Writing a Benchmark Script
import time
import mlx.core as mx
from mlx_lm import load, generate
def benchmark_model(
model_name: str,
prompt: str = "Tell me about the history of computing",
n_tokens: int = 200,
n_runs: int = 3
) -> dict:
"""Benchmark an MLX model for tokens/second"""
print(f"Loading {model_name}...")
model, tokenizer = load(model_name)
# Warmup run (JIT compilation happens here)
print("Warming up (first run includes JIT compilation)...")
_ = generate(model, tokenizer, prompt=prompt, max_tokens=20)
# Benchmark runs
times = []
for i in range(n_runs):
start = time.perf_counter()
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=n_tokens,
verbose=False
)
end = time.perf_counter()
elapsed = end - start
# Count actual tokens generated
output_tokens = len(tokenizer.encode(response))
tok_per_sec = output_tokens / elapsed
times.append(tok_per_sec)
print(f" Run {i+1}: {tok_per_sec:.1f} tok/s")
return {
"model": model_name,
"mean_tok_per_sec": sum(times) / len(times),
"max_tok_per_sec": max(times),
"min_tok_per_sec": min(times)
}
# Run benchmark
result = benchmark_model(
"mlx-community/Meta-Llama-3.1-8B-Instruct-4bit",
n_tokens=200,
n_runs=3
)
print(f"\nResults for {result['model']}:")
print(f" Mean: {result['mean_tok_per_sec']:.1f} tok/s")
print(f" Best: {result['max_tok_per_sec']:.1f} tok/s")
MLX vs llama.cpp Comparison
Key takeaway: llama.cpp is still roughly 10-15% faster on pure inference throughput due to lower-level Metal optimizations. MLX wins when you need a Python-native workflow, fine-tuning, or rapid experimentation.
Building a Production Application with MLX
A Simple REST API Wrapper
from fastapi import FastAPI
from pydantic import BaseModel
from mlx_lm import load, generate
from mlx_lm.utils import stream_generate
from fastapi.responses import StreamingResponse
import asyncio
import json
app = FastAPI()
# Load model once at startup - expensive operation
print("Loading model...")
MODEL, TOKENIZER = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
print("Model ready")
class ChatRequest(BaseModel):
messages: list[dict]
max_tokens: int = 512
temperature: float = 0.7
stream: bool = False
@app.post("/chat")
async def chat(request: ChatRequest):
# Format with chat template
prompt = TOKENIZER.apply_chat_template(
request.messages,
tokenize=False,
add_generation_prompt=True
)
if request.stream:
# Streaming response
def generate_stream():
for token in stream_generate(
MODEL,
TOKENIZER,
prompt=prompt,
max_tokens=request.max_tokens,
temp=request.temperature
):
chunk = {"choices": [{"delta": {"content": token}}]}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(
generate_stream(),
media_type="text/event-stream"
)
else:
# Non-streaming
response = generate(
MODEL,
TOKENIZER,
prompt=prompt,
max_tokens=request.max_tokens,
temp=request.temperature,
verbose=False
)
return {"choices": [{"message": {"role": "assistant", "content": response}}]}
# Run with: uvicorn app:app --host 0.0.0.0 --port 8080
Memory Management
For applications that need to load and unload models dynamically:
import mlx.core as mx
import gc
def unload_model(model, tokenizer):
"""Properly unload an MLX model and free memory"""
del model
del tokenizer
gc.collect()
# Force MLX to release cached GPU memory
mx.metal.clear_cache()
# Check memory usage
def get_memory_stats():
"""Get current MLX memory usage"""
stats = mx.metal.device_info()
return {
"peak_memory_gb": stats.get("peak_memory", 0) / (1024**3),
"current_memory_gb": stats.get("current_memory", 0) / (1024**3)
}
# Usage
print(f"Memory before load: {get_memory_stats()}")
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
print(f"Memory after load: {get_memory_stats()}")
unload_model(model, tokenizer)
print(f"Memory after unload: {get_memory_stats()}")
Production Engineering Notes
Memory Planning
The rough rule for MLX memory usage:
Required RAM = (model_size_on_disk * 1.2) + (context_length * num_layers * head_dim * 2 * 2 bytes)
For a 4-bit LLaMA 3.1 8B with 4096 context:
- Model: ~4.5 GB (4 GB weights + 10% MLX overhead)
- KV cache: 4096 * 32 * 128 * 2 * 2 bytes = ~512 MB
- Total: ~5 GB
Practical guidance by Mac model:
- MacBook Air M3 8 GB: 4-bit models up to 4B parameters
- MacBook Pro M3 16 GB: 4-bit models up to 8B parameters (8B 4-bit uses ~5 GB)
- MacBook Pro M3 Pro 36 GB: 4-bit models up to 13B, fp16 models up to 7B
- MacBook Pro M3 Max 48 GB: 4-bit models up to 34B parameters
- Mac Studio M2 Ultra 192 GB: 4-bit models up to 70B+ parameters
Thermal Throttling
Sustained inference on MacBook Pros will trigger thermal throttling after 10-15 minutes. Watch for:
# Monitor performance over time - detect throttling
import time
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")
for i in range(10):
prompt = "Count from 1 to 100 in words"
start = time.time()
generate(model, tokenizer, prompt=prompt, max_tokens=200, verbose=False)
elapsed = time.time() - start
print(f"Iteration {i+1}: {200/elapsed:.1f} tok/s")
time.sleep(5) # Brief pause between runs
If you see tok/s drop by 20%+ after iteration 5-6, the chip is throttling. For sustained production workloads on MacBooks, cap generation to ~60-70% of peak speed to avoid thermal issues. Mac Studio and Mac Pro with active cooling handle sustained loads much better.
Batch Size Considerations
Unlike GPU inference where batch size significantly increases throughput, MLX on Apple Silicon shows diminishing returns on batch size due to memory bandwidth saturation:
- Batch size 1: 38 tok/s on M3 Pro 8B 4-bit
- Batch size 4: 42 tok/s (only 10% increase)
- Batch size 8: 43 tok/s (diminishing further)
For production servers on Apple Silicon, optimize for low latency (batch size 1-2) rather than throughput (large batches). The hardware is designed for interactive workloads, not datacenter batch processing.
Common Mistakes
:::danger Model too large for available RAM Attempting to load a model that exceeds available unified memory will cause MLX to swap to disk, making inference 10-100x slower than expected - often completely unusable.
Check before loading:
import subprocess
# Get available memory
result = subprocess.run(
["sysctl", "hw.memsize"],
capture_output=True, text=True
)
total_gb = int(result.stdout.split(":")[1]) / (1024**3)
print(f"Total RAM: {total_gb:.0f} GB")
# Rule of thumb: model size (on disk) * 1.2 should be < 70% of total RAM
:::
:::danger Forgetting the warmup run The first MLX inference call triggers JIT compilation, which takes 5-20 seconds depending on the model. If you benchmark without a warmup run, your first-call numbers will be artificially slow. Always run at least one generation before benchmarking. :::
:::warning Mixing MLX and PyTorch in the same process
Loading both mlx and torch in the same Python process works but competes for Metal GPU resources. If you need both frameworks, prefer running them in separate processes. Memory usage will also be higher than expected if both frameworks initialize their Metal command queues simultaneously.
:::
:::warning Chat template is not optional
Different models use different chat templates. Passing raw text without the model's chat template often produces incoherent outputs for instruction-tuned models. Always use tokenizer.apply_chat_template() for chat/instruct models:
# Wrong - raw prompt for an instruct model
response = generate(model, tokenizer, prompt="What is Python?")
# Right - use chat template
messages = [{"role": "user", "content": "What is Python?"}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt)
:::
:::warning Trust remote code on converted models
When loading models from mlx-community, the code is already trusted (converted by Apple/community). But if you convert your own models from private HuggingFace repos, set trust_remote_code=True with caution - only for models you have audited.
:::
Interview Q&A
Q: What is the fundamental architectural difference between Apple Silicon's unified memory and a discrete GPU setup, and why does this matter for LLM inference?
A: In a discrete GPU setup, CPU RAM and GPU VRAM are physically separate. Transferring data between them goes through PCIe, which has roughly 32 GB/s bandwidth for PCIe Gen 4 x16. This creates a bottleneck for LLM inference: model weights, KV cache, and intermediate activations must be explicitly moved between CPU and GPU memory. If a model exceeds VRAM capacity, you must either quantize more aggressively, accept slow inference via memory offloading, or add more GPUs.
Apple Silicon uses a single physical memory pool accessible by all compute units (CPU, GPU, Neural Engine) at the same physical addresses. There is no copy for CPU-to-GPU transfer because the concept of separate memory spaces does not exist. For LLM inference - which is heavily memory-bandwidth-bound - this means the bottleneck is simply the memory bandwidth of the unified pool, not PCIe. An M3 Max's 400 GB/s bandwidth is competitive with many datacenter GPU configurations for this workload, in a laptop form factor.
Q: How does MLX's lazy evaluation improve performance compared to eager execution frameworks like PyTorch?
A: Lazy evaluation means operations are not executed when called - they are recorded in a computation graph. When mx.eval() is called (explicitly or implicitly when a value is needed), the entire graph is optimized and executed.
The primary benefit is kernel fusion: multiple consecutive operations that would each require a separate GPU kernel dispatch in eager mode can be merged into a single kernel. For example, a linear layer followed by an activation function followed by a bias add becomes one fused Metal shader rather than three separate dispatches. This eliminates intermediate memory allocations (the output of layer 1 that only exists to be input to layer 2 is never materialized) and reduces GPU kernel launch overhead. In practice this delivers 20-40% speedups for transformer workloads compared to PyTorch MPS, which uses eager execution with Metal backend.
Q: Explain LoRA fine-tuning and why it is practical for Apple Silicon.
A: LoRA (Low-Rank Adaptation) fine-tunes a pretrained model by adding small trainable matrices to frozen pretrained weights. For a weight matrix , LoRA adds where and with .
This is practical on Apple Silicon for two reasons. First, memory: only the LoRA matrices and their gradients need to be in active training state - the base model weights are frozen and can remain in their quantized form. A 4-bit quantized 8B model uses ~4 GB; LoRA adds perhaps 200-400 MB of trainable parameters. Total training memory is 6-8 GB rather than 32+ GB for full fine-tuning. Second, speed: fewer parameters means fewer gradient computations and smaller optimizer states. Training converges in minutes on Apple Silicon for small datasets, not hours.
Q: When would you choose llama.cpp over MLX for Apple Silicon inference?
A: llama.cpp wins when: (1) you need maximum raw inference throughput and the 10-15% performance advantage matters, (2) you are deploying a production server that will serve many users via REST API (llama.cpp's server mode has better concurrent request handling), (3) you need cross-platform support - the same llama.cpp setup can move to Linux/NVIDIA with minimal changes, and (4) you do not need Python integration and prefer a compiled binary.
MLX wins when: (1) you are doing research or experimentation in Python and need to modify model code, (2) you want to fine-tune with LoRA - mlx-lm's training workflow is significantly simpler than llama.cpp's fine-tuning story, (3) you are building Python applications that integrate model inference with other Python libraries, or (4) you want access to raw tensor operations for custom model implementations.
For most individual developers, MLX is the better starting point because the Python API is more approachable. For production serving at scale, llama.cpp's server mode is typically preferred.
Q: How do you handle a model that is too large for your Mac's RAM?
A: Several options in order of preference:
-
Use more aggressive quantization - switch from 8-bit to 4-bit, or 4-bit to 3-bit. A 13B model at 3-bit uses ~5 GB versus 8 GB at 4-bit, at some quality cost.
-
Use a smaller model - a well-tuned 3B model often outperforms a poorly-specified 13B model for narrow tasks. Mistral 3B and Phi-3 Mini are strong options for constrained environments.
-
Use model splitting - MLX supports model sharding across multiple Macs via mlx-lm's distributed utilities (experimental), though this requires the Macs to be on the same network.
-
Accept the slowdown from swapping - if the model only slightly exceeds available RAM (say, 36 GB model on 32 GB system) and latency is not critical, macOS's unified memory paging will use NVMe as overflow at 3-6 GB/s. This is dramatically slower than RAM-resident inference but may be acceptable for batch processing.
The fundamental constraint is that on Apple Silicon, what fits in RAM is what runs well. Unlike NVIDIA CUDA where you can do layer offloading across CPU/GPU with moderate performance loss, Apple Silicon does not have a separate "fast tier" to offload to.
Q: What is the significance of mx.compile() in MLX, and when should you use it?
A: mx.compile() takes a Python function and compiles it ahead-of-time to an optimized Metal shader program. The first call to a compiled function is slow (compilation happens), but subsequent calls skip the Python interpreter and graph construction entirely and execute the pre-compiled Metal program directly.
Use mx.compile() when you have a function that will be called thousands of times with the same structure (shapes, dtypes) but different values - for example, a single forward pass function in a training loop. The compilation overhead pays off quickly:
import mlx.core as mx
def forward(x, weights):
return mx.tanh(x @ weights)
# Compile for repeated calls
compiled_forward = mx.compile(forward)
# First call: slow (compiles)
result = compiled_forward(x, w)
# Subsequent calls: fast (runs pre-compiled program)
result = compiled_forward(x2, w) # ~3-5x faster than uncompiled
Do not use mx.compile() for functions called infrequently, functions with variable shapes (compilation is shape-specific), or functions with Python side effects (compilation may change execution order). The mlx-lm library already applies compilation internally to its model forward pass, so for standard inference you get the benefit automatically.
