Edge and Mobile Inference
A Friday Night in San Francisco
It is 11:47 PM on a Friday. Your ML team just shipped a new feature - a live camera-based meal analyzer that estimates nutritional content from a photo. Marketing loved the demo. Product put it in the release notes. The VP sent a congratulatory Slack message.
Then your on-call phone rings.
The P0 alert says API latency has spiked to 4.2 seconds per request. Your inference cluster - a cluster of A100 GPUs costing $14,000/month - is saturated. Dinner rush hit simultaneously across time zones. Tokyo, London, and New York users are all opening the app at the same time to photograph their late-night meals. The queue has 40,000 pending requests. Users are force-quitting.
You have three choices. You could spin up more GPU instances, burning $3,000 per hour in emergency cloud spend. You could rate-limit users, which kills the product launch. Or - if you had built this differently from the start - you could serve every single inference request from the user's own device, at zero marginal cost to you, with zero latency, and zero dependency on your servers.
That third option is edge inference. The model runs on the phone, on the laptop, on the smart camera at the restaurant entrance. The computation happens where the data lives. The server sees no traffic, no cost, no failure mode. Users get responses in under 200 milliseconds because the GPU is 2 centimeters from their thumb.
This lesson is about building that third option - not just making it work, but making it fast, efficient, and maintainable in production. The hardware landscape for edge inference has shifted dramatically. A modern iPhone has a 38-TOPS neural engine. A Jetson Orin has 275 TOPS. Consumer laptops with Apple M3 chips can run a 7B-parameter language model at 30 tokens per second. The compute exists. What remains is knowing how to use it correctly.
Why This Exists - The Problem With Centralized Inference
The default ML deployment model is centralized. You train a model, deploy it to a cloud endpoint, and every inference call crosses the public internet to reach your GPU cluster. This worked when models were small (a few hundred MB), when latency requirements were loose (a second or two was acceptable), and when privacy regulations were sparse.
All three of those conditions have changed.
The latency problem. Real-time applications - live translation, AR object detection, voice assistants, medical imaging on a handheld scanner - require inference in under 100 milliseconds from the user's perspective. Round-trip time to even the closest cloud region often consumes 50-80ms of that budget before computation even starts. A 20ms inference on an A100 becomes a 120ms user experience because of networking alone. You cannot optimize away the speed of light.
The privacy problem. GDPR, HIPAA, CCPA, and newer state-level regulations increasingly require that sensitive data (health information, biometrics, financial data, children's data) not leave the user's device. The easiest compliance path is to never transmit the data in the first place. On-device inference is the only architecture that is architecturally private - not just contractually private.
The scale problem. If you have 100 million users and each makes 10 inference calls per day, you need to serve 1 billion inferences daily. At 2 million per day. On-device inference shifts that cost entirely to the user's hardware - which they already own and which they replace every 2-3 years anyway.
The connectivity problem. Industrial IoT, agricultural sensors, offshore equipment, aircraft entertainment systems, rural healthcare devices - all operate in environments where reliable internet connectivity cannot be assumed. A model that requires a server call fails silently in these environments. On-device inference works in airplane mode.
The industry's response to these pressures was the dedicated neural processing unit (NPU). Beginning around 2017, every major chip vendor began embedding dedicated silicon for matrix multiply and convolution workloads alongside their CPU and GPU cores. Today, nearly every phone sold includes 3-5 TOPS of dedicated AI compute. High-end mobile SoCs hit 38-87 TOPS. The hardware has arrived. What has not caught up is the engineering knowledge to use it effectively.
Historical Context - From Cortex-M to Neural Engine
The story of edge AI hardware begins not with a research lab but with a business problem at Apple.
In 2017, Siri was embarrassingly slow. Face ID was about to launch and needed to process face geometry in under 100ms to feel instantaneous. Apple's A-series CPUs and GPUs could run the required neural networks, but the power draw was unacceptable - Face ID running on the GPU would drain a significant portion of the battery on every unlock. Apple needed inference at single-digit milliwatts, not the 5-10 watts a GPU demands.
The A11 Bionic chip, released September 2017, contained the first commercial Neural Engine - two dedicated cores that could execute 600 billion operations per second while consuming far less power than the GPU for the same workload. The architecture was purpose-built for the specific patterns of neural network inference: fixed dataflow through layers, large matrix multiplications, and activation functions applied element-wise at each layer boundary.
Qualcomm had been quietly building similar technology. The Hexagon DSP, which appeared in Snapdragon chips from 2006 onward, was originally designed for audio and modem signal processing. The fixed-point arithmetic and SIMD execution units that made it efficient for signal processing turned out to be almost exactly what quantized neural network inference required. Qualcomm added dedicated "Tensor Accelerator" blocks to Hexagon starting with the 855 in 2018.
Google took a different path. The Pixel Visual Core (2017) and later Tensor chip (2021) built a custom image signal processor with neural network acceleration embedded in the image pipeline - purpose-built for the camera processing workloads that dominate mobile AI compute.
NVIDIA entered the edge market through autonomous vehicles rather than phones. The Jetson platform, beginning with Jetson TK1 in 2014, offered a full CUDA-compatible GPU in a 10-watt envelope, explicitly targeting robotics and embedded systems that needed GPU-class compute without GPU-class power draw. The Jetson Orin, released in 2022, reached 275 TOPS while fitting in a 60-watt power budget - roughly what a light bulb consumes.
The "aha moment" for the industry came when researchers at Stanford published the MobileNet paper in 2017 (Howard et al.). They showed that by replacing standard convolutions with depthwise separable convolutions, you could cut computation by 8-9x with only minor accuracy loss. This was not just model compression - it was a fundamentally different operation count profile that mapped onto hardware with limited compute budgets. MobileNet was not a smaller ResNet. It was a different architecture designed from the ground up for the hardware constraints of edge devices. That conceptual shift - designing for the hardware rather than shrinking for the hardware - became the foundation of edge ML engineering.
Core Concepts
Understanding the Edge Hardware Budget
The fundamental constraint of edge inference is the power envelope. A datacenter A100 GPU draws 400 watts. A laptop has a sustained power budget of 15-45 watts shared across CPU, GPU, RAM, storage, display, and wireless radios. A smartphone runs at 5-7 watts total. A microcontroller might have 100 milliwatts.
Power matters because it converts directly to heat, which causes thermal throttling, which causes inconsistent performance. A model that runs at 30 tokens/second for 10 seconds but throttles to 8 tokens/second after the device heats up provides a worse user experience than a model that runs at 15 tokens/second consistently. Sustained throughput under thermal constraints is the real metric, not peak throughput.
The relationship between power, performance, and efficiency is captured in TOPS/Watt:
Apple M3's Neural Engine delivers roughly 18 TOPS at approximately 1.5 watts - 12 TOPS/Watt. A desktop H100 delivers 3,958 TOPS at 700 watts - 5.7 TOPS/Watt. The mobile NPU is more than twice as efficient per operation. This is the fundamental insight behind dedicated inference silicon: specialization trades programmability for efficiency.
The memory budget is equally constraining. A flagship smartphone has 6-12 GB of LPDDR5 RAM shared between the OS, all running apps, and the AI workload. The model, its KV cache (for autoregressive generation), and intermediate activations must all fit within a few gigabytes. For context:
- LLaMA-3-8B in float16: 16 GB - does not fit on most phones
- LLaMA-3-8B in INT8: 8 GB - fits on high-end phones with care
- LLaMA-3-8B in INT4: 4 GB - fits comfortably on iPhone 15 Pro (8 GB RAM)
- Phi-3-mini-4K in INT4: ~2 GB - fits on most modern smartphones
The math is simple but unforgiving:
Mobile NPU Architectures
Apple Neural Engine (ANE)
The ANE in the M3 chip delivers 18 TOPS using a systolic array architecture similar to Google's TPU but optimized for the specific shapes of CoreML graph operations. The ANE is not programmable in the traditional sense - you cannot write custom kernels for it. Instead, Apple provides the CoreML compiler, which automatically lowers model operations onto ANE compute units where possible and falls back to CPU or GPU for unsupported operations.
What the ANE excels at:
- Fixed convolution patterns (same shapes as training)
- Matrix multiply with specific dimensions
- Standard activation functions (ReLU, sigmoid, softmax)
- Elementwise operations
What forces fallback to CPU/GPU:
- Dynamic shapes (variable sequence lengths that change between calls)
- Custom operations not in the CoreML op set
- Very small batch sizes that don't amortize NPU dispatch overhead
- Operations that require integer arithmetic the ANE doesn't support
The practical implication: if you design your model for ANE acceleration, use static shapes wherever possible. For LLMs, this means padding inputs to fixed context lengths or using chunked prefill with fixed-size chunks.
Qualcomm Hexagon DSP + Adreno GPU
Qualcomm's Snapdragon 8 Gen 3 reaches 45 TOPS through a combination of Hexagon NPU (dedicated matrix math), Hexagon DSP (signal processing and fixed-point SIMD), and Adreno GPU (shader-based general compute). Unlike Apple's integrated approach, Qualcomm exposes these through the Qualcomm AI Engine Direct SDK (formerly SNPE), allowing developers to target specific processing elements.
The Hexagon architecture is notable for its very low memory access latency due to tightly coupled scratchpad memory. For INT8 quantized models, the Hexagon NPU can achieve near-peak TOPS because memory bandwidth is rarely the bottleneck - the weights fit in local SRAM between layers.
Google Edge TPU and Tensor G3
The Edge TPU (available in Coral devices and integrated into Pixel chips as part of the Tensor SoC) is the most constrained of the major NPU architectures - it requires INT8 quantization and supports only a specific subset of TensorFlow operations. The upside is exceptional power efficiency: the USB Coral Edge TPU delivers 4 TOPS at under 2 watts.
The Tensor G3 in the Pixel 8 Pro integrates a more capable version of this accelerator alongside a Samsung-designed GPU, reaching 15 TOPS for AI workloads while prioritizing camera processing (Google's primary edge AI use case).
NVIDIA Jetson Orin
The Jetson AGX Orin represents a different point in the edge design space - not a phone but an embedded computing module for robotics, autonomous vehicles, and industrial systems. At 275 TOPS with a 60-watt TDP, it uses a full Ampere GPU (1792 CUDA cores, 56 Tensor Core clusters) plus an 8-core Arm CPU and dedicated deep learning accelerators.
The Jetson platform is unique in providing full CUDA compatibility - any model that runs on a datacenter GPU runs on Jetson without modification, just slower. This makes it ideal for prototyping edge deployments before optimizing for a more constrained NPU.
Deployment Frameworks
Mobile Platform Framework Hardware Target
----------- --------- ---------------
iOS / macOS CoreML ANE + GPU + CPU
Android TFLite Hexagon + Adreno + CPU
Android ONNX Runtime Hexagon + CPU
Cross-platform ExecuTorch (Meta) Multiple backends
Cross-platform llama.cpp CPU + Metal + CUDA + Vulkan
Apple Silicon MLX GPU + CPU unified memory
Jetson TensorRT Ampere GPU + DLA
CoreML is Apple's first-party framework. It accepts models in the CoreML format (.mlpackage), which you generate from PyTorch, TensorFlow, or ONNX using the coremltools Python library. CoreML handles automatic dispatch to ANE, GPU, or CPU based on the operation and device state. For production iOS apps, CoreML is the correct default choice.
TFLite is Google's framework targeting Android and embedded systems. It has the largest ecosystem of pre-optimized operators and the most hardware targets via its delegate system. The Hexagon delegate routes operations to Qualcomm DSP; the GPU delegate routes to OpenCL or Vulkan; the XNNPACK delegate optimizes for ARM CPU SIMD.
ExecuTorch is Meta's edge inference framework, released in 2023. It is designed to export PyTorch models directly to edge targets without going through an intermediate format, reducing the friction of the PyTorch-to-edge pipeline. The XNNPACK backend covers CPU, and backends for ANE and Hexagon are in active development.
llama.cpp is the de facto standard for running LLMs on consumer hardware. Written in C++ with zero mandatory dependencies, it supports Metal (Apple GPU), CUDA, Vulkan, OpenCL, and pure CPU backends. The key innovation is aggressive use of quantization formats specifically designed for the memory bandwidth constraints of edge hardware: GGUF format supports Q2_K through Q8_0 quantization with mixed precision per tensor.
MLX is Apple's research framework for Apple Silicon, released in late 2023. Unlike CoreML, which hides hardware details behind a compiler, MLX is a NumPy-like array framework that operates directly on Apple's unified memory architecture. This matters because M-series chips have no separate CPU/GPU memory - the same physical RAM is accessed by both compute units with equal bandwidth. MLX exploits this to avoid memory copies entirely.
Code Examples
Converting a PyTorch Model to CoreML
import torch
import torchvision
import coremltools as ct
# Load a pre-trained model
model = torchvision.models.mobilenet_v3_small(pretrained=True)
model.eval()
# Trace the model with example input
# IMPORTANT: use static shapes - CoreML performance degrades with dynamic shapes
example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)
# Convert to CoreML
# compute_units controls which hardware is targeted
coreml_model = ct.convert(
traced_model,
inputs=[ct.TensorType(name="input_image", shape=(1, 3, 224, 224))],
outputs=[ct.TensorType(name="class_logits")],
compute_units=ct.ComputeUnit.ALL, # Allow ANE + GPU + CPU
minimum_deployment_target=ct.target.iOS17,
compute_precision=ct.precision.FLOAT16, # FP16 for ANE compatibility
)
# Add metadata
coreml_model.short_description = "MobileNetV3 Small - ImageNet classifier"
coreml_model.input_description["input_image"] = "224x224 RGB image, normalized 0-1"
coreml_model.output_description["class_logits"] = "1000-class logit scores"
# Save
coreml_model.save("mobilenet_v3_small.mlpackage")
# Benchmark on device (run this on a Mac with Apple Silicon)
import coremltools.optimize.coreml as cto
# Optional: quantize weights to INT8 for smaller size
config = cto.OptimizationConfig(
global_config=cto.OpPalettizerConfig(nbits=8, algorithm="kmeans")
)
quantized_model = cto.palettize_weights(coreml_model, config)
quantized_model.save("mobilenet_v3_small_int8.mlpackage")
Running LLaMA-3-8B on Apple Silicon with llama.cpp
# Install llama.cpp with Metal support (macOS)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with Metal acceleration for Apple GPU
cmake -B build -DLLAMA_METAL=ON
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)
# Download a quantized LLaMA-3-8B model from Hugging Face
# Q4_K_M is the recommended balance of size and quality
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='bartowski/Meta-Llama-3-8B-Instruct-GGUF',
filename='Meta-Llama-3-8B-Instruct-Q4_K_M.gguf',
local_dir='./models'
)
"
# Run inference with Metal acceleration
./build/bin/llama-cli \
--model models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
--prompt "Explain transformer attention in one paragraph:" \
--n-predict 200 \
--ctx-size 4096 \
--threads 8 \
--n-gpu-layers 33 \ # Offload all 33 layers to Metal GPU
--flash-attn # Enable flash attention for memory efficiency
# Expected output on M3 Max:
# llm_load_tensors: offloaded 33/33 layers to GPU
# llama_print_timings: eval time = 4523.41 ms / 200 tokens (22.62 ms per token)
# llama_print_timings: prompt eval speed: 187.3 tokens/s
# llama_print_timings: generation speed: 44.2 tokens/s
Running LLaMA-3-8B with MLX for Maximum Apple Silicon Performance
# pip install mlx mlx-lm
from mlx_lm import load, generate
import time
# Load model - MLX automatically quantizes on load if requested
# 4-bit quantization makes 8B model ~4.3GB - fits in M1/M2 Pro unified memory
model, tokenizer = load(
"mlx-community/Meta-Llama-3-8B-Instruct-4bit",
# tokenizer_config={"trust_remote_code": True}
)
prompt = "Explain the difference between edge inference and cloud inference:"
# Warm up (first call compiles Metal kernels)
_ = generate(model, tokenizer, prompt=prompt, max_tokens=10)
# Benchmark
start = time.perf_counter()
response = generate(
model,
tokenizer,
prompt=prompt,
max_tokens=200,
verbose=False,
)
elapsed = time.perf_counter() - start
tokens = len(tokenizer.encode(response))
print(f"Generated {tokens} tokens in {elapsed:.2f}s")
print(f"Throughput: {tokens / elapsed:.1f} tokens/sec")
print(f"\nResponse:\n{response}")
# Typical results on M3 Max (128GB unified memory):
# Generated 200 tokens in 5.8s
# Throughput: 34.5 tokens/sec
Measuring Power Draw During Inference (macOS)
# Use powermetrics to measure real-time power during inference
# Run this in one terminal while inference runs in another
sudo powermetrics \
--samplers gpu_power,cpu_power,thermal \
--sample-rate 1000 \
--show-all \
2>/dev/null | grep -E "(ANE|GPU|CPU|Power|Thermal)"
# Example output during LLM generation on M3 Max:
# CPU Power: 8543 mW
# GPU Power: 12891 mW
# ANE Power: 847 mW
# Combined Power: 22281 mW (22.3W total)
# Thermal pressure: Nominal
# For more detailed GPU metrics
sudo powermetrics --samplers gpu_power -i 500 | tail -20
TFLite Deployment for Android
import tensorflow as tf
import numpy as np
# Convert PyTorch model to ONNX first, then to TFLite
# Or start with a TF/Keras model
# Load a Keras MobileNetV2
base_model = tf.keras.applications.MobileNetV2(
input_shape=(224, 224, 3),
include_top=True,
weights='imagenet'
)
# Convert to TFLite with INT8 quantization
# Representative dataset required for full integer quantization
def representative_dataset():
for _ in range(100):
# Use real calibration data in production
data = np.random.rand(1, 224, 224, 3).astype(np.float32)
yield [data]
converter = tf.lite.TFLiteConverter.from_keras_model(base_model)
# Full INT8 quantization - required for Hexagon DSP acceleration
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
# Save the quantized model
with open('mobilenetv2_int8.tflite', 'wb') as f:
f.write(tflite_model)
print(f"FP32 model size: {len(tf.lite.TFLiteConverter.from_keras_model(base_model).convert()) / 1e6:.1f} MB")
print(f"INT8 model size: {len(tflite_model) / 1e6:.1f} MB")
# FP32 model size: 13.9 MB
# INT8 model size: 3.7 MB (3.8x reduction)
Architecture Diagrams
Model Optimization for Edge Deployment
Quantization - The Non-Negotiable Step
Edge inference without quantization is not a viable option. A 7B parameter model in FP32 weighs 28 GB. In FP16, 14 GB. In INT8, 7 GB. In INT4 with GPTQ or AWQ, roughly 3.5-4.5 GB. Only INT4 fits in phone RAM alongside the OS and other applications.
The four main quantization approaches for edge LLMs:
GPTQ (Generative Pre-Training Quantization) - Applies optimal quantization per weight tensor using second-order gradient information. Produces high-quality INT4 models with ~0.3-0.8 perplexity penalty. Requires a calibration dataset and significant compute to produce.
AWQ (Activation-aware Weight Quantization) - Finds salient weights (those that most affect output quality) and preserves higher precision for them. Often slightly better quality than GPTQ at the same bit width. Works well for INT4.
GGUF Q4_K_M - The llama.cpp format. Uses mixed precision: attention weights at higher precision, FFN weights at lower precision, with k-quant grouping that minimizes outlier damage. Practical and fast - no calibration required.
INT8 activation + INT4 weight - Used by ExecuTorch and some TFLite deployments. Keeps activations in INT8 for numerical stability while aggressively quantizing weights to INT4.
For non-LLM models (vision, audio), INT8 post-training quantization (PTQ) is usually sufficient and often free of meaningful accuracy loss. The TFLite representative dataset approach calibrates activation ranges without retraining.
Architecture Choices That Matter for Edge
Not all model architectures are equally efficient on edge hardware. Three principles:
Depthwise separable convolutions over standard convolutions. A standard 3x3 convolution on a feature map costs multiplications. The depthwise separable version costs - roughly fewer operations. MobileNet, EfficientNet, and MobileViT all use this.
Static shapes over dynamic shapes. NPUs are hardware-compiled for specific tensor shapes. A model that uses a sequence length of exactly 512 at every call can be fully compiled to the ANE. A model that accepts "up to 512 tokens" forces the compiler to generate multiple kernels or fall back to the GPU.
Activation function compatibility. SiLU (used in LLaMA) and GELU are not natively supported on all NPUs. On hardware that lacks native SiLU support, the operation degrades to an element-wise multiply followed by a sigmoid - two operations instead of one. ReLU and ReLU6 are universally supported and often run faster.
Knowledge Distillation for Edge Models
When architecture selection and quantization are not enough, knowledge distillation provides a path to a smaller model that retains more accuracy than naive pruning:
import torch
import torch.nn as nn
import torch.nn.functional as F
class DistillationLoss(nn.Module):
"""Combined classification loss + distillation loss from teacher model."""
def __init__(self, temperature: float = 4.0, alpha: float = 0.7):
super().__init__()
self.temperature = temperature
self.alpha = alpha # weight of distillation vs hard label loss
def forward(
self,
student_logits: torch.Tensor,
teacher_logits: torch.Tensor,
targets: torch.Tensor,
) -> torch.Tensor:
# Hard label loss (standard cross-entropy)
hard_loss = F.cross_entropy(student_logits, targets)
# Soft label loss (KL divergence on temperature-softened distributions)
# Temperature scaling makes the probability distribution softer,
# revealing more of the teacher's "dark knowledge" about class relationships
student_soft = F.log_softmax(student_logits / self.temperature, dim=-1)
teacher_soft = F.softmax(teacher_logits / self.temperature, dim=-1)
# KL divergence: T^2 scaling restores gradient magnitude lost by temperature
distill_loss = (
F.kl_div(student_soft, teacher_soft, reduction="batchmean")
* self.temperature ** 2
)
return self.alpha * distill_loss + (1 - self.alpha) * hard_loss
# Training loop with distillation
def train_with_distillation(
teacher_model, student_model, dataloader, epochs=10
):
teacher_model.eval()
student_model.train()
optimizer = torch.optim.AdamW(student_model.parameters(), lr=1e-4)
criterion = DistillationLoss(temperature=4.0, alpha=0.7)
for epoch in range(epochs):
total_loss = 0.0
for images, labels in dataloader:
with torch.no_grad():
teacher_logits = teacher_model(images)
student_logits = student_model(images)
loss = criterion(student_logits, teacher_logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(dataloader):.4f}")
Production Engineering Notes
Thermal Management is Mandatory
Mobile devices throttle aggressively. The A16 Bionic drops from peak performance to 60% of peak after approximately 30 seconds of sustained full-load inference. This is a hardware safety feature, not a bug. Build your system to expect it:
-
Measure sustained performance, not peak. Run a 5-minute benchmark, not a 30-second one. The numbers you put in your product spec must come from the plateau performance, not the warm-up peak.
-
Use the low-power compute units when battery is below 20%. iOS exposes
ProcessInfo.thermalStateandUIDevice.batteryLevel. When the device is hot or battery is low, fall back to a smaller model or defer heavy inference. -
Batch small inferences where possible. Five 50ms inferences spread over five seconds generate less heat than five 50ms inferences fired in rapid succession.
iOS-Specific Production Considerations
CoreML model loading is expensive (300ms-2s depending on model size). Load the model once at app launch, cache it in memory, and keep it warm. Do not load and unload the model per-request.
The ANE is a shared resource across all apps. If the user switches to another app that uses the ANE, your inference may be preempted. Build retry logic and timeouts into your inference pipeline.
Use .mlpackage format (not the older .mlmodel) for all new deployments. The package format supports model versioning, allows the CoreML compiler to cache compiled graphs between app launches, and supports newer op types.
Android-Specific Production Considerations
TFLite delegate loading is non-deterministic - the Hexagon delegate may not be available on all Android devices even if they use Snapdragon chips. Always implement fallback to CPU:
# Pseudocode - actual TFLite API is Java/Kotlin on Android
def load_interpreter_with_fallback(model_path):
# Try Hexagon DSP first (best efficiency)
try:
interpreter = tflite.Interpreter(
model_path=model_path,
experimental_delegates=[load_delegate('libQnnTFLiteDelegate.so')]
)
return interpreter, "hexagon"
except Exception:
pass
# Try GPU (good for parallelizable ops)
try:
interpreter = tflite.Interpreter(
model_path=model_path,
experimental_delegates=[load_delegate('libGpuDelegateV2.so')]
)
return interpreter, "gpu"
except Exception:
pass
# CPU fallback (always works, slowest)
interpreter = tflite.Interpreter(model_path=model_path)
return interpreter, "cpu"
Benchmarking Edge Inference Correctly
import time
import statistics
import subprocess
def benchmark_inference(model_fn, inputs, warmup_runs=5, benchmark_runs=50):
"""
Benchmark inference with proper warmup and statistical reporting.
Warmup is critical on mobile - first N calls compile/cache kernels.
"""
# Warmup - do not measure these
for _ in range(warmup_runs):
_ = model_fn(inputs)
# Benchmark runs
latencies = []
for _ in range(benchmark_runs):
start = time.perf_counter()
result = model_fn(inputs)
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)
return {
"p50_ms": statistics.median(latencies),
"p95_ms": sorted(latencies)[int(0.95 * len(latencies))],
"p99_ms": sorted(latencies)[int(0.99 * len(latencies))],
"mean_ms": statistics.mean(latencies),
"stdev_ms": statistics.stdev(latencies),
"min_ms": min(latencies),
"max_ms": max(latencies),
}
Common Mistakes
:::danger Measuring Peak Performance and Shipping It as the Spec The most common mistake in edge ML engineering is benchmarking for 10-30 seconds and using those numbers in product specs. Mobile devices run at peak performance for 20-60 seconds before thermal throttling kicks in. After 60 seconds at full load, performance can drop 30-60%. Always run 5-minute sustained benchmarks. The number that goes in your spec must be the performance at minute 4, not second 10. :::
:::danger Using Dynamic Shapes in CoreML Models Targeting ANE The Apple Neural Engine requires static tensor shapes for full acceleration. A model exported with dynamic sequence lengths (batch size "None" or variable context length) will fall back to GPU or CPU for the variable-dimension operations. For LLMs on iOS, this means using chunked prefill with fixed chunk sizes, or accepting that prefill runs on the GPU while generation runs on the ANE. :::
:::warning Quantizing Without Calibration Data Post-training quantization requires a representative dataset for calibration. Using random noise as calibration data (as many tutorials do for simplicity) produces incorrect activation range estimates and can reduce model accuracy by 5-15% compared to proper calibration with real data. Use 100-500 examples from your actual input distribution. :::
:::warning Ignoring LPDDR Bandwidth as the Real Bottleneck Edge NPUs often advertise impressive TOPS numbers, but the real bottleneck for LLM inference on edge is LPDDR memory bandwidth, not compute. A 7B parameter INT4 model loaded from 68 GB/s LPDDR5 takes at minimum per token generation step, regardless of NPU TOPS. This is why M3 Max (400 GB/s bandwidth) generates tokens 5-6x faster than M2 (100 GB/s bandwidth) for the same model, even with similar TOPS. :::
:::warning Shipping Without Battery Level Checks
Running a 4-bit quantized 7B LLM on an iPhone draws approximately 6-8 watts when generating tokens. A user with 15% battery running your LLM feature for 3 minutes will drop to 0% charge. Ship with explicit battery level checks. If UIDevice.batteryLevel < 0.20, either disable the feature, use a smaller model, or show a clear warning.
:::
Interview Q&A
Q1: Why can't you just run a full FP32 LLM on a modern smartphone? Isn't the compute there?
The compute is sometimes there, but compute is not the binding constraint. Memory is. A 7B parameter model in FP32 requires 28 GB of RAM. A flagship iPhone has 8 GB total. Even setting aside RAM shared with the OS and other apps, the model physically does not fit. The solution is quantization to INT4, which brings the 7B model to ~4 GB. Beyond size, LPDDR5 memory bandwidth is 68-100 GB/s on mobile, versus 3,350 GB/s on an H100. For autoregressive generation, every token requires loading the full model weights once. At 68 GB/s, loading 4.3 GB of INT4 weights takes ~63ms minimum per token - you are bandwidth-limited, not compute-limited. This is why phone LLMs generate 20-40 tokens/second while server GPUs generate 2,000+ tokens/second.
Q2: What is the Apple Neural Engine and why does it matter for production deployments?
The ANE is a dedicated systolic-array accelerator embedded in Apple SoCs since the A11 (2017). It delivers 18-38 TOPS at approximately 1.5 watts - roughly 10-20x more efficient per operation than the GPU for supported workloads. The production implication is that ANE-accelerated inference extends battery life significantly: running MobileNetV3 on the ANE instead of the GPU consumes ~80% less power. For production iOS apps doing continuous inference (real-time camera processing, voice activity detection, keyboard prediction), the difference between GPU and ANE power draw determines whether users notice battery drain. The catch is that the ANE only supports specific operations and requires static tensor shapes. You access it exclusively through CoreML - there is no direct programming API.
Q3: Walk me through how you would deploy a 7B LLM to iPhone in production.
Start with model selection - choose a model that fits in INT4 within 3-4 GB (leaving headroom for OS and app memory). Phi-3-mini-4K (3.8B parameters) and Mistral-7B are both viable; LLaMA-3-8B is at the edge. Convert to GGUF format using llama.cpp's convert.py with Q4_K_M quantization. Integrate via the llama.cpp iOS framework (ggml-org/llama.cpp provides Swift bindings). Route generation to the Metal backend for GPU acceleration. Implement streaming output - the user should see tokens appear as they generate, not wait for the full response. Add thermal state checks using ProcessInfo.thermalState - downgrade to a smaller model or increase context reuse if thermal state is "serious" or "critical". Monitor memory pressure via os_proc_available_memory() and abort with a graceful error rather than crashing. Profile with Instruments (Xcode) to verify Memory bandwidth usage and that generation is bandwidth-bound, not CPU-bound.
Q4: What is the difference between TFLite delegates and CoreML compute units?
Both systems solve the same problem - dispatching inference operations to the most efficient available hardware - but with different philosophies. CoreML compute units are a declaration of preference (ComputeUnit.ALL, .cpuAndNeuralEngine, .cpuOnly), and the CoreML compiler automatically partitions the graph to run subgraphs on the best available unit per operation. The developer does not specify which operations go where. TFLite delegates are explicit plugins - you load a specific delegate (Hexagon, GPU, XNNPACK) and it claims the operations it can accelerate; unaclaimed operations fall back to the CPU interpreter. CoreML's approach is higher-level and more portable; TFLite's approach gives more control but requires more explicit fallback handling. In practice, CoreML is the right choice for iOS and TFLite is the right choice for Android, where no equivalent Apple-managed compiler exists.
Q5: How does llama.cpp achieve competitive performance with no dependencies and a pure C++ implementation?
llama.cpp's performance comes from several engineering decisions. First, it implements GGUF quantization formats (Q4_K, Q5_K, Q8_0, etc.) that use block quantization with per-block scale factors, minimizing the accuracy loss from low-bit weights while fitting in cache-friendly memory layouts. Second, the Metal backend generates compute shaders that explicitly tile matrix multiplications to match the GPU's SIMD width and shared memory size - the same optimization that TensorRT applies automatically. Third, the memory layout is designed for sequential access during autoregressive generation: model weights are laid out so that each token generation step reads contiguous memory, maximizing DRAM row buffer hit rates on LPDDR. Fourth, mmap is used for model loading - the OS loads pages on demand, which means a 4GB model on a device with 8GB RAM does not require 4GB of allocations; pages are loaded lazily and evicted under memory pressure.
Q6: A team wants to add real-time pose estimation to their fitness app. The model runs fine in the lab at 30fps but drops to 12fps after 45 seconds of use. What is wrong and how do you fix it?
This is a thermal throttling problem. The diagnosis is confirmed by the timing: 45 seconds is typical for an A-series chip to hit thermal limits under sustained load. The GPU or ANE is hitting temperature thresholds and the OS is reducing clock frequency to protect the hardware. Fixes in order of impact: (1) Profile which processing units are active and whether you are unnecessarily using the GPU for operations the ANE handles more efficiently - unnecessary GPU usage generates more heat. (2) Check whether you are running the model at 30fps when 20fps is acceptable - reducing inference frequency linearly reduces heat. (3) Add explicit frame skipping: if the last inference took more than 50ms (below 20fps pace), skip the next frame's inference and use the previous result - users do not notice 1-2 frame gaps at 20fps. (4) Check the model's input pipeline - are you re-encoding YUV camera frames to RGB on the CPU on every frame? CoreML's NMSLayer and Vision framework can accept YCbCr directly. (5) If none of these work, distill to a smaller model or use MobileViT-XS instead of a full EfficientNet.
Q7: Compare the Apple Neural Engine to the Qualcomm Hexagon NPU. Which should you target and why?
They share the same goal - efficient neural network inference at low power - but differ in programmability and target workload profile.
The ANE is a black box from the developer's perspective. You cannot write ANE kernels. You describe a computation in CoreML format, and Apple's compiler decides what runs on the ANE. This simplifies development but limits control. The ANE excels at the specific operation patterns Apple has optimized for: CNNs for camera processing, transformer attention for on-device Siri and keyboard prediction, and the particular tensor shapes that appear in Apple's own ML features. If your model uses CoreML-native operations and static shapes, ANE acceleration is essentially automatic and highly efficient.
The Hexagon NPU is exposed through SNPE (Snapdragon Neural Processing Engine) and QNN (Qualcomm AI Engine Direct) with a more explicit API. Qualcomm publishes performance counters, memory access patterns, and cycle-level profiling tools. Advanced teams can write custom operations targeting specific Hexagon DSP instructions. This extra complexity yields extra control: Qualcomm hardware often outperforms ANE on custom architectures that fall outside CoreML's op set.
For production: if you target iOS, the ANE is the right choice and CoreML is the only sensible path. If you target Android, the Hexagon NPU via TFLite Hexagon delegate or SNPE is the highest-efficiency path on Snapdragon devices, but you must test on actual hardware because Hexagon support varies by Android OEM customization even on the same Snapdragon chip.
Q8: A startup is building a real-time translation earpiece. The model must run for 8 hours continuously on a 300mAh battery. Walk through how you would design the hardware and software stack.
A 300mAh battery at 3.7V holds 1.11 Wh of energy. For 8 hours, the total power budget is . That is 0.139 watts for the entire device - audio, BLE, CPU, and inference combined.
Typical MCU power at active state is 10-30 mW. Audio codec and microphone consume 5-10 mW. BLE transmission (if continuous) adds 10-15 mW. This leaves roughly 80-100 mW for inference.
At 80-100 mW, you cannot run a transformer model. A 7B LLM draws 6,000 mW on a phone. Even a MobileNetV3 draws 50-80 mW on a phone (which has a much larger battery thermal budget than 139 mW total). The path to 80mW inference has three requirements:
First, model architecture. You need something in the 500K-2M parameter range. Conformer-based speech recognition models like Google's streaming RNN-T run at that scale. For translation, you might use a cascaded pipeline: ASR model to text (on-device), BLE to phone, NMT on phone, audio synthesis back via BLE. This offloads the heavy compute to the phone.
Second, dedicated inference silicon. Qualcomm's Aro XSOC and Ambiq Apollo series MCUs have dedicated vector engines for neural network inference in the 0.1-1 TOPS range at under 5mW. The Ambiq Apollo4 does 50 GOPS at under 1mW in low-power mode. These enable running small transformer models within a 10-20mW inference budget.
Third, always-on vs triggered inference. Voice activity detection (VAD) with a 10K parameter model consumes under 1mW and runs continuously. The larger ASR/translation model is triggered only when speech is detected. This reduces duty cycle from 100% to 10-20%, effectively multiplying the power budget for the heavy model by 5-10x.
The production architecture: always-on VAD at 0.5mW, triggered streaming ASR at 15mW (30% duty cycle average), BLE to paired phone for NMT at 10mW, TTS playback. Total average: ~45mW. This fits in the 139mW budget with margin for audio and connectivity.
Hardware Comparison and Selection Guide
Choosing the Right Edge Platform
The right hardware depends entirely on the deployment context. Here is a decision framework organized by use case:
Consumer Mobile (iOS): Use Apple Silicon with CoreML as the default. For LLMs specifically, use MLX on M-series Macs and llama.cpp with Metal on iPhone. The ANE provides exceptional power efficiency for vision and audio models. The unified memory architecture on M-series makes large model inference feasible in ways that discrete GPU architectures cannot match.
Consumer Mobile (Android): Use TFLite with the Hexagon delegate as the default. For LLMs, use llama.cpp with the Vulkan backend. On Snapdragon 8 Gen 2+ devices, ONNX Runtime with QNN execution provider can reach ANE-comparable efficiency for quantized models. Test on actual target device families - performance varies enormously between Snapdragon, MediaTek, and Exynos SoCs.
Robotics and Industrial Edge: NVIDIA Jetson Orin is the default for any application requiring full CUDA compatibility and the ability to run the same model code that runs in the datacenter. The 60-watt power budget is high for a truly embedded system but acceptable for a robot, drone, or industrial gateway. For lower power budgets (under 10W), the Jetson Nano or Google Coral Edge TPU are alternatives.
Always-On Camera / Vision Applications: The Google Coral Edge TPU (USB or M.2 form factor) provides 4 TOPS at under 2 watts - optimal for always-on object detection at 30fps that must run indefinitely without draining a battery. The constraint is strictly INT8 and limited TF op support.
Hardware Specifications Comparison
Platform TOPS TDP Memory Framework Ideal Use Case
--------- ---- --- ------ --------- --------------
Apple M3 ANE 18 1.5W Shared CoreML/MLX iOS/macOS app inference
Apple M3 Max 38* 15-30W 128GB uni CoreML/MLX On-device LLMs, Mac apps
Qualcomm 8 Gen 3 45 5-7W 8-16GB LP5 TFLite/SNPE Android flagship inference
Google Tensor G3 15 5W 12GB LP5 TFLite Pixel camera AI
NVIDIA Jetson Orin 275 60W 64GB LP5 TensorRT Robotics, autonomous systems
Google Coral TPU 4 2W N/A (PC) TFLite/EdgeTPU Always-on vision
Raspberry Pi 4 0.1 5W 4-8GB DDR4 TFLite/NCNN Low-power prototyping
* ANE + GPU combined on M3 Max chip
The Unified Memory Advantage on Apple Silicon
The M-series chip's unified memory architecture deserves specific attention because it changes the economics of large model deployment. On a conventional laptop, the CPU has DDR5 RAM and the GPU has GDDR6 or HBM VRAM. Moving a tensor from CPU to GPU requires a PCIe transfer at 64 GB/s. Running a 70B model requires it to live in GPU VRAM, which on the best laptop GPU (RTX 4090 laptop) is 16 GB - far too small.
On Apple M3 Max, all 128 GB of LPDDR5 is physically shared between CPU, GPU, and ANE. There is no PCIe transfer. There is no separate VRAM limit. A 70B model in INT4 requires about 35 GB - it fits comfortably in M3 Max unified memory, accessed by the GPU at 400 GB/s bandwidth.
This is why MLX is particularly powerful on Apple Silicon. The framework assumes zero-copy access between CPU and GPU:
import mlx.core as mx
import mlx.nn as nn
# In MLX, arrays live in unified memory
# CPU and GPU access the SAME physical bytes - no copies
x = mx.array([1.0, 2.0, 3.0]) # Unified memory allocation
# Runs on GPU via Metal, but x was not "copied" to GPU
# It was always accessible by both
result = mx.sum(x * x)
mx.eval(result) # Lazy evaluation - compute executes here
print(result) # Accesses result back on CPU - no transfer needed
For LLM inference, this matters because the KV cache (which grows with each generated token) lives in the same memory pool as the model weights. On a 128 GB M3 Max, you can maintain a 128K-token KV cache for a 70B INT4 model simultaneously - something impossible on any laptop with discrete GPU.
ExecuTorch and the PyTorch-Native Edge Path
Meta's ExecuTorch (released 2023) represents a third paradigm for edge deployment alongside CoreML and TFLite. Instead of converting to an intermediate format, ExecuTorch exports PyTorch models as portable executables that carry their own runtime:
# ExecuTorch export - stays in the PyTorch ecosystem
import torch
from executorch.exir import to_edge
from torch.export import export
# Your standard PyTorch model
model = torch.nn.Sequential(
torch.nn.Linear(768, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, 10),
)
model.eval()
# Export to ExecuTorch portable format
example_inputs = (torch.randn(1, 768),)
exported = export(model, example_inputs)
# Lower to edge format
edge_program = to_edge(exported)
# Compile with XNNPACK backend (optimized for ARM CPU)
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
edge_program = edge_program.to_backend(XnnpackPartitioner())
# Serialize to .pte file (runs on iOS/Android without Python)
executorch_program = edge_program.to_executorch()
with open("model.pte", "wb") as f:
f.write(executorch_program.buffer)
# model.pte can now be bundled directly in an iOS or Android app
# The ExecuTorch runtime is ~1MB and handles execution
ExecuTorch's advantage over CoreML and TFLite is that the export path is identical across all target platforms. A model exported once runs on iOS, Android, and embedded Linux without re-conversion. This matters for teams maintaining models across multiple platforms - one conversion pipeline instead of three.
Performance Profiling on Mobile
Profiling edge inference requires platform-specific tools. Knowing how to use them separates engineers who guess from engineers who fix:
# iOS: Instruments - Core ML Performance Operator View
# Launch from Xcode -> Product -> Profile -> Core ML
# Shows per-operator execution time and which compute unit handled each op
# macOS command-line profiling with coremltools
python3 << 'EOF'
import coremltools as ct
# Profile a CoreML model to see operator timing
model = ct.models.MLModel("my_model.mlpackage")
spec = model.get_spec()
# Use coremltools predict with compute unit tracking
result = model.predict(
{"input": example_input},
# In newer coremltools versions:
# state = model.make_state()
)
EOF
# Android: NNAPI profiling via ADB
adb shell am start -n com.example.app/.MainActivity
adb shell setprop debug.nn.model_token 1
adb logcat | grep -E "(NN|NNAPI|TFLite)"
# NVIDIA Jetson: tegrastats for power and utilization
sudo tegrastats --interval 100
# Output: RAM 2048/7764MB SWAP 0/0MB CPU [45%@1190,23%@1190,67%@1190,12%@1190] GPU 78%@921 VIC off
Key Takeaways
Edge inference in 2024 is a solved engineering problem for models up to 7-8B parameters on high-end mobile hardware, and for models up to 1B parameters on mid-range devices. The hardware exists. The frameworks are mature. The remaining challenge is engineering discipline:
- Always benchmark sustained performance, not peak performance
- Quantization to INT4 is mandatory for LLMs; INT8 is sufficient for vision models
- Use static shapes for ANE acceleration; accept GPU fallback for dynamic sequences
- Memory bandwidth, not TOPS, is the binding constraint for LLM token generation
- CoreML on iOS and TFLite on Android are the production defaults; llama.cpp and MLX serve specialized cases
- Thermal management is a first-class engineering concern, not an afterthought
The engineer who understands these constraints builds products that work in production. The engineer who ignores them ships demos that fall apart after 45 seconds of real use.
