Apple Silicon for AI
Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.
Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.
Deep dive into AWS custom AI chips - Trainium for training and Inferentia for inference, NeuronCore-v2 architecture, the Neuron SDK compilation pipeline, and real-world cost-performance tradeoffs versus GPU instances.
How Cerebras builds the world's largest chip by using the entire silicon wafer as one device, eliminating inter-chip communication overhead for large model training and delivering linear scaling without distributed training frameworks.
A complete decision framework for AI accelerator selection - how to evaluate NVIDIA GPUs, TPUs, Trainium, Gaudi, Groq, and custom ASICs across workload fit, TCO, ecosystem maturity, and team capability.
How FPGAs enable sub-microsecond AI inference - reconfigurable logic, HLS programming, Xilinx Vitis AI, quantization strategies, and when FPGAs beat GPUs for latency-critical deployments.
Deep dive into Google's Tensor Processing Units - systolic array design, XLA compilation, TPU pod topology, and how to write high-performance JAX programs that avoid recompilation traps.
How Groq's Language Processing Unit eliminates the memory bottleneck for LLM inference by keeping model weights in on-chip SRAM and using deterministic compiler-scheduled execution.
Intel Gaudi AI accelerator architecture - Tensor Processor Cores, built-in RoCE scale-out networking, SynapseAI SDK, and price-performance positioning against NVIDIA H100 for LLM training.
TPUs, Trainium, Groq LPU, Cerebras WSE, Intel Gaudi, and Apple Silicon - how each architecture differs from GPUs and what workloads each wins on.