9 docs tagged with "custom-silicon"

Apple Silicon for AI

Apple M-series unified memory architecture for ML inference - how the ANE, GPU, and CPU share one memory pool, why this matters for local LLMs, and how to run models with MLX and llama.cpp on Apple Silicon.

AWS Trainium and Inferentia

Deep dive into AWS custom AI chips - Trainium for training and Inferentia for inference, NeuronCore-v2 architecture, the Neuron SDK compilation pipeline, and real-world cost-performance tradeoffs versus GPU instances.

Cerebras Wafer Scale Engine

How Cerebras builds the world's largest chip by using the entire silicon wafer as one device, eliminating inter-chip communication overhead for large model training and delivering linear scaling without distributed training frameworks.

Choosing Custom Silicon vs GPUs

A complete decision framework for AI accelerator selection - how to evaluate NVIDIA GPUs, TPUs, Trainium, Gaudi, Groq, and custom ASICs across workload fit, TCO, ecosystem maturity, and team capability.

FPGAs for AI Inference

How FPGAs enable sub-microsecond AI inference - reconfigurable logic, HLS programming, Xilinx Vitis AI, quantization strategies, and when FPGAs beat GPUs for latency-critical deployments.

Google TPU Architecture

Deep dive into Google's Tensor Processing Units - systolic array design, XLA compilation, TPU pod topology, and how to write high-performance JAX programs that avoid recompilation traps.

Groq LPU Architecture

How Groq's Language Processing Unit eliminates the memory bottleneck for LLM inference by keeping model weights in on-chip SRAM and using deterministic compiler-scheduled execution.

Intel Gaudi and Habana Labs

Intel Gaudi AI accelerator architecture - Tensor Processor Cores, built-in RoCE scale-out networking, SynapseAI SDK, and price-performance positioning against NVIDIA H100 for LLM training.

Module 3: Custom Silicon for AI

TPUs, Trainium, Groq LPU, Cerebras WSE, Intel Gaudi, and Apple Silicon - how each architecture differs from GPUs and what workloads each wins on.