Blog Research Lab AI Letters The Lab Interactive 3D

Hardware and Silicon

Hardware and Silicon for AI

GPU architecture, CUDA programming, custom accelerators, kernel optimization, memory systems, and distributed training infrastructure - from transistors to clusters.

756Free

Start Learning →Explore Modules ↓

7Modules

56Lessons

Freeto start

Modules

7 Modules. From GPU Internals to Production Clusters.

The hardware layer that determines whether your model trains fast or sits idle.

IntermediateFree

GPU Architecture

SM architecture, warp execution, CUDA execution model, and GPU memory hierarchy from first principles.

What you'll master

GPU vs CPU Architecture
Streaming Multiprocessors
CUDA Execution Model
GPU Memory Hierarchy
Warp Divergence and Efficiency
PCIe and NVLink
GPU Performance Metrics
Profiling with Nsight

8 lessons

Start for Free →

IntermediateFree

CUDA Programming

Thread hierarchies, memory management, kernel optimization, and writing high-performance GPU kernels.

What you'll master

CUDA Programming Model
Thread and Block Hierarchy
CUDA Memory Types
Kernel Optimization Fundamentals
Atomic Operations and Synchronization
Streams and Concurrency
CUBLAS and CUDNN
Debugging CUDA Code

8 lessons

Start for Free →

AdvancedFree

Custom Silicon

TPUs, Trainium, Groq LPU, Cerebras, Gaudi, Apple Silicon - architecture and when to choose custom accelerators.

What you'll master

Google TPU Architecture
AWS Trainium and Inferentia
Groq LPU Architecture
Cerebras Wafer-Scale Engine
Intel Gaudi and Habana
Apple Silicon for AI
FPGAs for AI Inference
Choosing Custom Silicon vs GPUs

8 lessons

Start for Free →

AdvancedFree

Kernel Optimization

Occupancy, tiling, tensor cores, Flash Attention kernel, kernel fusion, and Triton for custom kernels.

What you'll master

Occupancy and Thread Block Tuning
Tiling and Shared Memory Optimization
Instruction-Level Optimization
Tensor Core Programming
Flash Attention Kernel Deep Dive
Kernel Fusion Strategies
Mixed Precision Kernels
Triton for Custom Kernels

8 lessons

Start for Free →

AdvancedFree

Memory Systems

HBM, roofline analysis, NVLink bandwidth, KV cache capacity planning, and storage I/O bottlenecks.

What you'll master

GPU Memory Hierarchy Deep Dive
HBM and GDDR Memory Technologies
Memory Bandwidth Roofline Analysis
PCIe and NVLink Interconnects
CPU Memory Architecture for ML
Memory Capacity Planning for LLMs
Unified Memory and Pooling
Storage IO for Training Pipelines

8 lessons

Start for Free →

AdvancedFree

Distributed Training Hardware

Multi-GPU architectures, NCCL, DGX systems, ZeRO, fault tolerance, and cloud vs on-prem.

What you'll master

Multi-GPU Training Architectures
GPU Cluster Networking
NCCL and Collective Communication
DGX and HGX System Design
ZeRO and Memory Efficiency
Gradient Checkpointing
Fault Tolerance in Large Clusters
Cloud vs On-Prem GPU Infrastructure

8 lessons

Start for Free →

AdvancedFree

Inference Hardware

Inference-optimized hardware, KV cache management, speculative decoding, TensorRT, and edge inference.

What you'll master

GPU Inference vs Training Requirements
Quantization Hardware Tradeoffs
KV Cache Management and PagedAttention
Speculative Decoding
TensorRT and Inference Optimization
Batching Strategies for LLM Serving
Edge and Mobile Inference
Inference Cost Optimization

8 lessons

Start for Free →

Understand the hardware your models run on.

Every millisecond of latency and every dollar of compute traces back to hardware decisions.

Start Learning Free →