Skip to main content
Hardware and Silicon

Hardware and Silicon for AI

GPU architecture, CUDA programming, custom accelerators, kernel optimization, memory systems, and distributed training infrastructure - from transistors to clusters.

756Free
7Modules
56Lessons
Freeto start

7 Modules. From GPU Internals to Production Clusters.

The hardware layer that determines whether your model trains fast or sits idle.

01
IntermediateFree

GPU Architecture

SM architecture, warp execution, CUDA execution model, and GPU memory hierarchy from first principles.

What you'll master

  • GPU vs CPU Architecture
  • Streaming Multiprocessors
  • CUDA Execution Model
  • GPU Memory Hierarchy
  • Warp Divergence and Efficiency
  • PCIe and NVLink
  • GPU Performance Metrics
  • Profiling with Nsight

8 lessons


Start for Free →
02
IntermediateFree

CUDA Programming

Thread hierarchies, memory management, kernel optimization, and writing high-performance GPU kernels.

What you'll master

  • CUDA Programming Model
  • Thread and Block Hierarchy
  • CUDA Memory Types
  • Kernel Optimization Fundamentals
  • Atomic Operations and Synchronization
  • Streams and Concurrency
  • CUBLAS and CUDNN
  • Debugging CUDA Code

8 lessons


Start for Free →
03
AdvancedFree

Custom Silicon

TPUs, Trainium, Groq LPU, Cerebras, Gaudi, Apple Silicon - architecture and when to choose custom accelerators.

What you'll master

  • Google TPU Architecture
  • AWS Trainium and Inferentia
  • Groq LPU Architecture
  • Cerebras Wafer-Scale Engine
  • Intel Gaudi and Habana
  • Apple Silicon for AI
  • FPGAs for AI Inference
  • Choosing Custom Silicon vs GPUs

8 lessons


Start for Free →
04
AdvancedFree

Kernel Optimization

Occupancy, tiling, tensor cores, Flash Attention kernel, kernel fusion, and Triton for custom kernels.

What you'll master

  • Occupancy and Thread Block Tuning
  • Tiling and Shared Memory Optimization
  • Instruction-Level Optimization
  • Tensor Core Programming
  • Flash Attention Kernel Deep Dive
  • Kernel Fusion Strategies
  • Mixed Precision Kernels
  • Triton for Custom Kernels

8 lessons


Start for Free →
05
AdvancedFree

Memory Systems

HBM, roofline analysis, NVLink bandwidth, KV cache capacity planning, and storage I/O bottlenecks.

What you'll master

  • GPU Memory Hierarchy Deep Dive
  • HBM and GDDR Memory Technologies
  • Memory Bandwidth Roofline Analysis
  • PCIe and NVLink Interconnects
  • CPU Memory Architecture for ML
  • Memory Capacity Planning for LLMs
  • Unified Memory and Pooling
  • Storage IO for Training Pipelines

8 lessons


Start for Free →
06
AdvancedFree

Distributed Training Hardware

Multi-GPU architectures, NCCL, DGX systems, ZeRO, fault tolerance, and cloud vs on-prem.

What you'll master

  • Multi-GPU Training Architectures
  • GPU Cluster Networking
  • NCCL and Collective Communication
  • DGX and HGX System Design
  • ZeRO and Memory Efficiency
  • Gradient Checkpointing
  • Fault Tolerance in Large Clusters
  • Cloud vs On-Prem GPU Infrastructure

8 lessons


Start for Free →
07
AdvancedFree

Inference Hardware

Inference-optimized hardware, KV cache management, speculative decoding, TensorRT, and edge inference.

What you'll master

  • GPU Inference vs Training Requirements
  • Quantization Hardware Tradeoffs
  • KV Cache Management and PagedAttention
  • Speculative Decoding
  • TensorRT and Inference Optimization
  • Batching Strategies for LLM Serving
  • Edge and Mobile Inference
  • Inference Cost Optimization

8 lessons


Start for Free →

Understand the hardware your models run on.

Every millisecond of latency and every dollar of compute traces back to hardware decisions.

Start Learning Free →