Skip to main content

76 docs tagged with "system-design"

View all tags

Approximate Nearest Neighbor Algorithms

Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.

Build vs Buy Analysis

A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.

Building Embedding Pipelines

Design production embedding pipelines - model selection, batch ingestion, incremental indexing, zero-downtime model upgrades, embedding drift detection, normalization, and dimensionality reduction.

Caching for ML Serving

How to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.

Caching Strategies

Four caching layers for LLM applications - exact match, semantic similarity, provider prefix caching, and KV cache - with implementation patterns and production tradeoffs.

Cascade and Funnel Architecture

How multi-stage ranking systems reduce millions of candidates to a final ranked list within strict latency budgets - the architecture behind every major search and recommendation system.

Case Studies: Production LLM Systems

Five detailed production LLM architectures - GitHub Copilot, Notion AI, customer support bots, enterprise RAG, and code review agents - with real architecture decisions, scale numbers, and lessons learned.

CI/CD for ML

Build automated CI/CD pipelines for machine learning - from unit tests on transforms to canary deployments - so model degradation gets caught before it reaches users.

Cloud Cost Management

Implement full FinOps practice for ML teams - from commitment-based discounts and tagging strategies to budget alerts and spot instance automation.

Computer Vision Systems

Production computer vision at scale - autonomous vehicle perception with 30 cameras at 100Hz, real-time object detection, model compression for edge, active learning, and quality metrics.

Consistency and Availability in ML Systems

How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.

Context Window Management

Engineering strategies for managing context windows in production LLM applications - history truncation, compression, RAG ordering, and prompt caching design.

Data Systems for ML - The Foundation Layer

The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.

Distributed Training Strategies

Master data parallelism (DDP, FSDP), tensor parallelism, pipeline parallelism, 3D parallelism, gradient accumulation, all-reduce communication, and bandwidth requirements for training large models.

Event-Driven Architecture for ML

Event sourcing and CQRS patterns for ML systems - event-driven state management, Kafka Streams for ML pipelines, event schema design, dead letter queues, and event replay for debugging.

Experiment Tracking

Design and govern ML experiment tracking at scale - from MLflow architecture to organizing 50 data scientists' experiments without chaos.

Feature Platform

Build a shared feature platform that eliminates cross-team feature duplication, ensures training-serving consistency, and serves fresh features at millisecond latency.

Feedback Loops and Data Flywheels

How recommendation systems create self-reinforcing feedback loops, how to detect them, and how inverse propensity weighting and exploration strategies break them to enable unbiased learning.

Fraud Detection Systems

Real-time payment fraud detection at Stripe scale - rule-based baselines, graph fraud detection, session-level features, adversarial robustness, and false positive cost analysis.

GPU Architecture for ML Engineers

Understand CUDA cores vs Tensor Cores, GPU memory hierarchy, FLOPS vs memory bandwidth, the roofline model, warp execution, and NVLink - the hardware knowledge that drives ML optimization.

GPU Cost Optimization

Systematically reduce GPU infrastructure costs with spot instances, GPU sharing via MPS and MIG, right-sizing, reserved instances, efficient batching, utilization monitoring, and GPU marketplace strategies.

GPU Memory Management

Master VRAM capacity planning, activation checkpointing, mixed precision training, ZeRO optimizer stages, CPU offloading, and OOM debugging for production ML workloads.

Guardrails and Safety Systems

Build layered defense-in-depth safety systems for LLM applications - input filtering, toxicity detection, PII redaction, prompt injection defense, output validation, and human review escalation.

Hybrid Search - Dense and Sparse Retrieval

Combine BM25 keyword search with dense vector search using SPLADE, Reciprocal Rank Fusion, and learned sparse models to build retrieval systems that beat pure semantic search.

Inference Cost Optimization

Reduce LLM inference costs by 60–80% through quantization, intelligent batching, right-sizing, and autoscaling - turning an $80K/month bill into $20K.

Inference Scaling

Horizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.

Kubernetes for ML

Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.

Large Language Model Systems

Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.

Latency and Cost Tradeoffs

How to decompose LLM latency and cost, choose the right optimization strategies, and define SLOs that balance quality, speed, and budget.

LLM Gateway and Routing

Design and operate an LLM gateway - unified API, model routing, circuit breakers, budget enforcement, and fallback chains - using LiteLLM and custom routing logic.

LLM Product Architecture

The three fundamental LLM product patterns - chat, workflow automation, and autonomous agents - and how to design the production service graph for each.

Low-Latency Optimization

Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.

Metadata Filtering with Vector Search

Master pre-filtering vs post-filtering, the ACORN algorithm for filtered HNSW, namespace sharding for multi-tenancy, payload index design, and performance impact of filters in vector databases.

ML Cost Models

Learn to build a complete ML cost model - from compute and storage to hidden data transfer costs - so your team never gets blindsided by a $300K quarterly cloud bill.

ML Platform Design

Designing an internal ML platform for a team of 50 data scientists - feature stores, experiment tracking, model registry, serving infrastructure, and platform adoption strategies.

ML ROI and Business Cases

Build iron-clad ROI cases for ML investments - from quantifying recommendation system value to attributing A/B test results to long-term business outcomes.

MLOps Platform Architecture

Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.

Model Compilation and Optimization

Compiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.

Model Efficiency Economics

Analyze the accuracy-cost Pareto frontier to determine when model improvements are economically justified - and how to build the business case for the current model being cost-optimal.

Model Monitoring Platform

Build production model monitoring infrastructure that catches data drift, prediction drift, and concept drift - detecting model degradation within 24 hours instead of two months.

Model Registry and Versioning

Design a model registry that enables 3-minute rollbacks, full model lineage, and controlled staging-to-production promotion - turning model lifecycle management from a manual process into a reliable system.

Module 01: Systems Foundations

Master the foundational principles of AI system design - from requirements gathering to distributed systems theory applied to machine learning.

Module 10 - AI Platform Engineering

Build the internal platform that lets data scientists ship models to production in days, not months - covering MLOps architecture, experiment tracking, CI/CD for ML, and Kubernetes-native ML infrastructure.

Module 8 - GPU and TPU Infrastructure

Master GPU architecture, memory management, distributed training, fault-tolerant clusters, TPU workloads, inference hardware, and cost optimization for ML infrastructure.

Module 9 - Cost & FinOps for AI

Master AI infrastructure economics - from cost modeling to FinOps culture - so you can build powerful systems without burning your budget.

Monitoring ML Serving in Production

Production monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.

Multi-Model Serving

How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.

Multi-Task Learning Systems

How production ML systems share representations across multiple objectives simultaneously - covering hard vs soft parameter sharing, loss balancing, gradient conflicts, and negative transfer detection.

Observability for LLM Apps

Build production observability for LLM applications - distributed tracing, quality metrics, cost attribution, prompt versioning, and drift detection using LangSmith, Langfuse, and Helicone.

Online Learning

Continuous learning in production - online learning vs mini-batch, concept drift adaptation, Vowpal Wabbit, streaming gradient descent, bandit algorithms, and preventing catastrophic forgetting.

RAG System Design

How to design Retrieval Augmented Generation systems for production - from naive RAG to advanced pipelines with chunking strategies, hybrid search, reranking, and RAG evaluation.

Real-Time Inference Design

Architecture for ML inference at 1M QPS with sub-10ms SLA - synchronous vs async real-time, circuit breakers, fallback models, and timeout budget management.

Recommendation Systems at Scale

End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.

Running Vector Databases in Production

Master monitoring, capacity planning, index building strategy, warm-up, disaster recovery, index versioning, gradual rollout, and cost optimization for production vector database operations.

Scaling Vector Databases to Billions of Vectors

Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.

Search and Retrieval Systems

Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.

Self-Service ML Platform

Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.

Specialized Inference Hardware

Compare AWS Inferentia/Trainium, NVIDIA L4/L40S, edge inference hardware (Jetson, Apple Neural Engine), hardware-specific quantization, and cost-performance tradeoffs for production AI inference.

Streaming Inference

Running ML inference on data streams - Kafka integration, Flink ML, stateful stream processing, windowed feature aggregations, exactly-once inference, and time semantics.

Temporal Features for Real-Time ML

Engineering time-based features for real-time ML - recency-weighted features, session features, sliding window aggregations, point-in-time joins, temporal leakage prevention, and clock skew in distributed systems.

The ML System Design Framework

A structured 4-step framework for approaching ML system design interviews and real production projects - from requirements to deep dive.

TPU Architecture and Use

Deep dive into Google TPU v4/v5 architecture, systolic arrays, XLA compilation, TPU pods, JAX programming model, cost comparison with GPUs, and when TPUs outperform GPU clusters.

Training Cost Optimization

Reduce model training costs by 60–80% through spot instances, gradient checkpointing, mixed precision, and compute-optimal training - without sacrificing accuracy.

Training Infrastructure at Scale

Build fault-tolerant GPU training clusters with InfiniBand, NCCL collective operations, Slurm and Kubernetes job scheduling, elastic training, and automatic checkpointing for multi-day training runs.

Two-Tower Models

How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.

Vector Similarity Search Fundamentals

Master cosine similarity, dot product, L2 distance, exact vs approximate search, the curse of dimensionality, and how to evaluate vector search quality with recall@K.