Approximate Nearest Neighbor Algorithms
Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.
Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.
How to estimate storage, compute, memory, and infrastructure requirements for ML systems before writing a line of code - including the 6PD training compute rule and model sizing.
How static, dynamic, and continuous batching work - and how to go from 20% GPU utilization to 85% without increasing latency.
A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.
Design production embedding pipelines - model selection, batch ingestion, incremental indexing, zero-downtime model upgrades, embedding drift detection, normalization, and dimensionality reduction.
How to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.
Four caching layers for LLM applications - exact match, semantic similarity, provider prefix caching, and KV cache - with implementation patterns and production tradeoffs.
How multi-stage ranking systems reduce millions of candidates to a final ranked list within strict latency budgets - the architecture behind every major search and recommendation system.
Five detailed production LLM architectures - GitHub Copilot, Notion AI, customer support bots, enterprise RAG, and code review agents - with real architecture decisions, scale numbers, and lessons learned.
Build automated CI/CD pipelines for machine learning - from unit tests on transforms to canary deployments - so model degradation gets caught before it reaches users.
Implement full FinOps practice for ML teams - from commitment-based discounts and tagging strategies to budget alerts and spot instance automation.
Production computer vision at scale - autonomous vehicle perception with 30 cameras at 100Hz, real-time object detection, model compression for edge, active learning, and quality metrics.
How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.
Engineering strategies for managing context windows in production LLM applications - history truncation, compression, RAG ordering, and prompt caching design.
The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.
Master data parallelism (DDP, FSDP), tensor parallelism, pipeline parallelism, 3D parallelism, gradient accumulation, all-reduce communication, and bandwidth requirements for training large models.
Event sourcing and CQRS patterns for ML systems - event-driven state management, Kafka Streams for ML pipelines, event schema design, dead letter queues, and event replay for debugging.
Design and govern ML experiment tracking at scale - from MLflow architecture to organizing 50 data scientists' experiments without chaos.
How to design statistically rigorous experiments for ML systems - Bayesian vs frequentist A/B tests, network interference, interleaving, switchback experiments, and guardrail metrics.
Build a shared feature platform that eliminates cross-team feature duplication, ensures training-serving consistency, and serves fresh features at millisecond latency.
How recommendation systems create self-reinforcing feedback loops, how to detect them, and how inverse propensity weighting and exploration strategies break them to enable unbiased learning.
Real-time payment fraud detection at Stripe scale - rule-based baselines, graph fraud detection, session-level features, adversarial robustness, and false positive cost analysis.
Understand CUDA cores vs Tensor Cores, GPU memory hierarchy, FLOPS vs memory bandwidth, the roofline model, warp execution, and NVLink - the hardware knowledge that drives ML optimization.
Systematically reduce GPU infrastructure costs with spot instances, GPU sharing via MPS and MIG, right-sizing, reserved instances, efficient batching, utilization monitoring, and GPU marketplace strategies.
Master VRAM capacity planning, activation checkpointing, mixed precision training, ZeRO optimizer stages, CPU offloading, and OOM debugging for production ML workloads.
Build layered defense-in-depth safety systems for LLM applications - input filtering, toxicity detection, PII redaction, prompt injection defense, output validation, and human review escalation.
Combine BM25 keyword search with dense vector search using SPLADE, Reciprocal Rank Fusion, and learned sparse models to build retrieval systems that beat pure semantic search.
Reduce LLM inference costs by 60–80% through quantization, intelligent batching, right-sizing, and autoscaling - turning an $80K/month bill into $20K.
Horizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.
Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.
Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.
How to decompose LLM latency and cost, choose the right optimization strategies, and define SLOs that balance quality, speed, and budget.
Understanding the fundamental tension between latency and throughput in ML serving - Little's Law, tail latency, batching strategies, and caching for production ML systems.
Design and operate an LLM gateway - unified API, model routing, circuit breakers, budget enforcement, and fallback chains - using LiteLLM and custom routing logic.
The three fundamental LLM product patterns - chat, workflow automation, and autonomous agents - and how to design the production service graph for each.
Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.
Master pre-filtering vs post-filtering, the ACORN algorithm for filtered HNSW, namespace sharding for multi-tenancy, payload index design, and performance impact of filters in vector databases.
Learn to build a complete ML cost model - from compute and storage to hidden data transfer costs - so your team never gets blindsided by a $300K quarterly cloud bill.
Designing an internal ML platform for a team of 50 data scientists - feature stores, experiment tracking, model registry, serving infrastructure, and platform adoption strategies.
Build iron-clad ROI cases for ML investments - from quantifying recommendation system value to attributing A/B test results to long-term business outcomes.
Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.
Compiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.
Analyze the accuracy-cost Pareto frontier to determine when model improvements are economically justified - and how to build the business case for the current model being cost-optimal.
Build production model monitoring infrastructure that catches data drift, prediction drift, and concept drift - detecting model degradation within 24 hours instead of two months.
How quantization reduces model size and inference latency - from FP32 to INT8 to INT4 - covering PTQ, QAT, GPTQ, AWQ, and GGUF with accuracy tradeoffs.
Design a model registry that enables 3-minute rollbacks, full model lineage, and controlled staging-to-production promotion - turning model lifecycle management from a manual process into a reliable system.
Master the foundational principles of AI system design - from requirements gathering to distributed systems theory applied to machine learning.
Production architecture for AI-powered products - from prototype to reliable, scalable, cost-efficient systems.
Build the internal platform that lets data scientists ship models to production in days, not months - covering MLOps architecture, experiment tracking, CI/CD for ML, and Kubernetes-native ML infrastructure.
Master vector similarity search, ANN algorithms, embedding pipelines, hybrid search, and production vector database deployment.
Master GPU architecture, memory management, distributed training, fault-tolerant clusters, TPU workloads, inference hardware, and cost optimization for ML infrastructure.
Master AI infrastructure economics - from cost modeling to FinOps culture - so you can build powerful systems without burning your budget.
Production monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.
How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.
How production ML systems share representations across multiple objectives simultaneously - covering hard vs soft parameter sharing, loss balancing, gradient conflicts, and negative transfer detection.
Build production observability for LLM applications - distributed tracing, quality metrics, cost attribution, prompt versioning, and drift detection using LangSmith, Langfuse, and Helicone.
Continuous learning in production - online learning vs mini-batch, concept drift adaptation, Vowpal Wabbit, streaming gradient descent, bandit algorithms, and preventing catastrophic forgetting.
How to design Retrieval Augmented Generation systems for production - from naive RAG to advanced pipelines with chunking strategies, hybrid search, reranking, and RAG evaluation.
Architecture for ML inference at 1M QPS with sub-10ms SLA - synchronous vs async real-time, circuit breakers, fallback models, and timeout budget management.
End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.
How to gather, prioritize, and translate business requirements into technical specifications for ML systems - including latency budgets, SLOs, and ML-specific constraints.
Master monitoring, capacity planning, index building strategy, warm-up, disaster recovery, index versioning, gradual rollout, and cost optimization for production vector database operations.
Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.
Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.
Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.
How to choose the right serving protocol for ML models - REST, gRPC, and WebSocket compared across latency, throughput, streaming, and operational complexity.
Compare AWS Inferentia/Trainium, NVIDIA L4/L40S, edge inference hardware (Jetson, Apple Neural Engine), hardware-specific quantization, and cost-performance tradeoffs for production AI inference.
Running ML inference on data streams - Kafka integration, Flink ML, stateful stream processing, windowed feature aggregations, exactly-once inference, and time semantics.
Engineering time-based features for real-time ML - recency-weighted features, session features, sliding window aggregations, point-in-time joins, temporal leakage prevention, and clock skew in distributed systems.
A structured 4-step framework for approaching ML system design interviews and real production projects - from requirements to deep dive.
Deep dive into Google TPU v4/v5 architecture, systolic arrays, XLA compilation, TPU pods, JAX programming model, cost comparison with GPUs, and when TPUs outperform GPU clusters.
Reduce model training costs by 60–80% through spot instances, gradient checkpointing, mixed precision, and compute-optimal training - without sacrificing accuracy.
Build fault-tolerant GPU training clusters with InfiniBand, NCCL collective operations, Slurm and Kubernetes job scheduling, elastic training, and automatic checkpointing for multi-day training runs.
How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.
Systematic comparison of the major vector databases - architecture, managed vs self-hosted, hybrid search, filtering, update performance, consistency, and cost.
Master cosine similarity, dot product, L2 distance, exact vs approximate search, the curse of dimensionality, and how to evaluate vector search quality with recall@K.