Skip to main content

104 docs tagged with "ai-systems"

View all tags

Ad Click Prediction at Scale

End-to-end design of a production ad click prediction system - covering Wide and Deep learning, feature engineering at scale, online learning, calibration, and serving under 10ms.

AI Systems Design - Engineering Track

Design and build production ML systems - model serving, real-time inference, vector databases, GPU infrastructure, cost optimization, and platform engineering.

Approximate Nearest Neighbor Algorithms

Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.

Batch Inference Pipelines

Designing efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.

Build vs Buy Analysis

A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.

Building Embedding Pipelines

Design production embedding pipelines - model selection, batch ingestion, incremental indexing, zero-downtime model upgrades, embedding drift detection, normalization, and dimensionality reduction.

Caching for ML Serving

How to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.

Cascade and Funnel Architecture

How multi-stage ranking systems reduce millions of candidates to a final ranked list within strict latency budgets - the architecture behind every major search and recommendation system.

CI/CD for ML

Build automated CI/CD pipelines for machine learning - from unit tests on transforms to canary deployments - so model degradation gets caught before it reaches users.

Cloud Cost Management

Implement full FinOps practice for ML teams - from commitment-based discounts and tagging strategies to budget alerts and spot instance automation.

Computer Vision Systems

Production computer vision at scale - autonomous vehicle perception with 30 cameras at 100Hz, real-time object detection, model compression for edge, active learning, and quality metrics.

Consistency and Availability in ML Systems

How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.

Data Lake and Data Warehouse for ML

The evolution from database to data lake to lakehouse - when to use each storage architecture for ML training data, feature engineering, and model serving.

Data Quality and Validation for ML

Why data quality is the number-one cause of ML failures in production - Great Expectations, data contracts, PSI distribution monitoring, and pipeline quality gates.

Data Systems for ML - The Foundation Layer

The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.

Data Versioning with Delta Lake

ACID transactions, time travel, schema evolution, and training data versioning with Delta Lake - building reproducible ML pipelines on object storage.

Designing a Content Moderation System

End-to-end design of a large-scale content moderation system - covering multi-modal ML pipelines, human review integration, active learning, adversarial robustness, and platform-scale architecture.

Designing a Fraud Detection System at Scale

End-to-end design of a real-time fraud detection system - covering feature engineering, imbalanced learning, streaming scoring, delayed labels, and graph-based fraud ring detection.

Designing a Recommendation System at Scale

End-to-end design of a recommendation system serving billions of items to millions of users - covering two-stage architecture, candidate generation, ranking, cold start, and serving at scale.

Designing a Search Ranking System

End-to-end design of a production search ranking system - covering query understanding, BM25 + dense retrieval, Learning to Rank, semantic reranking, and A/B testing metrics.

Distributed Training Strategies

Master data parallelism (DDP, FSDP), tensor parallelism, pipeline parallelism, 3D parallelism, gradient accumulation, all-reduce communication, and bandwidth requirements for training large models.

Edge ML Deployment

Deploying ML models to smartphones, IoT devices, and embedded systems - model compression, edge runtimes, OTA updates, federated learning, and real-world examples.

Event Sourcing for ML Systems

Learn how event sourcing enables auditable, reproducible ML systems - covering the event log, Kafka as an event store, temporal queries, and the projection pattern.

Event-Driven Architecture for ML

Event sourcing and CQRS patterns for ML systems - event-driven state management, Kafka Streams for ML pipelines, event schema design, dead letter queues, and event replay for debugging.

Event-Driven ML Architecture

Designing ML systems around events - event sourcing, CQRS for feature stores, the outbox pattern, and how LinkedIn's unified messaging platform drives ML at scale.

Experiment Tracking

Design and govern ML experiment tracking at scale - from MLflow architecture to organizing 50 data scientists' experiments without chaos.

Feature Platform

Build a shared feature platform that eliminates cross-team feature duplication, ensures training-serving consistency, and serves fresh features at millisecond latency.

Feature Store Architecture

How feature stores solve training-serving skew with a dual-store architecture - offline store for training, online store for serving, and point-in-time correct retrieval.

Feedback Loops and Data Flywheels

How recommendation systems create self-reinforcing feedback loops, how to detect them, and how inverse propensity weighting and exploration strategies break them to enable unbiased learning.

Fraud Detection Systems

Real-time payment fraud detection at Stripe scale - rule-based baselines, graph fraud detection, session-level features, adversarial robustness, and false positive cost analysis.

GPU Architecture for ML Engineers

Understand CUDA cores vs Tensor Cores, GPU memory hierarchy, FLOPS vs memory bandwidth, the roofline model, warp execution, and NVLink - the hardware knowledge that drives ML optimization.

GPU Cost Optimization

Systematically reduce GPU infrastructure costs with spot instances, GPU sharing via MPS and MIG, right-sizing, reserved instances, efficient batching, utilization monitoring, and GPU marketplace strategies.

GPU Memory Management

Master VRAM capacity planning, activation checkpointing, mixed precision training, ZeRO optimizer stages, CPU offloading, and OOM debugging for production ML workloads.

Hybrid Search - Dense and Sparse Retrieval

Combine BM25 keyword search with dense vector search using SPLADE, Reciprocal Rank Fusion, and learned sparse models to build retrieval systems that beat pure semantic search.

Inference Cost Optimization

Reduce LLM inference costs by 60–80% through quantization, intelligent batching, right-sizing, and autoscaling - turning an $80K/month bill into $20K.

Inference Scaling

Horizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.

Kubernetes for ML

Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.

Lakehouse Architecture for ML

Lakehouse architecture for ML systems - Delta Lake, Apache Iceberg, Apache Hudi, medallion architecture, query engines, and ML pipelines on the lakehouse.

Large Language Model Systems

Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.

LLM-Powered Product Architecture

End-to-end design of a production LLM-powered product - covering the serving stack, prompt management, RAG architecture, multi-LLM routing, streaming, cost management, and observability.

Low-Latency Inference Patterns

Engineering ML predictions under 10ms p99 - hardware choices, model optimization, batching strategies, pre-computation, memory layout, and real production targets.

Low-Latency Optimization

Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.

Metadata Filtering with Vector Search

Master pre-filtering vs post-filtering, the ACORN algorithm for filtered HNSW, namespace sharding for multi-tenancy, payload index design, and performance impact of filters in vector databases.

Microservices for ML Systems

Learn when and how to decompose ML systems into microservices - covering feature services, model services, service mesh, gRPC, and circuit breakers.

ML Cost Models

Learn to build a complete ML cost model - from compute and storage to hidden data transfer costs - so your team never gets blindsided by a $300K quarterly cloud bill.

ML Platform Design

Designing an internal ML platform for a team of 50 data scientists - feature stores, experiment tracking, model registry, serving infrastructure, and platform adoption strategies.

ML Platform Design

Learn how to design internal ML platforms that enable data scientists and engineers to train, deploy, and monitor models efficiently - covering platform components, build vs buy, and real-world case studies.

ML ROI and Business Cases

Build iron-clad ROI cases for ML investments - from quantifying recommendation system value to attributing A/B test results to long-term business outcomes.

MLOps Platform Architecture

Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.

Model Compilation and Optimization

Compiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.

Model Efficiency Economics

Analyze the accuracy-cost Pareto frontier to determine when model improvements are economically justified - and how to build the business case for the current model being cost-optimal.

Model Monitoring Platform

Build production model monitoring infrastructure that catches data drift, prediction drift, and concept drift - detecting model degradation within 24 hours instead of two months.

Model Registry and Versioning

Design a model registry that enables 3-minute rollbacks, full model lineage, and controlled staging-to-production promotion - turning model lifecycle management from a manual process into a reliable system.

Module 01: Systems Foundations

Master the foundational principles of AI system design - from requirements gathering to distributed systems theory applied to machine learning.

Module 03: Model Serving

Production patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.

Module 04: Real-Time ML Systems

Architecture patterns for real-time machine learning - from sub-10ms inference at scale to online learning, streaming inference pipelines, and ultra-low-latency optimization.

Module 06: Case Studies

Real-world end-to-end case studies of production ML systems - recommendation, search, fraud, content moderation, ad click prediction, and LLM-powered products.

Module 10 - AI Platform Engineering

Build the internal platform that lets data scientists ship models to production in days, not months - covering MLOps architecture, experiment tracking, CI/CD for ML, and Kubernetes-native ML infrastructure.

Module 2 - Data Infrastructure

A complete map of the Data Infrastructure module covering data lakes, Spark, Kafka, feature stores, data quality, Delta Lake, and lakehouse architecture for ML.

Module 8 - GPU and TPU Infrastructure

Master GPU architecture, memory management, distributed training, fault-tolerant clusters, TPU workloads, inference hardware, and cost optimization for ML infrastructure.

Module 9 - Cost & FinOps for AI

Master AI infrastructure economics - from cost modeling to FinOps culture - so you can build powerful systems without burning your budget.

Monitoring ML Serving in Production

Production monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.

Multi-Model Serving

How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.

Multi-Task Learning Systems

How production ML systems share representations across multiple objectives simultaneously - covering hard vs soft parameter sharing, loss balancing, gradient conflicts, and negative transfer detection.

Multi-Tenant ML Platforms

Learn how to design ML platforms that safely serve multiple teams from shared GPU infrastructure - covering Kubernetes isolation, fair scheduling, data isolation, cost attribution, and quota management.

Online Learning

Continuous learning in production - online learning vs mini-batch, concept drift adaptation, Vowpal Wabbit, streaming gradient descent, bandit algorithms, and preventing catastrophic forgetting.

RAG System Design

How to design Retrieval Augmented Generation systems for production - from naive RAG to advanced pipelines with chunking strategies, hybrid search, reranking, and RAG evaluation.

Real-Time Feature Engineering at Scale

Computing ML features from raw events within milliseconds - Redis patterns, sliding window aggregations, session detection, and Uber's Michelangelo real-time pipeline.

Real-Time Inference Design

Architecture for ML inference at 1M QPS with sub-10ms SLA - synchronous vs async real-time, circuit breakers, fallback models, and timeout budget management.

Recommendation Systems at Scale

End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.

REST vs gRPC for ML Model Serving

A production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.

Running Vector Databases in Production

Master monitoring, capacity planning, index building strategy, warm-up, disaster recovery, index versioning, gradual rollout, and cost optimization for production vector database operations.

Scaling Vector Databases to Billions of Vectors

Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.

Search and Retrieval Systems

Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.

Self-Service ML Platform

Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.

Specialized Inference Hardware

Compare AWS Inferentia/Trainium, NVIDIA L4/L40S, edge inference hardware (Jetson, Apple Neural Engine), hardware-specific quantization, and cost-performance tradeoffs for production AI inference.

Stream Processing for ML Systems

Continuous feature computation on unbounded data streams using Apache Flink - windowing, watermarks, state management, and production ML feature pipelines.

Streaming Inference

Running ML inference on data streams - Kafka integration, Flink ML, stateful stream processing, windowed feature aggregations, exactly-once inference, and time semantics.

Synchronous vs Asynchronous Inference

When to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.

Temporal Features for Real-Time ML

Engineering time-based features for real-time ML - recency-weighted features, session features, sliding window aggregations, point-in-time joins, temporal leakage prevention, and clock skew in distributed systems.

The ML System Design Framework

A structured 4-step framework for approaching ML system design interviews and real production projects - from requirements to deep dive.

TPU Architecture and Use

Deep dive into Google TPU v4/v5 architecture, systolic arrays, XLA compilation, TPU pods, JAX programming model, cost comparison with GPUs, and when TPUs outperform GPU clusters.

Training Cost Optimization

Reduce model training costs by 60–80% through spot instances, gradient checkpointing, mixed precision, and compute-optimal training - without sacrificing accuracy.

Training Infrastructure at Scale

Build fault-tolerant GPU training clusters with InfiniBand, NCCL collective operations, Slurm and Kubernetes job scheduling, elastic training, and automatic checkpointing for multi-day training runs.

Triton Inference Server and TorchServe

Production-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.

Two-Tower Models

How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.

Vector Similarity Search Fundamentals

Master cosine similarity, dot product, L2 distance, exact vs approximate search, the curse of dimensionality, and how to evaluate vector search quality with recall@K.