104 docs tagged with "ai-systems"

Ad Click Prediction at Scale

End-to-end design of a production ad click prediction system - covering Wide and Deep learning, feature engineering at scale, online learning, calibration, and serving under 10ms.

AI Systems Design - Engineering Track

Design and build production ML systems - model serving, real-time inference, vector databases, GPU infrastructure, cost optimization, and platform engineering.

Approximate Nearest Neighbor Algorithms

Deep dive into HNSW, IVF, Product Quantization, IVFPQ, LSH, and DiskANN - how each algorithm trades recall for speed and how to choose the right one for your dataset.

Back-of-the-Envelope Estimation for ML Systems

How to estimate storage, compute, memory, and infrastructure requirements for ML systems before writing a line of code - including the 6PD training compute rule and model sizing.

Batch Inference Pipelines

Designing efficient batch inference pipelines for scoring millions of examples - architecture, GPU utilization, failure recovery, and production patterns.

Batch Processing with Spark for ML Pipelines

How Apache Spark processes terabyte-scale training data - architecture, DataFrames, partitioning, joins, and integration with Delta Lake for ML feature engineering.

Batching Strategies for Inference

How static, dynamic, and continuous batching work - and how to go from 20% GPU utilization to 85% without increasing latency.

Build vs Buy Analysis

A rigorous financial and risk framework for deciding when to build ML infrastructure in-house vs use managed services - applied to feature stores, vector DBs, LLMs, and more.

Building Embedding Pipelines

Design production embedding pipelines - model selection, batch ingestion, incremental indexing, zero-downtime model upgrades, embedding drift detection, normalization, and dimensionality reduction.

Caching for ML Serving

How to reduce ML serving cost and latency with result caching, semantic similarity caching, KV cache for transformers, prefix caching for LLMs, and feature caching with Redis.

Canary and Blue-Green Deployments for ML Models

Safe model rollout strategies - canary deployments for gradual traffic migration, blue-green for instant switch, and automated rollback triggers.

Cascade and Funnel Architecture

How multi-stage ranking systems reduce millions of candidates to a final ranked list within strict latency budgets - the architecture behind every major search and recommendation system.

CI/CD for ML

Build automated CI/CD pipelines for machine learning - from unit tests on transforms to canary deployments - so model degradation gets caught before it reaches users.

Cloud Cost Management

Implement full FinOps practice for ML teams - from commitment-based discounts and tagging strategies to budget alerts and spot instance automation.

Computer Vision Systems

Production computer vision at scale - autonomous vehicle perception with 30 cameras at 100Hz, real-time object detection, model compression for edge, active learning, and quality metrics.

Consistency and Availability in ML Systems

How CAP theorem, eventual consistency, and training-serving skew apply to ML systems - feature stores, model versioning, multi-region serving, and when consistency actually matters.

Data Lake and Data Warehouse for ML

The evolution from database to data lake to lakehouse - when to use each storage architecture for ML training data, feature engineering, and model serving.

Data Quality and Validation for ML

Why data quality is the number-one cause of ML failures in production - Great Expectations, data contracts, PSI distribution monitoring, and pipeline quality gates.

Data Systems for ML - The Foundation Layer

The complete ML data stack - from raw storage through feature engineering to model training and serving, including data lakes, warehouses, lakehouses, and temporal joins.

Data Versioning with Delta Lake

ACID transactions, time travel, schema evolution, and training data versioning with Delta Lake - building reproducible ML pipelines on object storage.

Designing a Content Moderation System

End-to-end design of a large-scale content moderation system - covering multi-modal ML pipelines, human review integration, active learning, adversarial robustness, and platform-scale architecture.

Designing a Fraud Detection System at Scale

End-to-end design of a real-time fraud detection system - covering feature engineering, imbalanced learning, streaming scoring, delayed labels, and graph-based fraud ring detection.

Designing a Recommendation System at Scale

End-to-end design of a recommendation system serving billions of items to millions of users - covering two-stage architecture, candidate generation, ranking, cold start, and serving at scale.

Designing a Search Ranking System

End-to-end design of a production search ranking system - covering query understanding, BM25 + dense retrieval, Learning to Rank, semantic reranking, and A/B testing metrics.

Distributed Training Strategies

Master data parallelism (DDP, FSDP), tensor parallelism, pipeline parallelism, 3D parallelism, gradient accumulation, all-reduce communication, and bandwidth requirements for training large models.

Edge ML Deployment

Deploying ML models to smartphones, IoT devices, and embedded systems - model compression, edge runtimes, OTA updates, federated learning, and real-world examples.

Event Sourcing for ML Systems

Learn how event sourcing enables auditable, reproducible ML systems - covering the event log, Kafka as an event store, temporal queries, and the projection pattern.

Event-Driven Architecture for ML

Event sourcing and CQRS patterns for ML systems - event-driven state management, Kafka Streams for ML pipelines, event schema design, dead letter queues, and event replay for debugging.

Event-Driven ML Architecture

Designing ML systems around events - event sourcing, CQRS for feature stores, the outbox pattern, and how LinkedIn's unified messaging platform drives ML at scale.

Experiment Tracking

Design and govern ML experiment tracking at scale - from MLflow architecture to organizing 50 data scientists' experiments without chaos.

Experimentation and A/B Testing for ML Systems

How to design statistically rigorous experiments for ML systems - Bayesian vs frequentist A/B tests, network interference, interleaving, switchback experiments, and guardrail metrics.

Feature Platform

Build a shared feature platform that eliminates cross-team feature duplication, ensures training-serving consistency, and serves fresh features at millisecond latency.

Feature Store Architecture

How feature stores solve training-serving skew with a dual-store architecture - offline store for training, online store for serving, and point-in-time correct retrieval.

Feedback Loops and Data Flywheels

How recommendation systems create self-reinforcing feedback loops, how to detect them, and how inverse propensity weighting and exploration strategies break them to enable unbiased learning.

Fraud Detection Systems

Real-time payment fraud detection at Stripe scale - rule-based baselines, graph fraud detection, session-level features, adversarial robustness, and false positive cost analysis.

GPU Architecture for ML Engineers

Understand CUDA cores vs Tensor Cores, GPU memory hierarchy, FLOPS vs memory bandwidth, the roofline model, warp execution, and NVLink - the hardware knowledge that drives ML optimization.

Systematically reduce GPU infrastructure costs with spot instances, GPU sharing via MPS and MIG, right-sizing, reserved instances, efficient batching, utilization monitoring, and GPU marketplace strategies.

GPU Memory Management

Master VRAM capacity planning, activation checkpointing, mixed precision training, ZeRO optimizer stages, CPU offloading, and OOM debugging for production ML workloads.

Hybrid Search - Dense and Sparse Retrieval

Combine BM25 keyword search with dense vector search using SPLADE, Reciprocal Rank Fusion, and learned sparse models to build retrieval systems that beat pure semantic search.

Inference Cost Optimization

Reduce LLM inference costs by 60–80% through quantization, intelligent batching, right-sizing, and autoscaling - turning an $80K/month bill into $20K.

Inference Scaling

Horizontal and vertical scaling for ML inference - autoscaling policies, KEDA with custom GPU metrics, spot instances, global load balancing, and handling traffic spikes.

Kubernetes for ML

Use Kubernetes as ML infrastructure - from GPU scheduling and device plugins to Kubeflow Pipelines and autoscaling - migrating ML workloads from VMs to K8s without disruption.

Lakehouse Architecture for ML

Lakehouse architecture for ML systems - Delta Lake, Apache Iceberg, Apache Hudi, medallion architecture, query engines, and ML pipelines on the lakehouse.

Lambda and Kappa Architecture for ML Systems

Master Lambda and Kappa architecture - the two dominant patterns for building ML systems that handle both historical and real-time data at scale.

Large Language Model Systems

Deploying Llama-3-70B for a 100K DAU application - vLLM serving, tensor parallelism, KV cache management, speculative decoding, LoRA serving, cost management, and RAG integration.

Latency vs Throughput Trade-offs in ML Systems

Understanding the fundamental tension between latency and throughput in ML serving - Little's Law, tail latency, batching strategies, and caching for production ML systems.

LLM-Powered Product Architecture

End-to-end design of a production LLM-powered product - covering the serving stack, prompt management, RAG architecture, multi-LLM routing, streaming, cost management, and observability.

Low-Latency Inference Patterns

Engineering ML predictions under 10ms p99 - hardware choices, model optimization, batching strategies, pre-computation, memory layout, and real production targets.

Low-Latency Optimization

Engineering for ultra-low latency inference - NUMA awareness, CPU affinity, memory pre-allocation, lock-free data structures, cache line optimization, zero-copy inference, CUDA streams, and kernel profiling.

Metadata Filtering with Vector Search

Master pre-filtering vs post-filtering, the ACORN algorithm for filtered HNSW, namespace sharding for multi-tenancy, payload index design, and performance impact of filters in vector databases.

Microservices for ML Systems

Learn when and how to decompose ML systems into microservices - covering feature services, model services, service mesh, gRPC, and circuit breakers.

ML Cost Models

Learn to build a complete ML cost model - from compute and storage to hidden data transfer costs - so your team never gets blindsided by a $300K quarterly cloud bill.

ML Platform Design

Designing an internal ML platform for a team of 50 data scientists - feature stores, experiment tracking, model registry, serving infrastructure, and platform adoption strategies.

ML Platform Design

Learn how to design internal ML platforms that enable data scientists and engineers to train, deploy, and monitor models efficiently - covering platform components, build vs buy, and real-world case studies.

ML ROI and Business Cases

Build iron-clad ROI cases for ML investments - from quantifying recommendation system value to attributing A/B test results to long-term business outcomes.

MLOps Platform Architecture

Understand the MLOps maturity model from Level 0 to Level 3, design the components of a complete ML platform, and build a realistic 12-month roadmap from ad-hoc to automated.

Model Compilation and Optimization

Compiler-level optimizations for ML inference - TensorRT, torch.compile, ONNX export, kernel fusion, layer fusion, XLA, and profiling bottlenecks.

Model Efficiency Economics

Analyze the accuracy-cost Pareto frontier to determine when model improvements are economically justified - and how to build the business case for the current model being cost-optimal.

Model Monitoring Platform

Build production model monitoring infrastructure that catches data drift, prediction drift, and concept drift - detecting model degradation within 24 hours instead of two months.

Model Quantization for Production Inference

How quantization reduces model size and inference latency - from FP32 to INT8 to INT4 - covering PTQ, QAT, GPTQ, AWQ, and GGUF with accuracy tradeoffs.

Model Registry and Versioning

Design a model registry that enables 3-minute rollbacks, full model lineage, and controlled staging-to-production promotion - turning model lifecycle management from a manual process into a reliable system.

Module 01: Systems Foundations

Master the foundational principles of AI system design - from requirements gathering to distributed systems theory applied to machine learning.

Module 03: Model Serving

Production patterns for serving ML model predictions - from protocol choice and batching to quantization, compilation, caching, and autoscaling.

Module 04: Real-Time ML Systems

Architecture patterns for real-time machine learning - from sub-10ms inference at scale to online learning, streaming inference pipelines, and ultra-low-latency optimization.

Module 05: ML Architecture Patterns

A deep dive into the architectural patterns that power production ML systems - from Lambda/Kappa to multi-tenant platforms.

Module 06: Case Studies

Real-world end-to-end case studies of production ML systems - recommendation, search, fraud, content moderation, ad click prediction, and LLM-powered products.

Module 10 - AI Platform Engineering

Build the internal platform that lets data scientists ship models to production in days, not months - covering MLOps architecture, experiment tracking, CI/CD for ML, and Kubernetes-native ML infrastructure.

Module 2 - Data Infrastructure

A complete map of the Data Infrastructure module covering data lakes, Spark, Kafka, feature stores, data quality, Delta Lake, and lakehouse architecture for ML.

Module 7 - Vector Database Engineering

Master vector similarity search, ANN algorithms, embedding pipelines, hybrid search, and production vector database deployment.

Module 8 - GPU and TPU Infrastructure

Master GPU architecture, memory management, distributed training, fault-tolerant clusters, TPU workloads, inference hardware, and cost optimization for ML infrastructure.

Module 9 - Cost & FinOps for AI

Master AI infrastructure economics - from cost modeling to FinOps culture - so you can build powerful systems without burning your budget.

Monitoring ML Serving in Production

Production monitoring for ML serving - inference latency histograms, GPU metrics, throughput monitoring, error rates, distributed tracing with OpenTelemetry, and drift detection.

Multi-Model Serving

How to serve hundreds of models efficiently - model multiplexing, ensembles in production, A/B testing infrastructure, shadow mode, canary deployments, and multi-tenant GPU resource isolation.

Multi-Task Learning Systems

How production ML systems share representations across multiple objectives simultaneously - covering hard vs soft parameter sharing, loss balancing, gradient conflicts, and negative transfer detection.

Multi-Tenant ML Platforms

Learn how to design ML platforms that safely serve multiple teams from shared GPU infrastructure - covering Kubernetes isolation, fair scheduling, data isolation, cost attribution, and quota management.

Online Feature Computation for Model Serving

How to compute ML features at request time without blowing your latency budget - caching strategies, vectorized computation, and production patterns.

Online Learning

Continuous learning in production - online learning vs mini-batch, concept drift adaptation, Vowpal Wabbit, streaming gradient descent, bandit algorithms, and preventing catastrophic forgetting.

RAG System Design

How to design Retrieval Augmented Generation systems for production - from naive RAG to advanced pipelines with chunking strategies, hybrid search, reranking, and RAG evaluation.

Real-Time Feature Engineering at Scale

Computing ML features from raw events within milliseconds - Redis patterns, sliding window aggregations, session detection, and Uber's Michelangelo real-time pipeline.

Real-Time Inference Design

Architecture for ML inference at 1M QPS with sub-10ms SLA - synchronous vs async real-time, circuit breakers, fallback models, and timeout budget management.

Recommendation Systems at Scale

End-to-end system design for YouTube-scale video recommendation - candidate generation, multi-stage ranking, post-processing for diversity, cold start, and session modeling.

Reproducibility and Auditability in ML Systems

Learn how to build fully reproducible ML systems - covering the reproducibility stack, DVC, MLflow, Docker, seed management, GDPR compliance, and financial model audits.

Requirements and Constraints for ML Systems

How to gather, prioritize, and translate business requirements into technical specifications for ML systems - including latency budgets, SLOs, and ML-specific constraints.

REST vs gRPC for ML Model Serving

A production engineer's guide to choosing between REST and gRPC for ML APIs - protocol mechanics, performance trade-offs, and when each wins.

Running Vector Databases in Production

Master monitoring, capacity planning, index building strategy, warm-up, disaster recovery, index versioning, gradual rollout, and cost optimization for production vector database operations.

Scaling Vector Databases to Billions of Vectors

Architect horizontal sharding, replication, consistent hashing, hot-cold tiering, distributed HNSW, geographic distribution, and backup strategies for production vector databases at billion-vector scale.

Search and Retrieval Systems

Redesigning an Elasticsearch-only search system with neural search - from BM25 baseline through dense retrieval, learning to rank, query understanding, and search quality evaluation.

Self-Service ML Platform

Build ML platforms that data scientists actually use - applying product thinking to internal tooling, from user research and notebook-to-production workflows to adoption metrics and guardrails.

Serving Architectures: REST vs gRPC vs WebSocket

How to choose the right serving protocol for ML models - REST, gRPC, and WebSocket compared across latency, throughput, streaming, and operational complexity.

Shadow Deployment for Safe Model Releases

How to validate new ML models on real production traffic without affecting users - traffic mirroring, prediction comparison, and graduation criteria.

Specialized Inference Hardware

Compare AWS Inferentia/Trainium, NVIDIA L4/L40S, edge inference hardware (Jetson, Apple Neural Engine), hardware-specific quantization, and cost-performance tradeoffs for production AI inference.

Stream Processing for ML Systems

Continuous feature computation on unbounded data streams using Apache Flink - windowing, watermarks, state management, and production ML feature pipelines.

Stream Processing with Kafka for Real-Time ML

How Apache Kafka and Flink enable real-time ML features - topics, consumer groups, exactly-once semantics, streaming feature computation, and architecture patterns.

Streaming Inference

Running ML inference on data streams - Kafka integration, Flink ML, stateful stream processing, windowed feature aggregations, exactly-once inference, and time semantics.

Synchronous vs Asynchronous Inference

When to use synchronous versus asynchronous inference patterns for ML systems - queue architectures, streaming, timeout handling, and production trade-offs.

Temporal Features for Real-Time ML

Engineering time-based features for real-time ML - recency-weighted features, session features, sliding window aggregations, point-in-time joins, temporal leakage prevention, and clock skew in distributed systems.

The ML System Design Framework

A structured 4-step framework for approaching ML system design interviews and real production projects - from requirements to deep dive.

TPU Architecture and Use

Deep dive into Google TPU v4/v5 architecture, systolic arrays, XLA compilation, TPU pods, JAX programming model, cost comparison with GPUs, and when TPUs outperform GPU clusters.

Training Cost Optimization

Reduce model training costs by 60–80% through spot instances, gradient checkpointing, mixed precision, and compute-optimal training - without sacrificing accuracy.

Training Infrastructure at Scale

Build fault-tolerant GPU training clusters with InfiniBand, NCCL collective operations, Slurm and Kubernetes job scheduling, elastic training, and automatic checkpointing for multi-day training runs.

Triton Inference Server and TorchServe

Production-grade ML serving frameworks - NVIDIA Triton's dynamic batching and multi-backend support, TorchServe's PyTorch-native serving, and when to use each.

Two-Tower Models

How dual encoder architectures power billion-scale recommendation and search by separating user and item representations and querying them with approximate nearest neighbor search.

Vector Databases Compared - Pinecone, Weaviate, Qdrant, Chroma, pgvector

Systematic comparison of the major vector databases - architecture, managed vs self-hosted, hybrid search, filtering, update performance, consistency, and cost.

Vector Similarity Search Fundamentals

Master cosine similarity, dot product, L2 distance, exact vs approximate search, the curse of dimensionality, and how to evaluate vector search quality with recall@K.

Ad Click Prediction at Scale

AI Systems Design - Engineering Track

Approximate Nearest Neighbor Algorithms

Back-of-the-Envelope Estimation for ML Systems

Batch Inference Pipelines

Batch Processing with Spark for ML Pipelines

Batching Strategies for Inference

Build vs Buy Analysis

Building Embedding Pipelines

Caching for ML Serving

Canary and Blue-Green Deployments for ML Models

Cascade and Funnel Architecture

CI/CD for ML

Cloud Cost Management

Computer Vision Systems

Consistency and Availability in ML Systems

Data Lake and Data Warehouse for ML

Data Quality and Validation for ML

Data Systems for ML - The Foundation Layer

Data Versioning with Delta Lake

Designing a Content Moderation System

Designing a Fraud Detection System at Scale

Designing a Recommendation System at Scale

Designing a Search Ranking System

Distributed Training Strategies

Edge ML Deployment

Event Sourcing for ML Systems

Event-Driven Architecture for ML

Event-Driven ML Architecture

Experiment Tracking

Experimentation and A/B Testing for ML Systems

Feature Platform

Feature Store Architecture

Feedback Loops and Data Flywheels

Fraud Detection Systems

GPU Architecture for ML Engineers

GPU Cost Optimization

GPU Memory Management

Hybrid Search - Dense and Sparse Retrieval

Inference Cost Optimization

Inference Scaling

Kubernetes for ML

Lakehouse Architecture for ML

Lambda and Kappa Architecture for ML Systems

Large Language Model Systems

Latency vs Throughput Trade-offs in ML Systems

LLM-Powered Product Architecture

Low-Latency Inference Patterns

Low-Latency Optimization

Metadata Filtering with Vector Search

Microservices for ML Systems

ML Cost Models

ML Platform Design

ML Platform Design

ML ROI and Business Cases

MLOps Platform Architecture

Model Compilation and Optimization

Model Efficiency Economics

Model Monitoring Platform

Model Quantization for Production Inference

Model Registry and Versioning

Module 01: Systems Foundations

Module 03: Model Serving

Module 04: Real-Time ML Systems

Module 05: ML Architecture Patterns

Module 06: Case Studies

Module 10 - AI Platform Engineering

Module 2 - Data Infrastructure

Module 7 - Vector Database Engineering

Module 8 - GPU and TPU Infrastructure

Module 9 - Cost & FinOps for AI

Monitoring ML Serving in Production

Multi-Model Serving

Multi-Task Learning Systems

Multi-Tenant ML Platforms

Online Feature Computation for Model Serving

Online Learning

RAG System Design

Real-Time Feature Engineering at Scale

Real-Time Inference Design