Design: Anomaly Detection - Unsupervised Detection at Scale

Reading time: ~22 min | Interview relevance: High | Roles: MLE, MLOps

The Real Interview Moment

"Design an anomaly detection system for a cloud infrastructure platform - detect when servers, services, or networks are behaving abnormally." You describe training an autoencoder on normal data. The interviewer asks: "You have 10,000 time series metrics. Each has different scales, seasonality, and noise levels. Your autoencoder triggers 500 alerts per day but only 10 are real issues. How do you reduce the noise?"

Anomaly detection is deceptively hard. The ML model is often the easy part - the hard part is making the system operationally useful by reducing false alarms to a level where humans trust and act on the alerts.

What You Will Master

Statistical vs. ML-based anomaly detection methods
Time series anomaly detection with seasonality handling
Unsupervised approaches: isolation forest, autoencoders, clustering
Streaming vs. batch detection trade-offs
Alert fatigue: reducing false positives while maintaining recall
Root cause analysis and alert correlation

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Monitor 10,000+ time series metrics (CPU, memory, latency, error rates, throughput)
Detect point anomalies (single spike), contextual anomalies (unusual for time of day), and collective anomalies (series of slightly unusual points)
Alert on-call engineers with context (what's anomalous, potential cause)

Non-functional requirements:

Detection latency: <5 minutes for critical metrics, <1 hour for others
False positive rate: <5% of alerts (trust requirement)
Recall: >90% for incidents that cause user impact
No labeled data available (unsupervised requirement)

Step 2: Problem Formulation (5 min)

ML problem type: Unsupervised anomaly detection on multivariate time series.

Why unsupervised? Labeled anomalies are rare (maybe 50-100 incidents per year) and diverse (each incident is different). You can't train a supervised classifier with so few positives.

Anomaly Type	Example	Detection Approach
Point anomaly	CPU spike to 100%	Statistical threshold, z-score
Contextual anomaly	High latency at 3 AM (normally low traffic)	Seasonal decomposition
Collective anomaly	Slowly increasing error rate over 2 hours	Trend detection, change point

Step 3: Methods (8 min)

Method Comparison

Method	How It Works	Best For	Limitation
Statistical (3-sigma)	Flag points > 3σ from mean	Simple metrics, Gaussian data	Doesn't handle seasonality
Seasonal decomposition	STL decomposition → anomaly on residual	Metrics with daily/weekly patterns	Requires sufficient history
Isolation Forest	Random partitioning - anomalies isolated quickly	Multivariate, tabular features	Doesn't handle time dependencies
LSTM Autoencoder	Reconstruct time series - high error = anomaly	Complex temporal patterns	Training complexity, slow
Prophet-based	Forecast expected value → anomaly if actual deviates	Metrics with trend + seasonality	One model per metric = 10K models

Recommended Approach: Layered Detection

Layered Anomaly Detection - Statistical + Seasonal + ML layers feeding Alert Pool → Correlation → Priority → Alert

Step 4: Reducing False Alarms (8 min)

This is the most important section - the difference between a useful system and an ignored one.

The Alert Fatigue Problem

If you have 10K metrics, each with 0.1% false alarm rate per hour:

10K × 0.001 × 24 = 240 false alerts per day
On-call engineers stop trusting the system after day 2

Solutions

Strategy	How It Works	Impact
Alert correlation	Group alerts from the same root cause	50 alerts → 1 incident
Anomaly scoring	Severity score based on deviation magnitude + duration	Prioritize serious anomalies
Minimum duration	Require anomaly to persist for N minutes	Filters transient spikes
Feedback loop	Engineers mark alerts as true/false positive → adjust thresholds	Continuous improvement
Dependency-aware	If DB is down, don't alert on all dependent services	Reduces cascading alerts

Interviewer's Perspective

The candidate who only talks about the ML model gets a Lean Hire. The candidate who designs the full alerting pipeline - including correlation, deduplication, severity scoring, and feedback loops - gets a Strong Hire. The ML model is 20% of the value; the operational pipeline is 80%.

Step 5: Serving Architecture (5 min)

Anomaly Detection Streaming Architecture - Kafka → Flink → Detection Models → Alert Aggregator → PagerDuty/Slack

Component	Technology	Why
Metric ingestion	Kafka / Kinesis	Handle 10K metrics × 1 point/sec
Stream processing	Flink / Spark Streaming	Real-time feature computation (rolling stats)
Detection	Stateless models on each data point	Low latency, horizontally scalable
State storage	Redis (rolling windows) + TimescaleDB (history)	Fast lookups + long-term storage
Alerting	PagerDuty / Opsgenie integration	Routing, escalation, on-call schedules

Step 6: Evaluation (5 min)

The challenge: No labels for offline evaluation.

Approach	How
Inject synthetic anomalies	Add known anomalies to historical data, measure detection rate
Retrospective labeling	After an incident, label the time window and check if the system caught it
Alert precision tracking	Track % of alerts that led to actual incidents (feedback from engineers)
MTTR correlation	Does the system reduce mean time to resolution?

Practice Problems

Problem 1: Seasonal Anomaly

Direction

Your e-commerce platform has 5x traffic on Black Friday. Your anomaly detector fires hundreds of alerts because it sees "abnormal" traffic levels. How do you handle expected seasonality?

Key Insight

Use seasonal decomposition (STL) to separate trend, seasonality, and residual. Detect anomalies on the residual only. For known events (Black Friday, product launches), use event-aware models that adjust expectations. Alternatively, train separate models for "event" and "normal" periods. The key: the system should know what's expected and only alert on the unexpected.

Problem 2: Multivariate Correlation

Direction

CPU, memory, and latency all increase together. Individually, each is within normal range. But the combination is unusual - normally when CPU is high, latency isn't. How do you detect this?

Key Insight

Multivariate anomaly detection: use a model that considers metric correlations. Options: (1) Multivariate autoencoder - learns normal correlations, flags when reconstruction error is high. (2) Isolation Forest on feature vectors (CPU, memory, latency combined). (3) Correlation matrix monitoring - alert when the correlation structure between metrics changes. The key insight: some anomalies are only visible in the joint distribution, not in individual metrics.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design anomaly detection"	Layered detection + alert pipeline	"Statistical for simple, seasonal models for patterns, ML for complex - then correlate and prioritize"
"How do you handle false alarms?"	Alert correlation + feedback	"Correlate alerts to root causes, require minimum duration, engineer feedback loop"
"No labels available"	Unsupervised + synthetic evaluation	"Train unsupervised models, evaluate with synthetic injection and retrospective labeling"

Spaced Repetition Checkpoints

Day 0: List 3 anomaly detection methods and when you'd use each.
Day 3: Design the alert pipeline: how do you reduce 500 raw alerts to 10 actionable ones?
Day 7: Design anomaly detection for a fintech payment system in 45 minutes.
Day 14: Explain how to evaluate an anomaly detection system without labeled data.
Day 21: Mock interview with follow-ups on seasonality, multivariate detection, and alert fatigue.

What's Next

Machine Translation - Sequence-to-sequence generation at scale
A/B Testing Platform - Rigorous experiment design for ML systems

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Methods (8 min)​

Method Comparison​

Recommended Approach: Layered Detection​

Step 4: Reducing False Alarms (8 min)​

The Alert Fatigue Problem​

Solutions​

Step 5: Serving Architecture (5 min)​

Step 6: Evaluation (5 min)​

Practice Problems​

Problem 1: Seasonal Anomaly​

Problem 2: Multivariate Correlation​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​