Design: Anomaly Detection - Unsupervised Detection at Scale
Reading time: ~22 min | Interview relevance: High | Roles: MLE, MLOps
The Real Interview Moment
"Design an anomaly detection system for a cloud infrastructure platform - detect when servers, services, or networks are behaving abnormally." You describe training an autoencoder on normal data. The interviewer asks: "You have 10,000 time series metrics. Each has different scales, seasonality, and noise levels. Your autoencoder triggers 500 alerts per day but only 10 are real issues. How do you reduce the noise?"
Anomaly detection is deceptively hard. The ML model is often the easy part - the hard part is making the system operationally useful by reducing false alarms to a level where humans trust and act on the alerts.
What You Will Master
- Statistical vs. ML-based anomaly detection methods
- Time series anomaly detection with seasonality handling
- Unsupervised approaches: isolation forest, autoencoders, clustering
- Streaming vs. batch detection trade-offs
- Alert fatigue: reducing false positives while maintaining recall
- Root cause analysis and alert correlation
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Monitor 10,000+ time series metrics (CPU, memory, latency, error rates, throughput)
- Detect point anomalies (single spike), contextual anomalies (unusual for time of day), and collective anomalies (series of slightly unusual points)
- Alert on-call engineers with context (what's anomalous, potential cause)
Non-functional requirements:
- Detection latency: <5 minutes for critical metrics, <1 hour for others
- False positive rate: <5% of alerts (trust requirement)
- Recall: >90% for incidents that cause user impact
- No labeled data available (unsupervised requirement)
Step 2: Problem Formulation (5 min)
ML problem type: Unsupervised anomaly detection on multivariate time series.
Why unsupervised? Labeled anomalies are rare (maybe 50-100 incidents per year) and diverse (each incident is different). You can't train a supervised classifier with so few positives.
| Anomaly Type | Example | Detection Approach |
|---|---|---|
| Point anomaly | CPU spike to 100% | Statistical threshold, z-score |
| Contextual anomaly | High latency at 3 AM (normally low traffic) | Seasonal decomposition |
| Collective anomaly | Slowly increasing error rate over 2 hours | Trend detection, change point |
Step 3: Methods (8 min)
Method Comparison
| Method | How It Works | Best For | Limitation |
|---|---|---|---|
| Statistical (3-sigma) | Flag points > 3σ from mean | Simple metrics, Gaussian data | Doesn't handle seasonality |
| Seasonal decomposition | STL decomposition → anomaly on residual | Metrics with daily/weekly patterns | Requires sufficient history |
| Isolation Forest | Random partitioning - anomalies isolated quickly | Multivariate, tabular features | Doesn't handle time dependencies |
| LSTM Autoencoder | Reconstruct time series - high error = anomaly | Complex temporal patterns | Training complexity, slow |
| Prophet-based | Forecast expected value → anomaly if actual deviates | Metrics with trend + seasonality | One model per metric = 10K models |
Recommended Approach: Layered Detection
Step 4: Reducing False Alarms (8 min)
This is the most important section - the difference between a useful system and an ignored one.
The Alert Fatigue Problem
If you have 10K metrics, each with 0.1% false alarm rate per hour:
- 10K × 0.001 × 24 = 240 false alerts per day
- On-call engineers stop trusting the system after day 2
Solutions
| Strategy | How It Works | Impact |
|---|---|---|
| Alert correlation | Group alerts from the same root cause | 50 alerts → 1 incident |
| Anomaly scoring | Severity score based on deviation magnitude + duration | Prioritize serious anomalies |
| Minimum duration | Require anomaly to persist for N minutes | Filters transient spikes |
| Feedback loop | Engineers mark alerts as true/false positive → adjust thresholds | Continuous improvement |
| Dependency-aware | If DB is down, don't alert on all dependent services | Reduces cascading alerts |
The candidate who only talks about the ML model gets a Lean Hire. The candidate who designs the full alerting pipeline - including correlation, deduplication, severity scoring, and feedback loops - gets a Strong Hire. The ML model is 20% of the value; the operational pipeline is 80%.
Step 5: Serving Architecture (5 min)
| Component | Technology | Why |
|---|---|---|
| Metric ingestion | Kafka / Kinesis | Handle 10K metrics × 1 point/sec |
| Stream processing | Flink / Spark Streaming | Real-time feature computation (rolling stats) |
| Detection | Stateless models on each data point | Low latency, horizontally scalable |
| State storage | Redis (rolling windows) + TimescaleDB (history) | Fast lookups + long-term storage |
| Alerting | PagerDuty / Opsgenie integration | Routing, escalation, on-call schedules |
Step 6: Evaluation (5 min)
The challenge: No labels for offline evaluation.
| Approach | How |
|---|---|
| Inject synthetic anomalies | Add known anomalies to historical data, measure detection rate |
| Retrospective labeling | After an incident, label the time window and check if the system caught it |
| Alert precision tracking | Track % of alerts that led to actual incidents (feedback from engineers) |
| MTTR correlation | Does the system reduce mean time to resolution? |
Practice Problems
Problem 1: Seasonal Anomaly
Direction
Your e-commerce platform has 5x traffic on Black Friday. Your anomaly detector fires hundreds of alerts because it sees "abnormal" traffic levels. How do you handle expected seasonality?
Key Insight
Use seasonal decomposition (STL) to separate trend, seasonality, and residual. Detect anomalies on the residual only. For known events (Black Friday, product launches), use event-aware models that adjust expectations. Alternatively, train separate models for "event" and "normal" periods. The key: the system should know what's expected and only alert on the unexpected.
Problem 2: Multivariate Correlation
Direction
CPU, memory, and latency all increase together. Individually, each is within normal range. But the combination is unusual - normally when CPU is high, latency isn't. How do you detect this?
Key Insight
Multivariate anomaly detection: use a model that considers metric correlations. Options: (1) Multivariate autoencoder - learns normal correlations, flags when reconstruction error is high. (2) Isolation Forest on feature vectors (CPU, memory, latency combined). (3) Correlation matrix monitoring - alert when the correlation structure between metrics changes. The key insight: some anomalies are only visible in the joint distribution, not in individual metrics.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design anomaly detection" | Layered detection + alert pipeline | "Statistical for simple, seasonal models for patterns, ML for complex - then correlate and prioritize" |
| "How do you handle false alarms?" | Alert correlation + feedback | "Correlate alerts to root causes, require minimum duration, engineer feedback loop" |
| "No labels available" | Unsupervised + synthetic evaluation | "Train unsupervised models, evaluate with synthetic injection and retrospective labeling" |
Spaced Repetition Checkpoints
- Day 0: List 3 anomaly detection methods and when you'd use each.
- Day 3: Design the alert pipeline: how do you reduce 500 raw alerts to 10 actionable ones?
- Day 7: Design anomaly detection for a fintech payment system in 45 minutes.
- Day 14: Explain how to evaluate an anomaly detection system without labeled data.
- Day 21: Mock interview with follow-ups on seasonality, multivariate detection, and alert fatigue.
What's Next
- Machine Translation - Sequence-to-sequence generation at scale
- A/B Testing Platform - Rigorous experiment design for ML systems
