Skip to main content

Design: Anomaly Detection - Unsupervised Detection at Scale

Reading time: ~22 min | Interview relevance: High | Roles: MLE, MLOps

The Real Interview Moment

"Design an anomaly detection system for a cloud infrastructure platform - detect when servers, services, or networks are behaving abnormally." You describe training an autoencoder on normal data. The interviewer asks: "You have 10,000 time series metrics. Each has different scales, seasonality, and noise levels. Your autoencoder triggers 500 alerts per day but only 10 are real issues. How do you reduce the noise?"

Anomaly detection is deceptively hard. The ML model is often the easy part - the hard part is making the system operationally useful by reducing false alarms to a level where humans trust and act on the alerts.

What You Will Master

  • Statistical vs. ML-based anomaly detection methods
  • Time series anomaly detection with seasonality handling
  • Unsupervised approaches: isolation forest, autoencoders, clustering
  • Streaming vs. batch detection trade-offs
  • Alert fatigue: reducing false positives while maintaining recall
  • Root cause analysis and alert correlation

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Monitor 10,000+ time series metrics (CPU, memory, latency, error rates, throughput)
  • Detect point anomalies (single spike), contextual anomalies (unusual for time of day), and collective anomalies (series of slightly unusual points)
  • Alert on-call engineers with context (what's anomalous, potential cause)

Non-functional requirements:

  • Detection latency: <5 minutes for critical metrics, <1 hour for others
  • False positive rate: <5% of alerts (trust requirement)
  • Recall: >90% for incidents that cause user impact
  • No labeled data available (unsupervised requirement)

Step 2: Problem Formulation (5 min)

ML problem type: Unsupervised anomaly detection on multivariate time series.

Why unsupervised? Labeled anomalies are rare (maybe 50-100 incidents per year) and diverse (each incident is different). You can't train a supervised classifier with so few positives.

Anomaly TypeExampleDetection Approach
Point anomalyCPU spike to 100%Statistical threshold, z-score
Contextual anomalyHigh latency at 3 AM (normally low traffic)Seasonal decomposition
Collective anomalySlowly increasing error rate over 2 hoursTrend detection, change point

Step 3: Methods (8 min)

Method Comparison

MethodHow It WorksBest ForLimitation
Statistical (3-sigma)Flag points > 3σ from meanSimple metrics, Gaussian dataDoesn't handle seasonality
Seasonal decompositionSTL decomposition → anomaly on residualMetrics with daily/weekly patternsRequires sufficient history
Isolation ForestRandom partitioning - anomalies isolated quicklyMultivariate, tabular featuresDoesn't handle time dependencies
LSTM AutoencoderReconstruct time series - high error = anomalyComplex temporal patternsTraining complexity, slow
Prophet-basedForecast expected value → anomaly if actual deviatesMetrics with trend + seasonalityOne model per metric = 10K models

Layered Anomaly Detection - Statistical + Seasonal + ML layers feeding Alert Pool → Correlation → Priority → Alert

Step 4: Reducing False Alarms (8 min)

This is the most important section - the difference between a useful system and an ignored one.

The Alert Fatigue Problem

If you have 10K metrics, each with 0.1% false alarm rate per hour:

  • 10K × 0.001 × 24 = 240 false alerts per day
  • On-call engineers stop trusting the system after day 2

Solutions

StrategyHow It WorksImpact
Alert correlationGroup alerts from the same root cause50 alerts → 1 incident
Anomaly scoringSeverity score based on deviation magnitude + durationPrioritize serious anomalies
Minimum durationRequire anomaly to persist for N minutesFilters transient spikes
Feedback loopEngineers mark alerts as true/false positive → adjust thresholdsContinuous improvement
Dependency-awareIf DB is down, don't alert on all dependent servicesReduces cascading alerts
Interviewer's Perspective

The candidate who only talks about the ML model gets a Lean Hire. The candidate who designs the full alerting pipeline - including correlation, deduplication, severity scoring, and feedback loops - gets a Strong Hire. The ML model is 20% of the value; the operational pipeline is 80%.

Step 5: Serving Architecture (5 min)

Anomaly Detection Streaming Architecture - Kafka → Flink → Detection Models → Alert Aggregator → PagerDuty/Slack

ComponentTechnologyWhy
Metric ingestionKafka / KinesisHandle 10K metrics × 1 point/sec
Stream processingFlink / Spark StreamingReal-time feature computation (rolling stats)
DetectionStateless models on each data pointLow latency, horizontally scalable
State storageRedis (rolling windows) + TimescaleDB (history)Fast lookups + long-term storage
AlertingPagerDuty / Opsgenie integrationRouting, escalation, on-call schedules

Step 6: Evaluation (5 min)

The challenge: No labels for offline evaluation.

ApproachHow
Inject synthetic anomaliesAdd known anomalies to historical data, measure detection rate
Retrospective labelingAfter an incident, label the time window and check if the system caught it
Alert precision trackingTrack % of alerts that led to actual incidents (feedback from engineers)
MTTR correlationDoes the system reduce mean time to resolution?

Practice Problems

Problem 1: Seasonal Anomaly

Direction

Your e-commerce platform has 5x traffic on Black Friday. Your anomaly detector fires hundreds of alerts because it sees "abnormal" traffic levels. How do you handle expected seasonality?

Key Insight

Use seasonal decomposition (STL) to separate trend, seasonality, and residual. Detect anomalies on the residual only. For known events (Black Friday, product launches), use event-aware models that adjust expectations. Alternatively, train separate models for "event" and "normal" periods. The key: the system should know what's expected and only alert on the unexpected.

Problem 2: Multivariate Correlation

Direction

CPU, memory, and latency all increase together. Individually, each is within normal range. But the combination is unusual - normally when CPU is high, latency isn't. How do you detect this?

Key Insight

Multivariate anomaly detection: use a model that considers metric correlations. Options: (1) Multivariate autoencoder - learns normal correlations, flags when reconstruction error is high. (2) Isolation Forest on feature vectors (CPU, memory, latency combined). (3) Correlation matrix monitoring - alert when the correlation structure between metrics changes. The key insight: some anomalies are only visible in the joint distribution, not in individual metrics.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design anomaly detection"Layered detection + alert pipeline"Statistical for simple, seasonal models for patterns, ML for complex - then correlate and prioritize"
"How do you handle false alarms?"Alert correlation + feedback"Correlate alerts to root causes, require minimum duration, engineer feedback loop"
"No labels available"Unsupervised + synthetic evaluation"Train unsupervised models, evaluate with synthetic injection and retrospective labeling"

Spaced Repetition Checkpoints

  • Day 0: List 3 anomaly detection methods and when you'd use each.
  • Day 3: Design the alert pipeline: how do you reduce 500 raw alerts to 10 actionable ones?
  • Day 7: Design anomaly detection for a fintech payment system in 45 minutes.
  • Day 14: Explain how to evaluate an anomaly detection system without labeled data.
  • Day 21: Mock interview with follow-ups on seasonality, multivariate detection, and alert fatigue.

What's Next

© 2026 EngineersOfAI. All rights reserved.