Skip to main content

Design: Content Moderation - Multi-Modal Classification with Human-in-the-Loop

Reading time: ~25 min | Interview relevance: High | Roles: MLE, AI Eng

The Real Interview Moment

"Design a content moderation system for a social media platform." You describe a text classifier for hate speech. The interviewer asks: "What about images? What about memes that combine text and images where neither is harmful alone but the combination is? What about sarcasm? What about different cultural norms - something acceptable in one country is offensive in another?"

Content moderation is the design problem that tests your understanding of multi-modal ML, human-AI collaboration, and the limits of AI. No model can handle this alone - the strongest designs show a layered system where ML handles scale and humans handle nuance.

What You Will Master

  • Multi-modal classification: text, images, video, audio
  • Multi-label taxonomy: hate speech, violence, nudity, spam, misinformation
  • Human-in-the-loop review and active learning
  • Policy enforcement and appeals process
  • Handling cultural context and evolving policies
  • Balancing speed (remove harmful content fast) with accuracy (don't over-censor)

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Classify content across categories: hate speech, violence, nudity, spam, misinformation, self-harm, copyright
  • Handle multiple modalities: text, images, video, audio, text-in-image
  • Support user reporting and appeals
  • Process 500M posts per day

Non-functional requirements:

  • Latency: <500ms for synchronous moderation (before content is visible)
  • Recall: >95% for severe violations (child safety, terrorism), >80% for others
  • Precision: >90% for auto-removal (false removals are costly)
  • Scalability: Handle viral content (1M+ shares in hours)

Step 2: Problem Formulation (5 min)

ML problem type: Multi-label, multi-modal classification with severity scoring.

Content Moderation - Multi-Modal Classification (Text, Image, Video, Multi-Modal) → Aggregation → Decision

CategorySeverityActionLatency Requirement
Child safety / terrorismCriticalAuto-remove + report to authoritiesReal-time (<100ms)
Nudity / graphic violenceHighAuto-remove, allow appeal<500ms
Hate speech / harassmentMediumQueue for review, reduce distribution<1 min
Spam / clickbaitLowReduce distribution, labelBatch OK
MisinformationComplexLabel, reduce distribution, fact-check queueHours

Step 3: Features & Data (8 min)

Per-Modality Features

ModalityModelFeatures
TextFine-tuned BERT / multilingual modelToken embeddings, sentiment, toxicity scores
ImageCNN (ResNet/EfficientNet) fine-tunedVisual features, NSFW score, OCR text extraction
VideoFrame sampling + image model + audio modelKeyframe analysis, audio transcription, scene detection
Multi-modalCLIP-style modelText-image alignment, meme understanding

Training Data

  • Human annotations: Content review team labels examples (expensive, gold standard)
  • User reports: Noisy but abundant - 10K+ reports/day
  • Active learning: Model identifies uncertain cases, routes to human reviewers
  • Synthetic data: Augmentation for rare categories (paraphrasing, backtranslation)
Common Trap

Content moderation has severe class imbalance AND evolving definitions. "Hate speech" is culturally and temporally dependent - what's considered hate speech changes, and varies by region. Your model must be updateable quickly (new policy → new labels → fine-tune → deploy in days, not months). Discuss this in the interview.

Step 4: Model (8 min)

The Layered Approach

Three-Layer Content Moderation - Hash Matching → ML Classifiers → Human Review Queue

Layer 1 - Hash Matching (deterministic):

  • PhotoDNA / perceptual hashing for known CSAM and terrorist content
  • Exact match against databases (e.g., NCMEC hash database)
  • 100% precision for known content

Layer 2 - ML Classifiers:

  • Separate classifiers per modality and category
  • Multi-task learning for related categories
  • Calibrated outputs for threshold-based decisions

Layer 3 - Human Review:

  • Prioritized queue (severity × volume × uncertainty)
  • Human decisions become training data (active learning loop)
  • Reviewer consensus (2-3 reviewers for borderline cases)

Step 5: Serving (8 min)

Architecture Decisions

ComponentDecisionRationale
Pre-publish vs. post-publishHybrid: pre-publish for severe categories, post-publish ML + distribution reduction for othersBalance speed with user experience
Model servingGPU cluster for image/video models, CPU for textCost optimization
Human review platformPriority queue with SLAs by severityCritical content reviewed in <1 hour
AppealsSeparate review team, cannot be same reviewerFairness requirement
Feedback loopHuman decisions → retrain weeklyContinuous model improvement

Handling Viral Content

When a post goes viral (shared 10K+ times):

  1. Proactive review: Flag for human review even if ML score is low
  2. Distribution throttle: Slow sharing while review is pending
  3. Batch action: If violation confirmed, remove all reshares simultaneously

Step 6: Evaluation & Iteration (8 min)

Metrics

MetricWhat It MeasuresTarget
Precision (auto-remove)% of auto-removed content that was actually violating> 95%
Recall (severe)% of severe violations caught> 99%
Recall (all)% of all violations caught> 85%
Time to actionMedian time from post to removal< 1 hour for severe
Appeal overturn rate% of appealed decisions reversed< 10%
Reviewer agreementInter-annotator agreementCohen's κ > 0.7

Practice Problems

Problem 1: Meme Moderation

Direction

Memes combine text and images where neither may be harmful alone. "Cute puppy" + "I want to destroy all [group]" = harmful meme. How do you handle this?

Key Insight

Use a multi-modal model (CLIP-based or similar) that processes text and image jointly, not independently. The model learns that certain text-image combinations are harmful even when each modality alone is benign. Training data: annotated meme datasets (Hateful Memes Challenge). Challenge: sarcasm and irony - "I love how [group] always [negative stereotype]" - requires understanding pragmatics, not just semantics.

Problem 2: Cross-Lingual Hate Speech

Direction

Your platform operates in 100+ languages. You have good English training data but minimal data for Swahili, Tagalog, and Bengali. How do you moderate content in low-resource languages?

Key Insight

Use multilingual models (mBERT, XLM-R) for zero-shot transfer from English. Fine-tune on available data per language. Use translation-based augmentation: translate English training data to target languages. For very low-resource: combine ML (low confidence) with community moderators (native speakers). Monitor per-language metrics - don't let global metrics hide poor performance in specific languages.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design content moderation"Three-layer: hash → ML → human"Hash matching for known content, ML classifiers for scale, human review for nuance"
"How do you handle new policies?"Rapid iteration pipeline"New policy → annotate examples → fine-tune → deploy in days"
"Precision vs. recall?"Severity-based thresholds"99% recall for child safety, 85% recall for spam - thresholds match severity"
"Multi-modal content?"Joint models"CLIP-based multi-modal model for text-image combinations"

Spaced Repetition Checkpoints

  • Day 0: Draw the three-layer moderation architecture. Explain why each layer exists.
  • Day 3: Explain the precision-recall trade-off for content moderation. How does severity change the threshold?
  • Day 7: Design content moderation for a video-sharing platform in 45 minutes.
  • Day 14: Discuss the human-in-the-loop pipeline. How does active learning improve the model?
  • Day 21: Mock interview with follow-ups on multi-modal memes, cultural context, and appeals.

What's Next

© 2026 EngineersOfAI. All rights reserved.