Design: Content Moderation - Multi-Modal Classification with Human-in-the-Loop

Reading time: ~25 min | Interview relevance: High | Roles: MLE, AI Eng

The Real Interview Moment

"Design a content moderation system for a social media platform." You describe a text classifier for hate speech. The interviewer asks: "What about images? What about memes that combine text and images where neither is harmful alone but the combination is? What about sarcasm? What about different cultural norms - something acceptable in one country is offensive in another?"

Content moderation is the design problem that tests your understanding of multi-modal ML, human-AI collaboration, and the limits of AI. No model can handle this alone - the strongest designs show a layered system where ML handles scale and humans handle nuance.

What You Will Master

Multi-modal classification: text, images, video, audio
Multi-label taxonomy: hate speech, violence, nudity, spam, misinformation
Human-in-the-loop review and active learning
Policy enforcement and appeals process
Handling cultural context and evolving policies
Balancing speed (remove harmful content fast) with accuracy (don't over-censor)

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Classify content across categories: hate speech, violence, nudity, spam, misinformation, self-harm, copyright
Handle multiple modalities: text, images, video, audio, text-in-image
Support user reporting and appeals
Process 500M posts per day

Non-functional requirements:

Latency: <500ms for synchronous moderation (before content is visible)
Recall: >95% for severe violations (child safety, terrorism), >80% for others
Precision: >90% for auto-removal (false removals are costly)
Scalability: Handle viral content (1M+ shares in hours)

Step 2: Problem Formulation (5 min)

ML problem type: Multi-label, multi-modal classification with severity scoring.

Content Moderation - Multi-Modal Classification (Text, Image, Video, Multi-Modal) → Aggregation → Decision

Category	Severity	Action	Latency Requirement
Child safety / terrorism	Critical	Auto-remove + report to authorities	Real-time (<100ms)
Nudity / graphic violence	High	Auto-remove, allow appeal	<500ms
Hate speech / harassment	Medium	Queue for review, reduce distribution	<1 min
Spam / clickbait	Low	Reduce distribution, label	Batch OK
Misinformation	Complex	Label, reduce distribution, fact-check queue	Hours

Step 3: Features & Data (8 min)

Per-Modality Features

Modality	Model	Features
Text	Fine-tuned BERT / multilingual model	Token embeddings, sentiment, toxicity scores
Image	CNN (ResNet/EfficientNet) fine-tuned	Visual features, NSFW score, OCR text extraction
Video	Frame sampling + image model + audio model	Keyframe analysis, audio transcription, scene detection
Multi-modal	CLIP-style model	Text-image alignment, meme understanding

Training Data

Human annotations: Content review team labels examples (expensive, gold standard)
User reports: Noisy but abundant - 10K+ reports/day
Active learning: Model identifies uncertain cases, routes to human reviewers
Synthetic data: Augmentation for rare categories (paraphrasing, backtranslation)

Common Trap

Content moderation has severe class imbalance AND evolving definitions. "Hate speech" is culturally and temporally dependent - what's considered hate speech changes, and varies by region. Your model must be updateable quickly (new policy → new labels → fine-tune → deploy in days, not months). Discuss this in the interview.

Step 4: Model (8 min)

The Layered Approach

Three-Layer Content Moderation - Hash Matching → ML Classifiers → Human Review Queue

Layer 1 - Hash Matching (deterministic):

PhotoDNA / perceptual hashing for known CSAM and terrorist content
Exact match against databases (e.g., NCMEC hash database)
100% precision for known content

Layer 2 - ML Classifiers:

Separate classifiers per modality and category
Multi-task learning for related categories
Calibrated outputs for threshold-based decisions

Layer 3 - Human Review:

Prioritized queue (severity × volume × uncertainty)
Human decisions become training data (active learning loop)
Reviewer consensus (2-3 reviewers for borderline cases)

Step 5: Serving (8 min)

Architecture Decisions

Component	Decision	Rationale
Pre-publish vs. post-publish	Hybrid: pre-publish for severe categories, post-publish ML + distribution reduction for others	Balance speed with user experience
Model serving	GPU cluster for image/video models, CPU for text	Cost optimization
Human review platform	Priority queue with SLAs by severity	Critical content reviewed in <1 hour
Appeals	Separate review team, cannot be same reviewer	Fairness requirement
Feedback loop	Human decisions → retrain weekly	Continuous model improvement

Handling Viral Content

When a post goes viral (shared 10K+ times):

Proactive review: Flag for human review even if ML score is low
Distribution throttle: Slow sharing while review is pending
Batch action: If violation confirmed, remove all reshares simultaneously

Step 6: Evaluation & Iteration (8 min)

Metrics

Metric	What It Measures	Target
Precision (auto-remove)	% of auto-removed content that was actually violating	> 95%
Recall (severe)	% of severe violations caught	> 99%
Recall (all)	% of all violations caught	> 85%
Time to action	Median time from post to removal	< 1 hour for severe
Appeal overturn rate	% of appealed decisions reversed	< 10%
Reviewer agreement	Inter-annotator agreement	Cohen's κ > 0.7

Practice Problems

Problem 1: Meme Moderation

Direction

Memes combine text and images where neither may be harmful alone. "Cute puppy" + "I want to destroy all [group]" = harmful meme. How do you handle this?

Key Insight

Use a multi-modal model (CLIP-based or similar) that processes text and image jointly, not independently. The model learns that certain text-image combinations are harmful even when each modality alone is benign. Training data: annotated meme datasets (Hateful Memes Challenge). Challenge: sarcasm and irony - "I love how [group] always [negative stereotype]" - requires understanding pragmatics, not just semantics.

Problem 2: Cross-Lingual Hate Speech

Direction

Your platform operates in 100+ languages. You have good English training data but minimal data for Swahili, Tagalog, and Bengali. How do you moderate content in low-resource languages?

Key Insight

Use multilingual models (mBERT, XLM-R) for zero-shot transfer from English. Fine-tune on available data per language. Use translation-based augmentation: translate English training data to target languages. For very low-resource: combine ML (low confidence) with community moderators (native speakers). Monitor per-language metrics - don't let global metrics hide poor performance in specific languages.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design content moderation"	Three-layer: hash → ML → human	"Hash matching for known content, ML classifiers for scale, human review for nuance"
"How do you handle new policies?"	Rapid iteration pipeline	"New policy → annotate examples → fine-tune → deploy in days"
"Precision vs. recall?"	Severity-based thresholds	"99% recall for child safety, 85% recall for spam - thresholds match severity"
"Multi-modal content?"	Joint models	"CLIP-based multi-modal model for text-image combinations"

Spaced Repetition Checkpoints

Day 0: Draw the three-layer moderation architecture. Explain why each layer exists.
Day 3: Explain the precision-recall trade-off for content moderation. How does severity change the threshold?
Day 7: Design content moderation for a video-sharing platform in 45 minutes.
Day 14: Discuss the human-in-the-loop pipeline. How does active learning improve the model?
Day 21: Mock interview with follow-ups on multi-modal memes, cultural context, and appeals.

What's Next

AI Chatbot System - Guardrails for AI-generated content
Anomaly Detection - Detecting unusual patterns at scale

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Features & Data (8 min)​

Per-Modality Features​

Training Data​

Step 4: Model (8 min)​

The Layered Approach​

Step 5: Serving (8 min)​

Architecture Decisions​

Handling Viral Content​

Step 6: Evaluation & Iteration (8 min)​

Metrics​

Practice Problems​

Problem 1: Meme Moderation​

Problem 2: Cross-Lingual Hate Speech​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​