Design: Content Moderation - Multi-Modal Classification with Human-in-the-Loop
Reading time: ~25 min | Interview relevance: High | Roles: MLE, AI Eng
The Real Interview Moment
"Design a content moderation system for a social media platform." You describe a text classifier for hate speech. The interviewer asks: "What about images? What about memes that combine text and images where neither is harmful alone but the combination is? What about sarcasm? What about different cultural norms - something acceptable in one country is offensive in another?"
Content moderation is the design problem that tests your understanding of multi-modal ML, human-AI collaboration, and the limits of AI. No model can handle this alone - the strongest designs show a layered system where ML handles scale and humans handle nuance.
What You Will Master
- Multi-modal classification: text, images, video, audio
- Multi-label taxonomy: hate speech, violence, nudity, spam, misinformation
- Human-in-the-loop review and active learning
- Policy enforcement and appeals process
- Handling cultural context and evolving policies
- Balancing speed (remove harmful content fast) with accuracy (don't over-censor)
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Classify content across categories: hate speech, violence, nudity, spam, misinformation, self-harm, copyright
- Handle multiple modalities: text, images, video, audio, text-in-image
- Support user reporting and appeals
- Process 500M posts per day
Non-functional requirements:
- Latency: <500ms for synchronous moderation (before content is visible)
- Recall: >95% for severe violations (child safety, terrorism), >80% for others
- Precision: >90% for auto-removal (false removals are costly)
- Scalability: Handle viral content (1M+ shares in hours)
Step 2: Problem Formulation (5 min)
ML problem type: Multi-label, multi-modal classification with severity scoring.
| Category | Severity | Action | Latency Requirement |
|---|---|---|---|
| Child safety / terrorism | Critical | Auto-remove + report to authorities | Real-time (<100ms) |
| Nudity / graphic violence | High | Auto-remove, allow appeal | <500ms |
| Hate speech / harassment | Medium | Queue for review, reduce distribution | <1 min |
| Spam / clickbait | Low | Reduce distribution, label | Batch OK |
| Misinformation | Complex | Label, reduce distribution, fact-check queue | Hours |
Step 3: Features & Data (8 min)
Per-Modality Features
| Modality | Model | Features |
|---|---|---|
| Text | Fine-tuned BERT / multilingual model | Token embeddings, sentiment, toxicity scores |
| Image | CNN (ResNet/EfficientNet) fine-tuned | Visual features, NSFW score, OCR text extraction |
| Video | Frame sampling + image model + audio model | Keyframe analysis, audio transcription, scene detection |
| Multi-modal | CLIP-style model | Text-image alignment, meme understanding |
Training Data
- Human annotations: Content review team labels examples (expensive, gold standard)
- User reports: Noisy but abundant - 10K+ reports/day
- Active learning: Model identifies uncertain cases, routes to human reviewers
- Synthetic data: Augmentation for rare categories (paraphrasing, backtranslation)
Content moderation has severe class imbalance AND evolving definitions. "Hate speech" is culturally and temporally dependent - what's considered hate speech changes, and varies by region. Your model must be updateable quickly (new policy → new labels → fine-tune → deploy in days, not months). Discuss this in the interview.
Step 4: Model (8 min)
The Layered Approach
Layer 1 - Hash Matching (deterministic):
- PhotoDNA / perceptual hashing for known CSAM and terrorist content
- Exact match against databases (e.g., NCMEC hash database)
- 100% precision for known content
Layer 2 - ML Classifiers:
- Separate classifiers per modality and category
- Multi-task learning for related categories
- Calibrated outputs for threshold-based decisions
Layer 3 - Human Review:
- Prioritized queue (severity × volume × uncertainty)
- Human decisions become training data (active learning loop)
- Reviewer consensus (2-3 reviewers for borderline cases)
Step 5: Serving (8 min)
Architecture Decisions
| Component | Decision | Rationale |
|---|---|---|
| Pre-publish vs. post-publish | Hybrid: pre-publish for severe categories, post-publish ML + distribution reduction for others | Balance speed with user experience |
| Model serving | GPU cluster for image/video models, CPU for text | Cost optimization |
| Human review platform | Priority queue with SLAs by severity | Critical content reviewed in <1 hour |
| Appeals | Separate review team, cannot be same reviewer | Fairness requirement |
| Feedback loop | Human decisions → retrain weekly | Continuous model improvement |
Handling Viral Content
When a post goes viral (shared 10K+ times):
- Proactive review: Flag for human review even if ML score is low
- Distribution throttle: Slow sharing while review is pending
- Batch action: If violation confirmed, remove all reshares simultaneously
Step 6: Evaluation & Iteration (8 min)
Metrics
| Metric | What It Measures | Target |
|---|---|---|
| Precision (auto-remove) | % of auto-removed content that was actually violating | > 95% |
| Recall (severe) | % of severe violations caught | > 99% |
| Recall (all) | % of all violations caught | > 85% |
| Time to action | Median time from post to removal | < 1 hour for severe |
| Appeal overturn rate | % of appealed decisions reversed | < 10% |
| Reviewer agreement | Inter-annotator agreement | Cohen's κ > 0.7 |
Practice Problems
Problem 1: Meme Moderation
Direction
Memes combine text and images where neither may be harmful alone. "Cute puppy" + "I want to destroy all [group]" = harmful meme. How do you handle this?
Key Insight
Use a multi-modal model (CLIP-based or similar) that processes text and image jointly, not independently. The model learns that certain text-image combinations are harmful even when each modality alone is benign. Training data: annotated meme datasets (Hateful Memes Challenge). Challenge: sarcasm and irony - "I love how [group] always [negative stereotype]" - requires understanding pragmatics, not just semantics.
Problem 2: Cross-Lingual Hate Speech
Direction
Your platform operates in 100+ languages. You have good English training data but minimal data for Swahili, Tagalog, and Bengali. How do you moderate content in low-resource languages?
Key Insight
Use multilingual models (mBERT, XLM-R) for zero-shot transfer from English. Fine-tune on available data per language. Use translation-based augmentation: translate English training data to target languages. For very low-resource: combine ML (low confidence) with community moderators (native speakers). Monitor per-language metrics - don't let global metrics hide poor performance in specific languages.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design content moderation" | Three-layer: hash → ML → human | "Hash matching for known content, ML classifiers for scale, human review for nuance" |
| "How do you handle new policies?" | Rapid iteration pipeline | "New policy → annotate examples → fine-tune → deploy in days" |
| "Precision vs. recall?" | Severity-based thresholds | "99% recall for child safety, 85% recall for spam - thresholds match severity" |
| "Multi-modal content?" | Joint models | "CLIP-based multi-modal model for text-image combinations" |
Spaced Repetition Checkpoints
- Day 0: Draw the three-layer moderation architecture. Explain why each layer exists.
- Day 3: Explain the precision-recall trade-off for content moderation. How does severity change the threshold?
- Day 7: Design content moderation for a video-sharing platform in 45 minutes.
- Day 14: Discuss the human-in-the-loop pipeline. How does active learning improve the model?
- Day 21: Mock interview with follow-ups on multi-modal memes, cultural context, and appeals.
What's Next
- AI Chatbot System - Guardrails for AI-generated content
- Anomaly Detection - Detecting unusual patterns at scale
