Design: Machine Translation - Sequence-to-Sequence at Scale
Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)
The Real Interview Moment
"Design a machine translation system that supports 100+ language pairs." You describe a Transformer encoder-decoder. The interviewer asks: "You can't train a separate model for each of the 10,000 possible language pairs. How do you handle languages with very little training data? How do you know when the translation is bad enough to show a warning?"
Translation system design tests whether you can build a multi-lingual ML system with graceful degradation - high quality for popular languages, acceptable quality for rare ones, and honest quality estimation throughout.
What You Will Master
- Encoder-decoder Transformer architecture for translation
- Multilingual models: one model for many language pairs
- Low-resource language strategies: transfer learning, back-translation
- Quality estimation: knowing when translations are unreliable
- Serving: batching, caching, and latency optimization
- Evaluation: BLEU, COMET, and human evaluation
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Translate text between 100+ languages
- Support document translation (up to 10K words) and real-time chat translation
- Quality indicators for unreliable translations
- Formality options (formal/informal) for applicable languages
Non-functional requirements:
- Latency: <500ms for sentences (<50 words), <5s for paragraphs
- Quality: BLEU > 30 for high-resource pairs, > 15 for low-resource
- Throughput: 1M translations per day
- Cost: GPU inference costs must be sustainable
Step 2: Problem Formulation (5 min)
ML problem type: Sequence-to-sequence generation.
| Approach | How It Works | Pro | Con |
|---|---|---|---|
| Bilingual models | One model per language pair | Best quality per pair | 10K+ models needed |
| Multilingual model | One model for all language pairs | Scalable, transfer learning | Quality trade-off for high-resource pairs |
| Pivot through English | Source → English → Target | Only need N models | Error compounds, loses nuance |
| LLM-based | GPT-4 / Claude for translation | No training needed, high quality | Expensive, slow, limited control |
Recommendation: Multilingual Transformer (NLLB/mBART style) as the primary system, with LLM fallback for quality-critical translations.
Step 3: Data & Training (8 min)
Training Data
| Source | Volume | Quality |
|---|---|---|
| Parallel corpora (Europarl, UN documents) | 100M+ sentence pairs (high-resource) | High |
| Web-crawled (CCMatrix, ParaCrawl) | Billions of pairs | Variable |
| Back-translation | Generate synthetic parallel data | Medium-High |
| Monolingual data | Billions of sentences per language | Used for pretraining |
Low-Resource Language Strategies
Back-translation: Train target→source model on available data. Use it to translate monolingual target text into source language. Now you have synthetic parallel data. Train source→target on real + synthetic data. This can double the effective training data.
Web-crawled parallel data is noisy - misaligned pairs, machine-translated content, wrong language. You need a data quality pipeline: language identification, alignment scoring, deduplication, and toxicity filtering. Mention this in the interview - "garbage in, garbage out" is especially true for MT.
Step 4: Model Architecture (8 min)
Multilingual Transformer
- Architecture: Encoder-decoder Transformer (similar to NLLB-200)
- Language token: Prepend target language token to decoder input
- Shared vocabulary: SentencePiece tokenizer trained on all languages (256K tokens)
- Model size: 600M-3B parameters depending on quality targets
Decoding Strategies
| Strategy | How | Quality | Latency |
|---|---|---|---|
| Greedy | Pick highest probability token at each step | Lowest | Fastest |
| Beam search (k=5) | Track top 5 sequences, pick best | Good | 5x slower |
| Sampling + temperature | Random sampling with temperature | Creative but inconsistent | Similar to greedy |
Recommendation: Beam search with k=4-5 for batch translation, greedy with quality check for real-time chat.
Step 5: Serving (8 min)
Architecture
| Component | Technology | Purpose |
|---|---|---|
| Model serving | TorchServe / Triton on GPU | Inference |
| Batching | Dynamic batching (group requests by source/target language) | GPU utilization |
| Caching | Cache common translations (Redis) | Reduce GPU calls |
| Sentence splitting | Split documents into sentences, translate, reassemble | Parallelize long documents |
| Quality estimation | Lightweight model that scores translation quality | Warn users of low quality |
Quality Estimation
A separate model that predicts translation quality without reference translations:
- Input: Source sentence + translated sentence
- Output: Quality score (0-1)
- Use cases: Show quality warnings, route to human translators, filter training data
Step 6: Evaluation (5 min)
Metrics
| Metric | What It Measures | Limitation |
|---|---|---|
| BLEU | N-gram overlap with reference | Doesn't capture meaning, penalizes valid alternatives |
| COMET | Learned metric, correlates with human judgment | Requires trained model |
| chrF | Character-level F-score | Better for morphologically rich languages |
| Human evaluation | Side-by-side preference, adequacy, fluency | Expensive, slow |
Recommendation: Use COMET as primary automated metric, BLEU for backwards compatibility, human evaluation for major releases.
Practice Problems
Problem 1: Translate Code-Switched Text
Direction
Users write messages mixing two languages: "Hey can you translate esto para mi?" (English + Spanish). How does your system handle this?
Key Insight
Code-switching detection: identify which segments are in which language. Options: (1) Translate the entire message treating it as the dominant language - the multilingual model may handle intra-sentence code-switching. (2) Segment by language, translate segments independently, recombine. (3) Train on code-switched data if available. The multilingual model approach is simplest and often works because multilingual models have seen code-switched training data.
Problem 2: Domain-Specific Translation
Direction
Medical documents require precise translation - "benign" in medical context has specific meaning. How do you improve domain-specific quality?
Key Insight
Domain adaptation: (1) Fine-tune on in-domain parallel data (medical, legal, technical). (2) Terminology enforcement - maintain a glossary of domain terms that must be translated consistently. (3) Prefix-based domain conditioning - add a domain token ("<medical>") to signal the model. (4) Retrieval-augmented translation - retrieve similar translated sentences from a domain-specific translation memory.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design translation" | Multilingual Transformer + quality estimation | "Single multilingual model with language tokens, quality estimation for reliability" |
| "Low-resource languages?" | Transfer + back-translation | "Multilingual pretraining for transfer, back-translation for data augmentation" |
| "How do you evaluate?" | COMET + human eval | "COMET as primary automated metric, human eval for major releases" |
Spaced Repetition Checkpoints
- Day 0: Explain the encoder-decoder Transformer architecture for translation.
- Day 3: Compare bilingual vs. multilingual vs. pivot approaches. Trade-offs?
- Day 7: Design a translation system for a messaging app in 45 minutes.
- Day 14: Explain back-translation and why it helps low-resource languages.
- Day 21: Mock interview with follow-ups on quality estimation and domain adaptation.
What's Next
- Speech Recognition - Another sequence-to-sequence system with streaming constraints
- AI Chatbot System - Translation as a component of multilingual chatbots
