Design: Machine Translation - Sequence-to-Sequence at Scale

Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)

The Real Interview Moment

"Design a machine translation system that supports 100+ language pairs." You describe a Transformer encoder-decoder. The interviewer asks: "You can't train a separate model for each of the 10,000 possible language pairs. How do you handle languages with very little training data? How do you know when the translation is bad enough to show a warning?"

Translation system design tests whether you can build a multi-lingual ML system with graceful degradation - high quality for popular languages, acceptable quality for rare ones, and honest quality estimation throughout.

What You Will Master

Encoder-decoder Transformer architecture for translation
Multilingual models: one model for many language pairs
Low-resource language strategies: transfer learning, back-translation
Quality estimation: knowing when translations are unreliable
Serving: batching, caching, and latency optimization
Evaluation: BLEU, COMET, and human evaluation

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Translate text between 100+ languages
Support document translation (up to 10K words) and real-time chat translation
Quality indicators for unreliable translations
Formality options (formal/informal) for applicable languages

Non-functional requirements:

Latency: <500ms for sentences (<50 words), <5s for paragraphs
Quality: BLEU > 30 for high-resource pairs, > 15 for low-resource
Throughput: 1M translations per day
Cost: GPU inference costs must be sustainable

Step 2: Problem Formulation (5 min)

ML problem type: Sequence-to-sequence generation.

Approach	How It Works	Pro	Con
Bilingual models	One model per language pair	Best quality per pair	10K+ models needed
Multilingual model	One model for all language pairs	Scalable, transfer learning	Quality trade-off for high-resource pairs
Pivot through English	Source → English → Target	Only need N models	Error compounds, loses nuance
LLM-based	GPT-4 / Claude for translation	No training needed, high quality	Expensive, slow, limited control

Recommendation: Multilingual Transformer (NLLB/mBART style) as the primary system, with LLM fallback for quality-critical translations.

Step 3: Data & Training (8 min)

Training Data

Source	Volume	Quality
Parallel corpora (Europarl, UN documents)	100M+ sentence pairs (high-resource)	High
Web-crawled (CCMatrix, ParaCrawl)	Billions of pairs	Variable
Back-translation	Generate synthetic parallel data	Medium-High
Monolingual data	Billions of sentences per language	Used for pretraining

Low-Resource Language Strategies

Low-Resource Language Strategies - Transfer Learning, Back-Translation, Pivot Translation, Zero-Shot

Back-translation: Train target→source model on available data. Use it to translate monolingual target text into source language. Now you have synthetic parallel data. Train source→target on real + synthetic data. This can double the effective training data.

Common Trap

Web-crawled parallel data is noisy - misaligned pairs, machine-translated content, wrong language. You need a data quality pipeline: language identification, alignment scoring, deduplication, and toxicity filtering. Mention this in the interview - "garbage in, garbage out" is especially true for MT.

Step 4: Model Architecture (8 min)

Multilingual Transformer

Architecture: Encoder-decoder Transformer (similar to NLLB-200)
Language token: Prepend target language token to decoder input
Shared vocabulary: SentencePiece tokenizer trained on all languages (256K tokens)
Model size: 600M-3B parameters depending on quality targets

Decoding Strategies

Strategy	How	Quality	Latency
Greedy	Pick highest probability token at each step	Lowest	Fastest
Beam search (k=5)	Track top 5 sequences, pick best	Good	5x slower
Sampling + temperature	Random sampling with temperature	Creative but inconsistent	Similar to greedy

Recommendation: Beam search with k=4-5 for batch translation, greedy with quality check for real-time chat.

Step 5: Serving (8 min)

Architecture

Component	Technology	Purpose
Model serving	TorchServe / Triton on GPU	Inference
Batching	Dynamic batching (group requests by source/target language)	GPU utilization
Caching	Cache common translations (Redis)	Reduce GPU calls
Sentence splitting	Split documents into sentences, translate, reassemble	Parallelize long documents
Quality estimation	Lightweight model that scores translation quality	Warn users of low quality

Quality Estimation

A separate model that predicts translation quality without reference translations:

Input: Source sentence + translated sentence
Output: Quality score (0-1)
Use cases: Show quality warnings, route to human translators, filter training data

Step 6: Evaluation (5 min)

Metrics

Metric	What It Measures	Limitation
BLEU	N-gram overlap with reference	Doesn't capture meaning, penalizes valid alternatives
COMET	Learned metric, correlates with human judgment	Requires trained model
chrF	Character-level F-score	Better for morphologically rich languages
Human evaluation	Side-by-side preference, adequacy, fluency	Expensive, slow

Recommendation: Use COMET as primary automated metric, BLEU for backwards compatibility, human evaluation for major releases.

Practice Problems

Problem 1: Translate Code-Switched Text

Direction

Users write messages mixing two languages: "Hey can you translate esto para mi?" (English + Spanish). How does your system handle this?

Key Insight

Code-switching detection: identify which segments are in which language. Options: (1) Translate the entire message treating it as the dominant language - the multilingual model may handle intra-sentence code-switching. (2) Segment by language, translate segments independently, recombine. (3) Train on code-switched data if available. The multilingual model approach is simplest and often works because multilingual models have seen code-switched training data.

Problem 2: Domain-Specific Translation

Direction

Medical documents require precise translation - "benign" in medical context has specific meaning. How do you improve domain-specific quality?

Key Insight

Domain adaptation: (1) Fine-tune on in-domain parallel data (medical, legal, technical). (2) Terminology enforcement - maintain a glossary of domain terms that must be translated consistently. (3) Prefix-based domain conditioning - add a domain token ("<medical>") to signal the model. (4) Retrieval-augmented translation - retrieve similar translated sentences from a domain-specific translation memory.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design translation"	Multilingual Transformer + quality estimation	"Single multilingual model with language tokens, quality estimation for reliability"
"Low-resource languages?"	Transfer + back-translation	"Multilingual pretraining for transfer, back-translation for data augmentation"
"How do you evaluate?"	COMET + human eval	"COMET as primary automated metric, human eval for major releases"

Spaced Repetition Checkpoints

Day 0: Explain the encoder-decoder Transformer architecture for translation.
Day 3: Compare bilingual vs. multilingual vs. pivot approaches. Trade-offs?
Day 7: Design a translation system for a messaging app in 45 minutes.
Day 14: Explain back-translation and why it helps low-resource languages.
Day 21: Mock interview with follow-ups on quality estimation and domain adaptation.

What's Next

Speech Recognition - Another sequence-to-sequence system with streaming constraints
AI Chatbot System - Translation as a component of multilingual chatbots

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Data & Training (8 min)​

Training Data​

Low-Resource Language Strategies​

Step 4: Model Architecture (8 min)​

Multilingual Transformer​

Decoding Strategies​

Step 5: Serving (8 min)​

Architecture​

Quality Estimation​

Step 6: Evaluation (5 min)​

Metrics​

Practice Problems​

Problem 1: Translate Code-Switched Text​

Problem 2: Domain-Specific Translation​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​