Skip to main content

Design: Machine Translation - Sequence-to-Sequence at Scale

Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)

The Real Interview Moment

"Design a machine translation system that supports 100+ language pairs." You describe a Transformer encoder-decoder. The interviewer asks: "You can't train a separate model for each of the 10,000 possible language pairs. How do you handle languages with very little training data? How do you know when the translation is bad enough to show a warning?"

Translation system design tests whether you can build a multi-lingual ML system with graceful degradation - high quality for popular languages, acceptable quality for rare ones, and honest quality estimation throughout.

What You Will Master

  • Encoder-decoder Transformer architecture for translation
  • Multilingual models: one model for many language pairs
  • Low-resource language strategies: transfer learning, back-translation
  • Quality estimation: knowing when translations are unreliable
  • Serving: batching, caching, and latency optimization
  • Evaluation: BLEU, COMET, and human evaluation

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Translate text between 100+ languages
  • Support document translation (up to 10K words) and real-time chat translation
  • Quality indicators for unreliable translations
  • Formality options (formal/informal) for applicable languages

Non-functional requirements:

  • Latency: <500ms for sentences (<50 words), <5s for paragraphs
  • Quality: BLEU > 30 for high-resource pairs, > 15 for low-resource
  • Throughput: 1M translations per day
  • Cost: GPU inference costs must be sustainable

Step 2: Problem Formulation (5 min)

ML problem type: Sequence-to-sequence generation.

ApproachHow It WorksProCon
Bilingual modelsOne model per language pairBest quality per pair10K+ models needed
Multilingual modelOne model for all language pairsScalable, transfer learningQuality trade-off for high-resource pairs
Pivot through EnglishSource → English → TargetOnly need N modelsError compounds, loses nuance
LLM-basedGPT-4 / Claude for translationNo training needed, high qualityExpensive, slow, limited control

Recommendation: Multilingual Transformer (NLLB/mBART style) as the primary system, with LLM fallback for quality-critical translations.

Step 3: Data & Training (8 min)

Training Data

SourceVolumeQuality
Parallel corpora (Europarl, UN documents)100M+ sentence pairs (high-resource)High
Web-crawled (CCMatrix, ParaCrawl)Billions of pairsVariable
Back-translationGenerate synthetic parallel dataMedium-High
Monolingual dataBillions of sentences per languageUsed for pretraining

Low-Resource Language Strategies

Low-Resource Language Strategies - Transfer Learning, Back-Translation, Pivot Translation, Zero-Shot

Back-translation: Train target→source model on available data. Use it to translate monolingual target text into source language. Now you have synthetic parallel data. Train source→target on real + synthetic data. This can double the effective training data.

Common Trap

Web-crawled parallel data is noisy - misaligned pairs, machine-translated content, wrong language. You need a data quality pipeline: language identification, alignment scoring, deduplication, and toxicity filtering. Mention this in the interview - "garbage in, garbage out" is especially true for MT.

Step 4: Model Architecture (8 min)

Multilingual Transformer

  • Architecture: Encoder-decoder Transformer (similar to NLLB-200)
  • Language token: Prepend target language token to decoder input
  • Shared vocabulary: SentencePiece tokenizer trained on all languages (256K tokens)
  • Model size: 600M-3B parameters depending on quality targets

Decoding Strategies

StrategyHowQualityLatency
GreedyPick highest probability token at each stepLowestFastest
Beam search (k=5)Track top 5 sequences, pick bestGood5x slower
Sampling + temperatureRandom sampling with temperatureCreative but inconsistentSimilar to greedy

Recommendation: Beam search with k=4-5 for batch translation, greedy with quality check for real-time chat.

Step 5: Serving (8 min)

Architecture

ComponentTechnologyPurpose
Model servingTorchServe / Triton on GPUInference
BatchingDynamic batching (group requests by source/target language)GPU utilization
CachingCache common translations (Redis)Reduce GPU calls
Sentence splittingSplit documents into sentences, translate, reassembleParallelize long documents
Quality estimationLightweight model that scores translation qualityWarn users of low quality

Quality Estimation

A separate model that predicts translation quality without reference translations:

  • Input: Source sentence + translated sentence
  • Output: Quality score (0-1)
  • Use cases: Show quality warnings, route to human translators, filter training data

Step 6: Evaluation (5 min)

Metrics

MetricWhat It MeasuresLimitation
BLEUN-gram overlap with referenceDoesn't capture meaning, penalizes valid alternatives
COMETLearned metric, correlates with human judgmentRequires trained model
chrFCharacter-level F-scoreBetter for morphologically rich languages
Human evaluationSide-by-side preference, adequacy, fluencyExpensive, slow

Recommendation: Use COMET as primary automated metric, BLEU for backwards compatibility, human evaluation for major releases.

Practice Problems

Problem 1: Translate Code-Switched Text

Direction

Users write messages mixing two languages: "Hey can you translate esto para mi?" (English + Spanish). How does your system handle this?

Key Insight

Code-switching detection: identify which segments are in which language. Options: (1) Translate the entire message treating it as the dominant language - the multilingual model may handle intra-sentence code-switching. (2) Segment by language, translate segments independently, recombine. (3) Train on code-switched data if available. The multilingual model approach is simplest and often works because multilingual models have seen code-switched training data.

Problem 2: Domain-Specific Translation

Direction

Medical documents require precise translation - "benign" in medical context has specific meaning. How do you improve domain-specific quality?

Key Insight

Domain adaptation: (1) Fine-tune on in-domain parallel data (medical, legal, technical). (2) Terminology enforcement - maintain a glossary of domain terms that must be translated consistently. (3) Prefix-based domain conditioning - add a domain token ("<medical>") to signal the model. (4) Retrieval-augmented translation - retrieve similar translated sentences from a domain-specific translation memory.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design translation"Multilingual Transformer + quality estimation"Single multilingual model with language tokens, quality estimation for reliability"
"Low-resource languages?"Transfer + back-translation"Multilingual pretraining for transfer, back-translation for data augmentation"
"How do you evaluate?"COMET + human eval"COMET as primary automated metric, human eval for major releases"

Spaced Repetition Checkpoints

  • Day 0: Explain the encoder-decoder Transformer architecture for translation.
  • Day 3: Compare bilingual vs. multilingual vs. pivot approaches. Trade-offs?
  • Day 7: Design a translation system for a messaging app in 45 minutes.
  • Day 14: Explain back-translation and why it helps low-resource languages.
  • Day 21: Mock interview with follow-ups on quality estimation and domain adaptation.

What's Next

© 2026 EngineersOfAI. All rights reserved.