Design: Speech Recognition - Streaming ASR at Scale
Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)
The Real Interview Moment
"Design a speech recognition system for a voice assistant." You describe a CTC-based model. The interviewer asks: "The user says 'Set a timer for 10 minutes' - they expect to see results as they speak, not after they finish. How do you do streaming recognition? And what happens in a noisy restaurant? Or when the user has a strong accent?"
Speech recognition design tests your understanding of streaming inference, noise robustness, and the audio-to-text pipeline end-to-end.
What You Will Master
- End-to-end ASR architectures (CTC, attention, transducer)
- Streaming vs. offline recognition trade-offs
- Language model integration and beam search decoding
- Noise robustness and speaker adaptation
- Serving: real-time streaming with partial results
- Evaluation: WER, latency, and robustness metrics
The Complete Design
Step 1: Requirements (5 min)
Functional requirements:
- Transcribe spoken language to text in real-time
- Support 50+ languages
- Handle voice commands (short utterances) and dictation (long-form)
- Streaming: show partial results as user speaks
Non-functional requirements:
- Latency: Partial results within 200ms of speech, final result within 500ms of silence
- Accuracy: Word Error Rate (WER) <5% for clean speech, <15% for noisy environments
- Throughput: 100K concurrent audio streams
Step 2: Problem Formulation (5 min)
ML problem type: Sequence-to-sequence - audio frames → text tokens.
Step 3: Architecture (8 min)
ASR Approaches
| Approach | How It Works | Streaming? | Quality |
|---|---|---|---|
| CTC (Connectionist Temporal Classification) | Encoder predicts character at each frame, collapse repeats | Yes | Good |
| Attention-based (Listen Attend Spell) | Encoder-decoder with attention | Difficult | Better |
| RNN-Transducer (RNN-T) | Encoder + prediction network + joint network | Yes (designed for it) | Best for streaming |
| Whisper-style | Large encoder-decoder trained on 680K hours | No (offline) | Best overall |
Recommendation: RNN-Transducer for streaming applications (voice assistant). Whisper-style for offline transcription (meeting notes, subtitles).
Feature Extraction
| Feature | Description | Typical Config |
|---|---|---|
| Mel spectrogram | Frequency representation of audio | 80 mel bins, 25ms window, 10ms hop |
| Log-mel filterbanks | Log-scaled mel spectrogram | Most common input for ASR |
| Raw waveform | Direct input (no preprocessing) | Used by wav2vec 2.0, requires more data |
Streaming Architecture (RNN-T)
The RNN-T is designed for streaming: the encoder processes audio frames as they arrive, and the joint network can emit tokens at any time - no need to wait for the full utterance.
Step 4: Language Model Integration (5 min)
The acoustic model alone makes acoustic errors ("their" vs. "there"). A language model corrects these:
| Method | How It Works | Latency Impact |
|---|---|---|
| Shallow fusion | Add LM score to beam search | Small (+20ms) |
| Rescoring | Generate N-best list, rescore with LM | Medium (+100ms) |
| End-to-end with internal LM | Train the ASR model to include language modeling | None (built-in) |
Don't forget about domain-specific language models. A general ASR system struggles with medical terminology, legal jargon, or product names. Fine-tune the language model (or add a domain-specific phrase list) for vertical applications. "Take 50mg of metformin" requires medical vocabulary - a general LM might predict "met for men."
Step 5: Serving (8 min)
Streaming Pipeline
| Stage | What Happens | Latency |
|---|---|---|
| Audio capture | Client captures audio in 20ms chunks | Real-time |
| Streaming upload | WebSocket / gRPC stream to server | 10-50ms network |
| Feature extraction | Compute mel spectrogram | 5ms |
| Encoder | Process audio chunk | 20ms |
| Decoding | Emit tokens (partial result) | 10ms |
| Endpointing | Detect when user stops speaking | 200-500ms silence detection |
| Final result | LM rescoring + punctuation + formatting | 100ms |
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Protocol | gRPC bidirectional streaming | Low latency, supports streaming both ways |
| Partial results | Send after each emitted token | Users see text as they speak |
| Endpointing | Energy-based + model-based silence detection | Determine when to finalize |
| Punctuation | Separate punctuation model | ASR models don't naturally produce punctuation |
| GPU sharing | Batch multiple streams on one GPU | Cost efficiency |
Noise Robustness
| Technique | How It Works |
|---|---|
| Data augmentation | Add noise, reverb, room impulse response during training |
| Beamforming | Use multiple microphones to focus on speaker direction |
| Speech enhancement | Preprocessing model that removes noise before ASR |
| Robust training | Multi-condition training on clean + noisy + reverberant data |
Step 6: Evaluation (5 min)
Metrics
| Metric | Definition | Target |
|---|---|---|
| WER | Word Error Rate: (substitutions + insertions + deletions) / total words | <5% clean, <15% noisy |
| RTF | Real-Time Factor: processing time / audio duration | <0.3 (3x faster than real-time) |
| Latency | Time from speech to displayed text | <200ms partial, <500ms final |
| Endpointing latency | Time from silence start to final result | <500ms |
Evaluation Dimensions
- By accent: Ensure WER doesn't disproportionately increase for non-native speakers
- By noise level: Measure WER at different SNR levels (clean, 15dB, 10dB, 5dB)
- By domain: Medical, legal, conversational - each has different vocabulary
- By utterance length: Short commands vs. long dictation
Practice Problems
Problem 1: Meeting Transcription with Speaker Diarization
Direction
Transcribe a meeting with 4 speakers. Who said what?
Key Insight
Speaker diarization (who spoke when) + ASR (what they said). Pipeline: (1) Voice Activity Detection - detect speech segments. (2) Speaker embedding extraction (d-vector or ECAPA-TDNN). (3) Clustering speaker embeddings to identify speakers. (4) Assign speaker labels to ASR segments. Challenge: overlapping speech. Solutions: target-speaker extraction or multi-speaker ASR models.
Problem 2: Custom Wake Word
Direction
Design a system that listens for "Hey Assistant" and activates only then. It must run on-device with minimal battery usage.
Key Insight
Wake word detection is a small-footprint keyword spotting problem. Use a tiny neural network (10K-100K parameters) that runs continuously on a low-power DSP chip. Train with positive examples of the wake word + negative examples of everything else. Key requirements: very low false rejection rate (<3%) and very low false acceptance rate (<1 per day). Use keyword-specific model, not general ASR - much smaller and more power-efficient.
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Design speech recognition" | Feature extraction → encoder → decoder → LM | "Conformer encoder, RNN-T for streaming, LM rescoring for accuracy" |
| "How do you handle streaming?" | RNN-Transducer | "RNN-T emits tokens as audio arrives - no need to wait for complete utterance" |
| "Noisy environments?" | Multi-condition training + enhancement | "Data augmentation with noise/reverb, optional speech enhancement preprocessing" |
| "Different accents?" | Diverse training data + adaptation | "Train on diverse accents, fine-tune for specific demographics, monitor WER by accent" |
Spaced Repetition Checkpoints
- Day 0: Draw the ASR pipeline from memory. Explain CTC vs. RNN-T.
- Day 3: Explain streaming ASR. Why is RNN-T better suited for streaming than attention-based models?
- Day 7: Design a voice assistant ASR system in 45 minutes.
- Day 14: Discuss noise robustness techniques. How do you evaluate across environments?
- Day 21: Mock interview with follow-ups on speaker diarization, wake word detection, and on-device inference.
What's Next
- A/B Testing Platform - How to evaluate ML system changes rigorously
- Machine Translation - Another sequence-to-sequence system
