Design: Speech Recognition - Streaming ASR at Scale

Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)

The Real Interview Moment

"Design a speech recognition system for a voice assistant." You describe a CTC-based model. The interviewer asks: "The user says 'Set a timer for 10 minutes' - they expect to see results as they speak, not after they finish. How do you do streaming recognition? And what happens in a noisy restaurant? Or when the user has a strong accent?"

Speech recognition design tests your understanding of streaming inference, noise robustness, and the audio-to-text pipeline end-to-end.

What You Will Master

End-to-end ASR architectures (CTC, attention, transducer)
Streaming vs. offline recognition trade-offs
Language model integration and beam search decoding
Noise robustness and speaker adaptation
Serving: real-time streaming with partial results
Evaluation: WER, latency, and robustness metrics

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

Transcribe spoken language to text in real-time
Support 50+ languages
Handle voice commands (short utterances) and dictation (long-form)
Streaming: show partial results as user speaks

Non-functional requirements:

Latency: Partial results within 200ms of speech, final result within 500ms of silence
Accuracy: Word Error Rate (WER) <5% for clean speech, <15% for noisy environments
Throughput: 100K concurrent audio streams

Step 2: Problem Formulation (5 min)

ML problem type: Sequence-to-sequence - audio frames → text tokens.

ASR Pipeline - Audio Waveform → Feature Extraction → Acoustic Model → Decoder → Language Model → Text Output

Step 3: Architecture (8 min)

ASR Approaches

Approach	How It Works	Streaming?	Quality
CTC (Connectionist Temporal Classification)	Encoder predicts character at each frame, collapse repeats	Yes	Good
Attention-based (Listen Attend Spell)	Encoder-decoder with attention	Difficult	Better
RNN-Transducer (RNN-T)	Encoder + prediction network + joint network	Yes (designed for it)	Best for streaming
Whisper-style	Large encoder-decoder trained on 680K hours	No (offline)	Best overall

Recommendation: RNN-Transducer for streaming applications (voice assistant). Whisper-style for offline transcription (meeting notes, subtitles).

Feature Extraction

Feature	Description	Typical Config
Mel spectrogram	Frequency representation of audio	80 mel bins, 25ms window, 10ms hop
Log-mel filterbanks	Log-scaled mel spectrogram	Most common input for ASR
Raw waveform	Direct input (no preprocessing)	Used by wav2vec 2.0, requires more data

Streaming Architecture (RNN-T)

RNN-Transducer Streaming ASR - Audio Frames → Encoder + Previous Tokens → Prediction Network → Joint Network → Next Token

The RNN-T is designed for streaming: the encoder processes audio frames as they arrive, and the joint network can emit tokens at any time - no need to wait for the full utterance.

Step 4: Language Model Integration (5 min)

The acoustic model alone makes acoustic errors ("their" vs. "there"). A language model corrects these:

Method	How It Works	Latency Impact
Shallow fusion	Add LM score to beam search	Small (+20ms)
Rescoring	Generate N-best list, rescore with LM	Medium (+100ms)
End-to-end with internal LM	Train the ASR model to include language modeling	None (built-in)

Common Trap

Don't forget about domain-specific language models. A general ASR system struggles with medical terminology, legal jargon, or product names. Fine-tune the language model (or add a domain-specific phrase list) for vertical applications. "Take 50mg of metformin" requires medical vocabulary - a general LM might predict "met for men."

Step 5: Serving (8 min)

Streaming Pipeline

Stage	What Happens	Latency
Audio capture	Client captures audio in 20ms chunks	Real-time
Streaming upload	WebSocket / gRPC stream to server	10-50ms network
Feature extraction	Compute mel spectrogram	5ms
Encoder	Process audio chunk	20ms
Decoding	Emit tokens (partial result)	10ms
Endpointing	Detect when user stops speaking	200-500ms silence detection
Final result	LM rescoring + punctuation + formatting	100ms

Key Design Decisions

Decision	Choice	Rationale
Protocol	gRPC bidirectional streaming	Low latency, supports streaming both ways
Partial results	Send after each emitted token	Users see text as they speak
Endpointing	Energy-based + model-based silence detection	Determine when to finalize
Punctuation	Separate punctuation model	ASR models don't naturally produce punctuation
GPU sharing	Batch multiple streams on one GPU	Cost efficiency

Noise Robustness

Technique	How It Works
Data augmentation	Add noise, reverb, room impulse response during training
Beamforming	Use multiple microphones to focus on speaker direction
Speech enhancement	Preprocessing model that removes noise before ASR
Robust training	Multi-condition training on clean + noisy + reverberant data

Step 6: Evaluation (5 min)

Metrics

Metric	Definition	Target
WER	Word Error Rate: (substitutions + insertions + deletions) / total words	<5% clean, <15% noisy
RTF	Real-Time Factor: processing time / audio duration	<0.3 (3x faster than real-time)
Latency	Time from speech to displayed text	<200ms partial, <500ms final
Endpointing latency	Time from silence start to final result	<500ms

Evaluation Dimensions

By accent: Ensure WER doesn't disproportionately increase for non-native speakers
By noise level: Measure WER at different SNR levels (clean, 15dB, 10dB, 5dB)
By domain: Medical, legal, conversational - each has different vocabulary
By utterance length: Short commands vs. long dictation

Practice Problems

Problem 1: Meeting Transcription with Speaker Diarization

Direction

Transcribe a meeting with 4 speakers. Who said what?

Key Insight

Speaker diarization (who spoke when) + ASR (what they said). Pipeline: (1) Voice Activity Detection - detect speech segments. (2) Speaker embedding extraction (d-vector or ECAPA-TDNN). (3) Clustering speaker embeddings to identify speakers. (4) Assign speaker labels to ASR segments. Challenge: overlapping speech. Solutions: target-speaker extraction or multi-speaker ASR models.

Problem 2: Custom Wake Word

Direction

Design a system that listens for "Hey Assistant" and activates only then. It must run on-device with minimal battery usage.

Key Insight

Wake word detection is a small-footprint keyword spotting problem. Use a tiny neural network (10K-100K parameters) that runs continuously on a low-power DSP chip. Train with positive examples of the wake word + negative examples of everything else. Key requirements: very low false rejection rate (<3%) and very low false acceptance rate (<1 per day). Use keyword-specific model, not general ASR - much smaller and more power-efficient.

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Design speech recognition"	Feature extraction → encoder → decoder → LM	"Conformer encoder, RNN-T for streaming, LM rescoring for accuracy"
"How do you handle streaming?"	RNN-Transducer	"RNN-T emits tokens as audio arrives - no need to wait for complete utterance"
"Noisy environments?"	Multi-condition training + enhancement	"Data augmentation with noise/reverb, optional speech enhancement preprocessing"
"Different accents?"	Diverse training data + adaptation	"Train on diverse accents, fine-tune for specific demographics, monitor WER by accent"

Spaced Repetition Checkpoints

Day 0: Draw the ASR pipeline from memory. Explain CTC vs. RNN-T.
Day 3: Explain streaming ASR. Why is RNN-T better suited for streaming than attention-based models?
Day 7: Design a voice assistant ASR system in 45 minutes.
Day 14: Discuss noise robustness techniques. How do you evaluate across environments?
Day 21: Mock interview with follow-ups on speaker diarization, wake word detection, and on-device inference.

What's Next

A/B Testing Platform - How to evaluate ML system changes rigorously
Machine Translation - Another sequence-to-sequence system

The Real Interview Moment​

What You Will Master​

The Complete Design​

Step 1: Requirements (5 min)​

Step 2: Problem Formulation (5 min)​

Step 3: Architecture (8 min)​

ASR Approaches​

Feature Extraction​

Streaming Architecture (RNN-T)​

Step 4: Language Model Integration (5 min)​

Step 5: Serving (8 min)​

Streaming Pipeline​

Key Design Decisions​

Noise Robustness​

Step 6: Evaluation (5 min)​

Metrics​

Evaluation Dimensions​

Practice Problems​

Problem 1: Meeting Transcription with Speaker Diarization​

Problem 2: Custom Wake Word​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

What's Next​