Skip to main content

Design: Speech Recognition - Streaming ASR at Scale

Reading time: ~22 min | Interview relevance: Medium | Roles: MLE (specialized)

The Real Interview Moment

"Design a speech recognition system for a voice assistant." You describe a CTC-based model. The interviewer asks: "The user says 'Set a timer for 10 minutes' - they expect to see results as they speak, not after they finish. How do you do streaming recognition? And what happens in a noisy restaurant? Or when the user has a strong accent?"

Speech recognition design tests your understanding of streaming inference, noise robustness, and the audio-to-text pipeline end-to-end.

What You Will Master

  • End-to-end ASR architectures (CTC, attention, transducer)
  • Streaming vs. offline recognition trade-offs
  • Language model integration and beam search decoding
  • Noise robustness and speaker adaptation
  • Serving: real-time streaming with partial results
  • Evaluation: WER, latency, and robustness metrics

The Complete Design

Step 1: Requirements (5 min)

Functional requirements:

  • Transcribe spoken language to text in real-time
  • Support 50+ languages
  • Handle voice commands (short utterances) and dictation (long-form)
  • Streaming: show partial results as user speaks

Non-functional requirements:

  • Latency: Partial results within 200ms of speech, final result within 500ms of silence
  • Accuracy: Word Error Rate (WER) <5% for clean speech, <15% for noisy environments
  • Throughput: 100K concurrent audio streams

Step 2: Problem Formulation (5 min)

ML problem type: Sequence-to-sequence - audio frames → text tokens.

ASR Pipeline - Audio Waveform → Feature Extraction → Acoustic Model → Decoder → Language Model → Text Output

Step 3: Architecture (8 min)

ASR Approaches

ApproachHow It WorksStreaming?Quality
CTC (Connectionist Temporal Classification)Encoder predicts character at each frame, collapse repeatsYesGood
Attention-based (Listen Attend Spell)Encoder-decoder with attentionDifficultBetter
RNN-Transducer (RNN-T)Encoder + prediction network + joint networkYes (designed for it)Best for streaming
Whisper-styleLarge encoder-decoder trained on 680K hoursNo (offline)Best overall

Recommendation: RNN-Transducer for streaming applications (voice assistant). Whisper-style for offline transcription (meeting notes, subtitles).

Feature Extraction

FeatureDescriptionTypical Config
Mel spectrogramFrequency representation of audio80 mel bins, 25ms window, 10ms hop
Log-mel filterbanksLog-scaled mel spectrogramMost common input for ASR
Raw waveformDirect input (no preprocessing)Used by wav2vec 2.0, requires more data

Streaming Architecture (RNN-T)

RNN-Transducer Streaming ASR - Audio Frames → Encoder + Previous Tokens → Prediction Network → Joint Network → Next Token

The RNN-T is designed for streaming: the encoder processes audio frames as they arrive, and the joint network can emit tokens at any time - no need to wait for the full utterance.

Step 4: Language Model Integration (5 min)

The acoustic model alone makes acoustic errors ("their" vs. "there"). A language model corrects these:

MethodHow It WorksLatency Impact
Shallow fusionAdd LM score to beam searchSmall (+20ms)
RescoringGenerate N-best list, rescore with LMMedium (+100ms)
End-to-end with internal LMTrain the ASR model to include language modelingNone (built-in)
Common Trap

Don't forget about domain-specific language models. A general ASR system struggles with medical terminology, legal jargon, or product names. Fine-tune the language model (or add a domain-specific phrase list) for vertical applications. "Take 50mg of metformin" requires medical vocabulary - a general LM might predict "met for men."

Step 5: Serving (8 min)

Streaming Pipeline

StageWhat HappensLatency
Audio captureClient captures audio in 20ms chunksReal-time
Streaming uploadWebSocket / gRPC stream to server10-50ms network
Feature extractionCompute mel spectrogram5ms
EncoderProcess audio chunk20ms
DecodingEmit tokens (partial result)10ms
EndpointingDetect when user stops speaking200-500ms silence detection
Final resultLM rescoring + punctuation + formatting100ms

Key Design Decisions

DecisionChoiceRationale
ProtocolgRPC bidirectional streamingLow latency, supports streaming both ways
Partial resultsSend after each emitted tokenUsers see text as they speak
EndpointingEnergy-based + model-based silence detectionDetermine when to finalize
PunctuationSeparate punctuation modelASR models don't naturally produce punctuation
GPU sharingBatch multiple streams on one GPUCost efficiency

Noise Robustness

TechniqueHow It Works
Data augmentationAdd noise, reverb, room impulse response during training
BeamformingUse multiple microphones to focus on speaker direction
Speech enhancementPreprocessing model that removes noise before ASR
Robust trainingMulti-condition training on clean + noisy + reverberant data

Step 6: Evaluation (5 min)

Metrics

MetricDefinitionTarget
WERWord Error Rate: (substitutions + insertions + deletions) / total words<5% clean, <15% noisy
RTFReal-Time Factor: processing time / audio duration<0.3 (3x faster than real-time)
LatencyTime from speech to displayed text<200ms partial, <500ms final
Endpointing latencyTime from silence start to final result<500ms

Evaluation Dimensions

  • By accent: Ensure WER doesn't disproportionately increase for non-native speakers
  • By noise level: Measure WER at different SNR levels (clean, 15dB, 10dB, 5dB)
  • By domain: Medical, legal, conversational - each has different vocabulary
  • By utterance length: Short commands vs. long dictation

Practice Problems

Problem 1: Meeting Transcription with Speaker Diarization

Direction

Transcribe a meeting with 4 speakers. Who said what?

Key Insight

Speaker diarization (who spoke when) + ASR (what they said). Pipeline: (1) Voice Activity Detection - detect speech segments. (2) Speaker embedding extraction (d-vector or ECAPA-TDNN). (3) Clustering speaker embeddings to identify speakers. (4) Assign speaker labels to ASR segments. Challenge: overlapping speech. Solutions: target-speaker extraction or multi-speaker ASR models.

Problem 2: Custom Wake Word

Direction

Design a system that listens for "Hey Assistant" and activates only then. It must run on-device with minimal battery usage.

Key Insight

Wake word detection is a small-footprint keyword spotting problem. Use a tiny neural network (10K-100K parameters) that runs continuously on a low-power DSP chip. Train with positive examples of the wake word + negative examples of everything else. Key requirements: very low false rejection rate (<3%) and very low false acceptance rate (<1 per day). Use keyword-specific model, not general ASR - much smaller and more power-efficient.

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Design speech recognition"Feature extraction → encoder → decoder → LM"Conformer encoder, RNN-T for streaming, LM rescoring for accuracy"
"How do you handle streaming?"RNN-Transducer"RNN-T emits tokens as audio arrives - no need to wait for complete utterance"
"Noisy environments?"Multi-condition training + enhancement"Data augmentation with noise/reverb, optional speech enhancement preprocessing"
"Different accents?"Diverse training data + adaptation"Train on diverse accents, fine-tune for specific demographics, monitor WER by accent"

Spaced Repetition Checkpoints

  • Day 0: Draw the ASR pipeline from memory. Explain CTC vs. RNN-T.
  • Day 3: Explain streaming ASR. Why is RNN-T better suited for streaming than attention-based models?
  • Day 7: Design a voice assistant ASR system in 45 minutes.
  • Day 14: Discuss noise robustness techniques. How do you evaluate across environments?
  • Day 21: Mock interview with follow-ups on speaker diarization, wake word detection, and on-device inference.

What's Next

© 2026 EngineersOfAI. All rights reserved.