Audio-Language Models
Reading time: ~28 min | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Backend Engineer, MLOps Engineer
The Voice Interface That Broke at Scale
Your team shipped a customer service voice bot six months ago. The architecture was classic: a commercial ASR provider for transcription, a GPT-4 backend for response generation, a separate TTS provider for speech synthesis. Three vendors, three billing relationships, three latency contributions. Each component worked in isolation - 95% word error rate on clean audio, coherent GPT-4 responses, natural-sounding TTS. The demo was impressive.
In production, the seams showed immediately. Your ASR provider returned transcripts with confidence scores but no timing information, so the LLM could not tell whether a customer had finished speaking or just paused. The TTS provider introduced 800ms of latency before the first word was spoken. When a customer said "no, I mean the OTHER account" mid-sentence, the bot completed its previous thought, apologized, and started over - losing the context of what "the other account" referred to across the three-system boundary.
The most damaging failure was accent and noise sensitivity. The ASR provider had not been trained on your customer base, which skewed heavily toward non-native English speakers calling from noisy environments. Word error rates on this real distribution were closer to 30%, not 5%. You were feeding GPT-4 corrupted input and the model was gamely hallucinating coherent responses to nonsense transcripts.
You had built a pipeline that was the sum of three independently mediocre components rather than one end-to-end excellent system. The problems were not bugs - they were architectural. You needed to rethink from first principles how audio and language should be unified.
This lesson is about that rethinking. It covers how modern audio-language models actually work: how audio is represented as tokens, why Whisper's training approach enabled unprecedented robustness, how audio tokens enable end-to-end voice models, and how to build production speech pipelines that handle the real-world challenges your three-vendor system could not.
The Audio Representation Problem
Language models work with sequences of discrete tokens. Audio is a continuous waveform - a 1D signal sampled at 16,000 or 44,100 samples per second. To get from raw audio to something a neural network can learn from, you need a representation that:
- Captures perceptually relevant information (what a human hears)
- Is computationally tractable (not 44,100 numbers per second)
- Enables learning of temporal patterns (what comes before and after)
Two representation families dominate:
Spectrograms: The Frequency-Domain Approach
Rather than looking at the raw waveform amplitude over time, a spectrogram shows which frequencies are present at each point in time. The log-Mel spectrogram - used by Whisper and most modern ASR systems - applies the following pipeline:
- STFT (Short-Time Fourier Transform): Divide the audio into overlapping windows (typically 25ms with 10ms stride). Apply FFT to each window to get the frequency spectrum.
- Mel filterbanks: Apply 80 (or 128) triangular filters spaced on the Mel scale - a perceptual scale that matches human hearing sensitivity (more resolution at low frequencies, less at high).
- Log compression: Take the log of filter energies, compressing the dynamic range.
The result: a 2D matrix of shape (80 mel_bins, num_frames). Each column is a snapshot of the audio's frequency content at a point in time. For 30 seconds of audio at 10ms stride: 3,000 frames. For Whisper: 80 mel bins × 3,000 frames = 240,000 values per 30-second chunk.
This is the standard input format for Whisper and most speech models.
import numpy as np
import librosa
def compute_log_mel_spectrogram(
audio_path: str,
sample_rate: int = 16000,
n_mels: int = 80,
n_fft: int = 400, # 25ms at 16kHz
hop_length: int = 160, # 10ms at 16kHz
) -> np.ndarray:
"""
Compute log-Mel spectrogram - the standard audio representation for Whisper.
Returns array of shape (n_mels, n_frames).
"""
audio, sr = librosa.load(audio_path, sr=sample_rate, mono=True)
# STFT
stft = np.abs(librosa.stft(audio, n_fft=n_fft, hop_length=hop_length)) ** 2
# Apply Mel filterbanks
mel_filterbank = librosa.filters.mel(sr=sample_rate, n_fft=n_fft, n_mels=n_mels)
mel_spec = mel_filterbank @ stft
# Log compression (add small epsilon for numerical stability)
log_mel = np.log(mel_spec + 1e-9)
return log_mel
def normalize_spectrogram(log_mel: np.ndarray) -> np.ndarray:
"""Normalize log-Mel spectrogram to zero mean, unit variance (per-utterance)."""
mean = log_mel.mean()
std = log_mel.std()
return (log_mel - mean) / (std + 1e-6)
Discrete Audio Tokens: The Generation Approach
For audio generation - speech synthesis, music generation, audio continuation - you need a discrete token representation that a language model can generate autoregressively. Two neural codecs dominate:
EnCodec (Meta, 2022): A convolutional encoder compresses audio into continuous latent codes, followed by Residual Vector Quantization (RVQ) that maps each latent to a sequence of discrete codebook indices. At 75Hz (typical for 24kHz audio), EnCodec produces 8 codebook indices per time step. Each index comes from a codebook of 1,024 entries. The full discrete representation for 1 second of audio: 75 time steps × 8 codebooks = 600 tokens.
SoundStream (Google, 2021): Similar architecture to EnCodec - convolutional encoder + RVQ. Used as the backbone for AudioLM and MusicGen.
These codecs make audio amenable to language modeling: instead of predicting continuous waveform samples, a model predicts the next discrete audio token - just like predicting the next text token.
Whisper: The Landmark ASR System
Radford et al. (OpenAI, 2022) published "Robust Speech Recognition via Large-Scale Weak Supervision" - the Whisper paper. The key insight was about training data, not architecture.
Training Data: The Weak Supervision Insight
Prior ASR systems were trained on carefully curated, clean, human-verified transcriptions - maybe 1,000-10,000 hours. Whisper was trained on 680,000 hours of audio scraped from the internet, paired with automatically obtained transcripts.
This data is "weakly supervised" - the transcripts were not manually verified. Many are imperfect. But at 680K hours, the scale dwarfs any manually curated dataset. The diversity covers:
- 99 languages
- Multiple speakers, accents, dialects
- Various recording conditions: studio, phone, outdoor, noisy, quiet
- Multiple tasks: transcription, translation, language identification
The weak supervision at scale produced a model far more robust to real-world conditions than anything trained on clean data.
Architecture: Encoder-Decoder Transformer
Whisper is a standard encoder-decoder transformer - the same architecture as the original Transformer paper.
Encoder: The log-Mel spectrogram is processed by two 1D convolutional layers (for downsampling and local feature extraction) followed by sinusoidal positional encodings. Then a standard transformer encoder with multi-head self-attention. The output is a sequence of audio feature vectors.
Decoder: A standard transformer decoder with causal (masked) self-attention and cross-attention over the encoder output. The decoder generates transcription tokens autoregressively.
Special tokens: Whisper uses special tokens to control the task:
<|transcribe|>or<|translate|>: whether to transcribe or translate to English<|en|>,<|fr|>,<|zh|>, etc.: language tokens for language identification and conditioning<|nospeech|>: indicates no speech detected in the audio chunk- Timestamp tokens
<|0.00|>through<|30.00|>: encode word-level timestamps in the transcription
Model Sizes
| Model | Parameters | Layers | Heads | WER (LibriSpeech clean) | Speed |
|---|---|---|---|---|---|
| tiny | 39M | 4 | 6 | 5.7% | 32x realtime |
| base | 74M | 6 | 8 | 4.2% | 16x realtime |
| small | 244M | 12 | 12 | 3.2% | 6x realtime |
| medium | 769M | 24 | 16 | 2.9% | 2x realtime |
| large-v3 | 1.5B | 32 | 20 | 2.7% | 1x realtime |
"Realtime" = ratio of audio duration to transcription time on an A100. tiny processes audio 32x faster than real time; large-v3 matches real time.
For most production use cases: small or medium offers the best speed-accuracy trade-off. large-v3 is for maximum accuracy when latency is not critical.
Code: Speech-to-Text Pipeline with Whisper
import whisper
import numpy as np
from pathlib import Path
import time
from dataclasses import dataclass
@dataclass
class TranscriptionResult:
text: str
language: str
segments: list[dict]
duration_seconds: float
processing_time_seconds: float
rtf: float # real-time factor
def load_whisper_model(size: str = "base", device: str = "cpu") -> whisper.Whisper:
"""Load a Whisper model."""
print(f"Loading Whisper {size}...")
model = whisper.load_model(size, device=device)
print("Model loaded.")
return model
def transcribe_audio(
model: whisper.Whisper,
audio_path: str,
language: str = None,
task: str = "transcribe", # or "translate" to translate to English
word_timestamps: bool = False,
verbose: bool = False,
) -> TranscriptionResult:
"""
Transcribe an audio file using Whisper.
Args:
model: Loaded Whisper model
audio_path: Path to audio file (MP3, WAV, M4A, etc.)
language: Language code (e.g., "en", "fr") or None for auto-detection
task: "transcribe" or "translate"
word_timestamps: Whether to return word-level timestamps
"""
start_time = time.perf_counter()
# Load audio
audio = whisper.load_audio(audio_path)
duration = len(audio) / 16000 # Whisper resamples to 16kHz
# Transcribe
result = model.transcribe(
audio,
language=language,
task=task,
word_timestamps=word_timestamps,
verbose=verbose,
)
processing_time = time.perf_counter() - start_time
rtf = processing_time / duration
return TranscriptionResult(
text=result["text"].strip(),
language=result["language"],
segments=result["segments"],
duration_seconds=duration,
processing_time_seconds=processing_time,
rtf=rtf,
)
def transcribe_long_audio(
model: whisper.Whisper,
audio_path: str,
chunk_duration: float = 30.0,
overlap: float = 2.0,
language: str = None,
) -> str:
"""
Transcribe audio longer than 30 seconds using chunked processing.
Whisper is designed for 30-second chunks. For longer audio,
we use VAD-based chunking (or simple fixed-size chunking here).
"""
audio = whisper.load_audio(audio_path)
sample_rate = 16000
chunk_samples = int(chunk_duration * sample_rate)
overlap_samples = int(overlap * sample_rate)
transcription_parts = []
i = 0
while i < len(audio):
chunk = audio[i:i + chunk_samples]
# Pad to 30 seconds if shorter (Whisper expects exactly 30s input)
if len(chunk) < chunk_samples:
chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))
# Pad or trim to exact Whisper input size
chunk = whisper.pad_or_trim(chunk)
mel = whisper.log_mel_spectrogram(chunk).to(model.device)
# Detect language from first chunk
if language is None and i == 0:
_, probs = model.detect_language(mel)
language = max(probs, key=probs.get)
print(f"Detected language: {language}")
options = whisper.DecodingOptions(
language=language,
task="transcribe",
)
result = whisper.decode(model, mel, options)
transcription_parts.append(result.text.strip())
# Advance by chunk_duration - overlap (for continuity)
i += chunk_samples - overlap_samples
return " ".join(transcription_parts)
def batch_transcribe(
model: whisper.Whisper,
audio_paths: list[str],
language: str = None,
) -> list[TranscriptionResult]:
"""Transcribe multiple audio files."""
results = []
for path in audio_paths:
print(f"Transcribing: {path}")
result = transcribe_audio(model, path, language=language)
results.append(result)
print(f" Duration: {result.duration_seconds:.1f}s | RTF: {result.rtf:.2f}x | Lang: {result.language}")
return results
if __name__ == "__main__":
model = load_whisper_model("small")
# Basic transcription
result = transcribe_audio(
model,
"audio_sample.mp3",
word_timestamps=True,
)
print(f"Text: {result.text}")
print(f"Language: {result.language}")
print(f"RTF: {result.rtf:.2f}x realtime")
# Access word-level timestamps
for segment in result.segments[:3]:
print(f" [{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}")
Code: Streaming Transcription with Whisper
import queue
import threading
import numpy as np
import sounddevice as sd
class StreamingTranscriber:
"""
Real-time streaming speech-to-text using Whisper with overlapping chunks.
Architecture:
- Audio capture thread: records audio in small chunks (e.g., 1 second)
- Accumulation buffer: builds up 5-second chunks for transcription
- Transcription thread: processes chunks asynchronously
"""
def __init__(
self,
model_size: str = "small",
sample_rate: int = 16000,
chunk_seconds: float = 1.0,
window_seconds: float = 5.0,
language: str = None,
on_transcript: callable = None,
):
self.model = whisper.load_model(model_size)
self.sample_rate = sample_rate
self.chunk_samples = int(chunk_seconds * sample_rate)
self.window_samples = int(window_seconds * sample_rate)
self.language = language
self.on_transcript = on_transcript or print
self.audio_queue = queue.Queue()
self.audio_buffer = np.array([], dtype=np.float32)
self.is_running = False
def _audio_callback(self, indata, frames, time_info, status):
"""Called by sounddevice for each audio chunk."""
if status:
print(f"Audio status: {status}")
self.audio_queue.put(indata[:, 0].copy()) # mono channel
def _transcription_loop(self):
"""Process audio chunks from the queue."""
while self.is_running:
# Get audio from queue
try:
chunk = self.audio_queue.get(timeout=0.1)
except queue.Empty:
continue
self.audio_buffer = np.append(self.audio_buffer, chunk)
# Transcribe when buffer reaches window size
if len(self.audio_buffer) >= self.window_samples:
window = self.audio_buffer[-self.window_samples:]
window_padded = whisper.pad_or_trim(window)
mel = whisper.log_mel_spectrogram(window_padded).to(self.model.device)
options = whisper.DecodingOptions(
language=self.language,
task="transcribe",
fp16=self.model.device != "cpu",
)
result = whisper.decode(self.model, mel, options)
if result.text.strip() and result.no_speech_prob < 0.6:
self.on_transcript(result.text.strip())
# Keep last 2 seconds for context (overlap)
self.audio_buffer = self.audio_buffer[-2 * self.sample_rate:]
def start(self):
"""Start real-time transcription."""
self.is_running = True
transcription_thread = threading.Thread(target=self._transcription_loop, daemon=True)
transcription_thread.start()
print("Listening... (Ctrl+C to stop)")
with sd.InputStream(
samplerate=self.sample_rate,
channels=1,
dtype="float32",
blocksize=self.chunk_samples,
callback=self._audio_callback,
):
try:
while self.is_running:
sd.sleep(100)
except KeyboardInterrupt:
self.is_running = False
print("\nStopped.")
if __name__ == "__main__":
def on_transcript(text: str):
print(f"[TRANSCRIPT] {text}")
transcriber = StreamingTranscriber(
model_size="small",
window_seconds=5.0,
language="en",
on_transcript=on_transcript,
)
transcriber.start()
Audio Language Models for Generation
AudioLM: Predicting Audio Autoregressively
AudioLM (Borsos et al., Google, 2022) was a breakthrough for audio generation: treat audio generation as a language modeling problem over discrete audio tokens.
The pipeline:
- Encode audio to semantic tokens using a self-supervised model (w2v-BERT) - these capture high-level content like phonemes and prosody.
- Encode audio to acoustic tokens using SoundStream (EnCodec-style) - these capture fine-grained details.
- Train a hierarchical language model: predict semantic tokens first, then predict acoustic tokens conditioned on semantic tokens.
At generation time: sample semantic tokens autoregressively (capturing content and prosody), then sample acoustic tokens autoregressively conditioned on semantic tokens (capturing voice quality and fine detail), then decode back to waveform.
AudioLM can continue an audio clip in the same speaker's voice with consistent prosody and acoustic quality - without any explicit speaker modeling.
MusicGen: Text-to-Music
Meta's MusicGen (Copet et al., 2023) applies a similar approach to music. The model uses EnCodec for audio tokenization (4 codebooks at 50Hz) and a single-stage transformer decoder conditioned on text embeddings from a frozen T5 encoder.
Key innovation: instead of predicting the 4 codebook streams sequentially, MusicGen uses a delay pattern - each codebook is offset by 1 time step from the previous. This allows parallel generation with a single autoregressive transformer while maintaining inter-codebook dependencies.
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
def generate_music(
description: str,
duration: int = 10,
model_size: str = "small",
) -> None:
"""Generate music from a text description using MusicGen."""
model = MusicGen.get_pretrained(f"facebook/musicgen-{model_size}")
model.set_generation_params(duration=duration)
descriptions = [description]
wav = model.generate(descriptions)
for idx, (desc, one_wav) in enumerate(zip(descriptions, wav)):
# Save at 32kHz
audio_write(
f"music_{idx}",
one_wav.cpu(),
model.sample_rate,
strategy="loudness",
loudness_compressor=True,
)
print(f"Saved music_{idx}.wav")
if __name__ == "__main__":
generate_music(
description="an upbeat jazz piece with piano and drums, energetic, 120 BPM",
duration=15,
model_size="melody",
)
Whisper Fine-Tuning for Domain-Specific ASR
Whisper's generic training is excellent but can be improved significantly for domain-specific vocabulary (medical terms, technical jargon, proper nouns) or specific accents and noise conditions.
from transformers import (
WhisperProcessor,
WhisperForConditionalGeneration,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
)
from datasets import Dataset
import evaluate
def fine_tune_whisper(
train_audio_paths: list[str],
train_transcripts: list[str],
output_dir: str = "./whisper-finetuned",
model_name: str = "openai/whisper-small",
num_epochs: int = 3,
learning_rate: float = 1e-5,
):
"""Fine-tune Whisper on domain-specific audio data."""
processor = WhisperProcessor.from_pretrained(model_name, language="en", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Disable caching for training
model.config.use_cache = False
# Prepare dataset
def prepare_dataset(audio_path: str, transcript: str) -> dict:
import librosa
audio, _ = librosa.load(audio_path, sr=16000, mono=True)
input_features = processor(
audio, sampling_rate=16000, return_tensors="pt"
).input_features[0]
labels = processor.tokenizer(transcript).input_ids
return {"input_features": input_features, "labels": labels}
train_data = [
prepare_dataset(p, t)
for p, t in zip(train_audio_paths, train_transcripts)
]
train_dataset = Dataset.from_list(train_data)
wer_metric = evaluate.load("wer")
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.batch_decode(label_ids, skip_special_tokens=True)
wer = wer_metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
training_args = Seq2SeqTrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
learning_rate=learning_rate,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
fp16=True,
predict_with_generate=True,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=processor.feature_extractor,
compute_metrics=compute_metrics,
)
trainer.train()
model.save_pretrained(output_dir)
processor.save_pretrained(output_dir)
print(f"Fine-tuned model saved to {output_dir}")
Production Engineering: Building Robust Speech Pipelines
Voice Activity Detection (VAD)
Do not run ASR on silence. VAD detects when a speaker is talking and segments the audio into speech and non-speech regions. This dramatically reduces unnecessary ASR calls and improves transcription quality (ASR on silence produces hallucinated text).
import numpy as np
class SimpleEnergyVAD:
"""
Simple energy-based VAD for demonstration.
In production, use Silero VAD or WebRTC VAD.
"""
def __init__(
self,
threshold_db: float = -45.0,
min_speech_ms: int = 300,
sample_rate: int = 16000,
frame_ms: int = 20,
):
self.threshold = 10 ** (threshold_db / 20)
self.min_speech_samples = int(min_speech_ms * sample_rate / 1000)
self.sample_rate = sample_rate
self.frame_samples = int(frame_ms * sample_rate / 1000)
def is_speech_frame(self, frame: np.ndarray) -> bool:
"""Check if an audio frame contains speech."""
rms = np.sqrt(np.mean(frame ** 2))
return rms > self.threshold
def get_speech_segments(
self, audio: np.ndarray
) -> list[tuple[float, float]]:
"""Return list of (start_sec, end_sec) speech segments."""
frames = [
audio[i:i + self.frame_samples]
for i in range(0, len(audio) - self.frame_samples, self.frame_samples)
]
is_speech = [self.is_speech_frame(f) for f in frames]
segments = []
in_speech = False
start = 0
for i, speech in enumerate(is_speech):
if speech and not in_speech:
start = i * self.frame_samples
in_speech = True
elif not speech and in_speech:
end = i * self.frame_samples
if end - start >= self.min_speech_samples:
segments.append((
start / self.sample_rate,
end / self.sample_rate,
))
in_speech = False
return segments
# Production VAD using Silero
def setup_silero_vad():
"""Load the Silero VAD model (recommended for production)."""
import torch
model, utils = torch.hub.load(
repo_or_dir="snakers4/silero-vad",
model="silero_vad",
force_reload=False,
)
return model, utils
def get_speech_timestamps_silero(audio: np.ndarray, model, utils, sample_rate: int = 16000):
"""Get speech timestamps using Silero VAD."""
import torch
get_speech_ts, _, _, _, _ = utils
audio_tensor = torch.FloatTensor(audio)
speech_timestamps = get_speech_ts(
audio_tensor,
model,
sampling_rate=sample_rate,
min_speech_duration_ms=200,
min_silence_duration_ms=500,
return_seconds=True,
)
return [(ts["start"], ts["end"]) for ts in speech_timestamps]
Speaker Diarization
Who spoke when? Diarization is essential for multi-speaker audio like meetings and call center recordings.
from pyannote.audio import Pipeline
def diarize_audio(audio_path: str, hf_token: str) -> list[dict]:
"""
Speaker diarization using pyannote.audio.
Requires a HuggingFace token with accepted terms for pyannote models.
"""
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=hf_token,
)
diarization = pipeline(audio_path)
segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
segments.append({
"speaker": speaker,
"start": turn.start,
"end": turn.end,
})
return segments
def transcribe_with_diarization(
whisper_model,
audio_path: str,
diarization_segments: list[dict],
language: str = "en",
) -> list[dict]:
"""Combine Whisper transcription with speaker diarization labels."""
import librosa
import numpy as np
audio, sr = librosa.load(audio_path, sr=16000, mono=True)
results = []
for segment in diarization_segments:
start_sample = int(segment["start"] * 16000)
end_sample = int(segment["end"] * 16000)
audio_chunk = audio[start_sample:end_sample]
if len(audio_chunk) < 400: # skip very short segments
continue
audio_padded = whisper.pad_or_trim(audio_chunk)
mel = whisper.log_mel_spectrogram(audio_padded).to(whisper_model.device)
options = whisper.DecodingOptions(language=language, fp16=False)
result = whisper.decode(whisper_model, mel, options)
results.append({
"speaker": segment["speaker"],
"start": segment["start"],
"end": segment["end"],
"text": result.text.strip(),
})
return results
Latency Optimization for Real-Time Voice
The target for interactive voice systems is end-to-end latency under 500ms (ASR + LLM + TTS). Breaking this down:
| Component | Target Latency | Optimization |
|---|---|---|
| VAD detection | <20ms | silero-vad with threshold |
| ASR (Whisper small) | 100-200ms | GPU, batching, faster-whisper |
| LLM response start | 200-400ms | streaming generation |
| TTS first audio | 100-300ms | streaming synthesis |
| Total | 420-920ms | Use smallest models that meet accuracy bar |
For sub-500ms total latency: use faster-whisper (a CTranslate2-optimized Whisper port that is 4x faster), stream LLM output, and use a TTS model with low time-to-first-byte.
from faster_whisper import WhisperModel
def setup_faster_whisper(model_size: str = "small", device: str = "cuda") -> WhisperModel:
"""
Load faster-whisper - 4x faster than original Whisper on GPU,
2x faster on CPU. Uses CTranslate2 optimized runtime.
"""
model = WhisperModel(
model_size,
device=device,
compute_type="float16" if device == "cuda" else "int8",
)
return model
def transcribe_faster(model: WhisperModel, audio_path: str, language: str = None) -> dict:
"""Transcribe using faster-whisper with segment-level results."""
segments, info = model.transcribe(
audio_path,
beam_size=5,
language=language,
condition_on_previous_text=False, # faster, less context drift
vad_filter=True, # built-in VAD
vad_parameters=dict(min_silence_duration_ms=500),
)
transcript = " ".join(segment.text for segment in segments)
return {
"text": transcript.strip(),
"language": info.language,
"language_probability": info.language_probability,
}
Common Mistakes
:::danger Using Whisper Without VAD on Long Audio
Whisper is designed for 30-second chunks. On long audio without VAD, Whisper may hallucinate text during silence periods - generating plausible but completely fabricated transcriptions for quiet parts of the audio. Always apply VAD before Whisper to segment audio into speech-only regions. Filter out chunks where Whisper's no_speech_prob > 0.6.
:::
:::danger Assuming Word Error Rate on Clean Benchmarks Predicts Real-World Performance Whisper achieves 2-3% WER on LibriSpeech (studio-quality, native English speakers). Real-world WER on non-native speakers, phone audio, background noise, or domain-specific vocabulary can be 10-30%. Always benchmark on data that matches your production distribution before committing to a model size. If WER is high, fine-tune on in-domain data. :::
:::warning Blocking on ASR Before Starting LLM Processing In voice assistant pipelines, the naive approach waits for ASR to complete, then sends the transcript to the LLM, then starts TTS. This creates latency that compounds. Instead: (1) detect end-of-speech with VAD, (2) start ASR, (3) stream ASR partial results to the LLM as soon as first words appear, (4) stream LLM output to TTS as tokens are generated. The user hears the first word of the response much faster. :::
:::warning Ignoring Audio Format and Sample Rate Requirements
Whisper expects 16kHz mono audio. Most audio from consumer devices (phones, microphones) is at 44.1kHz or 48kHz, stereo. Always resample and convert to mono before processing. librosa.load(path, sr=16000, mono=True) handles this. Sending 44.1kHz audio without resampling causes Whisper to process it as if it were 16kHz - the audio appears sped up by 2.75x, destroying transcription quality.
:::
Interview Questions and Answers
Q1: How does Whisper represent audio before passing it to the transformer? What is a log-Mel spectrogram?
Whisper converts raw audio (resampled to 16kHz) to a log-Mel spectrogram through three steps: (1) Short-Time Fourier Transform - the audio is divided into overlapping 25ms windows at 10ms stride, and FFT is applied to each window to get the frequency content over time. (2) Mel filterbank - the frequency axis is compressed using 80 triangular filters spaced on the Mel scale, which matches human hearing sensitivity (more frequency resolution at low frequencies). (3) Log compression - the log of filter energies is taken, compressing the dynamic range and matching the way humans perceive loudness. The result is a 2D matrix of shape (80, num_frames) where each column captures the spectral snapshot at one time point. Whisper processes 30-second chunks, producing approximately 80x3000 spectrogram matrices. This is passed through two 1D convolutional layers for local feature extraction, then positional encodings are added, and the transformer encoder processes the resulting sequence.
Q2: What makes Whisper more robust than previous ASR systems?
The key is training data scale and diversity. Previous ASR systems trained on 1,000-10,000 hours of carefully curated, clean audio with verified transcriptions. Whisper trained on 680,000 hours of audio from the internet - 100-600x more data - with automatically obtained transcripts that are noisy but diverse. This diversity exposed the model to 99 languages, multiple accents, dialects, recording conditions (studio, phone, noisy environments), and audio quality levels. The weak supervision at scale produces a model that generalizes far better to real-world conditions than models trained on clean but narrow datasets. The architecture (encoder-decoder transformer with special task tokens) also allows a single model to handle transcription, translation, and language identification, unlike previous systems that required separate models for each.
Q3: Compare traditional ASR pipelines to end-to-end models like Whisper. What are the trade-offs?
Traditional ASR pipelines have three separate components: acoustic model (maps speech to phonemes), pronunciation lexicon (maps phonemes to words), and language model (scores word sequences). Each component is trained separately. Trade-offs: modular (can update each component independently), interpretable (you can trace errors to a specific component), but requires significant engineering to integrate and optimize each hand-off. End-to-end models like Whisper are a single neural network trained directly from audio to text. Trade-offs: simpler architecture (no explicit lexicon), better generalization from large-scale training, easier to deploy (one model), but harder to diagnose errors (error in "which component"?), and harder to update the vocabulary without retraining. In practice, Whisper consistently outperforms traditional pipelines on out-of-domain data because end-to-end training better captures the audio-to-text mapping without the error propagation across components.
Q4: How would you build a low-latency real-time transcription system?
The architecture: (1) Audio capture - use 20ms chunks at 16kHz for low-latency capture. (2) VAD - apply Silero VAD on each chunk to detect speech activity; buffer only when speech is detected. (3) Streaming transcription - accumulate 3-5 seconds of speech, then run Whisper on that window. Use overlap between windows (last 2 seconds carried over) to handle words that span chunk boundaries. (4) Transcript assembly - detect sentence endpoints and emit completed sentences downstream. Key optimizations: use faster-whisper (4x speedup over original Whisper), set compute_type="float16" on GPU or int8 on CPU, disable beam search for lowest latency (beam_size=1), use condition_on_previous_text=False to avoid context accumulation overhead. Typical achievable latency for 5-second windows on GPU: 200-400ms from end-of-speech to transcript.
Q5: What is the difference between Whisper's approach and native audio models like GPT-4o?
Whisper is a classic encoder-decoder model trained for speech recognition: it processes a fixed 30-second audio chunk, produces a log-Mel spectrogram, encodes it with a transformer encoder, and decodes text autoregressively. It cannot understand the content of what is said beyond producing the transcript - it has no language comprehension. Native audio models like GPT-4o process audio as a first-class input modality, understanding not just the words but the paralinguistic content - prosody, emotion, speaking rate, hesitations - alongside the semantics. GPT-4o can respond to "you seem nervous about this" based on vocal cues, not just words. The architecture difference: GPT-4o likely uses audio codecs to tokenize audio into discrete tokens that are processed alongside text tokens in a unified context window, versus Whisper's separate encoder-decoder design. GPT-4o also handles native audio output (speech synthesis) in the same model, while Whisper is transcription-only.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Audio-Language Models (Whisper) demo on the EngineersOfAI Playground - no code required.
:::
