Dynamic Programming for ML

The Hidden DP in Everything You Deploy

A speech recognition engineer at a major tech company was debugging why their model occasionally produced sequences with repeated characters that made no phonetic sense. The model was trained with CTC loss - Connectionist Temporal Classification - and was supposed to handle the alignment between audio frames and text characters automatically. But something in the decoding was wrong.

The root cause took two days to find: the team had replaced the CTC greedy decoder with a custom "faster" implementation that greedily picked the most likely character at each timestep without considering the CTC blank token rules. The original CTC decoding uses a specific dynamic programming procedure to marginalize over all possible alignments of audio frames to characters - accounting for repeated characters, blank tokens, and the constraint that the same character can only appear twice in sequence if there is a blank token between them. The custom greedy decoder ignored all of this structure, producing hallucinated character sequences.

Dynamic programming is not just a technique for solving textbook problems about coins and knapsacks. It is embedded in the core algorithms of modern ML: the Viterbi algorithm that powers sequence labeling in NLP, the CTC decoding algorithm in every speech recognition and OCR system, the beam search decoder in every language model, the Bellman equation that defines value functions in reinforcement learning, and edit distance that underlies BLEU and ROUGE evaluation metrics.

The engineer who does not understand DP treats these as black boxes. The engineer who does understand it can debug the CTC decoder, implement a custom beam search with constraints, tune the Bellman backup operators for their RL environment, and understand why value iteration converges. This lesson builds that understanding - from the principles to the production code.

Why This Exists

The fundamental problem DP solves is optimization over structures with shared substructure. Many problems can be decomposed into subproblems, but naive recursion solves the same subproblem exponentially many times. DP solves each subproblem once and stores the result.

Two properties must hold:

Optimal substructure: the optimal solution to the whole problem can be built from optimal solutions to subproblems.
Overlapping subproblems: subproblems recur, so caching their solutions avoids redundant computation.

Without DP, edit distance between two strings of length $n$ and $m$ would be $O(3^{n+m})$ via naive recursion. With DP: $O(nm)$ . For comparing two 100-word strings: $3^{200}$ operations vs $10,000$ operations. That is roughly $10^{93}$ times faster.

Historical Context

Dynamic programming was invented by Richard Bellman in the 1950s at the RAND Corporation. The name is deliberately obscure - Bellman chose it specifically to hide the mathematical nature of his work from hostile government funders who were suspicious of mathematics. "Dynamic" sounded operational; "programming" meant planning (in the military sense, not computer sense). By the time anyone realized it was mathematics, the name had stuck.

Bellman's original application was optimal control theory - finding the optimal policy for a system over time. The Bellman equation, which defines the value of a state as the immediate reward plus the discounted value of the next state under the optimal policy, is the foundation of modern reinforcement learning.

The Viterbi algorithm was developed by Andrew Viterbi in 1967 for decoding convolutional codes in digital communications. It was independently discovered for HMM decoding in the 1970s and became the dominant sequence decoding algorithm for speech recognition in the 1980s and 1990s. Modern CRF models still use Viterbi decoding.

CTC (Connectionist Temporal Classification) was introduced by Alex Graves et al. in 2006, extending DP-based alignment ideas from HMMs to neural networks. It enabled end-to-end training of speech recognition models without pre-aligned training data.

Core Concepts

The Two Implementations: Memoization vs Tabulation

DP can be implemented in two equivalent ways:

Memoization (top-down): implement the natural recursion, but cache the result of each unique subproblem. When the recursive call encounters a subproblem it has already solved, return the cached result. Implemented with a dictionary or array of results.

Tabulation (bottom-up): identify the subproblem ordering (small problems before large), allocate a table to store all subproblem results, and fill in the table from smallest to largest. No recursion - iterative loops over the table.

Memoization is easier to implement (just add a cache to the naive recursion) but has function call overhead and risk of Python stack overflow for deep recursion. Tabulation is faster in practice (no function call overhead, better cache locality) and handles large problems without stack issues. For ML applications, tabulation is almost always preferred.

# Classic example: Fibonacci
# O(2^n) without caching, O(n) with DP

def fib_naive(n: int) -> int:
    """O(2^n) - exponential without caching."""
    if n <= 1:
        return n
    return fib_naive(n-1) + fib_naive(n-2)

# Memoization (top-down)
from functools import lru_cache

@lru_cache(maxsize=None)
def fib_memo(n: int) -> int:
    """O(n) time, O(n) space."""
    if n <= 1:
        return n
    return fib_memo(n-1) + fib_memo(n-2)

# Tabulation (bottom-up) - preferred for ML applications
def fib_dp(n: int) -> int:
    """O(n) time, O(1) space (rolling window)."""
    if n <= 1:
        return n
    prev, curr = 0, 1
    for _ in range(2, n+1):
        prev, curr = curr, prev + curr
    return curr

Edit Distance: The Foundation of NLP Evaluation

Edit distance (Levenshtein distance) is the minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another. It is the basis for:

BLEU score computation (word-level edit operations)
WER (Word Error Rate) in ASR evaluation
Fuzzy string matching in data processing pipelines
Approximate string matching in feature engineering

Subproblem: let $\text{dp}[i][j]$ = minimum edit distance to transform the first $i$ characters of string $A$ into the first $j$ characters of string $B$ .

Recurrence:

The Hidden DP in Everything You Deploy​

Why This Exists​

Historical Context​

Core Concepts​

The Two Implementations: Memoization vs Tabulation​

Edit Distance: The Foundation of NLP Evaluation​