Skip to main content

Lagrange Multipliers - Constrained Optimization and the Math Behind SVMs

Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering

You are training a logistic regression model, and you want to prevent it from memorizing noise. You add an L2 penalty term to the loss: L(w) = cross_entropy(w) + λ‖w‖². You tune λ until the model generalizes.

But here is what you are actually doing mathematically: you are solving a constrained optimization problem. You are minimizing the cross-entropy loss subject to the constraint that the parameter norm stays bounded: ‖w‖² ≤ C. Lagrange multipliers are the mathematical machinery that converts this constrained problem into the unconstrained penalized problem you actually solve.

SVMs, regularized regression, trust regions, projection onto probability simplexes - all of these are constrained optimization problems solved via Lagrange multipliers.

What You Will Learn

  • What constrained optimization is and why it arises in ML
  • The Lagrangian function: converting constraints to penalty terms
  • Equality constraints and the geometric intuition
  • Inequality constraints and the KKT conditions
  • The SVM dual problem: deriving SVMs from first principles
  • L1 and L2 regularization as constrained optimization
  • Practical constrained optimization in Python

Prerequisites

  • Lesson 01: Derivatives and Gradients (required)
  • Lesson 04: Convex Functions and Optimization (recommended)
  • Basic understanding of SVM concept (helpful but not required)

Part 1 - Constrained Optimization Setup

The problem

Standard optimization: minimize f(x) with no restrictions on x.

Constrained optimization: minimize f(x) subject to constraints on x.

General form:

minxf(x)\min_x f(x) subject to:gi(x)=0i=1,,m(equality constraints)\text{subject to:} \quad g_i(x) = 0 \quad i = 1, \ldots, m \quad \text{(equality constraints)} hj(x)0j=1,,p(inequality constraints)\quad\quad\quad\quad\quad h_j(x) \leq 0 \quad j = 1, \ldots, p \quad \text{(inequality constraints)}

Why constraints arise in ML

ML problemConstraintWhy
L2-regularized regression‖w‖² ≤ CLimit parameter magnitude
L1-regularized regression‖w‖₁ ≤ CPromote sparsity
SVM (hard margin)yᵢ(wᵀxᵢ + b) ≥ 1Enforce margin
Attention weightsΣᵢ αᵢ = 1Probability simplex
Probability outputpᵢ ≥ 0, Σᵢ pᵢ = 1Valid distribution
Trust region‖Δθ‖ ≤ rLimit step size

A motivating example

Unconstrained minimization: min f(x,y) = x² + y² has minimum at origin.

Constrained minimization: min f(x,y) = x² + y² subject to x + y = 1.

Without the constraint, we move freely to (0,0). With the constraint, we must stay on the line x+y=1. The minimum on this line is at (0.5, 0.5).

import numpy as np
from scipy.optimize import minimize

# Solve constrained problem numerically
def objective(xy):
x, y = xy
return x**2 + y**2

constraint = {'type': 'eq', 'fun': lambda xy: xy[0] + xy[1] - 1}

result = minimize(objective, [0.0, 0.0], constraints=[constraint])
print(f"Constrained minimum: x={result.x[0]:.4f}, y={result.x[1]:.4f}")
print(f"Objective value: {result.fun:.4f}")
# Output: x=0.5, y=0.5, value=0.5

Part 2 - The Lagrangian: Equality Constraints

Geometric intuition

For minimizing f(x) subject to g(x) = 0, at the optimal point x*:

  • The gradient of f must be parallel to the gradient of g
  • If they were not parallel, you could move along the constraint surface (keeping g = 0) in a direction that decreases f

Mathematically: ∇f(x*) = λ∇g(x*) for some scalar λ (the Lagrange multiplier).

The Lagrangian function

Instead of solving the constrained problem directly, we introduce the Lagrangian:

L(x,λ)=f(x)+λg(x)\mathcal{L}(x, \lambda) = f(x) + \lambda g(x)

The Lagrange multiplier λ is an auxiliary variable. The constrained optimum satisfies:

xL=0f(x)+λg(x)=0\nabla_x \mathcal{L} = 0 \quad \Rightarrow \quad \nabla f(x) + \lambda \nabla g(x) = 0 λL=0g(x)=0(enforces the constraint)\nabla_\lambda \mathcal{L} = 0 \quad \Rightarrow \quad g(x) = 0 \quad \text{(enforces the constraint)}

Example: solve min x² + y² subject to x + y = 1

Lagrangian: L(x, y, λ) = x² + y² + λ(x + y - 1)

Stationarity conditions:

  • ∂L/∂x = 2x + λ = 0 → x = -λ/2
  • ∂L/∂y = 2y + λ = 0 → y = -λ/2
  • ∂L/∂λ = x + y - 1 = 0 → constraint

From the first two: x = y. Substituting into constraint: 2x = 1, so x = y = 0.5. And λ = -1.

import numpy as np
from scipy.optimize import fsolve

# Solve the KKT system for equality constraint
def kkt_system(vars):
"""
Stationarity conditions for min x^2 + y^2 s.t. x + y = 1
"""
x, y, lam = vars
# dL/dx = 0: 2x + lambda = 0
eq1 = 2*x + lam
# dL/dy = 0: 2y + lambda = 0
eq2 = 2*y + lam
# dL/dlambda = 0: constraint g(x,y) = x + y - 1 = 0
eq3 = x + y - 1
return [eq1, eq2, eq3]

solution = fsolve(kkt_system, [0.0, 0.0, 0.0])
x_opt, y_opt, lambda_opt = solution
print(f"Optimal x: {x_opt:.4f}, y: {y_opt:.4f}")
print(f"Lagrange multiplier λ: {lambda_opt:.4f}")
print(f"Constraint satisfied: {abs(x_opt + y_opt - 1) < 1e-6}")
print(f"Gradient condition: ∇f = {[2*x_opt, 2*y_opt]}, λ∇g = {[lambda_opt, lambda_opt]}")

Interpretation of the Lagrange multiplier

The Lagrange multiplier λ has an important interpretation: it is the rate at which the optimal value changes as the constraint is relaxed.

If the constraint is g(x) = c (instead of g(x) = 0), then: df(x(c))dc=λ\frac{d f(x^*(c))}{dc} = -\lambda

In economics: λ is the "shadow price" of the constraint - how much you pay (in objective value) for tightening the constraint.

In ML regularization: λ in L(w) = f(w) + λ‖w‖² controls how much we penalize for violating the constraint ‖w‖² ≤ C. Larger λ → tighter constraint → more regularization.

Part 3 - Inequality Constraints and KKT Conditions

Extending to inequality constraints

For min f(x) subject to h(x) ≤ 0, the optimal point can either:

  1. Lie in the interior (h(x*) < 0) - constraint is not active, acts like unconstrained optimization
  2. Lie on the boundary (h(x*) = 0) - constraint is active, like an equality constraint

The KKT (Karush-Kuhn-Tucker) conditions handle both cases elegantly.

KKT conditions

For the general problem: minxf(x)s.t.gi(x)=0,hj(x)0\min_x f(x) \quad \text{s.t.} \quad g_i(x) = 0, \quad h_j(x) \leq 0

The KKT conditions are necessary for a local optimum (and sufficient if f is convex):

1. Stationarity (gradient condition): f(x)+iλigi(x)+jμjhj(x)=0\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) + \sum_j \mu_j \nabla h_j(x^*) = 0

2. Primal feasibility (constraints satisfied): gi(x)=0,hj(x)0g_i(x^*) = 0, \quad h_j(x^*) \leq 0

3. Dual feasibility (inequality multipliers non-negative): μj0\mu_j \geq 0

4. Complementary slackness (either constraint is active or multiplier is zero): μjhj(x)=0\mu_j h_j(x^*) = 0

Understanding complementary slackness

Complementary slackness says: either h_j(x*) = 0 (constraint is active/tight) or μ_j = 0 (multiplier is zero). They cannot both be non-zero.

Geometric intuition:

  • If constraint is not active (h_j < 0): constraint plays no role in optimality. Set μ_j = 0.
  • If constraint is active (h_j = 0): like an equality constraint. μ_j ≥ 0 adjusts gradient condition.
import numpy as np
from scipy.optimize import minimize

def solve_with_inequality_constraint():
"""
Solve: min f(x,y) = (x-2)^2 + (y-3)^2
subject to: x^2 + y^2 <= 4 (stay in disk of radius 2)
"""
def objective(xy):
x, y = xy
return (x-2)**2 + (y-3)**2

# Constraint: x^2 + y^2 - 4 <= 0
constraint = {
'type': 'ineq', # scipy uses >= 0 convention
'fun': lambda xy: 4 - xy[0]**2 - xy[1]**2
}

# Unconstrained minimum: (2, 3) - outside the disk
result_unconstrained = minimize(objective, [0.0, 0.0])
print(f"Unconstrained minimum: {result_unconstrained.x.round(4)}")

# Constrained minimum: closest point on disk boundary to (2,3)
result_constrained = minimize(objective, [0.5, 0.5], constraints=[constraint])
print(f"Constrained minimum: {result_constrained.x.round(4)}")
print(f"Is on boundary? {abs(np.linalg.norm(result_constrained.x) - 2) < 1e-4}")

# Verify KKT: constraint is active (boundary), so mu > 0
# Direction from origin to (2,3): normalized
direction = np.array([2.0, 3.0]) / np.sqrt(13)
point_on_boundary = 2 * direction # radius 2 in direction of (2,3)
print(f"KKT solution: {point_on_boundary.round(4)}") # ≈ [1.109, 1.664]

solve_with_inequality_constraint()

Part 4 - SVM: Lagrange Multipliers in Action

Support Vector Machines are the cleanest example of constrained optimization in ML.

The primal SVM problem

For linearly separable data, find hyperplane (w, b) that:

  • Classifies all training points correctly
  • Maximizes the margin (distance from hyperplane to nearest points)

Primal (hard-margin SVM): minw,b12w2\min_{w, b} \frac{1}{2} \|w\|^2 subject to:yi(wTxi+b)1i=1,,n\text{subject to:} \quad y_i(w^T x_i + b) \geq 1 \quad \forall i = 1, \ldots, n

The objective minimizes ‖w‖² (equivalent to maximizing the margin 2/‖w‖). The constraints enforce correct classification with margin ≥ 1/‖w‖.

Lagrangian for SVM

Introduce Lagrange multipliers αᵢ ≥ 0 for each constraint:

L(w,b,α)=12w2i=1nαi[yi(wTxi+b)1]\mathcal{L}(w, b, \alpha) = \frac{1}{2}\|w\|^2 - \sum_{i=1}^n \alpha_i [y_i(w^T x_i + b) - 1]

Deriving the dual problem

Stationarity with respect to w: Lw=wiαiyixi=0    w=iαiyixi\frac{\partial \mathcal{L}}{\partial w} = w - \sum_i \alpha_i y_i x_i = 0 \implies w = \sum_i \alpha_i y_i x_i

Stationarity with respect to b: Lb=iαiyi=0    iαiyi=0\frac{\partial \mathcal{L}}{\partial b} = -\sum_i \alpha_i y_i = 0 \implies \sum_i \alpha_i y_i = 0

Substituting w = Σᵢ αᵢ yᵢ xᵢ back into the Lagrangian gives the dual problem:

maxαiαi12i,jαiαjyiyjxiTxj\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j subject to:αi0,iαiyi=0\text{subject to:} \quad \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0

What the dual reveals

  1. The solution w depends only on αᵢ > 0: Points with αᵢ = 0 contribute nothing to w. By complementary slackness, αᵢ > 0 only when yᵢ(wᵀxᵢ + b) = 1 (exactly on the margin boundary). These are the support vectors.

  2. The kernel trick appears naturally: The dual depends only on dot products xᵢᵀxⱼ. Replace these with K(xᵢ, xⱼ) = φ(xᵢ)ᵀφ(xⱼ) to get non-linear SVMs without explicitly computing φ.

  3. Sparsity: Most αᵢ = 0. SVMs are sparse models - prediction depends only on support vectors.

import numpy as np

def simple_svm_dual(X: np.ndarray, y: np.ndarray, C: float = None) -> np.ndarray:
"""
Solve the SVM dual problem via scipy.
Returns Lagrange multipliers alpha_i.
"""
from scipy.optimize import minimize as sp_minimize

n = len(y)
# Gram matrix: K[i,j] = x_i^T x_j
K = X @ X.T

def dual_objective(alpha):
"""Negative dual (we minimize, dual is a maximization)."""
return -np.sum(alpha) + 0.5 * alpha @ (y[:, None] * y[None, :] * K) @ alpha

def dual_gradient(alpha):
return -np.ones(n) + (y[:, None] * y[None, :] * K) @ alpha

constraints = [{'type': 'eq', 'fun': lambda a: np.sum(a * y)}]

if C is not None:
bounds = [(0, C)] * n # soft-margin SVM: 0 <= alpha_i <= C
else:
bounds = [(0, None)] * n # hard-margin: alpha_i >= 0

result = sp_minimize(
dual_objective, np.zeros(n),
jac=dual_gradient,
method='SLSQP',
bounds=bounds,
constraints=constraints
)
return result.x

# Example with linearly separable data
np.random.seed(42)
X_pos = np.random.randn(20, 2) + np.array([2, 2])
X_neg = np.random.randn(20, 2) + np.array([-2, -2])
X = np.vstack([X_pos, X_neg])
y = np.array([1.0]*20 + [-1.0]*20)

alpha = simple_svm_dual(X, y)

# Recover w from dual solution
w = np.sum(alpha[:, None] * y[:, None] * X, axis=0)

support_vectors = np.where(alpha > 1e-4)[0]
print(f"Number of support vectors: {len(support_vectors)}")
print(f"Support vector indices: {support_vectors}")
print(f"Weight vector w: {w.round(4)}")
print(f"Margin: {2 / np.linalg.norm(w):.4f}")

Part 5 - Regularization as Constrained Optimization

The Lagrangian connection makes explicit what regularization is doing.

L2 regularization (Ridge)

Constrained form: min f(w) subject to ‖w‖² ≤ C

Lagrangian: L(w, λ) = f(w) + λ(‖w‖² - C) = f(w) + λ‖w‖²

Unconstrained penalized form (after absorbing -λC into constant): min f(w) + λ‖w‖²

This is exactly the Ridge regression / L2-regularized logistic regression objective. The regularization strength λ is the Lagrange multiplier.

KKT complementary slackness: Either ‖w‖² = C (constraint active, λ > 0) or λ = 0 (no regularization needed, unconstrained minimum satisfies ‖w‖² ≤ C already).

L1 regularization (Lasso)

Constrained form: min f(w) subject to ‖w‖₁ ≤ C

Lagrangian: L(w, λ) = f(w) + λ‖w‖₁

Penalized form: min f(w) + λ‖w‖₁

The L1 ball (constraint region) has corners at coordinate axes. When the loss function's level sets touch this region, they most often touch at a corner - where one or more weights are exactly zero. This is the geometric reason L1 regularization produces sparse models.

import numpy as np
from scipy.optimize import minimize

def regularized_regression_comparison():
"""
Show equivalence between constrained and penalized formulations.
"""
np.random.seed(42)
n, d = 50, 10
X = np.random.randn(n, d)
true_w = np.array([1.0, -2.0] + [0.0] * 8) # sparse true weights
y = X @ true_w + 0.1 * np.random.randn(n)

def mse(w): return np.mean((X @ w - y)**2)

# Penalized form: min MSE + lambda * ||w||^2
lambda_val = 0.5
def ridge_objective(w): return mse(w) + lambda_val * np.sum(w**2)

result_ridge = minimize(ridge_objective, np.zeros(d))
w_ridge = result_ridge.x

# Equivalent constrained form: min MSE s.t. ||w||^2 <= C
# (find C that corresponds to lambda_val)
C = np.sum(w_ridge**2) # the constraint that becomes active

def mse_with_l2_constraint(w):
return mse(w)

l2_constraint = {'type': 'ineq', 'fun': lambda w: C - np.sum(w**2)}
result_constrained = minimize(mse_with_l2_constraint, np.zeros(d),
constraints=[l2_constraint])
w_constrained = result_constrained.x

print("Equivalence of penalized vs constrained Ridge regression:")
print(f"Penalized (λ={lambda_val}): w = {w_ridge[:5].round(4)} ...")
print(f"Constrained (C={C:.4f}): w = {w_constrained[:5].round(4)} ...")
print(f"Max difference: {np.max(np.abs(w_ridge - w_constrained)):.2e}")

regularized_regression_comparison()

Soft-margin SVM: regularization and slack variables

The hard-margin SVM fails when data is not linearly separable. The soft-margin SVM introduces slack variables ξᵢ ≥ 0:

minw,b,ξ12w2+Ciξi\min_{w, b, \xi} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i s.t.:yi(wTxi+b)1ξi,ξi0\text{s.t.:} \quad y_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The parameter C controls the trade-off:

  • Large C: penalize violations heavily → small slack, larger margin → risk overfitting
  • Small C: allow violations → large slack, smaller margin → more regularization

This is equivalent to hinge loss + L2 regularization: min ‖w‖² + C·Σ max(0, 1 - yᵢ(wᵀxᵢ + b))

Part 6 - Practical Constrained Optimization

Projected gradient descent

For simple constraint sets (e.g., L2 ball, simplex), we can use projected gradient descent: take a gradient step, then project back onto the feasible set.

import numpy as np

def project_onto_l2_ball(w: np.ndarray, radius: float) -> np.ndarray:
"""Project w onto the L2 ball ‖w‖ ≤ radius."""
norm = np.linalg.norm(w)
if norm <= radius:
return w # already feasible
return w * (radius / norm) # scale to radius

def project_onto_simplex(v: np.ndarray) -> np.ndarray:
"""
Project onto probability simplex: {w : w_i ≥ 0, Σw_i = 1}.
Uses the O(n log n) algorithm of Duchi et al.
"""
n = len(v)
u = np.sort(v)[::-1]
cssv = np.cumsum(u)
rho = np.max(np.where(u * np.arange(1, n+1) > (cssv - 1)))
theta = (cssv[rho] - 1) / (rho + 1)
return np.maximum(v - theta, 0)

def projected_gradient_descent(
grad_fn,
theta0: np.ndarray,
projection_fn,
lr: float = 0.01,
n_steps: int = 100
) -> np.ndarray:
"""Gradient descent with projection onto feasible set."""
theta = theta0.copy()
for _ in range(n_steps):
grad = grad_fn(theta)
theta = theta - lr * grad # gradient step
theta = projection_fn(theta) # project back to feasible set
return theta

# Example: minimize quadratic subject to unit L2 ball
grad_fn = lambda w: 2 * w - np.array([3.0, 4.0]) # gradient of ||w - [1.5, 2]||^2
project_l2 = lambda w: project_onto_l2_ball(w, radius=1.0)

theta0 = np.array([0.5, 0.5])
theta_opt = projected_gradient_descent(grad_fn, theta0, project_l2, lr=0.1, n_steps=200)

print(f"Constrained minimum: {theta_opt.round(4)}")
print(f"Norm: {np.linalg.norm(theta_opt):.4f}") # should be 1.0 (on boundary)
print(f"Expected: {(np.array([3., 4.]) / 5).round(4)}") # unit vector in direction [3,4]

Augmented Lagrangian method

For harder constraints, augmented Lagrangian adds a quadratic penalty to improve conditioning:

Lρ(x,λ)=f(x)+λg(x)+ρ2g(x)2\mathcal{L}_\rho(x, \lambda) = f(x) + \lambda g(x) + \frac{\rho}{2} g(x)^2

This alternates between:

  1. Minimizing over x (unconstrained optimization)
  2. Updating λ ← λ + ρ·g(x)
import numpy as np
from scipy.optimize import minimize as sp_minimize

def augmented_lagrangian(
f, grad_f, g, grad_g,
x0: np.ndarray,
rho: float = 1.0,
n_outer: int = 20,
n_inner: int = 100,
lr: float = 0.01
) -> np.ndarray:
"""
Augmented Lagrangian method for equality constrained optimization.
min f(x) s.t. g(x) = 0
"""
x = x0.copy()
lam = 0.0 # Lagrange multiplier estimate

for outer in range(n_outer):
# Minimize augmented Lagrangian over x
def aug_lag(x):
g_val = g(x)
return f(x) + lam * g_val + (rho/2) * g_val**2

def aug_lag_grad(x):
g_val = g(x)
return grad_f(x) + (lam + rho * g_val) * grad_g(x)

result = sp_minimize(aug_lag, x, jac=aug_lag_grad, method='L-BFGS-B')
x = result.x

# Update Lagrange multiplier
lam = lam + rho * g(x)

if abs(g(x)) < 1e-8:
print(f"Converged at outer iteration {outer+1}")
break

return x

# Example: min x^2 + y^2 s.t. x + y = 1
f = lambda xy: xy[0]**2 + xy[1]**2
grad_f = lambda xy: 2 * xy
g = lambda xy: xy[0] + xy[1] - 1
grad_g = lambda xy: np.ones(2)

x_opt = augmented_lagrangian(f, grad_f, g, grad_g, x0=np.array([0.0, 0.0]))
print(f"\nAugmented Lagrangian solution: {x_opt.round(6)}")
print(f"Constraint satisfied: |g(x)| = {abs(g(x_opt)):.2e}")

Part 7 - Common Mistakes and Engineering Red Flags

:::warning Confusing the Lagrange multiplier with the regularization strength The regularization strength λ in L(w) + λ‖w‖² is the Lagrange multiplier from the constrained formulation. But scikit-learn and some other libraries use C = 1/λ (so larger C = less regularization). Always check which convention the library uses.

# sklearn LogisticRegression uses C (inverse regularization strength)
# L(w) = cross_entropy + (1/C) * ||w||^2
# C=1.0 → λ=1.0 (default, moderate regularization)
# C=0.01 → λ=100 (strong regularization)
# C=100 → λ=0.01 (weak regularization)

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=0.1) # strong regularization

:::

:::danger KKT conditions are necessary but not always sufficient For non-convex problems, satisfying the KKT conditions does not guarantee a global or even local minimum - it just means you might be at a critical point. For convex problems with constraint qualification, KKT conditions are sufficient for a global minimum. :::

:::tip Dual problems are sometimes easier to solve The SVM dual problem is often easier to solve than the primal because:

  1. The dual dimension is n (number of training examples), not d (feature dimension)
  2. The dual structure allows the kernel trick
  3. Most αᵢ = 0 at the solution (sparsity), which is exploitable

For very high-dimensional feature spaces (kernel methods), always prefer the dual. :::

Interview Questions

Q1: What are Lagrange multipliers and when do you use them?

Lagrange multipliers are auxiliary variables that convert a constrained optimization problem into an unconstrained one.

For equality constraints (min f(x) s.t. g(x) = 0):

The Lagrangian is L(x, λ) = f(x) + λg(x). At the optimum, both ∂L/∂x = 0 (gradient of objective balances gradient of constraint) and ∂L/∂λ = 0 (the constraint is satisfied).

Geometric intuition: At the optimum, ∇f must be parallel to ∇g. If it were not, you could move along the constraint surface (keeping g = 0) in a direction that decreases f - contradicting optimality.

In ML: Ridge regression (L2 regularization), Lasso (L1), SVM, trust region methods, projection onto probability simplex - all are constrained optimization problems with associated Lagrange multipliers.

The multiplier value λ measures the sensitivity of the optimal objective to the constraint: how much the minimum increases when you tighten the constraint.

Q2: What are the KKT conditions and when are they sufficient?

KKT conditions (for min f(x) s.t. gᵢ(x)=0, hⱼ(x)≤0):

  1. Stationarity: ∇f(x*) + Σᵢλᵢ∇gᵢ(x*) + Σⱼμⱼ∇hⱼ(x*) = 0
  2. Primal feasibility: gᵢ(x*)=0, hⱼ(x*)≤0
  3. Dual feasibility: μⱼ ≥ 0
  4. Complementary slackness: μⱼhⱼ(x*) = 0

Necessity: KKT conditions are necessary for any local optimum (under constraint qualifications - e.g., LICQ: constraint gradients are linearly independent).

Sufficiency: KKT conditions are sufficient for a global optimum when:

  • f is convex
  • Equality constraint functions gᵢ are affine (linear)
  • Inequality constraint functions hⱼ are convex

Practical takeaway: For SVM (convex problem + linear/convex constraints), satisfying KKT guarantees the global optimum. For neural networks (non-convex), KKT is only a necessary condition.

Q3: Derive why L1 regularization leads to sparse solutions but L2 does not.

Constrained optimization perspective:

L2 constraint (Ridge): ‖w‖² ≤ C - the feasible region is a smooth sphere in d dimensions.

L1 constraint (Lasso): ‖w‖₁ ≤ C - the feasible region is a diamond (cross-polytope) with corners at the coordinate axes.

Why sparsity arises:

Optimization finds the point where the loss function's level set first touches the constraint region. The loss function's level sets are smooth (typically ellipsoids for convex losses). For L2, the sphere has no corners - the first contact is at a smooth point where no coordinate is forced to exactly zero.

For L1, the diamond has corners at coordinates like (C, 0, 0, ..., 0). These corners are the most common first-contact points because the corners jut out farther than any smooth point in each coordinate direction. Contact at a corner forces one or more weights to be exactly zero - sparsity.

Gradient perspective: The subdifferential of |wᵢ| at wᵢ = 0 is the interval [-1, 1]. Stationarity requires the loss gradient ∂L/∂wᵢ ∈ [-λ, λ]. For many features, this is satisfied with wᵢ = 0 - the optimal sparse solution.

Practical implication: L1 for feature selection (many weights exactly zero), L2 for shrinkage (all weights small but non-zero), ElasticNet = L1 + L2 for mixed behavior.

Q4: Explain the SVM dual problem and what Lagrange multipliers mean in that context.

Primal SVM: min (1/2)‖w‖² s.t. yᵢ(wᵀxᵢ+b) ≥ 1 for all i.

Lagrangian: L(w,b,α) = (1/2)‖w‖² - Σᵢ αᵢ[yᵢ(wᵀxᵢ+b) - 1], with αᵢ ≥ 0.

Deriving the dual: Setting ∂L/∂w = 0 gives w = Σᵢ αᵢyᵢxᵢ. Setting ∂L/∂b = 0 gives Σᵢ αᵢyᵢ = 0. Substituting back gives the dual:

max Σᵢ αᵢ - (1/2)Σᵢ,ⱼ αᵢαⱼyᵢyⱼ(xᵢᵀxⱼ) s.t. αᵢ ≥ 0, Σᵢ αᵢyᵢ = 0.

What αᵢ mean:

  • αᵢ = 0: point xᵢ is correctly classified and not on the margin - it contributes nothing to w
  • αᵢ > 0: point xᵢ is on the margin boundary - these are support vectors
  • By complementary slackness: αᵢ[yᵢ(wᵀxᵢ+b) - 1] = 0, so αᵢ > 0 only when the margin constraint is tight

Key insights:

  1. The solution w is a linear combination of support vectors: w = Σᵢ αᵢyᵢxᵢ
  2. The dual depends only on dot products xᵢᵀxⱼ → enables the kernel trick
  3. Most training points have αᵢ = 0 → prediction uses only support vectors (sparse)
Q5: How is regularization in ML equivalent to constrained optimization?

L2 regularization (Ridge regression: min f(w) + λ‖w‖²) is equivalent to the constrained problem:

min f(w) subject to ‖w‖² ≤ C

where C and λ are related by the Lagrange multiplier condition at the optimum.

Proof: The Lagrangian for the constrained problem is L(w, μ) = f(w) + μ(‖w‖² - C). Setting ∂L/∂w = 0 gives ∇f(w) + 2μw = 0, which is the same as the stationarity condition for the penalized problem min f(w) + λ‖w‖² with λ = μ.

The Lagrange multiplier λ is the shadow price of the constraint: if you relax the constraint (increase C), the optimal loss decreases by λ per unit increase in C.

Practical implications:

  1. Tuning λ is equivalent to tuning the constraint size C - two ways of specifying the same regularization
  2. The "right" λ depends on the scale of f(w) - standardizing features before training makes λ more interpretable
  3. L1 regularization corresponds to an L1-ball constraint; Elastic Net to an intersection of L1 and L2 balls

Quick Reference

ConceptFormulaCode
Lagrangian (equality)L(x,λ) = f(x) + λg(x)scipy.optimize.minimize(constraints={'type':'eq'})
Lagrangian (inequality)L(x,μ) = f(x) + μh(x)constraints={'type':'ineq'}
KKT stationarity∇f + Σλᵢ∇gᵢ + Σμⱼ∇hⱼ = 0Verify at solution
KKT comp. slacknessμⱼhⱼ = 0Either constraint active or μⱼ=0
SVM primalmin (1/2)‖w‖² s.t. yᵢ(wᵀxᵢ+b)≥1sklearn.svm.SVC
SVM dualmax Σαᵢ - (1/2)ΣΣαᵢαⱼyᵢyⱼ(xᵢᵀxⱼ)Solved internally
L2 as constraintmin f(w) s.t. ‖w‖²≤CEquiv. to min f(w)+λ‖w‖²
L1 as constraintmin f(w) s.t. ‖w‖₁≤CEquiv. to min f(w)+λ‖w‖₁
Projected GDθ ← Proj(θ - α∇f(θ))project_onto_l2_ball(theta)

Key Takeaways

  • Constrained optimization arises naturally in ML: regularization, SVMs, probability constraints, trust regions
  • The Lagrangian L(x,λ) = f(x) + λg(x) converts an equality-constrained problem to an unconstrained stationarity condition
  • The Lagrange multiplier λ measures the rate at which the optimal objective changes as the constraint is relaxed - it is the "cost" of the constraint
  • KKT conditions extend Lagrange multipliers to inequality constraints via complementary slackness (μⱼhⱼ = 0)
  • The SVM dual problem arises from applying Lagrange multipliers to the primal and reveals that the solution depends only on support vectors and dot products (enabling the kernel trick)
  • L1 and L2 regularization are constrained optimization problems - the regularization strength λ is the corresponding Lagrange multiplier
  • Projected gradient descent handles simple constraint sets (L2 ball, simplex) efficiently without solving a constrained program explicitly

Next: Taylor Series and Approximations →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Lagrange Multipliers demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.