Convex Functions and Optimization - Why Some Problems Are Easy and Others Are Not
Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering
You are training logistic regression. It converges cleanly in 50 iterations. You switch to a 5-layer neural network for the same task. Suddenly you are babysitting hyperparameters, watching loss curves plateau, wondering if you are stuck in a local minimum.
The difference is convexity. Logistic regression has a convex loss function. Gradient descent on a convex function is mathematically guaranteed to find the global minimum. Neural networks have non-convex loss functions. Gradient descent on non-convex functions offers no such guarantees.
Understanding convexity tells you exactly why some ML problems are easy to optimize and others are not - and what happens in the loss landscape when you are training a deep network.
What You Will Learn
- Formal definition of convex functions and convex sets
- How convexity guarantees global optima for gradient descent
- The geometry of convex vs. non-convex loss landscapes
- Strongly convex functions and linear convergence rates
- Why deep learning loss functions are non-convex
- Saddle points, flat regions, and local minima in practice
- Jensen's inequality and its ML applications
- Why deep learning still works despite non-convexity
Prerequisites
- Lesson 01: Derivatives and Gradients (required)
- Lesson 03: Gradient Descent Mechanics (recommended)
- Basic linear algebra (vectors, matrices)
Part 1 - Convex Sets
A set C ⊆ ℝⁿ is convex if the line segment between any two points in C lies entirely within C:
Geometric interpretation
Convex sets: Non-convex sets:
╭────╮ ╭──╮ ╭──╮
/ \ / \ / \
│ │ ✓ │ \/ │ ✗
\ / \ /
╰────╯ ╰──────────╯
(circle) (two blobs joined by a dip)
Line segment between Line segment exits
any two points stays inside the set - not convex
ML relevance: The constraint sets in regularized optimization are often convex. The feasibility of efficiently solving constrained optimization problems depends on whether both the objective and the constraint set are convex.
Common convex sets
| Set | Definition | ML Use |
|---|---|---|
| ℝⁿ (all of space) | trivially convex | unconstrained optimization |
| Halfspace {x : aᵀx ≤ b} | defined by linear inequality | linear SVM constraints |
| Ball {x : ‖x‖₂ ≤ r} | L2-bounded region | L2 regularization constraint |
| Simplex {x : Σxᵢ = 1, xᵢ ≥ 0} | probability distributions | softmax outputs |
| Positive semidefinite cone | {A : A ≽ 0} | covariance matrices |
Part 2 - Convex Functions: Definition and Tests
A function f: ℝⁿ → ℝ is convex if its domain is a convex set and for all x, y in the domain:
Geometric meaning: The function value at any weighted average of two points is at most the weighted average of the function values at those points. The function "curves upward" - it lies below the chord connecting any two points.
Convex function: Non-convex function:
f(x) f(x)
│ / │ /\/\/\
│ / │ / \
│ chord / │ / \
│ .....' │ / ←chord→\
│ ...../ ← f(x) lies below │
│ ..... │ Function pops above chord
│ / │
└─────────────── x └────────────────── x
λx + (1-λ)y
First-order condition for convexity
If f is differentiable, it is convex if and only if:
Meaning: The tangent hyperplane at any point lies below the function everywhere. The linear approximation underestimates the function. This is the geometric heart of why gradient descent works on convex functions - the first-order Taylor approximation is always a lower bound.
Second-order condition for convexity
If f is twice differentiable, it is convex if and only if the Hessian is positive semidefinite:
The Hessian ∇²f (matrix of second-order partial derivatives) is positive semidefinite when all its eigenvalues are ≥ 0, meaning the function curves upward in every direction.
import numpy as np
def check_convexity_numerical(
f,
domain_bounds: tuple,
n_samples: int = 1000
) -> bool:
"""
Test convexity numerically by checking the definition:
f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y) for random (x, y, λ)
"""
lo, hi = domain_bounds
violations = 0
tol = 1e-8
for _ in range(n_samples):
x = np.random.uniform(lo, hi)
y = np.random.uniform(lo, hi)
lam = np.random.uniform(0, 1)
midpoint = lam * x + (1 - lam) * y
convex_combo = lam * f(x) + (1 - lam) * f(y)
if f(midpoint) > convex_combo + tol:
violations += 1
return violations == 0
# Test various functions
functions = {
'x^2 (convex)': lambda x: x**2,
'x^4 (convex)': lambda x: x**4,
'exp(x) (convex)': lambda x: np.exp(x),
'x^3 (NOT convex)': lambda x: x**3,
'-x^2 (concave, NOT convex)': lambda x: -x**2,
'sin(x) (NOT convex on [-π, π])': lambda x: np.sin(x),
}
print("Convexity test results:")
for name, f in functions.items():
is_convex = check_convexity_numerical(f, (-3, 3))
print(f" {name}: {'CONVEX' if is_convex else 'NOT convex'}")
Strongly convex functions
f is μ-strongly convex if:
Strong convexity means the function curves upward by at least μ in every direction (has a lower bound on curvature). This leads to linear convergence for gradient descent:
| Type | Gradient descent convergence | Examples |
|---|---|---|
| Convex | O(1/t) (sublinear) | |
| Strongly convex | O((1-αμ)ᵗ) (linear) | L2-regularized linear/logistic regression |
| Smooth + convex | O(1/t) with larger constant | Logistic regression (no regularization) |
Part 3 - Key Convex Loss Functions in ML
Linear regression MSE
The Hessian is ∇²L = (2/n)XᵀX, which is always positive semidefinite. MSE for linear regression is convex. With L2 regularization, the Hessian is (2/n)XᵀX + 2λI, which is positive definite (strongly convex).
Logistic regression
The Hessian is (1/n)Xᵀ·diag(σ(Xw)(1-σ(Xw)))·X, which is always positive semidefinite. Logistic regression is convex. One global minimum (no local minima to get stuck in).
import numpy as np
def logistic_regression_hessian(X: np.ndarray, w: np.ndarray) -> np.ndarray:
"""
Compute the Hessian of logistic regression loss.
Should be positive semidefinite (all eigenvalues >= 0).
"""
def sigmoid(z): return 1 / (1 + np.exp(-z))
s = sigmoid(X @ w) # (n,)
S = np.diag(s * (1 - s)) # (n, n) diagonal matrix
H = (1/len(w)) * X.T @ S @ X # (d, d)
return H
# Verify PSD: all eigenvalues should be >= 0
np.random.seed(42)
n, d = 100, 5
X = np.random.randn(n, d)
w = np.random.randn(d)
H = logistic_regression_hessian(X, w)
eigenvalues = np.linalg.eigvalsh(H)
print(f"Hessian eigenvalues: {eigenvalues.round(6)}")
print(f"All non-negative: {np.all(eigenvalues >= -1e-10)}") # True → convex
Jensen's inequality
For a convex function f and random variable X:
"The function of the average is ≤ the average of the function values."
ML applications:
- Expected loss is at least the loss at the average: Averaging model predictions (ensembling) can be better than any single model when the loss is convex
- KL divergence derivation: KL(P‖Q) ≥ 0 follows from Jensen's inequality applied to -log
- Evidence lower bound (ELBO) in VAEs: Jensen's inequality bounds the intractable log-likelihood
import numpy as np
# Verify Jensen's inequality numerically
np.random.seed(42)
f = np.exp # exp is convex
n_samples = 10000
X = np.random.randn(n_samples) # random variable
# Jensen: f(E[X]) <= E[f(X)]
f_of_mean = f(np.mean(X))
mean_of_f = np.mean(f(X))
print(f"f(E[X]) = exp(E[X]) = {f_of_mean:.4f}")
print(f"E[f(X)] = E[exp(X)] = {mean_of_f:.4f}")
print(f"Jensen holds: {f_of_mean <= mean_of_f}") # True
print(f"Gap: {mean_of_f - f_of_mean:.4f}")
Part 4 - Non-Convex Loss Landscapes in Deep Learning
Neural networks have non-convex loss functions. The parameters θ enter the loss in a highly non-linear way, with multiple compositions of nonlinear functions.
Why deep network loss is non-convex
Symmetry: Any permutation of neurons in a hidden layer gives the same network output. This means there are exponentially many parameter configurations with the same loss - all "global minima" relative to each other. Gradient descent can find any of them.
Weight scaling: Scaling up the weights of layer 1 by a factor and scaling down layer 2 by the same factor preserves the network output. This creates continuous "ridges" in parameter space where the loss is constant.
Non-linear activations: ReLU, tanh, sigmoid create kinks and non-smooth regions that break convexity.
The loss landscape: what it actually looks like
Simplified 1D view of deep network loss:
L(θ)
│ local
│ local minimum global
│ max ↓ min
│ \ /\/\ /\/\ ↓
│ \ / saddle \ /\ /─────\___/
│ \/ \/ \/
│
└──────────────────────────────────── θ
Key features:
- Multiple local minima: Gradient descent may converge to different solutions depending on initialization
- Saddle points: Points where ∇f = 0 but not a minimum (some directions curve up, some curve down)
- Flat regions / plateaus: Very small gradients everywhere, slow convergence
- Sharp vs. flat minima: Sharp minima tend to generalize worse (sensitive to perturbations)
Saddle points dominate the landscape
For random high-dimensional functions, the majority of critical points (where ∇f = 0) are saddle points, not local minima. The probability of a point being a local minimum decreases exponentially with dimension.
Why this matters: At a saddle point, the gradient is zero, so gradient descent theoretically gets stuck. In practice, the noise in mini-batch gradient estimates gives a random kick that escapes saddle points. This is a key reason why SGD performs well in practice even for non-convex problems.
import numpy as np
# Visualize a saddle point in 2D: f(x,y) = x^2 - y^2
def f(x, y): return x**2 - y**2
def grad_f(x, y): return np.array([2*x, -2*y])
# Saddle point at (0, 0): gradient is 0
origin = np.array([0.0, 0.0])
print(f"Gradient at origin: {grad_f(*origin)}") # [0, 0] - looks like minimum!
# But the Hessian reveals it is a saddle point
# Hessian = [[2, 0], [0, -2]] - one positive, one negative eigenvalue
H = np.array([[2, 0], [0, -2]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Hessian eigenvalues: {eigenvalues}") # [2, -2] → saddle point!
print("Saddle point! (mixed-sign eigenvalues of Hessian)")
# Gradient descent near saddle: tiny perturbation escapes
theta = np.array([0.001, 0.0]) # slightly off saddle
lr = 0.1
print(f"\nGradient descent near saddle:")
for step in range(5):
grad = grad_f(*theta)
theta -= lr * grad
print(f" Step {step+1}: θ={theta.round(4)}, ‖∇f‖={np.linalg.norm(grad):.4f}")
# x grows (curvature positive), y stays near 0 → escapes saddle in x direction
Second-order criticality test (Hessian)
At a critical point ∇f = 0, the Hessian ∇²f tells you what type of critical point it is:
| Hessian ∇²f | Eigenvalues | Critical point type |
|---|---|---|
| Positive definite (∇²f ≻ 0) | All > 0 | Local minimum |
| Negative definite (∇²f ≺ 0) | All < 0 | Local maximum |
| Indefinite (mixed) | Some > 0, some < 0 | Saddle point |
| Positive semidefinite | All ≥ 0, some = 0 | Degenerate case |
Part 5 - Why Deep Learning Works Despite Non-Convexity
This is a central question in deep learning theory. If the loss is non-convex, why do networks consistently converge to good solutions?
Reason 1: Local minima are nearly as good as global minima
For overparameterized networks (far more parameters than data points), research suggests that most local minima have similar quality to the global minimum. The "bad" local minima essentially disappear in high dimensions.
Intuition: In 1D, a local minimum is a valley. In 1,000,000D, for a local minimum to exist, the function must curve upward in all 1,000,000 directions. This becomes exponentially rare, and when it happens, the local minimum is usually close in value to the global minimum.
Reason 2: SGD noise escapes saddle points and poor local minima
Mini-batch gradient noise provides an implicit random walk. If the noise scale is comparable to the width of a "bad" local minimum, SGD will escape it. This is why models trained with SGD often generalize better than those trained with second-order methods that converge more precisely to a specific minimum.
Reason 3: Flat minima generalize better
Not all minima are equal. A flat minimum (one in a wide valley in the loss landscape) generalizes better than a sharp minimum (one in a narrow spike). Here is why:
After training on dataset D, if the model is deployed on slightly different data D' (distribution shift, new examples), its parameters are effectively perturbed. A flat minimum maintains low loss under small perturbations; a sharp minimum does not.
SGD with small learning rates and batch noise naturally prefers flat minima - another reason SGD often generalizes better than full-batch methods.
import numpy as np
# Illustration: sharp vs flat minimum response to perturbation
def sharp_minimum(theta):
"""Loss near a sharp minimum: f(θ) = 100θ²"""
return 100 * theta**2
def flat_minimum(theta):
"""Loss near a flat minimum: f(θ) = θ²"""
return theta**2
# Both have minimum at θ=0, loss=0
perturbation = 0.1 # small distribution shift
sharp_loss = sharp_minimum(perturbation)
flat_loss = flat_minimum(perturbation)
print(f"Under perturbation of {perturbation}:")
print(f" Sharp minimum loss: {sharp_loss:.4f}") # 1.0 (big jump)
print(f" Flat minimum loss: {flat_loss:.4f}") # 0.01 (small jump)
print(f" Ratio: {sharp_loss/flat_loss:.1f}x worse for sharp minimum")
Reason 4: The loss landscape is better than expected
Empirical studies (Goodfellow et al., Li et al.) show that neural network loss landscapes, while non-convex, tend to have:
- Relatively few "bad" local minima in high-dimensional overparameterized networks
- Many saddle points but with escape directions detectable by SGD noise
- A percolation structure where one can find paths from initialization to a good minimum without climbing high barriers
Reason 5: Residual connections create a convex-like structure
ResNets add identity shortcuts: output = F(x) + x. The gradient of the loss with respect to the input passes through the identity mapping directly:
The identity term I means the gradient always has a direct path backward - this prevents vanishing gradients and creates a loss landscape that behaves more like a convex function in many respects.
Part 6 - Practical Implications
When convexity is lost: early warning signs
| Symptom | Possible cause |
|---|---|
| Loss oscillates without decreasing | Stuck in flat region or oscillating between saddle points |
| Loss converges to different values across runs | Multiple local minima; initialization-dependent |
| Very slow convergence | High condition number; using fixed learning rate in narrow valley |
| Loss suddenly spikes after many epochs | Escaped a local minimum; or learning rate too high |
| Training loss low but validation loss high | Sharp minimum (memorization); need regularization |
import numpy as np
def analyze_loss_landscape(
f, # loss function
grad_f, # gradient function
theta: np.ndarray,
perturbation_scale: float = 0.01,
n_directions: int = 10
) -> dict:
"""
Analyze local loss landscape around a critical point.
Estimate whether it looks like a minimum, saddle, or flat region.
"""
grad_norm = np.linalg.norm(grad_f(theta))
# Probe curvature in random directions
curvatures = []
for _ in range(n_directions):
direction = np.random.randn(len(theta))
direction /= np.linalg.norm(direction)
eps = perturbation_scale
# Numerical second derivative in direction
f_plus = f(theta + eps * direction)
f_center = f(theta)
f_minus = f(theta - eps * direction)
curvature = (f_plus - 2*f_center + f_minus) / (eps**2)
curvatures.append(curvature)
curvatures = np.array(curvatures)
n_positive = np.sum(curvatures > 0)
n_negative = np.sum(curvatures < 0)
if grad_norm < 1e-4:
if n_negative == 0:
point_type = "local minimum"
elif n_positive == 0:
point_type = "local maximum"
else:
point_type = f"saddle point ({n_positive} up, {n_negative} down)"
else:
point_type = "not a critical point"
return {
'grad_norm': grad_norm,
'point_type': point_type,
'avg_curvature': curvatures.mean(),
'min_curvature': curvatures.min(),
'max_curvature': curvatures.max(),
}
# Example: analyze a known saddle point
f_saddle = lambda t: t[0]**2 - t[1]**2
grad_saddle = lambda t: np.array([2*t[0], -2*t[1]])
analysis = analyze_loss_landscape(f_saddle, grad_saddle, np.array([0.0, 0.0]))
print(f"Analysis at origin:")
for k, v in analysis.items():
print(f" {k}: {v}")
Part 7 - Common Mistakes and Engineering Red Flags
:::warning Assuming your loss is convex when it is not Linear and logistic regression have convex losses - gradient descent is guaranteed to find the global minimum. Any non-linear model (neural networks, kernel methods with non-linear features) has a non-convex loss. This matters for debugging: if your linear regression is not converging, it is definitely a learning rate or data issue; if your neural network is not converging, it could also be a local minimum or saddle point issue. :::
:::tip Practical convexity check For any loss function, check the Hessian eigenvalues at the optimum. If all eigenvalues are positive, you found a local minimum. If some are negative, you are at a saddle point. If some are near zero, you are in a flat region.
import torch
def hessian_at_point(model, loss_fn, X, y):
"""Compute Hessian of loss at current parameters using torch.autograd."""
params = list(model.parameters())
# Flatten all parameters into a single vector
flat_params = torch.cat([p.view(-1) for p in params])
n = len(flat_params)
# This is expensive - only use for small models
loss = loss_fn(model(X), y)
grads = torch.autograd.grad(loss, params, create_graph=True)
flat_grads = torch.cat([g.view(-1) for g in grads])
H = torch.zeros(n, n)
for i in range(n):
second_grads = torch.autograd.grad(flat_grads[i], params, retain_graph=True)
H[i] = torch.cat([g.view(-1) for g in second_grads])
eigenvalues = torch.linalg.eigvalsh(H)
return eigenvalues
:::
:::danger Conflating "local minimum" with "bad solution" In overparameterized deep networks, local minima are not necessarily bad. A local minimum found by SGD often has similar quality to the global minimum. If your model is performing poorly, the issue is almost certainly not "stuck in a bad local minimum" - it is more likely learning rate, architecture, data quality, or regularization. :::
Interview Questions
Q1: What is a convex function, and why does convexity matter for optimization?
A function f is convex if for any two points x, y and any λ ∈ [0,1]:
Geometrically: the function lies below the chord connecting any two points - it "curves upward."
Why it matters for optimization:
-
No bad local minima: Every local minimum of a convex function is also a global minimum. Gradient descent is guaranteed to find the global optimum.
-
First-order condition is sufficient: ∇f(x) = 0 is necessary AND sufficient for x to be a global minimum (for convex functions). For non-convex functions, ∇f = 0 only identifies candidates, not guarantees.
-
Predictable convergence: For strongly convex functions, gradient descent converges at a linear rate O((1-αμ)ᵗ). For convex functions, it converges at sublinear rate O(1/t).
-
ML examples: Linear regression (MSE loss), logistic regression, SVM (hinge loss + regularization) are convex - their global optimum is guaranteed. Neural networks are non-convex.
Q2: Why is logistic regression convex but neural networks are not?
Logistic regression:
The logistic regression loss is:
The Hessian is ∇²L = (1/n)Xᵀ·diag(σ(Xw)(1-σ(Xw)))·X, which is always positive semidefinite (X.T @ D @ X with D diagonal non-negative is always PSD). So logistic regression is convex.
Neural networks:
For a 2-layer network ŷ = σ(W₂σ(W₁x)), consider what happens to the parameters (W₁, W₂):
- You can multiply W₁ by c and divide W₂ by c for any c≠0 - same output, different parameters
- This creates a "ridge" of equivalent parameter settings - many equivalent global minima
- The parameters interact multiplicatively, so the Hessian has negative eigenvalues
Additionally, the same function is represented by different parameter sets (permutation symmetry), creating multiple global minima that are not even connected by gradient flow.
Q3: What is a saddle point, and how does gradient descent handle it?
A saddle point is a critical point (∇f = 0) where the Hessian has both positive and negative eigenvalues. The function decreases in some directions and increases in others - like the center of a saddle (hyperbolic paraboloid).
Theoretically, gradient descent gets stuck at saddle points because the gradient is zero. In practice, several mechanisms help escape:
-
SGD noise: Mini-batch gradient estimates are noisy. The noise projects onto the negative curvature directions, providing a kick that moves the optimizer off the saddle.
-
Perturbation tricks: Some algorithms explicitly add noise (e.g., perturbed gradient descent) to provably escape saddle points.
-
Second-order methods: Newton's method can detect negative curvature (from the Hessian) and move in the negative curvature direction rather than the negative gradient direction.
In high-dimensional networks: The fraction of critical points that are local minima (versus saddle points) decreases exponentially with dimension. Most critical points in deep networks are saddle points, not local minima. SGD typically escapes them successfully because of gradient noise.
Q4: Why does small-batch SGD sometimes find solutions that generalize better than full-batch gradient descent?
The key is the flat vs. sharp minima distinction.
Loss landscape has many local minima. Flat minima (wide valleys) and sharp minima (narrow spikes) may have similar training loss, but different generalization behavior:
-
Flat minimum: Small perturbation to parameters → small change in loss. If training and test distributions differ slightly (real-world distribution shift), the model stays near its minimum. Generalizes well.
-
Sharp minimum: Small perturbation → large change in loss. The model is sensitive to small distribution shifts. Generalizes poorly (memorizes noise).
Why SGD prefers flat minima: The noise in mini-batch gradient estimates effectively adds a random walk to the optimization. Sharp minima are narrow - the random walk escapes them. Flat minima are wide - the random walk stays inside them. So SGD converges to and stays in flat minima.
Full-batch gradient descent has no noise, converges more precisely to whichever minimum it finds, and is more likely to settle in a sharp minimum.
This is empirically validated (Keskar et al. 2017: large-batch training consistently leads to sharp minima and worse generalization than small-batch training at the same learning rate).
Q5: State Jensen's inequality and give two ML applications.
Jensen's inequality: For a convex function f and random variable X:
"The function of the average is at most the average of the function values."
Application 1: Why ensembling works
For regression with MSE loss (convex in predictions):
The loss of the ensemble prediction is at most the average loss of the individual models. Ensembling always improves (or maintains) performance - Jensen's inequality is the mathematical guarantee.
Application 2: Deriving the ELBO in Variational Autoencoders
The VAE objective involves log p(x) which is intractable. Jensen's inequality gives:
The inequality holds because log is concave (the inequality flips). We maximize the tractable ELBO as a lower bound on the intractable log-likelihood. This is the foundation of variational inference.
Quick Reference
| Concept | Definition | Test |
|---|---|---|
| Convex set | Line segment between any two points stays in set | Check λx+(1-λ)y ∈ C |
| Convex function | f(λx+(1-λ)y) ≤ λf(x)+(1-λ)f(y) | Check Hessian eigenvalues ≥ 0 |
| Strongly convex | Curves upward by at least μ in every direction | μ-PD Hessian |
| Local minimum | ∇f = 0, Hessian PD | All eigenvalues > 0 |
| Saddle point | ∇f = 0, Hessian indefinite | Mixed eigenvalue signs |
| Jensen's inequality | f(E[X]) ≤ E[f(X)] for convex f | Apply to ensembling, ELBO |
| Flat minimum | Low curvature → good generalization | Small Hessian eigenvalues at minimum |
Key Takeaways
- A convex function lies below any chord - the Hessian is positive semidefinite everywhere
- Gradient descent on a convex function is guaranteed to find the global minimum; every local minimum is global
- Linear regression (MSE) and logistic regression are convex; neural networks are not
- Deep network loss landscapes have many saddle points (not local minima) in high dimensions - SGD noise escapes them effectively
- Flat minima generalize better than sharp minima because they are robust to small perturbations
- Despite non-convexity, deep learning works because overparameterized networks have many high-quality local minima and SGD implicitly prefers flat, generalizable solutions
- Jensen's inequality underlies ensembling guarantees and the ELBO in VAEs
Next: Lagrange Multipliers →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Convex Functions demo on the EngineersOfAI Playground - no code required.
:::
