Convex Functions and Optimization - Why Some Problems Are Easy and Others Are Not

Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering

You are training logistic regression. It converges cleanly in 50 iterations. You switch to a 5-layer neural network for the same task. Suddenly you are babysitting hyperparameters, watching loss curves plateau, wondering if you are stuck in a local minimum.

The difference is convexity. Logistic regression has a convex loss function. Gradient descent on a convex function is mathematically guaranteed to find the global minimum. Neural networks have non-convex loss functions. Gradient descent on non-convex functions offers no such guarantees.

Understanding convexity tells you exactly why some ML problems are easy to optimize and others are not - and what happens in the loss landscape when you are training a deep network.

What You Will Learn

Formal definition of convex functions and convex sets
How convexity guarantees global optima for gradient descent
The geometry of convex vs. non-convex loss landscapes
Strongly convex functions and linear convergence rates
Why deep learning loss functions are non-convex
Saddle points, flat regions, and local minima in practice
Jensen's inequality and its ML applications
Why deep learning still works despite non-convexity

Prerequisites

Lesson 01: Derivatives and Gradients (required)
Lesson 03: Gradient Descent Mechanics (recommended)
Basic linear algebra (vectors, matrices)

Part 1 - Convex Sets

A set C ⊆ ℝⁿ is convex if the line segment between any two points in C lies entirely within C:

$\forall\, x, y \in C \text{ and } \forall\, \lambda \in [0, 1]: \quad \lambda x + (1-\lambda) y \in C$

Geometric interpretation

Convex sets:                Non-convex sets:
    ╭────╮                      ╭──╮    ╭──╮
   /      \                    /    \  /    \
  │        │      ✓           │      \/      │  ✗
   \      /                    \            /
    ╰────╯                      ╰──────────╯
  (circle)                    (two blobs joined by a dip)

Line segment between          Line segment exits
any two points stays inside   the set - not convex

ML relevance: The constraint sets in regularized optimization are often convex. The feasibility of efficiently solving constrained optimization problems depends on whether both the objective and the constraint set are convex.

Common convex sets

Set	Definition	ML Use
ℝⁿ (all of space)	trivially convex	unconstrained optimization
Halfspace {x : aᵀx ≤ b}	defined by linear inequality	linear SVM constraints
Ball {x : ‖x‖₂ ≤ r}	L2-bounded region	L2 regularization constraint
Simplex {x : Σxᵢ = 1, xᵢ ≥ 0}	probability distributions	softmax outputs
Positive semidefinite cone	{A : A ≽ 0}	covariance matrices

Part 2 - Convex Functions: Definition and Tests

A function f: ℝⁿ → ℝ is convex if its domain is a convex set and for all x, y in the domain:

$f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y) \quad \forall \lambda \in [0, 1]$

Geometric meaning: The function value at any weighted average of two points is at most the weighted average of the function values at those points. The function "curves upward" - it lies below the chord connecting any two points.

Convex function:                 Non-convex function:

f(x)                            f(x)
│           /                   │    /\/\/\
│          /                    │   /      \
│    chord /                    │  /        \
│    .....'                     │ /    ←chord→\
│   ...../ ← f(x) lies below   │
│  .....                        │ Function pops above chord
│ /                             │
└─────────────── x              └────────────────── x
    λx + (1-λ)y

First-order condition for convexity

If f is differentiable, it is convex if and only if:

$f(y) \geq f(x) + \nabla f(x)^T (y - x) \quad \forall\, x, y$

Meaning: The tangent hyperplane at any point lies below the function everywhere. The linear approximation underestimates the function. This is the geometric heart of why gradient descent works on convex functions - the first-order Taylor approximation is always a lower bound.

Second-order condition for convexity

If f is twice differentiable, it is convex if and only if the Hessian is positive semidefinite:

$\nabla^2 f(x) \succeq 0 \quad \forall\, x$

The Hessian ∇²f (matrix of second-order partial derivatives) is positive semidefinite when all its eigenvalues are ≥ 0, meaning the function curves upward in every direction.

import numpy as np

def check_convexity_numerical(
    f,
    domain_bounds: tuple,
    n_samples: int = 1000
) -> bool:
    """
    Test convexity numerically by checking the definition:
    f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y) for random (x, y, λ)
    """
    lo, hi = domain_bounds
    violations = 0
    tol = 1e-8

    for _ in range(n_samples):
        x = np.random.uniform(lo, hi)
        y = np.random.uniform(lo, hi)
        lam = np.random.uniform(0, 1)

        midpoint = lam * x + (1 - lam) * y
        convex_combo = lam * f(x) + (1 - lam) * f(y)

        if f(midpoint) > convex_combo + tol:
            violations += 1

    return violations == 0

# Test various functions
functions = {
    'x^2 (convex)': lambda x: x**2,
    'x^4 (convex)': lambda x: x**4,
    'exp(x) (convex)': lambda x: np.exp(x),
    'x^3 (NOT convex)': lambda x: x**3,
    '-x^2 (concave, NOT convex)': lambda x: -x**2,
    'sin(x) (NOT convex on [-π, π])': lambda x: np.sin(x),
}

print("Convexity test results:")
for name, f in functions.items():
    is_convex = check_convexity_numerical(f, (-3, 3))
    print(f"  {name}: {'CONVEX' if is_convex else 'NOT convex'}")

Strongly convex functions

f is μ-strongly convex if:

$f(y) \geq f(x) + \nabla f(x)^T (y-x) + \frac{\mu}{2} \|y - x\|^2$

Strong convexity means the function curves upward by at least μ in every direction (has a lower bound on curvature). This leads to linear convergence for gradient descent:

$L(\theta_t) - L(\theta^*) \leq \left(1 - \alpha\mu\right)^t \left(L(\theta_0) - L(\theta^*)\right)$

Type	Gradient descent convergence	Examples
Convex	O(1/t) (sublinear)
Strongly convex	O((1-αμ)ᵗ) (linear)	L2-regularized linear/logistic regression
Smooth + convex	O(1/t) with larger constant	Logistic regression (no regularization)

Part 3 - Key Convex Loss Functions in ML

Linear regression MSE

$L(w) = \frac{1}{n} \|Xw - y\|^2 = \frac{1}{n} (Xw-y)^T(Xw-y)$

The Hessian is ∇²L = (2/n)XᵀX, which is always positive semidefinite. MSE for linear regression is convex. With L2 regularization, the Hessian is (2/n)XᵀX + 2λI, which is positive definite (strongly convex).

Logistic regression

$L(w) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(w^T x_i) + (1-y_i) \log(1 - \sigma(w^T x_i)) \right]$

The Hessian is (1/n)Xᵀ·diag(σ(Xw)(1-σ(Xw)))·X, which is always positive semidefinite. Logistic regression is convex. One global minimum (no local minima to get stuck in).

import numpy as np

def logistic_regression_hessian(X: np.ndarray, w: np.ndarray) -> np.ndarray:
    """
    Compute the Hessian of logistic regression loss.
    Should be positive semidefinite (all eigenvalues >= 0).
    """
    def sigmoid(z): return 1 / (1 + np.exp(-z))

    s = sigmoid(X @ w)          # (n,)
    S = np.diag(s * (1 - s))    # (n, n) diagonal matrix

    H = (1/len(w)) * X.T @ S @ X  # (d, d)
    return H

# Verify PSD: all eigenvalues should be >= 0
np.random.seed(42)
n, d = 100, 5
X = np.random.randn(n, d)
w = np.random.randn(d)

H = logistic_regression_hessian(X, w)
eigenvalues = np.linalg.eigvalsh(H)

print(f"Hessian eigenvalues: {eigenvalues.round(6)}")
print(f"All non-negative: {np.all(eigenvalues >= -1e-10)}")  # True → convex

Jensen's inequality

For a convex function f and random variable X:

$f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$

"The function of the average is ≤ the average of the function values."

ML applications:

Expected loss is at least the loss at the average: Averaging model predictions (ensembling) can be better than any single model when the loss is convex
KL divergence derivation: KL(P‖Q) ≥ 0 follows from Jensen's inequality applied to -log
Evidence lower bound (ELBO) in VAEs: Jensen's inequality bounds the intractable log-likelihood

import numpy as np

# Verify Jensen's inequality numerically
np.random.seed(42)
f = np.exp  # exp is convex

n_samples = 10000
X = np.random.randn(n_samples)  # random variable

# Jensen: f(E[X]) <= E[f(X)]
f_of_mean = f(np.mean(X))
mean_of_f = np.mean(f(X))

print(f"f(E[X]) = exp(E[X]) = {f_of_mean:.4f}")
print(f"E[f(X)] = E[exp(X)] = {mean_of_f:.4f}")
print(f"Jensen holds: {f_of_mean <= mean_of_f}")  # True
print(f"Gap: {mean_of_f - f_of_mean:.4f}")

Part 4 - Non-Convex Loss Landscapes in Deep Learning

Neural networks have non-convex loss functions. The parameters θ enter the loss in a highly non-linear way, with multiple compositions of nonlinear functions.

Why deep network loss is non-convex

Symmetry: Any permutation of neurons in a hidden layer gives the same network output. This means there are exponentially many parameter configurations with the same loss - all "global minima" relative to each other. Gradient descent can find any of them.

Weight scaling: Scaling up the weights of layer 1 by a factor and scaling down layer 2 by the same factor preserves the network output. This creates continuous "ridges" in parameter space where the loss is constant.

Non-linear activations: ReLU, tanh, sigmoid create kinks and non-smooth regions that break convexity.

The loss landscape: what it actually looks like

Simplified 1D view of deep network loss:

L(θ)
│                      local
│    local             minimum        global
│    max               ↓              min
│     \    /\/\   /\/\               ↓
│      \  / saddle \  /\  /─────\___/
│       \/          \/  \/
│
└──────────────────────────────────── θ

Key features:

Multiple local minima: Gradient descent may converge to different solutions depending on initialization
Saddle points: Points where ∇f = 0 but not a minimum (some directions curve up, some curve down)
Flat regions / plateaus: Very small gradients everywhere, slow convergence
Sharp vs. flat minima: Sharp minima tend to generalize worse (sensitive to perturbations)

Saddle points dominate the landscape

For random high-dimensional functions, the majority of critical points (where ∇f = 0) are saddle points, not local minima. The probability of a point being a local minimum decreases exponentially with dimension.

Why this matters: At a saddle point, the gradient is zero, so gradient descent theoretically gets stuck. In practice, the noise in mini-batch gradient estimates gives a random kick that escapes saddle points. This is a key reason why SGD performs well in practice even for non-convex problems.

import numpy as np

# Visualize a saddle point in 2D: f(x,y) = x^2 - y^2
def f(x, y): return x**2 - y**2
def grad_f(x, y): return np.array([2*x, -2*y])

# Saddle point at (0, 0): gradient is 0
origin = np.array([0.0, 0.0])
print(f"Gradient at origin: {grad_f(*origin)}")  # [0, 0] - looks like minimum!

# But the Hessian reveals it is a saddle point
# Hessian = [[2, 0], [0, -2]] - one positive, one negative eigenvalue
H = np.array([[2, 0], [0, -2]])
eigenvalues = np.linalg.eigvalsh(H)
print(f"Hessian eigenvalues: {eigenvalues}")  # [2, -2] → saddle point!
print("Saddle point! (mixed-sign eigenvalues of Hessian)")

# Gradient descent near saddle: tiny perturbation escapes
theta = np.array([0.001, 0.0])  # slightly off saddle
lr = 0.1
print(f"\nGradient descent near saddle:")
for step in range(5):
    grad = grad_f(*theta)
    theta -= lr * grad
    print(f"  Step {step+1}: θ={theta.round(4)}, ‖∇f‖={np.linalg.norm(grad):.4f}")
# x grows (curvature positive), y stays near 0 → escapes saddle in x direction

Second-order criticality test (Hessian)

At a critical point ∇f = 0, the Hessian ∇²f tells you what type of critical point it is:

Hessian ∇²f	Eigenvalues	Critical point type
Positive definite (∇²f ≻ 0)	All > 0	Local minimum
Negative definite (∇²f ≺ 0)	All < 0	Local maximum
Indefinite (mixed)	Some > 0, some < 0	Saddle point
Positive semidefinite	All ≥ 0, some = 0	Degenerate case

Part 5 - Why Deep Learning Works Despite Non-Convexity

This is a central question in deep learning theory. If the loss is non-convex, why do networks consistently converge to good solutions?

Reason 1: Local minima are nearly as good as global minima

For overparameterized networks (far more parameters than data points), research suggests that most local minima have similar quality to the global minimum. The "bad" local minima essentially disappear in high dimensions.

Intuition: In 1D, a local minimum is a valley. In 1,000,000D, for a local minimum to exist, the function must curve upward in all 1,000,000 directions. This becomes exponentially rare, and when it happens, the local minimum is usually close in value to the global minimum.

Reason 2: SGD noise escapes saddle points and poor local minima

Mini-batch gradient noise provides an implicit random walk. If the noise scale is comparable to the width of a "bad" local minimum, SGD will escape it. This is why models trained with SGD often generalize better than those trained with second-order methods that converge more precisely to a specific minimum.

Reason 3: Flat minima generalize better

Not all minima are equal. A flat minimum (one in a wide valley in the loss landscape) generalizes better than a sharp minimum (one in a narrow spike). Here is why:

After training on dataset D, if the model is deployed on slightly different data D' (distribution shift, new examples), its parameters are effectively perturbed. A flat minimum maintains low loss under small perturbations; a sharp minimum does not.

SGD with small learning rates and batch noise naturally prefers flat minima - another reason SGD often generalizes better than full-batch methods.

import numpy as np

# Illustration: sharp vs flat minimum response to perturbation
def sharp_minimum(theta):
    """Loss near a sharp minimum: f(θ) = 100θ²"""
    return 100 * theta**2

def flat_minimum(theta):
    """Loss near a flat minimum: f(θ) = θ²"""
    return theta**2

# Both have minimum at θ=0, loss=0
perturbation = 0.1  # small distribution shift

sharp_loss = sharp_minimum(perturbation)
flat_loss = flat_minimum(perturbation)

print(f"Under perturbation of {perturbation}:")
print(f"  Sharp minimum loss: {sharp_loss:.4f}")  # 1.0 (big jump)
print(f"  Flat minimum loss:  {flat_loss:.4f}")   # 0.01 (small jump)
print(f"  Ratio: {sharp_loss/flat_loss:.1f}x worse for sharp minimum")

Reason 4: The loss landscape is better than expected

Empirical studies (Goodfellow et al., Li et al.) show that neural network loss landscapes, while non-convex, tend to have:

Relatively few "bad" local minima in high-dimensional overparameterized networks
Many saddle points but with escape directions detectable by SGD noise
A percolation structure where one can find paths from initialization to a good minimum without climbing high barriers

Reason 5: Residual connections create a convex-like structure

ResNets add identity shortcuts: output = F(x) + x. The gradient of the loss with respect to the input passes through the identity mapping directly:

$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial \text{output}} \cdot \left(\frac{\partial F(x)}{\partial x} + I\right)$

The identity term I means the gradient always has a direct path backward - this prevents vanishing gradients and creates a loss landscape that behaves more like a convex function in many respects.

Part 6 - Practical Implications

When convexity is lost: early warning signs

Symptom	Possible cause
Loss oscillates without decreasing	Stuck in flat region or oscillating between saddle points
Loss converges to different values across runs	Multiple local minima; initialization-dependent
Very slow convergence	High condition number; using fixed learning rate in narrow valley
Loss suddenly spikes after many epochs	Escaped a local minimum; or learning rate too high
Training loss low but validation loss high	Sharp minimum (memorization); need regularization

import numpy as np

def analyze_loss_landscape(
    f,          # loss function
    grad_f,     # gradient function
    theta: np.ndarray,
    perturbation_scale: float = 0.01,
    n_directions: int = 10
) -> dict:
    """
    Analyze local loss landscape around a critical point.
    Estimate whether it looks like a minimum, saddle, or flat region.
    """
    grad_norm = np.linalg.norm(grad_f(theta))

    # Probe curvature in random directions
    curvatures = []
    for _ in range(n_directions):
        direction = np.random.randn(len(theta))
        direction /= np.linalg.norm(direction)
        eps = perturbation_scale

        # Numerical second derivative in direction
        f_plus = f(theta + eps * direction)
        f_center = f(theta)
        f_minus = f(theta - eps * direction)
        curvature = (f_plus - 2*f_center + f_minus) / (eps**2)
        curvatures.append(curvature)

    curvatures = np.array(curvatures)
    n_positive = np.sum(curvatures > 0)
    n_negative = np.sum(curvatures < 0)

    if grad_norm < 1e-4:
        if n_negative == 0:
            point_type = "local minimum"
        elif n_positive == 0:
            point_type = "local maximum"
        else:
            point_type = f"saddle point ({n_positive} up, {n_negative} down)"
    else:
        point_type = "not a critical point"

    return {
        'grad_norm': grad_norm,
        'point_type': point_type,
        'avg_curvature': curvatures.mean(),
        'min_curvature': curvatures.min(),
        'max_curvature': curvatures.max(),
    }

# Example: analyze a known saddle point
f_saddle = lambda t: t[0]**2 - t[1]**2
grad_saddle = lambda t: np.array([2*t[0], -2*t[1]])

analysis = analyze_loss_landscape(f_saddle, grad_saddle, np.array([0.0, 0.0]))
print(f"Analysis at origin:")
for k, v in analysis.items():
    print(f"  {k}: {v}")

Part 7 - Common Mistakes and Engineering Red Flags

:::warning Assuming your loss is convex when it is not Linear and logistic regression have convex losses - gradient descent is guaranteed to find the global minimum. Any non-linear model (neural networks, kernel methods with non-linear features) has a non-convex loss. This matters for debugging: if your linear regression is not converging, it is definitely a learning rate or data issue; if your neural network is not converging, it could also be a local minimum or saddle point issue. :::

:::tip Practical convexity check For any loss function, check the Hessian eigenvalues at the optimum. If all eigenvalues are positive, you found a local minimum. If some are negative, you are at a saddle point. If some are near zero, you are in a flat region.

import torch

def hessian_at_point(model, loss_fn, X, y):
    """Compute Hessian of loss at current parameters using torch.autograd."""
    params = list(model.parameters())
    # Flatten all parameters into a single vector
    flat_params = torch.cat([p.view(-1) for p in params])
    n = len(flat_params)

    # This is expensive - only use for small models
    loss = loss_fn(model(X), y)
    grads = torch.autograd.grad(loss, params, create_graph=True)
    flat_grads = torch.cat([g.view(-1) for g in grads])

    H = torch.zeros(n, n)
    for i in range(n):
        second_grads = torch.autograd.grad(flat_grads[i], params, retain_graph=True)
        H[i] = torch.cat([g.view(-1) for g in second_grads])

    eigenvalues = torch.linalg.eigvalsh(H)
    return eigenvalues

:::

:::danger Conflating "local minimum" with "bad solution" In overparameterized deep networks, local minima are not necessarily bad. A local minimum found by SGD often has similar quality to the global minimum. If your model is performing poorly, the issue is almost certainly not "stuck in a bad local minimum" - it is more likely learning rate, architecture, data quality, or regularization. :::

Interview Questions

Q1: What is a convex function, and why does convexity matter for optimization?

A function f is convex if for any two points x, y and any λ ∈ [0,1]: $f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)$

Geometrically: the function lies below the chord connecting any two points - it "curves upward."

Why it matters for optimization:

No bad local minima: Every local minimum of a convex function is also a global minimum. Gradient descent is guaranteed to find the global optimum.
First-order condition is sufficient: ∇f(x) = 0 is necessary AND sufficient for x to be a global minimum (for convex functions). For non-convex functions, ∇f = 0 only identifies candidates, not guarantees.
Predictable convergence: For strongly convex functions, gradient descent converges at a linear rate O((1-αμ)ᵗ). For convex functions, it converges at sublinear rate O(1/t).
ML examples: Linear regression (MSE loss), logistic regression, SVM (hinge loss + regularization) are convex - their global optimum is guaranteed. Neural networks are non-convex.

Q2: Why is logistic regression convex but neural networks are not?

Logistic regression:

The logistic regression loss is: $L(w) = -\frac{1}{n}\sum_i [y_i \log\sigma(w^Tx_i) + (1-y_i)\log(1-\sigma(w^Tx_i))]$

The Hessian is ∇²L = (1/n)Xᵀ·diag(σ(Xw)(1-σ(Xw)))·X, which is always positive semidefinite (X.T @ D @ X with D diagonal non-negative is always PSD). So logistic regression is convex.

Neural networks:

For a 2-layer network ŷ = σ(W₂σ(W₁x)), consider what happens to the parameters (W₁, W₂):

You can multiply W₁ by c and divide W₂ by c for any c≠0 - same output, different parameters
This creates a "ridge" of equivalent parameter settings - many equivalent global minima
The parameters interact multiplicatively, so the Hessian has negative eigenvalues

Additionally, the same function is represented by different parameter sets (permutation symmetry), creating multiple global minima that are not even connected by gradient flow.

Q3: What is a saddle point, and how does gradient descent handle it?

A saddle point is a critical point (∇f = 0) where the Hessian has both positive and negative eigenvalues. The function decreases in some directions and increases in others - like the center of a saddle (hyperbolic paraboloid).

Theoretically, gradient descent gets stuck at saddle points because the gradient is zero. In practice, several mechanisms help escape:

SGD noise: Mini-batch gradient estimates are noisy. The noise projects onto the negative curvature directions, providing a kick that moves the optimizer off the saddle.
Perturbation tricks: Some algorithms explicitly add noise (e.g., perturbed gradient descent) to provably escape saddle points.
Second-order methods: Newton's method can detect negative curvature (from the Hessian) and move in the negative curvature direction rather than the negative gradient direction.

In high-dimensional networks: The fraction of critical points that are local minima (versus saddle points) decreases exponentially with dimension. Most critical points in deep networks are saddle points, not local minima. SGD typically escapes them successfully because of gradient noise.

Q4: Why does small-batch SGD sometimes find solutions that generalize better than full-batch gradient descent?

The key is the flat vs. sharp minima distinction.

Loss landscape has many local minima. Flat minima (wide valleys) and sharp minima (narrow spikes) may have similar training loss, but different generalization behavior:

Flat minimum: Small perturbation to parameters → small change in loss. If training and test distributions differ slightly (real-world distribution shift), the model stays near its minimum. Generalizes well.
Sharp minimum: Small perturbation → large change in loss. The model is sensitive to small distribution shifts. Generalizes poorly (memorizes noise).

Why SGD prefers flat minima: The noise in mini-batch gradient estimates effectively adds a random walk to the optimization. Sharp minima are narrow - the random walk escapes them. Flat minima are wide - the random walk stays inside them. So SGD converges to and stays in flat minima.

Full-batch gradient descent has no noise, converges more precisely to whichever minimum it finds, and is more likely to settle in a sharp minimum.

This is empirically validated (Keskar et al. 2017: large-batch training consistently leads to sharp minima and worse generalization than small-batch training at the same learning rate).

Q5: State Jensen's inequality and give two ML applications.

Jensen's inequality: For a convex function f and random variable X: $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$

"The function of the average is at most the average of the function values."

Application 1: Why ensembling works

For regression with MSE loss (convex in predictions):

$L\left(\frac{1}{K}\sum_k f_k(x)\right) \leq \frac{1}{K}\sum_k L(f_k(x))$

The loss of the ensemble prediction is at most the average loss of the individual models. Ensembling always improves (or maintains) performance - Jensen's inequality is the mathematical guarantee.

Application 2: Deriving the ELBO in Variational Autoencoders

The VAE objective involves log p(x) which is intractable. Jensen's inequality gives:

$\log p(x) = \log \mathbb{E}_{q(z|x)}\left[\frac{p(x,z)}{q(z|x)}\right] \geq \mathbb{E}_{q(z|x)}\left[\log\frac{p(x,z)}{q(z|x)}\right] = \text{ELBO}$

The inequality holds because log is concave (the inequality flips). We maximize the tractable ELBO as a lower bound on the intractable log-likelihood. This is the foundation of variational inference.

Quick Reference

Concept	Definition	Test
Convex set	Line segment between any two points stays in set	Check λx+(1-λ)y ∈ C
Convex function	f(λx+(1-λ)y) ≤ λf(x)+(1-λ)f(y)	Check Hessian eigenvalues ≥ 0
Strongly convex	Curves upward by at least μ in every direction	μ-PD Hessian
Local minimum	∇f = 0, Hessian PD	All eigenvalues > 0
Saddle point	∇f = 0, Hessian indefinite	Mixed eigenvalue signs
Jensen's inequality	f(E[X]) ≤ E[f(X)] for convex f	Apply to ensembling, ELBO
Flat minimum	Low curvature → good generalization	Small Hessian eigenvalues at minimum

Key Takeaways

A convex function lies below any chord - the Hessian is positive semidefinite everywhere
Gradient descent on a convex function is guaranteed to find the global minimum; every local minimum is global
Linear regression (MSE) and logistic regression are convex; neural networks are not
Deep network loss landscapes have many saddle points (not local minima) in high dimensions - SGD noise escapes them effectively
Flat minima generalize better than sharp minima because they are robust to small perturbations
Despite non-convexity, deep learning works because overparameterized networks have many high-quality local minima and SGD implicitly prefers flat, generalizable solutions
Jensen's inequality underlies ensembling guarantees and the ELBO in VAEs

Next: Lagrange Multipliers →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Convex Functions demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - Convex Sets​

Geometric interpretation​

Common convex sets​

Part 2 - Convex Functions: Definition and Tests​

First-order condition for convexity​

Second-order condition for convexity​

Strongly convex functions​

Part 3 - Key Convex Loss Functions in ML​

Linear regression MSE​

Logistic regression​

Jensen's inequality​

Part 4 - Non-Convex Loss Landscapes in Deep Learning​

Why deep network loss is non-convex​

The loss landscape: what it actually looks like​

Saddle points dominate the landscape​

Second-order criticality test (Hessian)​

Part 5 - Why Deep Learning Works Despite Non-Convexity​

Reason 1: Local minima are nearly as good as global minima​

Reason 2: SGD noise escapes saddle points and poor local minima​

Reason 3: Flat minima generalize better​

Reason 4: The loss landscape is better than expected​

Reason 5: Residual connections create a convex-like structure​

Part 6 - Practical Implications​

When convexity is lost: early warning signs​

Part 7 - Common Mistakes and Engineering Red Flags​

Interview Questions​

Quick Reference​

Key Takeaways​