Stationarity and Ergodicity - The Prerequisites for Time Series ML
Reading time: ~22 minutes | Level: Time Series Foundations → Production ML
Your ML team trains a gradient boosted model to predict energy demand using the past 3 years of hourly consumption data. Backtesting shows 95% accuracy. You deploy. The model is 40% off every winter morning.
The problem: energy demand has a trend component that grew over your training period. Your model learned to predict "what level does demand run at in this period" rather than "what is the deviation from the long-run trend." This is the classic consequence of training on non-stationary data without accounting for it.
Stationarity is the first thing you must check before applying any time series ML model.
What You Will Learn
- Strict vs weak (covariance) stationarity - formal definitions
- Why stationarity matters for ML: assumptions underlying forecasting models
- Unit roots: the most common form of non-stationarity
- Augmented Dickey-Fuller (ADF) test: implementation and interpretation
- Differencing: transforming non-stationary to stationary
- Trend decomposition and deseasonalization
- Ergodicity: why time averages equal ensemble averages
Part 1 - Stationarity: Formal Definitions
Strict stationarity
A time series is strictly stationary if the joint distribution of any collection is the same as the joint distribution of for all , all time points , and all shifts .
Informally: the statistical properties of the process do not change over time.
Weak (covariance) stationarity
A time series is weakly stationary (covariance stationary) if:
- Constant mean: for all
- Constant variance: for all
- Autocovariance depends only on lag: depends only on , not on
:::note Weak vs strict stationarity Strict stationarity implies weak stationarity (assuming finite variance). Weak stationarity does not imply strict stationarity in general. However, for Gaussian processes, weak and strict stationarity are equivalent (since Gaussians are fully characterized by first and second moments).
In ML practice, we almost always work with weak stationarity. :::
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
np.random.seed(42)
n = 500
# --- Stationary processes ---
# White noise: E[X_t]=0, Var(X_t)=1, Cov(X_t, X_s)=0 for t≠s
white_noise = np.random.randn(n)
# AR(1) stationary (|φ|<1): X_t = 0.7 X_{t-1} + ε_t
ar1_stationary = np.zeros(n)
for t in range(1, n):
ar1_stationary[t] = 0.7 * ar1_stationary[t-1] + np.random.randn()
# --- Non-stationary processes ---
# Random walk (unit root): X_t = X_{t-1} + ε_t
random_walk = np.cumsum(np.random.randn(n))
# Trend-stationary: X_t = 0.02t + ε_t
trend_stationary = 0.02 * np.arange(n) + 0.5 * np.random.randn(n)
# Heteroskedastic (non-constant variance)
heteroskedastic = np.random.randn(n) * (1 + np.arange(n) / n)
# Check: does the mean/variance look constant over time?
for name, series in [('White noise', white_noise),
('AR(1)', ar1_stationary),
('Random walk', random_walk),
('Trend', trend_stationary)]:
first_half_mean = series[:n//2].mean()
second_half_mean = series[n//2:].mean()
first_half_var = series[:n//2].var()
second_half_var = series[n//2:].var()
print(f"{name}:")
print(f" Mean: {first_half_mean:.2f} (1st half) vs {second_half_mean:.2f} (2nd half)")
print(f" Var: {first_half_var:.2f} (1st half) vs {second_half_var:.2f} (2nd half)")
Part 2 - Why Stationarity Matters for ML
The i.i.d. assumption violation
Most ML models (linear regression, XGBoost, neural networks) assume training examples are independently and identically distributed (i.i.d.). For time series:
- Non-i.i.d.: Observations at nearby times are correlated
- Non-stationary: The distribution changes over time
If you train on non-stationary data without addressing the non-stationarity, you are learning a relation that no longer holds at test time.
Train/test split on non-stationary data
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
np.random.seed(42)
n = 1000
# Non-stationary series: trending upward
trend = np.linspace(0, 10, n)
noise = np.random.randn(n) * 0.5
y = trend + noise
# Feature: lag-1 value
X = y[:-1].reshape(-1, 1)
y_target = y[1:]
# BAD: random shuffle split (introduces future leakage)
from sklearn.model_selection import train_test_split
X_train_bad, X_test_bad, y_train_bad, y_test_bad = train_test_split(
X, y_target, test_size=0.2, shuffle=True, random_state=42) # WRONG
# CORRECT: time-based split - only use past to predict future
split_idx = int(0.8 * len(X))
X_train = X[:split_idx]
y_train = y_target[:split_idx]
X_test = X[split_idx:]
y_test = y_target[split_idx:]
# Train on random split (inflated performance due to future leakage)
model_bad = LinearRegression().fit(X_train_bad, y_train_bad)
score_bad = model_bad.score(X_test_bad, y_test_bad)
# Train on time split
model_correct = LinearRegression().fit(X_train, y_train)
score_correct = model_correct.score(X_test, y_test)
print(f"R² (random split, leaky): {score_bad:.4f}") # Artificially high
print(f"R² (time split, honest): {score_correct:.4f}")
# For trending data: both may be high here, but the insight holds for non-linear patterns
Distribution shift over time
import numpy as np
np.random.seed(42)
n = 1000
# Simulated financial data: regime change
# First 500 steps: low volatility bull market
# Next 500 steps: high volatility bear market
low_vol = np.random.randn(500) * 0.01 + 0.0005
high_vol = np.random.randn(500) * 0.03 - 0.001
returns = np.concatenate([low_vol, high_vol])
prices = np.exp(np.cumsum(returns)) * 100
print("First 500 observations (training-like period):")
print(f" Mean return: {low_vol.mean():.6f}")
print(f" Std return: {low_vol.std():.6f}")
print("\nNext 500 observations (test-like period):")
print(f" Mean return: {high_vol.mean():.6f}")
print(f" Std return: {high_vol.std():.6f}")
# A model trained on the first regime will massively underestimate volatility
# in the second regime - catastrophic for risk management
Part 3 - Unit Roots and the ADF Test
The unit root process
A time series has a unit root if:
or more generally:
The "unit" refers to - the characteristic polynomial has a root at 1.
Properties of unit root processes:
- Mean changes over time (trending or drifting)
- Variance grows without bound:
- Shocks are permanent: a perturbation at time affects all future values
- Standard regression on integrated series produces spurious correlations
The Augmented Dickey-Fuller (ADF) test
The ADF test is a hypothesis test for unit roots:
- : Series has a unit root (non-stationary)
- : Series is stationary
The test regression:
Test statistic: -statistic for . The augmented terms account for serial correlation.
Decision: If test statistic < critical value (usually at 5% level), reject → series is stationary.
import numpy as np
from statsmodels.tsa.stattools import adfuller
def adf_test(series: np.ndarray, name: str = '', verbose: bool = True) -> dict:
"""
Augmented Dickey-Fuller test with interpretation.
Returns: dict with test_statistic, p_value, is_stationary
"""
result = adfuller(series, autolag='AIC') # AIC selects optimal lag order
output = {
'test_statistic': result[0],
'p_value': result[1],
'n_lags_used': result[2],
'critical_values': result[4],
'is_stationary': result[1] < 0.05 # Common threshold
}
if verbose:
print(f"\nADF Test: {name}")
print(f" Test Statistic: {result[0]:.4f}")
print(f" p-value: {result[1]:.4f}")
print(f" Lags used: {result[2]}")
print(f" Critical Values:")
for level, cv in result[4].items():
print(f" {level}: {cv:.4f}")
conclusion = "STATIONARY" if output['is_stationary'] else "NON-STATIONARY (unit root)"
print(f" Conclusion: {conclusion} (p < 0.05: {result[1] < 0.05})")
return output
# Test different processes
np.random.seed(42)
n = 500
white_noise = np.random.randn(n)
random_walk = np.cumsum(np.random.randn(n))
ar1_stationary = np.zeros(n)
for t in range(1, n):
ar1_stationary[t] = 0.7 * ar1_stationary[t-1] + np.random.randn()
adf_test(white_noise, "White Noise")
adf_test(ar1_stationary, "AR(1), φ=0.7")
adf_test(random_walk, "Random Walk")
# Real-world example: S&P 500 price level vs log returns
import pandas as pd
# Simulated S&P 500-like data
returns_sim = 0.001 + 0.015 * np.random.randn(252 * 5) # 5 years daily
prices_sim = 3000 * np.exp(np.cumsum(returns_sim))
adf_test(prices_sim, "Price Level (non-stationary expected)")
adf_test(np.diff(np.log(prices_sim)), "Log Returns (stationary expected)")
Part 4 - Differencing to Achieve Stationarity
First differencing
If is a random walk (non-stationary), then is white noise (stationary):
In general, a series that requires rounds of differencing to become stationary is called "integrated of order ", denoted .
import numpy as np
from statsmodels.tsa.stattools import adfuller
def difference_to_stationarity(series: np.ndarray, max_d: int = 3) -> tuple:
"""
Apply differencing until ADF test accepts stationarity.
Returns: (differenced_series, d)
"""
d = 0
current_series = series.copy()
while d <= max_d:
result = adfuller(current_series, autolag='AIC')
p_value = result[1]
print(f"d={d}: ADF p-value = {p_value:.4f} ({'stationary' if p_value < 0.05 else 'non-stationary'})")
if p_value < 0.05:
return current_series, d
# Apply one more round of differencing
current_series = np.diff(current_series)
d += 1
print(f"Warning: series still non-stationary after {max_d} differences")
return current_series, max_d
np.random.seed(42)
# I(1) process: random walk
random_walk = np.cumsum(np.random.randn(500))
stationary_rw, d_rw = difference_to_stationarity(random_walk)
print(f"Random walk needed d={d_rw} differences")
# I(2) process: integrated twice
rw2 = np.cumsum(np.cumsum(np.random.randn(500)))
stationary_rw2, d_rw2 = difference_to_stationarity(rw2)
print(f"I(2) series needed d={d_rw2} differences")
Seasonal differencing
For seasonal data with period :
import numpy as np
def seasonal_difference(series: np.ndarray, s: int = 12) -> np.ndarray:
"""
Seasonal differencing with period s.
Removes seasonal non-stationarity without removing trend.
"""
return series[s:] - series[:-s]
# Monthly data with annual seasonality
np.random.seed(42)
t = np.arange(120) # 10 years monthly
seasonal = 5 * np.sin(2 * np.pi * t / 12) # Annual cycle
trend = 0.05 * t
noise = np.random.randn(120)
series = seasonal + trend + noise
# Regular differencing removes trend but seasonal pattern remains
diff_1 = np.diff(series)
# Seasonal differencing removes seasonal component
diff_s = seasonal_difference(series, s=12)
# Both together (what SARIMA does)
diff_both = np.diff(seasonal_difference(series, s=12))
print("ADF p-values:")
for name, data in [("Original", series),
("1st diff", diff_1),
("Seasonal diff", diff_s),
("Both diffs", diff_both)]:
pval = adfuller(data, autolag='AIC')[1]
print(f" {name}: {pval:.4f}")
Part 5 - Trend Decomposition and Deseasonalization
For ML with time series, a common preprocessing pipeline:
import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import seasonal_decompose, STL
def decompose_and_make_stationary(series: pd.Series, period: int = 12) -> dict:
"""
Decompose time series into trend, seasonal, and residual.
Return stationary residuals for ML modeling.
"""
# STL (Seasonal-Trend decomposition using LOESS) - more robust than classical
stl = STL(series, period=period, robust=True)
result = stl.fit()
trend = result.trend
seasonal = result.seasonal
residual = result.resid
# The residual should be stationary (or close to it)
pval_residual = adfuller(residual.dropna())[1]
print(f"Residual ADF p-value: {pval_residual:.4f} "
f"({'stationary' if pval_residual < 0.05 else 'non-stationary'})")
return {
'trend': trend,
'seasonal': seasonal,
'residual': residual,
'original': series
}
# ML workflow: predict residuals, then add back trend and seasonal
def ml_time_series_workflow(series: pd.Series, model, horizon: int = 12):
"""
Proper ML time series pipeline:
1. Decompose into trend + seasonal + residual
2. Difference residuals to stationarity
3. Train ML model on stationary residuals
4. Predict residuals, reconstruct full forecast
"""
components = decompose_and_make_stationary(series)
residuals = components['residual'].dropna()
# Feature engineering on stationary residuals
# (lag features, rolling stats - valid only after stationarity)
df = pd.DataFrame({'residual': residuals})
for lag in [1, 2, 3, 6, 12]:
df[f'lag_{lag}'] = df['residual'].shift(lag)
df = df.dropna()
X = df.drop('residual', axis=1)
y = df['residual']
# Train model (time-based split)
split = int(0.8 * len(X))
model.fit(X[:split], y[:split])
residual_forecast = model.predict(X[split:split+horizon])
# Reconstruct forecast by adding back trend and seasonal components
# (In practice, forecast trend and seasonal separately)
return residual_forecast
Part 6 - Ergodicity
Definition
A stationary process is ergodic if time averages converge to ensemble averages:
In other words: a single long realization of the process contains enough information to estimate all population statistics.
Why ergodicity matters in ML
In ML, we have one realization of a time series (the observed data). We use time averages (e.g., sample mean, sample autocorrelation) to estimate population quantities. This is only valid if the process is ergodic.
Non-ergodic example: Suppose you measure one person's resting heart rate for 10,000 days. The time average is that person's average heart rate - but it tells you nothing about the population distribution of heart rates. If you want to model the population, you need many individuals (ensemble), not one long series.
import numpy as np
# Ergodic process: AR(1) with |φ| < 1
# Time average converges to population mean
np.random.seed(42)
n = 10000
ar1 = np.zeros(n)
for t in range(1, n):
ar1[t] = 0.7 * ar1[t-1] + np.random.randn()
# Time average at different T
for T in [100, 1000, 5000, 10000]:
time_avg = ar1[:T].mean()
print(f"T={T:6d}: time_avg = {time_avg:.4f}") # Should converge to 0 (true mean)
# Converges to population mean E[X] = 0
# Non-ergodic: different realizations have different long-run means
# (like measuring different people's heart rates)
n_realizations = 5
individual_means = np.random.normal(65, 10, n_realizations) # True heart rate per person
series_per_person = [mean + np.random.randn(1000) for mean in individual_means]
print("\nPer-person time averages (non-ergodic scenario):")
for i, (series, true_mean) in enumerate(zip(series_per_person, individual_means)):
print(f" Person {i+1}: time_avg={series.mean():.1f}, true_mean={true_mean:.1f}")
# Time averages are person-specific - not estimates of population mean
Interview Questions
Q1: What is the difference between strict and weak stationarity? Which do we need for ARIMA models?
Strict stationarity: The joint distribution of is invariant to time shifts for any collection of time points. This means ALL statistical properties (mean, variance, higher moments, entire distribution) are constant over time.
Weak (covariance) stationarity: Only requires:
- Constant mean:
- Constant variance:
- Autocovariance depends only on lag:
Key relationships:
- Strict → weak (if variance is finite)
- Weak → strict only for Gaussian processes (where first two moments determine the distribution)
- In practice, weak stationarity is sufficient for most analysis
ARIMA requires: Weak stationarity for the differenced series. The Wold decomposition theorem guarantees that any weakly stationary process can be represented as an MA(∞) - the theoretical foundation for ARIMA models. Strict stationarity is not needed because ARIMA only models first and second moments.
ML applications: If your series is weakly stationary (confirmed by ADF test, constant rolling mean and variance), ARIMA and related models will work correctly. If only strictly stationary but not weakly (infinite variance - rare), standard ARIMA breaks down.
Q2: What is the Augmented Dickey-Fuller test, and how do you interpret its output?
The ADF test tests for unit roots:
- : Series has a unit root (non-stationary)
- : Series is stationary
The test regression:
The test statistic is the -ratio for . The "augmented" part (lags of ) controls for serial correlation that would invalidate the standard Dickey-Fuller test.
Interpreting output:
from statsmodels.tsa.stattools import adfuller
result = adfuller(series)
# result[0]: test statistic
# result[1]: p-value
# result[4]: critical values at 1%, 5%, 10%
# Decision rule:
# p-value < 0.05: reject H₀ → series is stationary
# p-value ≥ 0.05: fail to reject H₀ → series has unit root
Critical values: The ADF distribution is non-standard (left-skewed). Critical values at 1% ≈ -3.5, 5% ≈ -2.9, 10% ≈ -2.6. More negative test statistic = stronger evidence for stationarity.
Practical nuances:
- Low power with short series (T < 100): test may fail to detect stationarity
- Near-unit-root processes (): ADF may not reject even though the process is technically stationary
- Structural breaks: ADF may give misleading results if the series has a break mid-sample
- Always use
autolag='AIC'to let the model select the lag order automatically
When both ADF and KPSS agree:
- ADF rejects, KPSS does not reject → stationary (high confidence)
- ADF does not reject, KPSS rejects → unit root (high confidence)
- Both reject or both don't → structural break or near-unit-root (investigate further)
Q3: Why is it dangerous to train an ML model on non-stationary time series data without preprocessing?
Problem 1: Spurious correlations. Two independent random walks and can have very high correlation (often > 0.8) purely by chance, because both trend in the same direction. A model trained on the levels will learn this spurious relationship. In deployment, when the trends diverge, the model fails spectacularly. This is called "spurious regression."
Problem 2: Distribution shift. If the mean or variance of the series changes over time, the distribution seen at test time is different from training. The model has learned the wrong distribution. Example: training on 2015-2019 data and testing on 2020 (pandemic) data.
Problem 3: Violation of ARIMA assumptions. ARIMA requires a stationary process after differencing. Applying ARIMA to a non-stationary series without proper differencing gives biased parameter estimates.
Problem 4: Invalid standard errors. OLS standard errors assume stationarity. With integrated series, standard errors are incorrect → hypothesis tests are invalid → model selection (feature selection, regularization) is flawed.
Correct approach:
- Test for stationarity (ADF test, visual inspection of rolling mean/std)
- If non-stationary: difference, log-transform, or detrend
- Model the stationary residuals
- Convert predictions back to original scale
For deep learning: LSTMs and Transformers can technically model non-stationary series, but they are much more data-hungry to do so correctly. Pre-whitening (making the series stationary) dramatically reduces the amount of data needed for reliable learning.
Q4: What is ergodicity and why is it important for time series ML?
Ergodicity: A stationary process is ergodic if time averages (from a single realization) converge to ensemble averages (expectations over many independent realizations).
Why it matters for ML: In time series ML, we have ONE trajectory (e.g., 5 years of daily stock prices). We estimate population parameters (mean, variance, autocorrelation structure) from this single trajectory using time averages.
This is only valid if the process is ergodic. If not, the time average may converge to something specific to that realization, not the population parameter.
Practical ergodicity conditions: Processes with fast-decaying autocorrelation are ergodic. Specifically, if (absolute summability of autocovariance), the process is ergodic for the mean. This holds for any stationary ARMA process.
Non-ergodic situations in ML:
- Panel data: Each user in your recommendation system is a different "realization." Using one user's long history to estimate population-level patterns is non-ergodic. Need many users.
- Structural breaks: Before and after a regime change, the process has different statistical properties - a single long series spanning the break is non-ergodic for the pre-break period.
Practical implication: If you are fitting ARIMA to your time series, you are implicitly assuming ergodicity. If the series is ergodic, your parameter estimates from the single observed series are consistent. If not, you need ensemble data or regime-specific modeling.
Q5: How would you preprocess a non-stationary financial time series for ML feature engineering?
Step-by-step pipeline:
1. Inspect visually and test:
# Rolling mean and variance should be constant
rolling_mean = pd.Series(prices).rolling(50).mean()
rolling_std = pd.Series(prices).rolling(50).std()
# If rolling_mean trends → non-stationary mean
# If rolling_std trends → non-stationary variance (GARCH effects in finance)
2. Apply appropriate transformation:
- Log prices → log returns via first difference: eliminates multiplicative trend, achieves approximate stationarity
- Seasonal data: STL decomposition → model residuals
- Multiple seasonalities: FBProphet or MSTL
3. Test again:
log_returns = np.diff(np.log(prices))
adf_result = adfuller(log_returns)
# p-value should be < 0.01 for log returns - typically true for financial data
4. Engineer features from stationary series:
df = pd.DataFrame({'return': log_returns})
# Lag features (only on stationary series to avoid leakage)
for lag in [1, 2, 3, 5, 10]:
df[f'lag_{lag}'] = df['return'].shift(lag)
# Rolling statistics
df['rolling_mean_5'] = df['return'].rolling(5).mean()
df['rolling_std_5'] = df['return'].rolling(5).std()
df['rolling_skew_20'] = df['return'].rolling(20).skew()
5. Time-based cross-validation (never shuffle):
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
model.fit(X[train_idx], y[train_idx])
score = model.score(X[val_idx], y[val_idx])
Key rule: All feature engineering must use only past information relative to the prediction target - no future leakage.
Quick Reference
| Concept | Test | Fix |
|---|---|---|
| Unit root (random walk) | ADF, KPSS | First differencing |
| Seasonal non-stationarity | Plot seasonality | Seasonal differencing |
| Trending mean | Visual + ADF | Detrend or difference |
| Heteroskedasticity | Ljung-Box on squared residuals | GARCH model or log transform |
| ADF p-value | Interpretation |
|---|---|
| < 0.01 | Strongly stationary |
| 0.01–0.05 | Stationary (reject H₀ at 5%) |
| 0.05–0.10 | Borderline - check visually |
| > 0.10 | Non-stationary (unit root) |
Next: Lesson 02: Autocorrelation and PACF →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Stationarity demo on the EngineersOfAI Playground - no code required.
:::
