Skip to main content

ARIMA Models

Reading time: ~45 minutes Interview relevance: Very High - ARIMA is the classical forecasting baseline; every ML forecasting interview starts here Target roles: ML Engineer, Data Scientist, Financial ML, Research Engineer

The Real Interview Moment

You're presenting your LSTM forecasting model to a senior data scientist at a retail company. Your model achieves a MAPE of 8% on the test set. The senior DS asks: "Did you compare this to ARIMA?"

You did not.

"Why not? ARIMA with seasonal components often gets 5–7% MAPE on retail demand forecasting. Before you convince the business to run a GPU cluster for LSTMs, you need to prove that the complexity is justified. What's your ARIMA baseline?"

This scenario plays out constantly in industry. Neural networks are not always better than ARIMA. For short, well-behaved time series with strong seasonality, SARIMA frequently outperforms deep learning - at a fraction of the computational cost.

ARIMA is not a legacy technique you dismiss. It is the baseline you must beat and explain. Understanding its mathematics tells you exactly when it will and won't work - and what your deep learning model must be doing better.

Building Up to ARIMA

ARIMA combines three components: Autoregressive (AR), Integrated (I), and Moving Average (MA). Each adds expressive power for different temporal structures.

1. Autoregressive Model - AR(p)

An AR(p) model expresses the current value as a linear function of the pp most recent values:

Xt=ϕ1Xt1+ϕ2Xt2++ϕpXtp+ϵtX_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + \epsilon_t

where ϵtN(0,σ2)\epsilon_t \sim \mathcal{N}(0, \sigma^2) is white noise and {ϕi}\{\phi_i\} are AR coefficients.

Backshift notation: Define the backshift operator BB such that BkXt=XtkB^k X_t = X_{t-k}. Then:

ϕp(B)Xt=ϵt,ϕp(B)=1ϕ1Bϕ2B2ϕpBp\phi_p(B) X_t = \epsilon_t, \quad \phi_p(B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_p B^p

Stationarity condition for AR(p): The process is stationary if and only if all roots of the characteristic polynomial ϕp(z)=0\phi_p(z) = 0 lie outside the unit circle in the complex plane.

For AR(1): phi1<1|phi_1| < 1 for stationarity. For AR(2): The stationarity region is the triangle:

  • ϕ1+ϕ2<1\phi_1 + \phi_2 < 1
  • ϕ2ϕ1<1\phi_2 - \phi_1 < 1
  • 1<ϕ2<1-1 < \phi_2 < 1

Unconditional mean: μ=E[Xt]=0\mu = \mathbb{E}[X_t] = 0 (centered). For a model with intercept cc: μ=c/(1ϕ1ϕp)\mu = c / (1 - \phi_1 - \cdots - \phi_p).

2. Moving Average Model - MA(q)

An MA(q) model expresses the current value as a linear combination of current and past noise shocks:

Xt=ϵt+θ1ϵt1+θ2ϵt2++θqϵtqX_t = \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}

In backshift notation: Xt=θq(B)ϵtX_t = \theta_q(B) \epsilon_t

Always stationary: Any finite MA(q) is stationary regardless of {θi}\{\theta_i\}.

Invertibility condition: The MA process is invertible (can be written as infinite AR) if all roots of θq(z)=0\theta_q(z) = 0 lie outside the unit circle. Invertibility ensures a unique MA representation and is needed for parameter estimation.

Key property: ACF is exactly zero for lags >q> q. (This is why ACF identifies MA order.)

3. ARMA(p, q)

Combines both:

ϕp(B)Xt=θq(B)ϵt\phi_p(B) X_t = \theta_q(B) \epsilon_t

Expanded:

Xtϕ1Xt1ϕpXtp=ϵt+θ1ϵt1++θqϵtqX_t - \phi_1 X_{t-1} - \cdots - \phi_p X_{t-p} = \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}

ARMA(p,q) is the most general stationary linear model. It can parsimoniously represent many processes that would require high-order pure AR or MA models.

4. Integration - ARIMA(p, d, q)

Most real-world series (prices, demand) are non-stationary - their mean drifts over time. The "I" in ARIMA handles this via differencing.

Define the difference operator =1B\nabla = 1 - B:

  • Xt=XtXt1\nabla X_t = X_t - X_{t-1} (first difference)
  • 2Xt=(Xt)=Xt2Xt1+Xt2\nabla^2 X_t = \nabla(\nabla X_t) = X_t - 2X_{t-1} + X_{t-2} (second difference)

ARIMA(p, d, q): Apply dd differences to achieve stationarity, then fit ARMA(p,q):

ϕp(B)(1B)dXt=θq(B)ϵt\phi_p(B)(1-B)^d X_t = \theta_q(B)\epsilon_t

ddMeaningWhen to use
0No differencing (already stationary)Stationary series
1First difference XtXt1X_t - X_{t-1}Random walks, trending series
2Second differenceSeries where growth rate is trending

In practice, d2d \leq 2 is almost always sufficient. Rarely d=3d = 3.

SARIMA: Handling Seasonality

Seasonal patterns require seasonal AR/MA operators at lags s,2s,3s,s, 2s, 3s, \ldots

SARIMA(p,d,q)(P,D,Q)s(p,d,q)(P,D,Q)_s:

ΦP(Bs)ϕp(B)(1Bs)D(1B)dXt=ΘQ(Bs)θq(B)ϵt\Phi_P(B^s)\phi_p(B)(1-B^s)^D(1-B)^d X_t = \Theta_Q(B^s)\theta_q(B)\epsilon_t

where:

  • (p,d,q)(p,d,q): non-seasonal AR, integration, MA orders
  • (P,D,Q)(P,D,Q): seasonal AR, integration, MA orders
  • ss: seasonal period (12 for monthly, 4 for quarterly, 7 for daily-weekly, 52 for weekly)

Example SARIMA(1,1,1)(1,1,1)[12]: (1Φ1B12)seasonal AR(1ϕ1B)AR(1B12)seasonal diff(1B)diffXt=(1+Θ1B12)seasonal MA(1+θ1B)MAϵt\underbrace{(1-\Phi_1 B^{12})}_{\text{seasonal AR}}\underbrace{(1-\phi_1 B)}_{\text{AR}}\underbrace{(1-B^{12})}_{\text{seasonal diff}}\underbrace{(1-B)}_{\text{diff}} X_t = \underbrace{(1+\Theta_1 B^{12})}_{\text{seasonal MA}}\underbrace{(1+\theta_1 B)}_{\text{MA}}\epsilon_t

Python Implementation

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from sklearn.metrics import mean_squared_error, mean_absolute_error
import itertools

# ─── Generate test data ───────────────────────────────────────────────────────
np.random.seed(42)
n = 300

# AR(2) process (known ground truth for validation)
def generate_arima(n: int, phi: list = None, theta: list = None,
d: int = 0, sigma: float = 1.0) -> np.ndarray:
"""Generate ARIMA(p,d,q) data for testing."""
if phi is None: phi = []
if theta is None: theta = []

p = len(phi)
q = len(theta)
extra = max(p, q) + d + 20 # burn-in

eps = np.random.normal(0, sigma, n + extra)
x = np.zeros(n + extra)

for t in range(max(p, d), n + extra):
# AR component
ar_part = sum(phi[j] * x[t-1-j] for j in range(p))
# MA component
ma_part = sum(theta[j] * eps[t-1-j] for j in range(q)) if q > 0 else 0
x[t] = ar_part + ma_part + eps[t]

# Apply differencing removal (integration) if d > 0
result = x[extra:]
if d > 0:
result = np.cumsum(result) # undo differencing (simulate I(d))

return result

# Ground truth: ARIMA(2,1,1)
series_raw = generate_arima(n, phi=[0.6, -0.2], theta=[0.4], d=1)
series = pd.Series(series_raw, name="y")
print(f"Generated ARIMA(2,1,1) series: mean={series.mean():.2f}, std={series.std():.2f}")

The Box-Jenkins Methodology

The Box-Jenkins methodology is the systematic workflow for fitting ARIMA models.

┌─────────────────────────────────────────────────────┐
│ Box-Jenkins Methodology │
│ │
│ 1. Identification │
│ ├── Check stationarity (ADF test) │
│ ├── Determine d (differencing order) │
│ └── Read ACF/PACF to estimate p, q │
│ │
│ 2. Estimation │
│ ├── Fit ARIMA(p,d,q) via MLE │
│ └── Check parameter significance │
│ │
│ 3. Diagnostic Checking │
│ ├── Residuals ~ white noise? (Ljung-Box) │
│ ├── Residual ACF/PACF significant? │
│ └── Residuals normally distributed? │
│ │
│ 4. Forecasting │
│ ├── Point forecasts with confidence intervals │
│ └── Evaluate on held-out data │
└─────────────────────────────────────────────────────┘
class BoxJenkinsPipeline:
"""
Complete Box-Jenkins ARIMA workflow.
"""

def __init__(self, series: pd.Series, test_size: int = 30):
self.series = series
self.test_size = test_size
self.train = series[:-test_size]
self.test = series[-test_size:]
self.model_fit = None
self.best_order = None

# ── Step 1: Identification ─────────────────────────────────────────────
def step1_identify(self) -> dict:
"""Determine d, p, q from ADF test and ACF/PACF."""
print("=" * 60)
print("STEP 1: IDENTIFICATION")
print("=" * 60)

# Determine d: difference until stationary
d = 0
current = self.train.values.copy()
while d <= 2:
adf_stat, adf_p, _, _, crit, _ = adfuller(current)
print(f"ADF test (d={d}): stat={adf_stat:.4f}, p={adf_p:.4f} "
f"→ {'STATIONARY' if adf_p < 0.05 else 'NON-STATIONARY'}")
if adf_p < 0.05:
break
current = np.diff(current)
d += 1

print(f"\nDifferencing order d = {d}")

# After differencing, read ACF and PACF
acf_vals = acf(current, nlags=20, fft=True)
pacf_vals = pacf(current, nlags=20, method='ywmle')
ci = 1.96 / np.sqrt(len(current))

sig_acf = [k for k in range(1, 21) if abs(acf_vals[k]) > ci]
sig_pacf = [k for k in range(1, 21) if abs(pacf_vals[k]) > ci]

print(f"Significant ACF lags: {sig_acf[:5]}")
print(f"Significant PACF lags: {sig_pacf[:5]}")

# Heuristic p, q
p_candidate = min(sig_pacf[:1], default=[1])[0] if sig_pacf else 1
q_candidate = min(sig_acf[:1], default=[1])[0] if sig_acf else 1

print(f"\nInitial candidates: p={p_candidate}, d={d}, q={q_candidate}")

return {'d': d, 'p_initial': p_candidate, 'q_initial': q_candidate,
'sig_acf': sig_acf, 'sig_pacf': sig_pacf}

# ── Step 2: Estimation (grid search over p, q) ─────────────────────────
def step2_estimate(
self,
d: int = 1,
p_range: range = range(0, 4),
q_range: range = range(0, 4)
) -> pd.DataFrame:
"""Fit multiple ARIMA models, select by AIC."""
print("\n" + "=" * 60)
print("STEP 2: ESTIMATION (Model Selection by AIC)")
print("=" * 60)

results = []
for p, q in itertools.product(p_range, q_range):
try:
model = ARIMA(self.train, order=(p, d, q))
fitted = model.fit(method='innovations_mle')
results.append({
'p': p, 'd': d, 'q': q,
'aic': fitted.aic,
'bic': fitted.bic,
'order': (p, d, q)
})
except Exception:
pass # some configurations may not converge

df = pd.DataFrame(results).sort_values('aic')
print(f"\n{'Order':>15} | {'AIC':>10} | {'BIC':>10}")
print("-" * 40)
for _, row in df.head(5).iterrows():
order_str = f"({row['p']},{row['d']},{row['q']})"
print(f"{order_str:>15} | {row['aic']:>10.2f} | {row['bic']:>10.2f}")

best = df.iloc[0]
self.best_order = (int(best['p']), int(best['d']), int(best['q']))
print(f"\nBest model by AIC: ARIMA{self.best_order}")

# Fit best model
self.model_fit = ARIMA(self.train, order=self.best_order).fit()

print(f"\nModel Parameters:")
print(self.model_fit.summary().tables[1])

return df

# ── Step 3: Diagnostic Checking ─────────────────────────────────────────
def step3_diagnostics(self) -> dict:
"""Check residuals for white noise."""
print("\n" + "=" * 60)
print("STEP 3: DIAGNOSTIC CHECKING")
print("=" * 60)

residuals = self.model_fit.resid

# Ljung-Box test on residuals
lb_results = acorr_ljungbox(residuals, lags=[10, 20], return_df=True)
lb_p10 = lb_results['lb_pvalue'].iloc[0]
lb_p20 = lb_results['lb_pvalue'].iloc[1]

# Residual ACF (should all be insignificant)
res_acf = acf(residuals, nlags=20, fft=True)
ci = 1.96 / np.sqrt(len(residuals))
sig_res_acf = [k for k in range(1, 21) if abs(res_acf[k]) > ci]

# Normality test
from scipy.stats import shapiro, normaltest
_, normality_p = normaltest(residuals) # D'Agostino-Pearson

diagnostics = {
'lb_p_lag10': lb_p10,
'lb_p_lag20': lb_p20,
'residuals_white_noise': lb_p10 > 0.05 and lb_p20 > 0.05,
'sig_residual_acf_lags': sig_res_acf,
'normality_p': normality_p,
'residual_mean': residuals.mean(),
'residual_std': residuals.std()
}

print(f"Residual diagnostics:")
print(f" Ljung-Box p (lag 10): {lb_p10:.4f} "
f"→ {'✓ white noise' if lb_p10 > 0.05 else '✗ autocorrelation remains'}")
print(f" Ljung-Box p (lag 20): {lb_p20:.4f} "
f"→ {'✓ white noise' if lb_p20 > 0.05 else '✗ autocorrelation remains'}")
print(f" Sig. residual ACF lags: {sig_res_acf[:5]}")
print(f" Normality p-value: {normality_p:.4f} "
f"→ {'✓ normal' if normality_p > 0.05 else 'non-normal'}")
print(f" Residual mean: {residuals.mean():.6f} (should be ~0)")

if sig_res_acf:
print(f"\n ⚠ Significant residual ACF at lags {sig_res_acf[:3]}")
print(f" Consider increasing p or q to capture remaining structure.")

return diagnostics

# ── Step 4: Forecasting ─────────────────────────────────────────────────
def step4_forecast(self, steps: int = None) -> pd.DataFrame:
"""Generate forecasts with confidence intervals."""
print("\n" + "=" * 60)
print("STEP 4: FORECASTING")
print("=" * 60)

if steps is None:
steps = self.test_size

# In-sample fit
in_sample = self.model_fit.fittedvalues

# Out-of-sample forecast
forecast_result = self.model_fit.get_forecast(steps=steps)
forecast_mean = forecast_result.predicted_mean
forecast_ci = forecast_result.conf_int(alpha=0.05)

# Evaluation
# Align test with forecast (handle index alignment)
y_true = self.test.values
y_pred = forecast_mean.values[:len(y_true)]

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100

print(f"\nForecast metrics on {steps}-step ahead:")
print(f" RMSE: {rmse:.4f}")
print(f" MAE: {mae:.4f}")
print(f" MAPE: {mape:.2f}%")

# Naive baseline (last observed value)
naive_pred = np.full(len(y_true), self.train.iloc[-1])
naive_rmse = np.sqrt(mean_squared_error(y_true, naive_pred))
print(f"\n Naive baseline RMSE: {naive_rmse:.4f}")
print(f" ARIMA improvement: {(naive_rmse - rmse)/naive_rmse:.1%}")

return pd.DataFrame({
'forecast': forecast_mean,
'lower_95': forecast_ci.iloc[:, 0],
'upper_95': forecast_ci.iloc[:, 1]
})

# Run the full pipeline
pipeline = BoxJenkinsPipeline(series, test_size=30)
ident = pipeline.step1_identify()
model_df = pipeline.step2_estimate(d=ident['d'])
diag = pipeline.step3_diagnostics()
forecasts = pipeline.step4_forecast()

Model Selection: AIC vs BIC

Both AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) balance goodness-of-fit against model complexity.

AIC=2ln(L^)+2k\text{AIC} = -2\ln(\hat{L}) + 2k

BIC=2ln(L^)+kln(T)\text{BIC} = -2\ln(\hat{L}) + k\ln(T)

where L^\hat{L} is the maximized likelihood, kk is the number of parameters, and TT is the sample size.

Choose lower AIC/BIC.

CriterionPenaltyPreferenceWhen to use
AIC2k2kPenalizes lessWhen forecasting accuracy is the goal; tends to select larger models
BICklnTk \ln TPenalizes moreWhen interpretability matters; tends to select smaller models
AICc2k+2k2/(Tk1)2k + 2k^2/(T-k-1)AIC correctedSmall samples (T<40kT < 40k) - use AICc instead of AIC
def aic_bic_comparison(
series: pd.Series,
d: int = 1,
max_p: int = 4,
max_q: int = 4,
criterion: str = 'aic'
) -> tuple:
"""Grid search for best ARIMA order using AIC or BIC."""
best_score = np.inf
best_order = None
scores = {}

for p in range(max_p + 1):
for q in range(max_q + 1):
if p == 0 and q == 0:
continue
try:
model = ARIMA(series, order=(p, d, q))
fitted = model.fit()
score = fitted.aic if criterion == 'aic' else fitted.bic
scores[(p, d, q)] = score
if score < best_score:
best_score = score
best_order = (p, d, q)
except Exception:
pass

return best_order, best_score, scores

best_order_aic, best_aic, _ = aic_bic_comparison(series, d=1, criterion='aic')
best_order_bic, best_bic, _ = aic_bic_comparison(series, d=1, criterion='bic')

print(f"Best by AIC: ARIMA{best_order_aic}, AIC={best_aic:.2f}")
print(f"Best by BIC: ARIMA{best_order_bic}, BIC={best_bic:.2f}")

SARIMA in Practice: Monthly Retail Sales

def fit_sarima_monthly(
series: pd.Series,
test_months: int = 24
) -> dict:
"""
Fit SARIMA model to monthly data with annual seasonality.

Classic application: retail sales, energy consumption,
subscription counts - any KPI with monthly/quarterly patterns.
"""
train = series[:-test_months]
test = series[-test_months:]

print("Fitting SARIMA models for monthly data...")
print("Seasonal period: 12 months\n")

# Common SARIMA specifications for monthly data
sarima_configs = [
{'order': (1,1,1), 'seasonal_order': (1,1,1,12)},
{'order': (0,1,1), 'seasonal_order': (0,1,1,12)}, # airline model
{'order': (1,1,0), 'seasonal_order': (1,1,0,12)},
{'order': (2,1,1), 'seasonal_order': (1,1,1,12)},
{'order': (0,1,1), 'seasonal_order': (1,1,0,12)},
]

results = []
for config in sarima_configs:
try:
model = SARIMAX(
train,
order=config['order'],
seasonal_order=config['seasonal_order'],
enforce_stationarity=False,
enforce_invertibility=False
)
fitted = model.fit(disp=False)

# Forecast
forecast = fitted.get_forecast(steps=test_months)
y_pred = forecast.predicted_mean.values
y_true = test.values

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
results.append({
'config': f"SARIMA{config['order']}{config['seasonal_order']}",
'aic': fitted.aic,
'bic': fitted.bic,
'test_rmse': rmse,
'fitted': fitted
})
except Exception as e:
pass

df = pd.DataFrame([{k: v for k, v in r.items() if k != 'fitted'}
for r in results]).sort_values('aic')

print(f"{'Model':>40} | {'AIC':>8} | {'BIC':>8} | {'RMSE':>8}")
print("-" * 72)
for _, row in df.iterrows():
print(f"{row['config']:>40} | {row['aic']:>8.2f} | {row['bic']:>8.2f} | {row['test_rmse']:>8.4f}")

best_result = min(results, key=lambda x: x['aic'])
return best_result

# Generate monthly sales data with seasonality
np.random.seed(0)
n_months = 120 # 10 years of monthly data
t = np.arange(n_months)

# Seasonal pattern + trend + AR(1) structure
seasonal = 50 * np.sin(2 * np.pi * t / 12) + 20 * np.sin(2 * np.pi * 2 * t / 12)
trend = 2 * t
ar_noise = np.zeros(n_months)
for i in range(1, n_months):
ar_noise[i] = 0.7 * ar_noise[i-1] + np.random.normal(0, 15)

sales = 500 + trend + seasonal + ar_noise
sales_series = pd.Series(sales)

best = fit_sarima_monthly(sales_series, test_months=24)
print(f"\nBest model: {best['config']}, RMSE={best['test_rmse']:.4f}")

ARIMA vs Deep Learning: When Does Each Win?

def benchmark_arima_vs_naive(
series: pd.Series,
test_size: int = 24,
arima_order: tuple = (1,1,1)
) -> pd.DataFrame:
"""
Benchmark ARIMA against naive baselines.
ARIMA should beat these baselines; if not, something is wrong.
"""
train = series[:-test_size].values
test = series[-test_size:].values

results = {}

# Naive: last value
naive_last = np.full(test_size, train[-1])
results['Naive (last value)'] = np.sqrt(mean_squared_error(test, naive_last))

# Naive: seasonal (same period last year, for monthly data)
if len(train) >= 12:
naive_seasonal = train[-12:][:test_size] if test_size <= 12 else np.tile(train[-12:], 3)[:test_size]
results['Naive (seasonal)'] = np.sqrt(mean_squared_error(test, naive_seasonal))

# ARIMA
try:
model = ARIMA(train, order=arima_order)
fitted = model.fit()
arima_pred = fitted.forecast(steps=test_size)
results['ARIMA'] = np.sqrt(mean_squared_error(test, arima_pred))
except Exception as e:
results['ARIMA'] = float('nan')

# Exponential smoothing (simple)
from statsmodels.tsa.holtwinters import ExponentialSmoothing
hw_model = ExponentialSmoothing(train, trend='add', seasonal='add',
seasonal_periods=12).fit(optimized=True)
hw_pred = hw_model.forecast(test_size)
results['Holt-Winters'] = np.sqrt(mean_squared_error(test, hw_pred))

df = pd.DataFrame.from_dict(results, orient='index', columns=['RMSE'])
df = df.sort_values('RMSE')

print("Benchmark Results (RMSE):")
print(f"{'Model':<30} | {'RMSE':>10}")
print("-" * 45)
for model_name, row in df.iterrows():
print(f"{model_name:<30} | {row['RMSE']:>10.4f}")

return df

benchmark_arima_vs_naive(sales_series)

Decision framework for ARIMA vs neural networks:

FactorARIMA winsNeural net wins
Series length< 200 observations> 1000+ observations
Computational budgetLow (CPU, seconds)High (GPU, hours)
Interpretability requiredHigh (coefficients matter)Low (black box OK)
SeasonalityWell-defined seasonal periodComplex, multi-scale
Exogenous variablesFew, linear relationshipsMany, complex interactions
Distribution shiftStable (in-distribution)Adaptive models handle shift
Multiple seriesEach series separate modelSingle model generalizes

Auto-ARIMA: Automated Order Selection

def auto_arima_simple(
series: pd.Series | np.ndarray,
max_p: int = 5,
max_q: int = 5,
max_d: int = 2,
criterion: str = 'aic',
verbose: bool = True
) -> tuple:
"""
Simple auto-ARIMA implementation: grid search with AIC/BIC.

For production, use:
- pmdarima.auto_arima (battle-tested, handles edge cases)
- statsforecast.AutoARIMA (fastest implementation, Nixtla)
"""
if isinstance(series, np.ndarray):
series = pd.Series(series)

# Determine d by ADF test
d = 0
current = series.values.copy()
while d < max_d:
_, adf_p, *_ = adfuller(current)
if adf_p < 0.05:
break
current = np.diff(current)
d += 1

best_score = np.inf
best_order = None
best_model = None

for p in range(max_p + 1):
for q in range(max_q + 1):
if p == 0 and q == 0:
continue
try:
model = ARIMA(series, order=(p, d, q))
fitted = model.fit(method='innovations_mle')
score = fitted.aic if criterion == 'aic' else fitted.bic
if score < best_score:
best_score = score
best_order = (p, d, q)
best_model = fitted
except Exception:
pass

if verbose:
print(f"Auto-ARIMA result: ARIMA{best_order}, {criterion.upper()}={best_score:.2f}")

return best_order, best_model

order, model = auto_arima_simple(series)

Online Forecasting: Recursive Update

In production ML systems, you often need to update forecasts as new data arrives, without refitting from scratch.

def recursive_arima_forecast(
initial_train: np.ndarray,
new_observations: np.ndarray,
order: tuple = (1, 1, 1),
refit_every: int = 10
) -> np.ndarray:
"""
Rolling walk-forward ARIMA forecast.

Strategy 1: Refit the model every `refit_every` periods
Strategy 2: Use model.apply() to update state without full refit (faster)

In production systems:
- Full refit: expensive but gets updated parameter estimates
- State update (Kalman filter approach): O(1) per observation
- Compromise: refit every N periods, state-update in between
"""
all_data = list(initial_train)
forecasts = []

current_model = ARIMA(all_data, order=order).fit(method='innovations_mle')

for i, y_new in enumerate(new_observations):
# Forecast next step
fc = current_model.forecast(steps=1)[0]
forecasts.append(fc)

# Add new observation
all_data.append(y_new)

# Refit periodically
if (i + 1) % refit_every == 0:
current_model = ARIMA(all_data, order=order).fit(method='innovations_mle')

return np.array(forecasts)

# Run walk-forward evaluation
train_size = 200
initial = series.values[:train_size]
holdout = series.values[train_size:]

rolling_preds = recursive_arima_forecast(initial, holdout, order=best_order_aic)
rolling_rmse = np.sqrt(mean_squared_error(holdout, rolling_preds))
print(f"\nWalk-forward ARIMA RMSE: {rolling_rmse:.4f}")

Interview Questions

Q1: Explain ARIMA(p,d,q) - what does each component mean?

AR(p) - Autoregressive of order p: The current value XtX_t depends on the last pp observed values. Each lag XtkX_{t-k} has a coefficient ϕk\phi_k. Captures "momentum" and "mean reversion" patterns. PACF cuts off at lag pp.

I(d) - Integrated of order d: The series is differenced dd times to achieve stationarity. d=1d=1 removes linear trends (the difference XtXt1X_t - X_{t-1} is stationary). d=2d=2 removes quadratic trends. Most real-world series need d=0d=0 or d=1d=1.

MA(q) - Moving Average of order q: The current value depends on the last qq noise shocks ϵtk\epsilon_{t-k}. Captures how shocks propagate through the series. ACF cuts off at lag qq.

Combined ARIMA(p,d,q): Take the dd-th difference of the series to get a stationary series, then fit ARMA(p,q) to it.

Practical identification:

  1. Run ADF test → if non-stationary, difference once (d=1) and re-test
  2. Look at ACF of differenced series: if cuts off at q → MA(q) component
  3. Look at PACF of differenced series: if cuts off at p → AR(p) component
  4. Confirm with AIC/BIC grid search
Q2: What is the difference between AIC and BIC, and which should you use?

Both penalize model complexity (number of parameters kk) to prevent overfitting:

AIC=2lnL^+2kBIC=2lnL^+klnT\text{AIC} = -2\ln\hat{L} + 2k \qquad \text{BIC} = -2\ln\hat{L} + k\ln T

BIC penalizes more for large TT (since lnT>2\ln T > 2 when T>8T > 8), so BIC selects simpler models than AIC when sample sizes are moderate to large.

AIC is optimal for forecasting: It minimizes one-step-ahead prediction error asymptotically (Akaike, 1974). If your goal is accurate forecasts, use AIC.

BIC is consistent: It selects the true model order as TT \to \infty (if the true model is in the candidate set). If you want to identify the true data-generating process, use BIC.

Practical rule of thumb:

  • AIC → when you care about forecast accuracy (common in ML)
  • BIC → when you care about model parsimony/interpretability
  • If AIC and BIC agree: high confidence in the selected model
  • If they disagree: AIC selects a larger model; compare both on a validation set

For small samples (T<40kT < 40k), use AICc (corrected AIC) to avoid overfitting.

Q3: Why do you check residuals after fitting ARIMA? What patterns indicate a misspecified model?

After fitting ARIMA, the residuals should be white noise - no remaining temporal structure. If they're not, the model failed to capture some pattern.

What to check:

  1. Ljung-Box test on residuals: Test H0H_0: residuals are white noise. Reject (p<0.05p < 0.05) → model is misspecified.

  2. Residual ACF and PACF: Should be insignificant at all lags.

    • Significant spike at lag kk in ACF → add MA(kk) term
    • Significant spike at lag kk in PACF → add AR(kk) term
    • Spikes at seasonal lags (12, 24) → add seasonal SARIMA components
  3. Residual mean ≈ 0: Non-zero mean → include intercept term.

  4. Residual normality: For valid prediction intervals, residuals should be approximately normal. Non-normality doesn't affect point forecasts but invalidates confidence intervals.

  5. Ljung-Box on squared residuals: Detects ARCH effects (heteroskedasticity). If significant → residuals have non-constant variance; fit GARCH after ARIMA.

Common misspecification patterns and fixes:

  • ACF residual spike at lag 1 → q too small, increase q
  • PACF residual spike at lag 2 → p too small, increase p
  • Slow ACF decay in residuals → d insufficient, increase d
  • Seasonal ACF spikes in residuals → need SARIMA (P,D,Q)s(P, D, Q)_s
  • Residual variance increases over time → transform series (log), or fit GARCH
Q4: When should you use ARIMA vs a neural network for forecasting?

Use ARIMA when:

  • Short series (< 200 observations) - neural nets overfit
  • Need prediction intervals (ARIMA has exact distributional theory)
  • Interpretability required (AR/MA coefficients have clear meaning)
  • Fast inference / no GPU (ARIMA is CPU-only, milliseconds)
  • Strong simple seasonality (SARIMA handles this naturally)
  • Prototyping / establishing a baseline
  • Business stakeholders need to understand the model

Use neural networks when:

  • Long series (1000+ observations)
  • Multiple related series (one model generalizes across series)
  • Rich exogenous covariates (promotional variables, weather, macroeconomic)
  • Complex non-linear dynamics
  • Irregular seasonality or multiple seasonal periods
  • Transfer learning across series (pre-trained Chronos, TimeGPT)
  • When you've proven ARIMA can't capture the dynamics

The key discipline: Always fit ARIMA first and measure its performance. A neural network that underperforms ARIMA is a red flag - either the data doesn't have complex temporal structure worth learning, or the neural net is improperly configured.

In practice, ensembles often win: ARIMA captures the linear temporal structure; a gradient boosted model captures non-linear residuals.

Q5: What are unit roots and why do they matter for ARIMA?

A unit root occurs when the AR characteristic polynomial has a root exactly on the unit circle (modulus = 1). The simplest example is AR(1) with ϕ1=1\phi_1 = 1:

Xt=Xt1+ϵtX_t = X_{t-1} + \epsilon_t

This is a random walk - a unit root process. It is non-stationary: its variance grows without bound over time (Var(Xt)=tσ2\text{Var}(X_t) = t\sigma^2).

Why unit roots matter for ARIMA:

  1. Stationarity requirement: Standard ARIMA assumes the differenced series is stationary. If the original series has a unit root, d=1d=1 differencing removes it and gives a stationary ARMA process.

  2. Hypothesis testing: The ADF (Augmented Dickey-Fuller) test tests H0H_0: unit root present vs H1H_1: stationary. You must reject H0H_0 before fitting ARMA. If H0H_0 is not rejected → difference first.

  3. Spurious regression risk: Regressing two independent random walks (both with unit roots) produces statistically significant coefficients even though they're completely unrelated. This is why you cannot apply OLS to non-stationary series directly.

  4. Integration order: A series with dd unit roots is I(d)I(d) (integrated of order dd). The ARIMA differencing parameter dd is the number of unit roots to remove.

  5. Cointegration: If two I(1)I(1) series have a linear combination that is I(0)I(0) (stationary), they are cointegrated - they share a common stochastic trend. This is handled by VECM (Vector Error Correction Model), not ARIMA.

Key Takeaways

  • AR(p): current value depends on pp past values; MA(q): depends on qq past shocks; ARIMA(p,d,q): difference dd times then fit ARMA(p,q)
  • SARIMA(p,d,q)(P,D,Q)s(p,d,q)(P,D,Q)_s handles seasonal patterns at period ss
  • Box-Jenkins methodology: Identify (ADF + ACF/PACF) → Estimate (MLE) → Diagnose (residual ACF, Ljung-Box) → Forecast
  • Use AIC for forecasting accuracy, BIC for model parsimony
  • Residuals must be white noise - persistent residual ACF indicates model misspecification
  • Always fit ARIMA before neural networks - it's the baseline you must beat and explain

Next: Lesson 05: State Space Models →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ARIMA Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.