ARIMA Models

Reading time: ~45 minutes Interview relevance: Very High - ARIMA is the classical forecasting baseline; every ML forecasting interview starts here Target roles: ML Engineer, Data Scientist, Financial ML, Research Engineer

The Real Interview Moment

You're presenting your LSTM forecasting model to a senior data scientist at a retail company. Your model achieves a MAPE of 8% on the test set. The senior DS asks: "Did you compare this to ARIMA?"

You did not.

"Why not? ARIMA with seasonal components often gets 5–7% MAPE on retail demand forecasting. Before you convince the business to run a GPU cluster for LSTMs, you need to prove that the complexity is justified. What's your ARIMA baseline?"

This scenario plays out constantly in industry. Neural networks are not always better than ARIMA. For short, well-behaved time series with strong seasonality, SARIMA frequently outperforms deep learning - at a fraction of the computational cost.

ARIMA is not a legacy technique you dismiss. It is the baseline you must beat and explain. Understanding its mathematics tells you exactly when it will and won't work - and what your deep learning model must be doing better.

Building Up to ARIMA

ARIMA combines three components: Autoregressive (AR), Integrated (I), and Moving Average (MA). Each adds expressive power for different temporal structures.

1. Autoregressive Model - AR(p)

An AR(p) model expresses the current value as a linear function of the $p$ most recent values:

$X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + \epsilon_t$

where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$ is white noise and $\{\phi_i\}$ are AR coefficients.

Backshift notation: Define the backshift operator $B$ such that $B^k X_t = X_{t-k}$ . Then:

$\phi_p(B) X_t = \epsilon_t, \quad \phi_p(B) = 1 - \phi_1 B - \phi_2 B^2 - \cdots - \phi_p B^p$

Stationarity condition for AR(p): The process is stationary if and only if all roots of the characteristic polynomial $\phi_p(z) = 0$ lie outside the unit circle in the complex plane.

For AR(1): $|phi_1| < 1$ for stationarity. For AR(2): The stationarity region is the triangle:

$\phi_1 + \phi_2 < 1$
$\phi_2 - \phi_1 < 1$
$-1 < \phi_2 < 1$

Unconditional mean: $\mu = \mathbb{E}[X_t] = 0$ (centered). For a model with intercept $c$ : $\mu = c / (1 - \phi_1 - \cdots - \phi_p)$ .

2. Moving Average Model - MA(q)

An MA(q) model expresses the current value as a linear combination of current and past noise shocks:

$X_t = \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \cdots + \theta_q \epsilon_{t-q}$

In backshift notation: $X_t = \theta_q(B) \epsilon_t$

Always stationary: Any finite MA(q) is stationary regardless of $\{\theta_i\}$ .

Invertibility condition: The MA process is invertible (can be written as infinite AR) if all roots of $\theta_q(z) = 0$ lie outside the unit circle. Invertibility ensures a unique MA representation and is needed for parameter estimation.

Key property: ACF is exactly zero for lags $> q$ . (This is why ACF identifies MA order.)

3. ARMA(p, q)

Combines both:

$\phi_p(B) X_t = \theta_q(B) \epsilon_t$

Expanded:

$X_t - \phi_1 X_{t-1} - \cdots - \phi_p X_{t-p} = \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}$

ARMA(p,q) is the most general stationary linear model. It can parsimoniously represent many processes that would require high-order pure AR or MA models.

4. Integration - ARIMA(p, d, q)

Most real-world series (prices, demand) are non-stationary - their mean drifts over time. The "I" in ARIMA handles this via differencing.

Define the difference operator $\nabla = 1 - B$ :

$\nabla X_t = X_t - X_{t-1}$ (first difference)
$\nabla^2 X_t = \nabla(\nabla X_t) = X_t - 2X_{t-1} + X_{t-2}$ (second difference)

ARIMA(p, d, q): Apply $d$ differences to achieve stationarity, then fit ARMA(p,q):

$\phi_p(B)(1-B)^d X_t = \theta_q(B)\epsilon_t$

$d$	Meaning	When to use
0	No differencing (already stationary)	Stationary series
1	First difference $X_t - X_{t-1}$	Random walks, trending series
2	Second difference	Series where growth rate is trending

In practice, $d \leq 2$ is almost always sufficient. Rarely $d = 3$ .

SARIMA: Handling Seasonality

Seasonal patterns require seasonal AR/MA operators at lags $s, 2s, 3s, \ldots$

SARIMA $(p,d,q)(P,D,Q)_s$ :

$\Phi_P(B^s)\phi_p(B)(1-B^s)^D(1-B)^d X_t = \Theta_Q(B^s)\theta_q(B)\epsilon_t$

where:

$(p,d,q)$ : non-seasonal AR, integration, MA orders
$(P,D,Q)$ : seasonal AR, integration, MA orders
$s$ : seasonal period (12 for monthly, 4 for quarterly, 7 for daily-weekly, 52 for weekly)

Example SARIMA(1,1,1)(1,1,1)[12]: $\underbrace{(1-\Phi_1 B^{12})}_{\text{seasonal AR}}\underbrace{(1-\phi_1 B)}_{\text{AR}}\underbrace{(1-B^{12})}_{\text{seasonal diff}}\underbrace{(1-B)}_{\text{diff}} X_t = \underbrace{(1+\Theta_1 B^{12})}_{\text{seasonal MA}}\underbrace{(1+\theta_1 B)}_{\text{MA}}\epsilon_t$

Python Implementation

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from sklearn.metrics import mean_squared_error, mean_absolute_error
import itertools

# ─── Generate test data ───────────────────────────────────────────────────────
np.random.seed(42)
n = 300

# AR(2) process (known ground truth for validation)
def generate_arima(n: int, phi: list = None, theta: list = None,
                   d: int = 0, sigma: float = 1.0) -> np.ndarray:
    """Generate ARIMA(p,d,q) data for testing."""
    if phi is None: phi = []
    if theta is None: theta = []

    p = len(phi)
    q = len(theta)
    extra = max(p, q) + d + 20  # burn-in

    eps = np.random.normal(0, sigma, n + extra)
    x = np.zeros(n + extra)

    for t in range(max(p, d), n + extra):
        # AR component
        ar_part = sum(phi[j] * x[t-1-j] for j in range(p))
        # MA component
        ma_part = sum(theta[j] * eps[t-1-j] for j in range(q)) if q > 0 else 0
        x[t] = ar_part + ma_part + eps[t]

    # Apply differencing removal (integration) if d > 0
    result = x[extra:]
    if d > 0:
        result = np.cumsum(result)  # undo differencing (simulate I(d))

    return result

# Ground truth: ARIMA(2,1,1)
series_raw = generate_arima(n, phi=[0.6, -0.2], theta=[0.4], d=1)
series = pd.Series(series_raw, name="y")
print(f"Generated ARIMA(2,1,1) series: mean={series.mean():.2f}, std={series.std():.2f}")

The Box-Jenkins Methodology

The Box-Jenkins methodology is the systematic workflow for fitting ARIMA models.

┌─────────────────────────────────────────────────────┐
│             Box-Jenkins Methodology                  │
│                                                     │
│  1. Identification                                  │
│     ├── Check stationarity (ADF test)               │
│     ├── Determine d (differencing order)            │
│     └── Read ACF/PACF to estimate p, q             │
│                                                     │
│  2. Estimation                                      │
│     ├── Fit ARIMA(p,d,q) via MLE                   │
│     └── Check parameter significance               │
│                                                     │
│  3. Diagnostic Checking                             │
│     ├── Residuals ~ white noise? (Ljung-Box)       │
│     ├── Residual ACF/PACF significant?              │
│     └── Residuals normally distributed?            │
│                                                     │
│  4. Forecasting                                     │
│     ├── Point forecasts with confidence intervals  │
│     └── Evaluate on held-out data                 │
└─────────────────────────────────────────────────────┘

class BoxJenkinsPipeline:
    """
    Complete Box-Jenkins ARIMA workflow.
    """

    def __init__(self, series: pd.Series, test_size: int = 30):
        self.series = series
        self.test_size = test_size
        self.train = series[:-test_size]
        self.test = series[-test_size:]
        self.model_fit = None
        self.best_order = None

    # ── Step 1: Identification ─────────────────────────────────────────────
    def step1_identify(self) -> dict:
        """Determine d, p, q from ADF test and ACF/PACF."""
        print("=" * 60)
        print("STEP 1: IDENTIFICATION")
        print("=" * 60)

        # Determine d: difference until stationary
        d = 0
        current = self.train.values.copy()
        while d <= 2:
            adf_stat, adf_p, _, _, crit, _ = adfuller(current)
            print(f"ADF test (d={d}): stat={adf_stat:.4f}, p={adf_p:.4f} "
                  f"→ {'STATIONARY' if adf_p < 0.05 else 'NON-STATIONARY'}")
            if adf_p < 0.05:
                break
            current = np.diff(current)
            d += 1

        print(f"\nDifferencing order d = {d}")

        # After differencing, read ACF and PACF
        acf_vals = acf(current, nlags=20, fft=True)
        pacf_vals = pacf(current, nlags=20, method='ywmle')
        ci = 1.96 / np.sqrt(len(current))

        sig_acf = [k for k in range(1, 21) if abs(acf_vals[k]) > ci]
        sig_pacf = [k for k in range(1, 21) if abs(pacf_vals[k]) > ci]

        print(f"Significant ACF lags:  {sig_acf[:5]}")
        print(f"Significant PACF lags: {sig_pacf[:5]}")

        # Heuristic p, q
        p_candidate = min(sig_pacf[:1], default=[1])[0] if sig_pacf else 1
        q_candidate = min(sig_acf[:1], default=[1])[0] if sig_acf else 1

        print(f"\nInitial candidates: p={p_candidate}, d={d}, q={q_candidate}")

        return {'d': d, 'p_initial': p_candidate, 'q_initial': q_candidate,
                'sig_acf': sig_acf, 'sig_pacf': sig_pacf}

    # ── Step 2: Estimation (grid search over p, q) ─────────────────────────
    def step2_estimate(
        self,
        d: int = 1,
        p_range: range = range(0, 4),
        q_range: range = range(0, 4)
    ) -> pd.DataFrame:
        """Fit multiple ARIMA models, select by AIC."""
        print("\n" + "=" * 60)
        print("STEP 2: ESTIMATION (Model Selection by AIC)")
        print("=" * 60)

        results = []
        for p, q in itertools.product(p_range, q_range):
            try:
                model = ARIMA(self.train, order=(p, d, q))
                fitted = model.fit(method='innovations_mle')
                results.append({
                    'p': p, 'd': d, 'q': q,
                    'aic': fitted.aic,
                    'bic': fitted.bic,
                    'order': (p, d, q)
                })
            except Exception:
                pass  # some configurations may not converge

        df = pd.DataFrame(results).sort_values('aic')
        print(f"\n{'Order':>15} | {'AIC':>10} | {'BIC':>10}")
        print("-" * 40)
        for _, row in df.head(5).iterrows():
            order_str = f"({row['p']},{row['d']},{row['q']})"
            print(f"{order_str:>15} | {row['aic']:>10.2f} | {row['bic']:>10.2f}")

        best = df.iloc[0]
        self.best_order = (int(best['p']), int(best['d']), int(best['q']))
        print(f"\nBest model by AIC: ARIMA{self.best_order}")

        # Fit best model
        self.model_fit = ARIMA(self.train, order=self.best_order).fit()

        print(f"\nModel Parameters:")
        print(self.model_fit.summary().tables[1])

        return df

    # ── Step 3: Diagnostic Checking ─────────────────────────────────────────
    def step3_diagnostics(self) -> dict:
        """Check residuals for white noise."""
        print("\n" + "=" * 60)
        print("STEP 3: DIAGNOSTIC CHECKING")
        print("=" * 60)

        residuals = self.model_fit.resid

        # Ljung-Box test on residuals
        lb_results = acorr_ljungbox(residuals, lags=[10, 20], return_df=True)
        lb_p10 = lb_results['lb_pvalue'].iloc[0]
        lb_p20 = lb_results['lb_pvalue'].iloc[1]

        # Residual ACF (should all be insignificant)
        res_acf = acf(residuals, nlags=20, fft=True)
        ci = 1.96 / np.sqrt(len(residuals))
        sig_res_acf = [k for k in range(1, 21) if abs(res_acf[k]) > ci]

        # Normality test
        from scipy.stats import shapiro, normaltest
        _, normality_p = normaltest(residuals)  # D'Agostino-Pearson

        diagnostics = {
            'lb_p_lag10': lb_p10,
            'lb_p_lag20': lb_p20,
            'residuals_white_noise': lb_p10 > 0.05 and lb_p20 > 0.05,
            'sig_residual_acf_lags': sig_res_acf,
            'normality_p': normality_p,
            'residual_mean': residuals.mean(),
            'residual_std': residuals.std()
        }

        print(f"Residual diagnostics:")
        print(f"  Ljung-Box p (lag 10): {lb_p10:.4f} "
              f"→ {'✓ white noise' if lb_p10 > 0.05 else '✗ autocorrelation remains'}")
        print(f"  Ljung-Box p (lag 20): {lb_p20:.4f} "
              f"→ {'✓ white noise' if lb_p20 > 0.05 else '✗ autocorrelation remains'}")
        print(f"  Sig. residual ACF lags: {sig_res_acf[:5]}")
        print(f"  Normality p-value: {normality_p:.4f} "
              f"→ {'✓ normal' if normality_p > 0.05 else 'non-normal'}")
        print(f"  Residual mean: {residuals.mean():.6f} (should be ~0)")

        if sig_res_acf:
            print(f"\n  ⚠ Significant residual ACF at lags {sig_res_acf[:3]}")
            print(f"  Consider increasing p or q to capture remaining structure.")

        return diagnostics

    # ── Step 4: Forecasting ─────────────────────────────────────────────────
    def step4_forecast(self, steps: int = None) -> pd.DataFrame:
        """Generate forecasts with confidence intervals."""
        print("\n" + "=" * 60)
        print("STEP 4: FORECASTING")
        print("=" * 60)

        if steps is None:
            steps = self.test_size

        # In-sample fit
        in_sample = self.model_fit.fittedvalues

        # Out-of-sample forecast
        forecast_result = self.model_fit.get_forecast(steps=steps)
        forecast_mean = forecast_result.predicted_mean
        forecast_ci = forecast_result.conf_int(alpha=0.05)

        # Evaluation
        # Align test with forecast (handle index alignment)
        y_true = self.test.values
        y_pred = forecast_mean.values[:len(y_true)]

        rmse = np.sqrt(mean_squared_error(y_true, y_pred))
        mae = mean_absolute_error(y_true, y_pred)
        mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100

        print(f"\nForecast metrics on {steps}-step ahead:")
        print(f"  RMSE: {rmse:.4f}")
        print(f"  MAE:  {mae:.4f}")
        print(f"  MAPE: {mape:.2f}%")

        # Naive baseline (last observed value)
        naive_pred = np.full(len(y_true), self.train.iloc[-1])
        naive_rmse = np.sqrt(mean_squared_error(y_true, naive_pred))
        print(f"\n  Naive baseline RMSE: {naive_rmse:.4f}")
        print(f"  ARIMA improvement:   {(naive_rmse - rmse)/naive_rmse:.1%}")

        return pd.DataFrame({
            'forecast': forecast_mean,
            'lower_95': forecast_ci.iloc[:, 0],
            'upper_95': forecast_ci.iloc[:, 1]
        })

# Run the full pipeline
pipeline = BoxJenkinsPipeline(series, test_size=30)
ident = pipeline.step1_identify()
model_df = pipeline.step2_estimate(d=ident['d'])
diag = pipeline.step3_diagnostics()
forecasts = pipeline.step4_forecast()

Model Selection: AIC vs BIC

Both AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) balance goodness-of-fit against model complexity.

$\text{AIC} = -2\ln(\hat{L}) + 2k$

$\text{BIC} = -2\ln(\hat{L}) + k\ln(T)$

where $\hat{L}$ is the maximized likelihood, $k$ is the number of parameters, and $T$ is the sample size.

Choose lower AIC/BIC.

Criterion	Penalty	Preference	When to use
AIC	$2k$	Penalizes less	When forecasting accuracy is the goal; tends to select larger models
BIC	$k \ln T$	Penalizes more	When interpretability matters; tends to select smaller models
AICc	$2k + 2k^2/(T-k-1)$	AIC corrected	Small samples ( $T < 40k$ ) - use AICc instead of AIC

def aic_bic_comparison(
    series: pd.Series,
    d: int = 1,
    max_p: int = 4,
    max_q: int = 4,
    criterion: str = 'aic'
) -> tuple:
    """Grid search for best ARIMA order using AIC or BIC."""
    best_score = np.inf
    best_order = None
    scores = {}

    for p in range(max_p + 1):
        for q in range(max_q + 1):
            if p == 0 and q == 0:
                continue
            try:
                model = ARIMA(series, order=(p, d, q))
                fitted = model.fit()
                score = fitted.aic if criterion == 'aic' else fitted.bic
                scores[(p, d, q)] = score
                if score < best_score:
                    best_score = score
                    best_order = (p, d, q)
            except Exception:
                pass

    return best_order, best_score, scores

best_order_aic, best_aic, _ = aic_bic_comparison(series, d=1, criterion='aic')
best_order_bic, best_bic, _ = aic_bic_comparison(series, d=1, criterion='bic')

print(f"Best by AIC: ARIMA{best_order_aic}, AIC={best_aic:.2f}")
print(f"Best by BIC: ARIMA{best_order_bic}, BIC={best_bic:.2f}")

SARIMA in Practice: Monthly Retail Sales

def fit_sarima_monthly(
    series: pd.Series,
    test_months: int = 24
) -> dict:
    """
    Fit SARIMA model to monthly data with annual seasonality.

    Classic application: retail sales, energy consumption,
    subscription counts - any KPI with monthly/quarterly patterns.
    """
    train = series[:-test_months]
    test = series[-test_months:]

    print("Fitting SARIMA models for monthly data...")
    print("Seasonal period: 12 months\n")

    # Common SARIMA specifications for monthly data
    sarima_configs = [
        {'order': (1,1,1), 'seasonal_order': (1,1,1,12)},
        {'order': (0,1,1), 'seasonal_order': (0,1,1,12)},  # airline model
        {'order': (1,1,0), 'seasonal_order': (1,1,0,12)},
        {'order': (2,1,1), 'seasonal_order': (1,1,1,12)},
        {'order': (0,1,1), 'seasonal_order': (1,1,0,12)},
    ]

    results = []
    for config in sarima_configs:
        try:
            model = SARIMAX(
                train,
                order=config['order'],
                seasonal_order=config['seasonal_order'],
                enforce_stationarity=False,
                enforce_invertibility=False
            )
            fitted = model.fit(disp=False)

            # Forecast
            forecast = fitted.get_forecast(steps=test_months)
            y_pred = forecast.predicted_mean.values
            y_true = test.values

            rmse = np.sqrt(mean_squared_error(y_true, y_pred))
            results.append({
                'config': f"SARIMA{config['order']}{config['seasonal_order']}",
                'aic': fitted.aic,
                'bic': fitted.bic,
                'test_rmse': rmse,
                'fitted': fitted
            })
        except Exception as e:
            pass

    df = pd.DataFrame([{k: v for k, v in r.items() if k != 'fitted'}
                       for r in results]).sort_values('aic')

    print(f"{'Model':>40} | {'AIC':>8} | {'BIC':>8} | {'RMSE':>8}")
    print("-" * 72)
    for _, row in df.iterrows():
        print(f"{row['config']:>40} | {row['aic']:>8.2f} | {row['bic']:>8.2f} | {row['test_rmse']:>8.4f}")

    best_result = min(results, key=lambda x: x['aic'])
    return best_result

# Generate monthly sales data with seasonality
np.random.seed(0)
n_months = 120  # 10 years of monthly data
t = np.arange(n_months)

# Seasonal pattern + trend + AR(1) structure
seasonal = 50 * np.sin(2 * np.pi * t / 12) + 20 * np.sin(2 * np.pi * 2 * t / 12)
trend = 2 * t
ar_noise = np.zeros(n_months)
for i in range(1, n_months):
    ar_noise[i] = 0.7 * ar_noise[i-1] + np.random.normal(0, 15)

sales = 500 + trend + seasonal + ar_noise
sales_series = pd.Series(sales)

best = fit_sarima_monthly(sales_series, test_months=24)
print(f"\nBest model: {best['config']}, RMSE={best['test_rmse']:.4f}")

ARIMA vs Deep Learning: When Does Each Win?

def benchmark_arima_vs_naive(
    series: pd.Series,
    test_size: int = 24,
    arima_order: tuple = (1,1,1)
) -> pd.DataFrame:
    """
    Benchmark ARIMA against naive baselines.
    ARIMA should beat these baselines; if not, something is wrong.
    """
    train = series[:-test_size].values
    test = series[-test_size:].values

    results = {}

    # Naive: last value
    naive_last = np.full(test_size, train[-1])
    results['Naive (last value)'] = np.sqrt(mean_squared_error(test, naive_last))

    # Naive: seasonal (same period last year, for monthly data)
    if len(train) >= 12:
        naive_seasonal = train[-12:][:test_size] if test_size <= 12 else np.tile(train[-12:], 3)[:test_size]
        results['Naive (seasonal)'] = np.sqrt(mean_squared_error(test, naive_seasonal))

    # ARIMA
    try:
        model = ARIMA(train, order=arima_order)
        fitted = model.fit()
        arima_pred = fitted.forecast(steps=test_size)
        results['ARIMA'] = np.sqrt(mean_squared_error(test, arima_pred))
    except Exception as e:
        results['ARIMA'] = float('nan')

    # Exponential smoothing (simple)
    from statsmodels.tsa.holtwinters import ExponentialSmoothing
    hw_model = ExponentialSmoothing(train, trend='add', seasonal='add',
                                    seasonal_periods=12).fit(optimized=True)
    hw_pred = hw_model.forecast(test_size)
    results['Holt-Winters'] = np.sqrt(mean_squared_error(test, hw_pred))

    df = pd.DataFrame.from_dict(results, orient='index', columns=['RMSE'])
    df = df.sort_values('RMSE')

    print("Benchmark Results (RMSE):")
    print(f"{'Model':<30} | {'RMSE':>10}")
    print("-" * 45)
    for model_name, row in df.iterrows():
        print(f"{model_name:<30} | {row['RMSE']:>10.4f}")

    return df

benchmark_arima_vs_naive(sales_series)

Decision framework for ARIMA vs neural networks:

Factor	ARIMA wins	Neural net wins
Series length	< 200 observations	> 1000+ observations
Computational budget	Low (CPU, seconds)	High (GPU, hours)
Interpretability required	High (coefficients matter)	Low (black box OK)
Seasonality	Well-defined seasonal period	Complex, multi-scale
Exogenous variables	Few, linear relationships	Many, complex interactions
Distribution shift	Stable (in-distribution)	Adaptive models handle shift
Multiple series	Each series separate model	Single model generalizes

Auto-ARIMA: Automated Order Selection

def auto_arima_simple(
    series: pd.Series | np.ndarray,
    max_p: int = 5,
    max_q: int = 5,
    max_d: int = 2,
    criterion: str = 'aic',
    verbose: bool = True
) -> tuple:
    """
    Simple auto-ARIMA implementation: grid search with AIC/BIC.

    For production, use:
    - pmdarima.auto_arima (battle-tested, handles edge cases)
    - statsforecast.AutoARIMA (fastest implementation, Nixtla)
    """
    if isinstance(series, np.ndarray):
        series = pd.Series(series)

    # Determine d by ADF test
    d = 0
    current = series.values.copy()
    while d < max_d:
        _, adf_p, *_ = adfuller(current)
        if adf_p < 0.05:
            break
        current = np.diff(current)
        d += 1

    best_score = np.inf
    best_order = None
    best_model = None

    for p in range(max_p + 1):
        for q in range(max_q + 1):
            if p == 0 and q == 0:
                continue
            try:
                model = ARIMA(series, order=(p, d, q))
                fitted = model.fit(method='innovations_mle')
                score = fitted.aic if criterion == 'aic' else fitted.bic
                if score < best_score:
                    best_score = score
                    best_order = (p, d, q)
                    best_model = fitted
            except Exception:
                pass

    if verbose:
        print(f"Auto-ARIMA result: ARIMA{best_order}, {criterion.upper()}={best_score:.2f}")

    return best_order, best_model

order, model = auto_arima_simple(series)

Online Forecasting: Recursive Update

In production ML systems, you often need to update forecasts as new data arrives, without refitting from scratch.

def recursive_arima_forecast(
    initial_train: np.ndarray,
    new_observations: np.ndarray,
    order: tuple = (1, 1, 1),
    refit_every: int = 10
) -> np.ndarray:
    """
    Rolling walk-forward ARIMA forecast.

    Strategy 1: Refit the model every `refit_every` periods
    Strategy 2: Use model.apply() to update state without full refit (faster)

    In production systems:
    - Full refit: expensive but gets updated parameter estimates
    - State update (Kalman filter approach): O(1) per observation
    - Compromise: refit every N periods, state-update in between
    """
    all_data = list(initial_train)
    forecasts = []

    current_model = ARIMA(all_data, order=order).fit(method='innovations_mle')

    for i, y_new in enumerate(new_observations):
        # Forecast next step
        fc = current_model.forecast(steps=1)[0]
        forecasts.append(fc)

        # Add new observation
        all_data.append(y_new)

        # Refit periodically
        if (i + 1) % refit_every == 0:
            current_model = ARIMA(all_data, order=order).fit(method='innovations_mle')

    return np.array(forecasts)

# Run walk-forward evaluation
train_size = 200
initial = series.values[:train_size]
holdout = series.values[train_size:]

rolling_preds = recursive_arima_forecast(initial, holdout, order=best_order_aic)
rolling_rmse = np.sqrt(mean_squared_error(holdout, rolling_preds))
print(f"\nWalk-forward ARIMA RMSE: {rolling_rmse:.4f}")

Interview Questions

Q1: Explain ARIMA(p,d,q) - what does each component mean?

AR(p) - Autoregressive of order p: The current value $X_t$ depends on the last $p$ observed values. Each lag $X_{t-k}$ has a coefficient $\phi_k$ . Captures "momentum" and "mean reversion" patterns. PACF cuts off at lag $p$ .

I(d) - Integrated of order d: The series is differenced $d$ times to achieve stationarity. $d=1$ removes linear trends (the difference $X_t - X_{t-1}$ is stationary). $d=2$ removes quadratic trends. Most real-world series need $d=0$ or $d=1$ .

MA(q) - Moving Average of order q: The current value depends on the last $q$ noise shocks $\epsilon_{t-k}$ . Captures how shocks propagate through the series. ACF cuts off at lag $q$ .

Combined ARIMA(p,d,q): Take the $d$ -th difference of the series to get a stationary series, then fit ARMA(p,q) to it.

Practical identification:

Run ADF test → if non-stationary, difference once (d=1) and re-test
Look at ACF of differenced series: if cuts off at q → MA(q) component
Look at PACF of differenced series: if cuts off at p → AR(p) component
Confirm with AIC/BIC grid search

Q2: What is the difference between AIC and BIC, and which should you use?

Both penalize model complexity (number of parameters $k$ ) to prevent overfitting:

$\text{AIC} = -2\ln\hat{L} + 2k \qquad \text{BIC} = -2\ln\hat{L} + k\ln T$

BIC penalizes more for large $T$ (since $\ln T > 2$ when $T > 8$ ), so BIC selects simpler models than AIC when sample sizes are moderate to large.

AIC is optimal for forecasting: It minimizes one-step-ahead prediction error asymptotically (Akaike, 1974). If your goal is accurate forecasts, use AIC.

BIC is consistent: It selects the true model order as $T \to \infty$ (if the true model is in the candidate set). If you want to identify the true data-generating process, use BIC.

Practical rule of thumb:

AIC → when you care about forecast accuracy (common in ML)
BIC → when you care about model parsimony/interpretability
If AIC and BIC agree: high confidence in the selected model
If they disagree: AIC selects a larger model; compare both on a validation set

For small samples ( $T < 40k$ ), use AICc (corrected AIC) to avoid overfitting.

Q3: Why do you check residuals after fitting ARIMA? What patterns indicate a misspecified model?

After fitting ARIMA, the residuals should be white noise - no remaining temporal structure. If they're not, the model failed to capture some pattern.

What to check:

Ljung-Box test on residuals: Test $H_0$ : residuals are white noise. Reject ( $p < 0.05$ ) → model is misspecified.
Residual ACF and PACF: Should be insignificant at all lags.
- Significant spike at lag $k$ in ACF → add MA( $k$ ) term
- Significant spike at lag $k$ in PACF → add AR( $k$ ) term
- Spikes at seasonal lags (12, 24) → add seasonal SARIMA components
Residual mean ≈ 0: Non-zero mean → include intercept term.
Residual normality: For valid prediction intervals, residuals should be approximately normal. Non-normality doesn't affect point forecasts but invalidates confidence intervals.
Ljung-Box on squared residuals: Detects ARCH effects (heteroskedasticity). If significant → residuals have non-constant variance; fit GARCH after ARIMA.

Common misspecification patterns and fixes:

ACF residual spike at lag 1 → q too small, increase q
PACF residual spike at lag 2 → p too small, increase p
Slow ACF decay in residuals → d insufficient, increase d
Seasonal ACF spikes in residuals → need SARIMA $(P, D, Q)_s$
Residual variance increases over time → transform series (log), or fit GARCH

Q4: When should you use ARIMA vs a neural network for forecasting?

Use ARIMA when:

Short series (< 200 observations) - neural nets overfit
Need prediction intervals (ARIMA has exact distributional theory)
Interpretability required (AR/MA coefficients have clear meaning)
Fast inference / no GPU (ARIMA is CPU-only, milliseconds)
Strong simple seasonality (SARIMA handles this naturally)
Prototyping / establishing a baseline
Business stakeholders need to understand the model

Use neural networks when:

Long series (1000+ observations)
Multiple related series (one model generalizes across series)
Rich exogenous covariates (promotional variables, weather, macroeconomic)
Complex non-linear dynamics
Irregular seasonality or multiple seasonal periods
Transfer learning across series (pre-trained Chronos, TimeGPT)
When you've proven ARIMA can't capture the dynamics

The key discipline: Always fit ARIMA first and measure its performance. A neural network that underperforms ARIMA is a red flag - either the data doesn't have complex temporal structure worth learning, or the neural net is improperly configured.

In practice, ensembles often win: ARIMA captures the linear temporal structure; a gradient boosted model captures non-linear residuals.

Q5: What are unit roots and why do they matter for ARIMA?

A unit root occurs when the AR characteristic polynomial has a root exactly on the unit circle (modulus = 1). The simplest example is AR(1) with $\phi_1 = 1$ :

$X_t = X_{t-1} + \epsilon_t$

This is a random walk - a unit root process. It is non-stationary: its variance grows without bound over time ( $\text{Var}(X_t) = t\sigma^2$ ).

Why unit roots matter for ARIMA:

Stationarity requirement: Standard ARIMA assumes the differenced series is stationary. If the original series has a unit root, $d=1$ differencing removes it and gives a stationary ARMA process.
Hypothesis testing: The ADF (Augmented Dickey-Fuller) test tests $H_0$ : unit root present vs $H_1$ : stationary. You must reject $H_0$ before fitting ARMA. If $H_0$ is not rejected → difference first.
Spurious regression risk: Regressing two independent random walks (both with unit roots) produces statistically significant coefficients even though they're completely unrelated. This is why you cannot apply OLS to non-stationary series directly.
Integration order: A series with $d$ unit roots is $I(d)$ (integrated of order $d$ ). The ARIMA differencing parameter $d$ is the number of unit roots to remove.
Cointegration: If two $I(1)$ series have a linear combination that is $I(0)$ (stationary), they are cointegrated - they share a common stochastic trend. This is handled by VECM (Vector Error Correction Model), not ARIMA.

Key Takeaways

AR(p): current value depends on $p$ past values; MA(q): depends on $q$ past shocks; ARIMA(p,d,q): difference $d$ times then fit ARMA(p,q)
SARIMA $(p,d,q)(P,D,Q)_s$ handles seasonal patterns at period $s$
Box-Jenkins methodology: Identify (ADF + ACF/PACF) → Estimate (MLE) → Diagnose (residual ACF, Ljung-Box) → Forecast
Use AIC for forecasting accuracy, BIC for model parsimony
Residuals must be white noise - persistent residual ACF indicates model misspecification
Always fit ARIMA before neural networks - it's the baseline you must beat and explain

Next: Lesson 05: State Space Models →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ARIMA Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Building Up to ARIMA​

1. Autoregressive Model - AR(p)​

2. Moving Average Model - MA(q)​

3. ARMA(p, q)​

4. Integration - ARIMA(p, d, q)​

SARIMA: Handling Seasonality​

Python Implementation​

The Box-Jenkins Methodology​

Model Selection: AIC vs BIC​

SARIMA in Practice: Monthly Retail Sales​

ARIMA vs Deep Learning: When Does Each Win?​

Auto-ARIMA: Automated Order Selection​

Online Forecasting: Recursive Update​

Interview Questions​

Key Takeaways​