Skip to main content

Time Series Forecasting Patterns

Reading time: 45 minutes | Interview relevance: Very High | Target roles: MLE, AI Engineer, Data Scientist, MLOps Engineer


The Alert at 3 AM

It is Black Friday eve. The demand forecasting model that has been running flawlessly for nine months just started producing nonsense. The predictions for tomorrow's electronics category are 40% below last year's actuals, and the warehouse team has already adjusted their staffing based on these numbers. Your VP of Operations is on a call demanding answers.

You pull up the model's training logs. A junior engineer, trying to speed up an experiment two weeks ago, switched the dataset split from a time-based cutoff to a random 80/20 split. The model scored a respectable 0.94 R-squared on validation and nobody noticed the problem. But "random 80/20" on time series means the model trained on data from the future to predict the past. It saw November demand signals while predicting October. The model learned patterns that only exist when you can cheat. Black Friday is the first real-world test where it cannot cheat, and it is failing spectacularly.

The fix takes four hours: revert to temporal split, retrain from scratch, redeploy. The ops team has to scramble to restore original staffing. The business takes a hit. And the root cause was a single line of code - train_test_split(X, y, test_size=0.2, random_state=42) - that felt safe because every classification tutorial uses exactly that pattern.

Time series forecasting is where standard ML intuitions break most catastrophically. The data is ordered. The future cannot inform the past. The patterns shift over time. The seasonality compounds with trend. And the evaluation methodology that works everywhere else will silently corrupt your results here. This lesson covers the forecasting patterns that actually work in production, the classical methods that still outperform deep learning in many regimes, the modern neural approaches for complex multivariate cases, and - most critically - the temporal validation methodology that separates practitioners who build reliable systems from those who build impressive demos.

By the end of this lesson, you will know how to decompose any time series, match the right forecasting method to the problem at hand, implement LSTM forecasting with proper temporal splitting in PyTorch, and answer every forecasting question that comes up in ML engineering interviews.


Why This Exists - Why Standard ML Fails on Time Series

Standard supervised learning assumes something that time series data fundamentally violates: that each sample is independent and identically distributed (i.i.d.). When you train an image classifier, the label for one photo has nothing to do with the label for the next photo in your dataset. Shuffle the dataset and nothing changes - the model trains just as well.

Time series data is the opposite. Each observation is explicitly dependent on previous observations. Tomorrow's temperature is strongly related to today's temperature. Next quarter's revenue depends on this quarter's revenue and the seasonal patterns from a year ago. The order of the data is not just metadata - it IS the signal.

This creates three categories of failure when you apply standard ML tooling to time series problems.

Data leakage through random splitting. When you randomly split time series data into train and test sets, observations from the test period get randomly distributed into the training set. Your model trains on November data while being evaluated on October data. It has seen the future. The evaluation metric looks great - but the model has memorized future patterns that cannot be available at actual inference time. This is data leakage. In production, the model encounters truly unseen future data and performs far worse than advertised.

Non-stationarity. A stationary time series has a constant mean and variance over time. Real-world time series are almost never stationary. Revenue grows. Seasonality shifts as customer behavior changes. The distribution of your target variable drifts year over year. Standard ML models trained on a stationary assumption will degrade as the distribution shifts - and in time series, distribution shift is the norm, not the exception.

Temporal dependencies require temporal context. Standard feature vectors are static snapshots. Time series models need to reason about sequences - what happened 7 days ago, 30 days ago, last year at this exact time. Standard ML models do not have an inductive bias for sequential reasoning. You need either engineered lag features (classical ML path) or architectures specifically designed to process sequences (RNN, LSTM, Transformer paths).

These three failure modes explain why forecasting has its own distinct literature, its own evaluation protocols, and its own set of specialized architectures. The rest of this lesson builds the tools to handle all three correctly.


Historical Context

Time series forecasting has a longer empirical history than most ML subfields. The forecasting community was rigorous about evaluation methodology - rolling-origin evaluation, multiple forecast horizons - decades before the ML community started worrying about dataset contamination.

1970 - Box-Jenkins and ARIMA. George Box and Gwilym Jenkins published "Time Series Analysis: Forecasting and Control" in 1970, introducing the ARIMA framework. The insight was that many time series can be characterized by three numbers: how many autoregressive terms (pp), how many differencing operations to achieve stationarity (dd), and how many moving average terms (qq). ARIMA(p,d,q)\text{ARIMA}(p,d,q) became the dominant forecasting framework for nearly five decades. It is still competitive on univariate monthly and quarterly series.

2017 - Facebook Prophet. Sean Taylor and Benjamin Letham at Facebook published "Forecasting at Scale" (2017), introducing Prophet. The innovation was not a new mathematical framework - Prophet decomposes trend, seasonality, and holidays using a Bayesian framework with intuitive parameters. The real contribution was operationalizability: automatic changepoint detection, human-readable parameters, and the ability to incorporate domain knowledge without PhD-level time series expertise. Prophet made robust forecasting accessible to data analysts. It is still widely used for business forecasting with weekly or daily granularity and strong seasonality.

2019 - N-BEATS. Boris Oreshkin and colleagues at Element AI published "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting" (2019). N-BEATS demonstrated that pure deep learning - no ARIMA components, no domain-specific architecture choices - could match and beat classical methods at scale. The architecture uses residual stacks of fully connected layers with basis expansion. It won the M4 forecasting competition, which was a significant result because competition datasets favor methods that generalize across diverse series types.

2023 - PatchTST. Yuqi Nie and colleagues published "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers" (2023), introducing PatchTST. The key insight was that treating time series with channel independence (each variate forecasted independently) and patching (grouping consecutive time steps into patches, like image patches in ViT) dramatically improves Transformer forecasting performance. PatchTST became state-of-the-art on multiple long-horizon forecasting benchmarks, closing the gap between Transformers and the surprisingly strong Linear baseline (DLinear, 2023) that had challenged Transformer-based forecasting methods.

The trajectory from ARIMA to PatchTST shows a consistent pattern: each generation either extends applicability (more series types, longer horizons, multivariate inputs) or reduces the expertise required to deploy. The classical methods did not disappear - they remain competitive in their regimes. The modern landscape is a portfolio, not a replacement chain.


Core Decomposition: Trend, Seasonality, and Residuals

Before choosing a forecasting method, decompose the series. Decomposition reveals the structure of the time series and tells you which components your model needs to handle.

Every time series can be broken into three components:

Trend - the long-run direction of the series. Revenue growing at 15% per year. User count declining after a product change. Trend is the low-frequency, persistent directional movement.

Seasonality - repeating patterns tied to a calendar or fixed cycle. Retail sales spike in December. Energy consumption peaks in summer and winter. Website traffic drops on weekends. Seasonality has a fixed, known period.

Residuals - everything that is neither trend nor seasonality. Random shocks, one-off events, measurement noise. A good decomposition produces residuals that look like white noise - no structure remaining.

The decomposition can be additive or multiplicative:

  • Additive: use when the seasonal amplitude is constant regardless of the level.

yt=Tt+St+Rty_t = T_t + S_t + R_t

  • Multiplicative: use when the seasonal amplitude grows proportionally with the level. Log-transforming a multiplicative series converts it to additive.

yt=TtStRty_t = T_t \cdot S_t \cdot R_t

STL Decomposition

STL (Seasonal and Trend decomposition using Loess) is the practical decomposition workhorse, introduced by Cleveland et al. (1990). STL uses locally weighted regression (Loess) to iteratively estimate trend and seasonality, making it robust to outliers and flexible to non-linear trends - improvements over classical additive decomposition which assumes linear trend.

import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import STL

# Generate a synthetic series: trend + seasonality + noise
np.random.seed(42)
n = 365 * 2 # 2 years of daily data
t = np.arange(n)

trend = 0.05 * t + 100 # linear growth
seasonality = 20 * np.sin(2 * np.pi * t / 365) # annual cycle
residual = np.random.normal(0, 3, n)

y = trend + seasonality + residual

dates = pd.date_range(start="2022-01-01", periods=n, freq="D")
series = pd.Series(y, index=dates, name="revenue")

# STL decomposition
stl = STL(series, period=365, robust=True)
result = stl.fit()

# The three components
trend_component = result.trend
seasonal_component = result.seasonal
residual_component = result.resid

print(f"Trend range: {trend_component.min():.1f} to {trend_component.max():.1f}")
print(f"Seasonal amplitude: {seasonal_component.max() - seasonal_component.min():.1f}")
print(f"Residual std dev: {residual_component.std():.2f}")

# Strength of seasonality: ratio of seasonal variance to total variance after detrending
var_seasonal = np.var(seasonal_component)
var_remainder = np.var(residual_component)
seasonal_strength = max(0, 1 - var_remainder / (var_seasonal + var_remainder))
print(f"Seasonal strength: {seasonal_strength:.3f}") # 0 to 1

The seasonal strength metric is critical for model selection. A strength above 0.6 indicates that a method which explicitly models seasonality (SARIMA, Prophet, seasonal naive baseline) will likely outperform a method that ignores it. A strength below 0.2 suggests the series is dominated by trend and noise - simpler methods may suffice.

:::tip Decompose First, Always Run STL decomposition before writing a single line of model code. It reveals whether you need seasonal handling, whether the trend is linear or exponential, and whether the residuals have any structure left (autocorrelation in residuals means your model is missing something). :::


Classical Methods: ARIMA and SARIMA

Classical methods remain highly competitive on univariate series with fewer than a few hundred observations, or when the series is well-characterized by linear autoregressive structure. They are also interpretable, fast to train, and have tight prediction intervals that are statistically valid under model assumptions.

Intuition Behind ARIMA

ARIMA combines three ideas:

Autoregression (AR) - predict the next value as a weighted linear combination of the past pp values:

yt=c+ϕ1yt1+ϕ2yt2++ϕpytp+εty_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + \cdots + \phi_p y_{t-p} + \varepsilon_t

This captures the inertia of a series - tomorrow's temperature is strongly related to today's.

Integration (I) - differencing the series dd times to achieve stationarity. First-order differencing computes:

yt=ytyt1y'_t = y_t - y_{t-1}

Most real-world series become stationary after 1 or 2 differences.

Moving Average (MA) - predict the next value as a function of the past qq forecast errors:

yt=μ+εt+θ1εt1++θqεtqy_t = \mu + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q}

This captures the impact of shocks that decay over time.

ARIMA(p,d,q)\text{ARIMA}(p, d, q) combines all three. The art is in choosing p,d,qp, d, q - traditionally done using ACF (Autocorrelation Function) and PACF (Partial ACF) plots. The auto_arima function in the pmdarima library automates this via information criteria (AIC, BIC).

SARIMA for Seasonal Series

SARIMA extends ARIMA with a seasonal component: SARIMA(p,d,q)(P,D,Q,m)\text{SARIMA}(p, d, q)(P, D, Q, m) where:

  • mm is the seasonal period (12 for monthly data with annual seasonality, 7 for daily with weekly seasonality)
  • P,D,QP, D, Q are the seasonal equivalents of p,d,qp, d, q - they operate on lags that are multiples of mm
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_percentage_error

# Generate monthly seasonal series
np.random.seed(0)
n_months = 60 # 5 years of monthly data
t = np.arange(n_months)
trend = 0.8 * t + 200
seasonality = 30 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 5, n_months)
monthly_series = trend + seasonality + noise

dates = pd.date_range(start="2020-01-01", periods=n_months, freq="MS")
series = pd.Series(monthly_series, index=dates)

# Temporal split: last 12 months as test set
train = series.iloc[:-12]
test = series.iloc[-12:]

# Fit SARIMA(1,1,1)(1,1,1,12) - a common starting configuration for monthly data
model = SARIMAX(
train,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False
)
fitted = model.fit(disp=False)

# Forecast
forecast = fitted.forecast(steps=12)
mape = mean_absolute_percentage_error(test, forecast) * 100

print(f"SARIMA MAPE on test set: {mape:.2f}%")
print(f"AIC: {fitted.aic:.2f}")
print(f"\nForecast vs Actual (last 12 months):")
for date, pred, actual in zip(test.index, forecast, test):
print(f" {date.strftime('%Y-%m')}: pred={pred:.1f}, actual={actual:.1f}, error={abs(pred-actual)/actual*100:.1f}%")

When to Use ARIMA / SARIMA

ARIMA family methods are the right choice when:

  • The series is univariate (one variable to forecast)
  • You have fewer than roughly 500 observations - deep learning overfits badly in this regime
  • The series shows clear linear autocorrelation structure
  • Interpretability matters - ARIMA coefficients have statistical meaning
  • You need statistically valid confidence intervals, not just point forecasts

When the number of series is large (thousands of SKUs, millions of users), ARIMA requires fitting one model per series, which becomes computationally expensive. This is where global deep learning models that train across many series simultaneously become valuable.


Deep Learning Approaches

Deep learning forecasters are not universally better than ARIMA. They win on specific problem types: long input and output horizons, multivariate inputs with complex cross-variable interactions, and when training data spans thousands of related series that allow a global model to extract shared patterns.

LSTM Forecaster

LSTM (Long Short-Term Memory) is the foundational deep learning approach for univariate and multivariate time series. The architecture processes input sequences one step at a time, maintaining a hidden state and cell state that act as memory. An LSTM forecaster takes a window of past observations as input and predicts future values.

The typical pattern is encoder-decoder: an LSTM encoder processes the historical window, and either the final hidden state is passed to a linear output head (direct multi-step forecast) or a decoder LSTM generates predictions autoregressively.

Full PyTorch implementation is in the implementation section below.

N-BEATS

N-BEATS (Neural Basis Expansion Analysis for Time Series, Oreshkin et al. 2019) is a pure MLP architecture with a clever residual structure. Each block in the network:

  1. Receives the backcast window (historical data)
  2. Produces a "backcast" - what the block explains about the past
  3. Produces a "forecast" - the block's contribution to the future prediction
  4. Passes the residual (unexplained past) to the next block

By stacking blocks organized into stacks (one stack for trend, one for seasonality), N-BEATS decomposes the forecast into interpretable components while allowing end-to-end gradient learning. The key insight: basis expansion forces each block to predict using a specific functional form (polynomial for trend, Fourier terms for seasonality), making the decomposition interpretable without manual feature engineering.

N-BEATS won the M4 competition (2018) and remains one of the strongest single-series univariate forecasters.

Temporal Fusion Transformer (TFT)

The Temporal Fusion Transformer (Bryan Lim et al., Google, 2021) is the gold standard for multi-horizon, multivariate forecasting when you have multiple related time series and various types of inputs:

  • Past observed inputs - historical values of the target and related variables
  • Static covariates - time-invariant features (store size, product category)
  • Future known inputs - calendar features, promotions planned ahead

TFT uses:

  • Variable selection networks - learn which inputs are relevant at each timestep
  • Gating mechanisms - skip connections that allow the model to ignore irrelevant components
  • Multi-head attention - attend to important past timesteps with interpretable attention weights
  • Quantile regression - output prediction intervals, not just point forecasts

TFT is production-grade at scale: Google uses it for traffic forecasting, Uber for trip demand, Amazon for inventory planning. It requires substantially more data than ARIMA (typically 10,000 or more observations across series) but produces significantly better results on complex multivariate problems.


Critical Concept: Train/Val/Test Split for Time Series

This is the most important section in this lesson. Get this wrong and everything else is irrelevant.

The fundamental rule: time must always flow forward. Your model must never see data from a future time period while training or during hyperparameter selection.

Why Random Split Is Data Leakage

Consider a series of daily sales from 2022-01-01 to 2024-12-31. If you use train_test_split(X, y, test_size=0.2, random_state=42):

  • 20% of the days (randomly scattered throughout the 3-year period) go into the test set
  • The model trains on days from December 2024 to predict days from January 2022
  • The model learns from context windows that contain future data
  • Features like "week of year," "month," and lag features from surrounding days leak future information

The model's test performance is an artifact of leakage, not genuine predictive ability. You will see this in practice as the model performing beautifully in backtesting but failing badly when deployed - because deployment is the first time it actually has to predict the future without having seen it.

The Three Correct Splits

Hold-out (Fixed Origin) Split - the simplest correct approach. Train on everything before cutoff date, validate on the next window, test on the final window. Periods are non-overlapping and always in chronological order.

|-------- train --------|--- val ---|-- test --|
Jan 2022 Dec 2023 Mar 2024 Jun 2024

Walk-Forward Validation (Rolling Origin) - the rigorous approach. The training window expands (or rolls) and the validation window moves forward, simulating the actual online deployment scenario. Each "fold" is a separate forecasting origin.

Train: Jan-Mar | Val: Apr
Train: Jan-Apr | Val: May
Train: Jan-May | Val: Jun
Train: Jan-Jun | Val: Jul
...and so on

This matches how the model will actually be used: retrained periodically with all available data up to the current date, then used to forecast the next period.

Blocked Cross-Validation - used when you want to use sklearn-style cross-validation tools. Split the series into blocks, use all combinations where training blocks come before validation blocks.

Walk-forward validation gives the most realistic picture of deployment performance and is the standard in competitive forecasting (Makridakis M competitions). It is computationally expensive - you fit one model per fold - but the performance estimate is reliable.


NumPy From Scratch: Sliding Window Dataset and Walk-Forward Validation

Sliding Window Dataset Creation

import numpy as np

def create_sliding_windows(series: np.ndarray, lookback: int, horizon: int):
"""
Convert a 1D time series into supervised learning format using sliding windows.

Parameters
----------
series : 1D numpy array of time series values
lookback : number of past timesteps to use as features (input window)
horizon : number of future timesteps to predict (output window)

Returns
-------
X : shape (n_samples, lookback) -- input windows
y : shape (n_samples, horizon) -- target windows
"""
n = len(series)
n_samples = n - lookback - horizon + 1

if n_samples <= 0:
raise ValueError(
f"Series length {n} is too short for lookback={lookback} + horizon={horizon}"
)

X = np.zeros((n_samples, lookback))
y = np.zeros((n_samples, horizon))

for i in range(n_samples):
X[i] = series[i : i + lookback]
y[i] = series[i + lookback : i + lookback + horizon]

return X, y


# Example: create dataset from synthetic series
np.random.seed(42)
n = 500
t = np.arange(n)
series = 0.1 * t + 50 + 15 * np.sin(2 * np.pi * t / 52) + np.random.normal(0, 2, n)

lookback = 52 # 1 year of weekly data
horizon = 13 # forecast next quarter

X, y = create_sliding_windows(series, lookback, horizon)
print(f"Series length: {n}")
print(f"Dataset shape -- X: {X.shape}, y: {y.shape}")
# X: (435, 52) y: (435, 13)

Vectorized Version (No Loop, Faster for Large Series)

def create_sliding_windows_fast(series: np.ndarray, lookback: int, horizon: int):
"""Vectorized sliding window using numpy stride tricks -- much faster for large series."""
from numpy.lib.stride_tricks import sliding_window_view

total_window = lookback + horizon
windows = sliding_window_view(series, total_window) # shape: (n - total_window + 1, total_window)

X = windows[:, :lookback]
y = windows[:, lookback:]
return X, y


# Verify it matches the loop version
X_fast, y_fast = create_sliding_windows_fast(series, lookback, horizon)
assert np.allclose(X, X_fast), "Results do not match!"
print("Vectorized version matches loop version.")
print(f"X shape: {X_fast.shape}")

Walk-Forward Validation From Scratch

def walk_forward_splits(
n: int,
lookback: int,
horizon: int,
n_folds: int,
gap: int = 0
):
"""
Generate walk-forward validation splits for time series.

Each fold produces a (train_indices, val_indices) pair where
- train_indices are all indices available up to the fold cutoff
- val_indices are the next `horizon` indices after the gap

Parameters
----------
n : total length of the time series (number of observations)
lookback : minimum training size (first fold has exactly this many training points)
horizon : length of each validation window
n_folds : number of folds to generate
gap : number of timesteps between end of training and start of validation
(use this to prevent leakage when features use future values)

Returns
-------
List of (train_indices, val_indices) tuples
"""
splits = []

for fold in range(n_folds - 1, -1, -1):
# val window ends at the last position for this fold
val_end = n - fold * horizon # exclusive index
val_start = val_end - horizon # inclusive index
train_end = val_start - gap # where training ends

if train_end < lookback:
print(f" Fold {n_folds - fold}: skipped -- not enough training data")
continue

train_indices = np.arange(0, train_end)
val_indices = np.arange(val_start, val_end)

splits.append((train_indices, val_indices))

return splits


# Demonstrate walk-forward splits on our series
splits = walk_forward_splits(
n=len(series),
lookback=100,
horizon=13,
n_folds=5,
gap=0
)

print(f"\nWalk-forward validation: {len(splits)} folds")
for i, (train_idx, val_idx) in enumerate(splits):
print(
f" Fold {i+1}: train t=0..{train_idx[-1]} "
f"({len(train_idx)} points) | "
f"val t={val_idx[0]}..{val_idx[-1]} "
f"({len(val_idx)} points)"
)

Running Walk-Forward Validation with a Simple Baseline

from sklearn.metrics import mean_absolute_error

def seasonal_naive_forecast(train: np.ndarray, horizon: int, period: int = 52) -> np.ndarray:
"""
Seasonal naive: forecast = value from same point in the previous season.
A strong baseline that many complex models fail to beat consistently.
"""
forecast = np.zeros(horizon)
for h in range(horizon):
# Index into train at position: most recent season's matching point
idx = len(train) - period + (h % period)
if idx < 0:
idx = h % period # fallback if series is shorter than one period
forecast[h] = train[idx]
return forecast


# Evaluate seasonal naive via walk-forward validation
maes = []
for fold_num, (train_idx, val_idx) in enumerate(splits):
train_fold = series[train_idx]
val_fold = series[val_idx]

forecast = seasonal_naive_forecast(train_fold, horizon=len(val_fold), period=52)
mae = mean_absolute_error(val_fold, forecast)
maes.append(mae)
print(f"Fold {fold_num+1}: MAE = {mae:.3f}")

print(f"\nMean MAE across folds: {np.mean(maes):.3f} (+/- {np.std(maes):.3f})")

:::note Baseline First Always compute a seasonal naive baseline before training any model. If your LSTM cannot beat seasonal naive, you have a training problem, not a data problem. Seasonal naive is a surprisingly strong benchmark - it won the M3 competition in several categories. :::


PyTorch Implementation: LSTM Forecaster with Proper Temporal Splitting

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error

# ------------------------------------------------------------------
# 1. Dataset
# ------------------------------------------------------------------

class TimeSeriesDataset(Dataset):
"""
Sliding window dataset for single-step or multi-step forecasting.
Each sample is (input_window, target_window).
"""

def __init__(self, series: np.ndarray, lookback: int, horizon: int):
super().__init__()
self.lookback = lookback
self.horizon = horizon

n_samples = len(series) - lookback - horizon + 1
assert n_samples > 0, "Series too short for these lookback/horizon settings"

self.X = torch.zeros(n_samples, lookback, 1, dtype=torch.float32)
self.y = torch.zeros(n_samples, horizon, dtype=torch.float32)

for i in range(n_samples):
self.X[i, :, 0] = torch.tensor(series[i : i + lookback], dtype=torch.float32)
self.y[i] = torch.tensor(series[i + lookback : i + lookback + horizon], dtype=torch.float32)

def __len__(self):
return len(self.X)

def __getitem__(self, idx):
return self.X[idx], self.y[idx]


# ------------------------------------------------------------------
# 2. Model
# ------------------------------------------------------------------

class LSTMForecaster(nn.Module):
"""
LSTM-based forecaster.

Architecture:
- Stacked LSTM encoder processes the input window
- Final hidden state passed to a linear projection head
- Output: direct multi-step forecast (all horizon steps at once)

Direct multi-step (seq2val) is simpler and often more accurate than
autoregressive decoding for medium horizons (up to ~30 steps).
"""

def __init__(
self,
input_size: int = 1,
hidden_size: int = 64,
num_layers: int = 2,
horizon: int = 13,
dropout: float = 0.2
):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers

self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True, # input shape: (batch, seq_len, features)
dropout=dropout if num_layers > 1 else 0.0
)

# Layer norm improves stability for time series
self.layer_norm = nn.LayerNorm(hidden_size)

# Output head: hidden state -> forecast horizon
self.fc = nn.Sequential(
nn.Linear(hidden_size, hidden_size // 2),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_size // 2, horizon)
)

def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
x shape: (batch_size, lookback, input_size)
returns: (batch_size, horizon)
"""
# LSTM output: (batch, seq_len, hidden_size), (h_n, c_n)
lstm_out, (h_n, c_n) = self.lstm(x)

# Take the last layer's hidden state as the sequence encoding
# h_n shape: (num_layers, batch, hidden_size)
last_hidden = h_n[-1] # shape: (batch, hidden_size)
last_hidden = self.layer_norm(last_hidden)

# Project to forecast horizon
forecast = self.fc(last_hidden) # shape: (batch, horizon)
return forecast


# ------------------------------------------------------------------
# 3. Training Loop
# ------------------------------------------------------------------

def train_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss = 0.0
for X_batch, y_batch in loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)

optimizer.zero_grad()
pred = model(X_batch)
loss = criterion(pred, y_batch)
loss.backward()

# Gradient clipping is important for LSTM stability
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()
total_loss += loss.item() * len(X_batch)

return total_loss / len(loader.dataset)


@torch.no_grad()
def eval_epoch(model, loader, criterion, device):
model.eval()
total_loss = 0.0
for X_batch, y_batch in loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
pred = model(X_batch)
loss = criterion(pred, y_batch)
total_loss += loss.item() * len(X_batch)
return total_loss / len(loader.dataset)


# ------------------------------------------------------------------
# 4. Full Training Run with Temporal Split
# ------------------------------------------------------------------

def run_lstm_forecasting():
# --- Generate data ---
np.random.seed(42)
torch.manual_seed(42)
n = 1000
t = np.arange(n)
series = (
0.05 * t
+ 50
+ 20 * np.sin(2 * np.pi * t / 52) # annual seasonality (weekly data)
+ 5 * np.sin(2 * np.pi * t / 4) # monthly cycle
+ np.random.normal(0, 2, n)
)

# --- Temporal split (NEVER random split) ---
test_size = 52 # last year as test
val_size = 52 # year before as validation
lookback = 104 # 2 years of context
horizon = 13 # forecast 1 quarter ahead

train_end = n - test_size - val_size
val_end = n - test_size

train_raw = series[:train_end]
val_raw = series[train_end - lookback : val_end] # include lookback context
test_raw = series[val_end - lookback :] # include lookback context

print(f"Train: {len(train_raw)} points | Val context: {len(val_raw)} | Test context: {len(test_raw)}")

# --- Normalization (fit ONLY on training data) ---
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train_raw.reshape(-1, 1)).flatten()
val_scaled = scaler.transform(val_raw.reshape(-1, 1)).flatten()
test_scaled = scaler.transform(test_raw.reshape(-1, 1)).flatten()

# --- Datasets ---
train_ds = TimeSeriesDataset(train_scaled, lookback, horizon)
val_ds = TimeSeriesDataset(val_scaled, lookback, horizon)
test_ds = TimeSeriesDataset(test_scaled, lookback, horizon)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=64, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=64, shuffle=False)

print(f"Train samples: {len(train_ds)} | Val samples: {len(val_ds)} | Test samples: {len(test_ds)}")

# --- Model ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMForecaster(
input_size=1,
hidden_size=64,
num_layers=2,
horizon=horizon,
dropout=0.2
).to(device)

criterion = nn.HuberLoss(delta=1.0) # robust to outliers vs MSE
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5
)

# --- Training ---
best_val_loss = float("inf")
best_state = None
patience_count = 0
early_stop_patience = 15

for epoch in range(100):
train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
val_loss = eval_epoch(model, val_loader, criterion, device)
scheduler.step(val_loss)

if val_loss < best_val_loss:
best_val_loss = val_loss
best_state = {k: v.clone() for k, v in model.state_dict().items()}
patience_count = 0
else:
patience_count += 1

if (epoch + 1) % 10 == 0:
lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1:3d}: train_loss={train_loss:.4f} val_loss={val_loss:.4f} lr={lr:.6f}")

if patience_count >= early_stop_patience:
print(f"Early stopping at epoch {epoch+1}")
break

# --- Evaluation on held-out test set ---
model.load_state_dict(best_state)
model.eval()

all_preds = []
all_targets = []

with torch.no_grad():
for X_batch, y_batch in test_loader:
pred = model(X_batch.to(device)).cpu().numpy()
all_preds.append(pred)
all_targets.append(y_batch.numpy())

preds = np.concatenate(all_preds)
targets = np.concatenate(all_targets)

# Inverse scale to original units
preds_orig = scaler.inverse_transform(preds.reshape(-1, 1)).reshape(preds.shape)
targets_orig = scaler.inverse_transform(targets.reshape(-1, 1)).reshape(targets.shape)

mae = mean_absolute_error(targets_orig.flatten(), preds_orig.flatten())
mape = np.mean(np.abs((targets_orig - preds_orig) / (targets_orig + 1e-8))) * 100

print(f"\nTest MAE: {mae:.3f}")
print(f"Test MAPE: {mape:.2f}%")
return model, scaler


if __name__ == "__main__":
model, scaler = run_lstm_forecasting()

Forecasting Pipeline


Production Engineering Notes

Feature Engineering for Time Series

Feature engineering dramatically impacts model performance, especially for gradient boosting approaches (LightGBM, XGBoost are very strong time series forecasters when combined with rich lag features).

Lag features - the value at t-1, t-2, ..., t-k. Use lags that match the seasonality period (lag 7 for daily data with weekly seasonality, lag 52 for weekly data with annual seasonality). These are the most impactful features.

Rolling statistics - rolling mean, rolling std, rolling min/max over windows of 7, 14, 30 days. Capture local trend and volatility.

import pandas as pd
import numpy as np

def create_time_features(df: pd.DataFrame, target_col: str) -> pd.DataFrame:
"""
Add calendar and lag features to a time series dataframe.
df must have a DatetimeIndex sorted in ascending order.
target_col is the name of the column being forecasted.
"""
df = df.copy()
assert df.index.dtype == "datetime64[ns]", "Index must be a DatetimeIndex"

# Calendar features -- cyclically encoded to preserve continuity
# Raw integer encoding (e.g., day 0=Mon, day 6=Sun) has a discontinuity
# at the wrap-around. Sin/cos encoding closes the loop.
df["day_of_week_sin"] = np.sin(2 * np.pi * df.index.dayofweek / 7)
df["day_of_week_cos"] = np.cos(2 * np.pi * df.index.dayofweek / 7)
df["month_sin"] = np.sin(2 * np.pi * df.index.month / 12)
df["month_cos"] = np.cos(2 * np.pi * df.index.month / 12)
week_of_year = df.index.isocalendar().week.astype(int)
df["week_of_year_sin"] = np.sin(2 * np.pi * week_of_year / 52)
df["week_of_year_cos"] = np.cos(2 * np.pi * week_of_year / 52)

# Lag features
for lag in [1, 7, 14, 28, 365]:
df[f"lag_{lag}"] = df[target_col].shift(lag)

# Rolling statistics (computed on shifted values to prevent leakage)
# shift(1) ensures we use only past values, not the current observation
for window in [7, 14, 28]:
shifted = df[target_col].shift(1)
df[f"rolling_mean_{window}"] = shifted.rolling(window).mean()
df[f"rolling_std_{window}"] = shifted.rolling(window).std()
df[f"rolling_max_{window}"] = shifted.rolling(window).max()

# Difference features (first-order and seasonal)
df["diff_1"] = df[target_col].diff(1)
df["diff_7"] = df[target_col].diff(7)

# Drop rows with NaN lag values (early rows without enough history)
df = df.dropna()

return df

:::warning Cyclic Encoding for Calendar Features Never encode day-of-week as a raw integer (0-6). The model sees a gap of 5 between Monday (0) and Saturday (5) - but Saturday to Sunday (5 to 6) and Sunday to Monday (6 to 0) wrap around. Sine/cosine encoding preserves the circular structure so the model understands that Sunday and Monday are adjacent. :::

Handling Missing Data

Time series with missing values require careful handling - forward-filling blindly is often wrong.

Completely at random (MCAR) - short gaps (1-3 periods) can be linearly interpolated without introducing significant bias. Use pd.Series.interpolate(method='time').

At regular intervals - if data is missing every weekend (e.g., stock prices), this is structural missingness. Do not impute - model it as a 5-day series or use business-day-aware indexing.

Long gaps - gaps longer than one seasonal period mean you cannot rely on the series before the gap to predict after it. Consider treating post-gap data as a new series or using global model priors.

def handle_missing_values(series: pd.Series, max_gap_to_interpolate: int = 3) -> pd.Series:
"""
Impute short gaps by linear interpolation.
Longer gaps are left as NaN for the caller to handle explicitly.
"""
is_null = series.isnull()
# Compute the length of each consecutive NaN run
gap_lengths = is_null.astype(int).groupby((~is_null).cumsum()).transform("sum")

interpolated = series.copy()
short_gap_mask = is_null & (gap_lengths <= max_gap_to_interpolate)

if short_gap_mask.any():
interpolated = interpolated.interpolate(method="time", limit=max_gap_to_interpolate)
# Restore long gaps as NaN (interpolate may overshoot across long gaps)
long_gap_mask = is_null & (gap_lengths > max_gap_to_interpolate)
interpolated[long_gap_mask] = np.nan

n_short = short_gap_mask.sum()
n_long = (is_null & ~short_gap_mask).sum()
print(f"Imputed {n_short} short-gap values. Left {n_long} long-gap values as NaN.")
return interpolated

Cold Start Problem

The cold start problem: a new product, store, or user has no historical data. How do you forecast demand when there is no series to learn from?

Strategy 1: Global model with static features. Train a global model (TFT, LightGBM) on all existing series. For new items, provide static covariates (category, brand, price point) and short-horizon lags from whatever data is available (days or weeks of data). The global model transfers patterns from similar items.

Strategy 2: Cluster-based initialization. Cluster existing series by shape and pattern. Assign new items to the most similar cluster. Use the cluster's representative series as the initial forecast, then update as the new series accumulates data.

Strategy 3: Hierarchical reconciliation. If the new item belongs to a known category, use the category-level forecast distributed proportionally based on item characteristics. Reconcile bottom-up when item data accumulates.

Prediction Intervals

Point forecasts alone are dangerous for decision-making. A demand forecast of 500 units with 80% prediction interval [400, 600] drives very different inventory decisions than [200, 800].

Conformal prediction (distribution-free, valid coverage guarantees):

def compute_conformal_intervals(
calibration_residuals: np.ndarray,
point_forecasts: np.ndarray,
coverage: float = 0.90
) -> tuple:
"""
Compute conformal prediction intervals.

Requires a calibration set of residuals: actual minus predicted.
Returns (lower_bounds, upper_bounds) with at least `coverage` empirical coverage.
"""
alpha = 1 - coverage
# Quantile of |residual| at the (1 - alpha) level
q = np.quantile(np.abs(calibration_residuals), 1 - alpha)

lower = point_forecasts - q
upper = point_forecasts + q

print(f"Conformal interval half-width: {q:.3f}")
print(f"Coverage target: {coverage:.0%} | Interval: [pred - {q:.2f}, pred + {q:.2f}]")
return lower, upper

Quantile regression (model-native, more flexible):

Train the model with pinball loss instead of MSE. For 90% coverage, train two models: one predicting the 5th percentile (alpha=0.05) and one predicting the 95th percentile (alpha=0.95). The TFT natively supports quantile outputs.

def pinball_loss(y_pred: torch.Tensor, y_true: torch.Tensor, alpha: float) -> torch.Tensor:
"""
Pinball (quantile) loss for quantile regression.
alpha = 0.05 trains a model to predict the 5th percentile.
alpha = 0.95 trains a model to predict the 95th percentile.
Together they form a 90% prediction interval.
"""
errors = y_true - y_pred
loss = torch.where(errors >= 0, alpha * errors, (alpha - 1) * errors)
return loss.mean()

Common Mistakes

:::danger Random Train/Test Split on Time Series This is the number one mistake in applied time series work and it is extremely common because every ML tutorial uses train_test_split. On time series, a random split is data leakage. The model sees future data during training. Validation metrics look great. Deployment performance is poor.

Always split by time. The training set is everything before a cutoff date. Validation is the next chronological window. Test is the final window. Never shuffle. :::

:::danger Normalizing Before Splitting Fitting the scaler (StandardScaler, MinMaxScaler) on the full dataset before splitting leaks the mean and standard deviation of the test period into the training normalization. Fit the scaler ONLY on the training set. Apply (transform only) to validation and test.

# WRONG -- leaks test set statistics into normalization
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X_all)
X_train, X_test = temporal_split(X_all_scaled)

# CORRECT -- fit only on training data, then transform the rest
X_train, X_test = temporal_split(X_all)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform only, not fit_transform

:::

:::danger Using Future Features as Inputs Features like "next week's promotion" or "number of days until holiday" are only valid if they are genuinely known at inference time. If your model uses a feature that is derived from future observations, it will fail in production when that information is unavailable. Every input feature must pass this test: "Would I know this value at the time I make the prediction?" :::

:::warning Ignoring the Baseline Before tuning your LSTM, compute seasonal naive and simple exponential smoothing. If your LSTM does not substantially beat naive, something is wrong with training, architecture, or data. Complex models failing to beat simple baselines is common and embarrassing at production review. :::

:::warning Single-Point Evaluation Evaluating on a single test window is high variance - you might have picked a lucky or unlucky test period. Use walk-forward validation with multiple folds. Report mean and standard deviation of error across folds, not a single number. :::

:::warning Evaluating on Log-Transformed Series If you log-transform the target for training (common to handle exponential growth), remember to back-transform predictions before computing metrics. MAPE computed on log scale is not the same as MAPE on the original scale, and the business does not care about log-scale errors. :::


Interview Q&A

Q1: What is the key difference between time series train/test split and standard ML train/test split?

Answer: In standard ML, training and test samples are i.i.d., so random shuffling does not break any assumptions. You can safely use train_test_split with shuffle=True.

In time series, data has temporal ordering and temporal dependencies. Shuffling introduces data leakage in two ways. First, observations from the test period appear in the training set, letting the model observe future patterns. Second, lag features and rolling statistics computed on shuffled data will include values from what would have been "future" timesteps relative to the prediction point.

The correct approach is always a temporal split: all observations before a cutoff date go into training, observations after the cutoff go into test. Walk-forward (rolling origin) validation extends this to produce multiple folds, each with training data preceding the validation window.

The practical impact is large. In many real-world cases, switching from random split to temporal split changes the reported validation MAPE from a single-digit figure to something substantially worse - the random split reports artificially inflated accuracy. The temporal split gives the honest estimate of real-world performance.


Q2: When would you choose SARIMA over an LSTM for a forecasting problem?

Answer: SARIMA wins over LSTM in several common scenarios.

Short series: LSTM requires sufficient data to learn robust patterns through gradient descent. On series with fewer than 200-500 observations, LSTM overfits. SARIMA's parameter count is small (p+d+q+P+D+Qp+d+q+P+D+Q, typically under 10 parameters), making it well-suited to short histories.

Univariate with strong linear autocorrelation: When ACF and PACF plots show clear exponential decay (AR behavior) or abrupt cutoffs (MA behavior), ARIMA models the data generating process correctly. LSTM is a flexible function approximator but does not have this structural bias - it works harder to learn what ARIMA gets by design.

Need for confidence intervals: ARIMA provides statistically valid prediction intervals under model assumptions. LSTM point forecasts require additional machinery (conformal prediction, quantile regression, Monte Carlo dropout) to produce uncertainty estimates.

Interpretability: ARIMA coefficients have economic interpretations. AR coefficients tell you how strongly the series depends on its own past. This is valuable when explaining forecast behavior to business stakeholders.

The practical heuristic: if you have fewer than 500 observations and the series is univariate, start with SARIMA. If you have thousands of observations or multiple related series with complex covariates, move to LSTM or TFT.


Q3: Explain STL decomposition and when you would use it.

Answer: STL (Seasonal and Trend decomposition using Loess, Cleveland et al. 1990) decomposes a time series into three components: trend, seasonal, and residual. It works by applying locally weighted regression (Loess smoothing) iteratively - first estimating the trend, removing it, estimating the seasonal pattern from the de-trended series, and repeating until convergence.

STL has two major advantages over classical decomposition methods like X-11 or Census X-13. First, it handles any seasonal period - not just monthly (period=12) or quarterly (period=4), but also daily with weekly seasonality (period=7) or weekly with annual seasonality (period=52). Second, it is robust to outliers through a re-weighting scheme that reduces the influence of anomalous observations on the trend and seasonal estimates.

You would use STL in three situations. Before model selection: decompose the series to measure seasonal strength and trend character, which determines which model family is appropriate. As a preprocessing step for classical methods: some practitioners fit ARIMA only on the STL residuals (which are closer to stationary), then add back the trend and seasonal components at forecast time - this is called MSTL + ARIMA. And for anomaly detection: STL residuals that are unusually large relative to their historical distribution indicate anomalous observations worth investigating.

The seasonal strength metric is particularly valuable:

Seasonal Strength=max ⁣(0, 1Var(Rt)Var(St+Rt))\text{Seasonal Strength} = \max\!\left(0,\ 1 - \frac{\text{Var}(R_t)}{\text{Var}(S_t + R_t)}\right)

Values above 0.6 indicate strong, predictable seasonality that the model must explicitly handle.


Q4: What is walk-forward validation and why is it preferred over a single hold-out set for time series?

Answer: Walk-forward validation (also called rolling origin evaluation) generates multiple train/validation pairs where the validation window advances forward in time for each fold. In the expanding window variant, the training set grows with each fold. In the rolling window variant, both the training start and end advance together.

The motivation: a single hold-out test set produces one error estimate, which is highly sensitive to whether the held-out period happened to be particularly easy or hard to forecast. A single number does not tell you whether the model degrades over time (which would indicate concept drift) or whether it performs consistently.

Walk-forward validation simulates the actual deployment scenario: the model is retrained periodically with all available data up to the current date, then used to forecast the next period. Each fold in walk-forward corresponds to one such deployment cycle.

Walk-forward also produces multiple error estimates that can be averaged and have standard deviation measured, giving a much more reliable picture of expected production performance. For example, if 5-fold walk-forward validation produces MAEs of [18.2, 22.1, 19.8, 24.3, 20.1] with mean 20.9 and std 2.2, you have a much more honest expectation than a single test set reporting 19.5 MAE.

The tradeoff is computational cost: you fit one model per fold. For classical methods this is fast. For deep learning, walk-forward can mean training 5-10 separate models. A practical compromise is to use walk-forward only for model selection during experimentation, then train a single final model on all available data for deployment.


Q5: How would you handle the cold start problem for a new product in a demand forecasting system?

Answer: Cold start in demand forecasting means predicting demand for a product that has little or no sales history. Three strategies work in practice, often combined.

Global model with static features. Train a single model across all products. New products have zero lag features but do have static attributes: category, brand, price tier, similarity to existing products. A model like LightGBM or TFT trained globally will use these static features to transfer patterns from similar products. The prediction quality is roughly equivalent to the most similar product with data.

Item embedding similarity. Learn embeddings for products in the historical dataset. Embed new products using their static attributes and find nearest neighbors in embedding space. Use the k-nearest neighbors' historical patterns as the prior for the new product's forecast. This is effective when there is a rich vocabulary of product attributes.

Hierarchical forecasting. Forecast at the category or brand level (where data is abundant) and disaggregate to the product level using proportional splits based on attributes. A new product's share within its category is estimated from its positioning (price point relative to category average, shelf placement category). As the product accumulates sales, progressively shift weight from the prior (category proportion) toward the product's own history.

In practice, cold start forecasts have much wider uncertainty than mature product forecasts. Production systems should output wider prediction intervals for cold start items and trigger more frequent review cycles - weekly reforecast instead of monthly - until sufficient history accumulates (typically 2-3 seasonal cycles worth of data).


Q6: What is the difference between recursive and direct multi-step forecasting, and when does each approach work better?

Answer: When forecasting multiple steps ahead, you have two architectural choices.

Recursive (iterated) multi-step: Forecast one step ahead. Feed that prediction back as input to forecast step two. Repeat for all horizon steps. This uses a single one-step-ahead model and applies it iteratively. The advantage: one model to train, efficient inference. The disadvantage: errors compound. If the first prediction is slightly wrong, the second prediction uses a corrupted input, and errors accumulate over the horizon.

Direct multi-step: Train a separate model (or a single model with multi-output head) that directly outputs all hh steps simultaneously. The model is trained with a loss computed over the full horizon, not just the next step. This avoids error compounding entirely - the model learns to distribute prediction error across the full horizon rather than having it cascade.

Empirically, direct multi-step outperforms recursive for longer horizons (beyond 5-10 steps). Recursive wins for very short horizons where error compounding is negligible.

The LSTM implementation above uses direct multi-step: the output head maps the final hidden state to a vector of horizon values in a single forward pass. N-BEATS also uses this pattern. The TFT uses a decoder architecture that generates predictions step-by-step but conditions each step on all past inputs rather than the previous prediction, giving it properties of both approaches.

For production systems forecasting 12-52 weeks ahead, direct multi-step is almost always the right choice. For short-horizon systems (1-3 steps), either approach is reasonable and the simpler recursive model may be preferable for its flexibility.


Key Takeaways

Time series forecasting requires a distinct discipline from standard ML. The single most important practice is correct temporal splitting - train on the past, validate on the adjacent future, test on the held-out final window. Walk-forward validation gives you reliable performance estimates that match deployment conditions.

Decomposition (STL) before model selection is not optional - it reveals the structure you need to model. Classical methods (SARIMA) remain competitive and are often superior for short, univariate series. Deep learning (LSTM, N-BEATS, TFT) wins on long histories, multivariate inputs, and large-scale multi-series settings.

Feature engineering - lag features, rolling statistics, cyclically encoded calendar features - is the highest-leverage activity for gradient boosting forecasters. For neural methods, the sliding window dataset structure is the equivalent: how you frame the supervised learning problem from a raw time series determines what the model can learn.

Prediction intervals matter as much as point forecasts. Business decisions around inventory, staffing, and procurement depend on the range of plausible outcomes, not just the most likely value. Every production forecasting system should output calibrated prediction intervals alongside the point forecast.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ARIMA Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.