Demand Forecasting Systems

The 3 AM Restock Alert

It is 3 AM at a Walmart distribution center in Memphis. A night-shift manager gets an alert: hand sanitizer demand across the Southeast region has spiked 400% in the last six hours. The forecasting system has already recomputed reorder quantities for 12,000 affected SKUs. Purchase orders are being generated automatically. Trucks are being rerouted.

Nobody made a decision. The ML pipeline made it.

This is not a hypothetical. During the early days of COVID-19, retailers whose forecasting systems could ingest real-time signals - news trends, search data, weather - and recompute demand within hours survived stockouts far better than those running weekly batch jobs. The difference between a good forecast and a bad one was not just accuracy. It was latency. Speed of adaptation.

Demand forecasting sounds like a statistics problem from a 1970s operations research textbook. In some ways it is. But at the scale of a modern retailer - Walmart forecasting 500 million SKU-store combinations, Amazon updating predictions every 30 minutes across its marketplace - it becomes one of the hardest engineering problems in applied ML. The data is noisy, the patterns are hierarchical, the signals are heterogeneous, and the cost of being wrong is asymmetric: too much inventory destroys margin, too little destroys customer trust.

This lesson walks through the full stack: from classical ARIMA to Temporal Fusion Transformers, from simple lag features to multi-source external signals, from single-store pilots to national-scale hierarchical forecasting.

Why This Exists

Before ML, retail forecasting was done one of two ways. The first was gut instinct from experienced buyers - people who had spent 20 years watching category performance and could smell a trend. The second was naive statistical models: same week last year, maybe adjusted for a trend factor.

Both approaches share a fatal flaw: they cannot process information at scale. A seasoned buyer manages maybe 500 SKUs. A category manager at Target handles 3,000. But Target carries 80,000 SKUs. Nobody can manually track 80,000 items, their interactions with promotions, the weather, competitor activity, social trends, and supply constraints simultaneously.

The cost of bad forecasting is enormous. In retail, inventory carrying costs run 20-30% of item cost annually. A retailer holding $10B in inventory that is 10% higher than optimal is burning$ 200-300M per year in carrying costs alone - before accounting for markdowns on items that never sell and stockout losses on items that run out. Industry estimates put the global cost of retail out-of-stocks at $1 trillion annually.

ML-based forecasting systems exist because the problem is too complex and too large for any other approach. The value is not that ML is magic. The value is that ML can process thousands of signals, find non-linear interactions, adapt to new patterns, and do it for millions of items simultaneously.

Historical Context

The history of demand forecasting mirrors the history of statistical modeling itself.

In the 1950s, Charles Holt developed what became exponential smoothing - a simple but powerful idea that recent observations should carry more weight than older ones. His student Winters extended this to handle seasonality in 1960. These methods are still used today.

Box and Jenkins formalized ARIMA in their 1970 textbook, giving practitioners a rigorous framework for identifying autoregressive and moving-average components in time series. ARIMA dominated retail forecasting for two decades.

The ML revolution in forecasting started seriously around 2015-2017. The M4 competition (2018) was a watershed: a pure ML entry (ES-RNN, combining exponential smoothing with LSTMs) won decisively over classical methods for the first time. The M5 competition (2020) used Walmart data specifically - 42,840 hierarchical time series - and was dominated by gradient boosting methods, particularly LightGBM.

Two deep learning architectures emerged as dominant in production:

N-BEATS (2019, Oreshkin et al.) - a pure deep learning model with interpretable decomposition into trend and seasonality
Temporal Fusion Transformer (2021, Lim et al.) - attention-based model handling multiple time series, static covariates, and future known inputs simultaneously

Google, Amazon, and Alibaba all now use variants of these architectures at scale. The field has settled into a pragmatic consensus: gradient boosting for tabular feature engineering, deep learning for capturing complex temporal patterns, and ensembles for production.

Core Concepts

The Forecasting Hierarchy

Retail data is naturally hierarchical. You can forecast at the national level, regional level, store level, or SKU level. These levels are constrained: the sum of store-level forecasts must equal the regional forecast, which must equal the national forecast.

National Total
    ├── Region: East
    │   ├── District: NY Metro
    │   │   ├── Store: Manhattan #001
    │   │   │   ├── Category: Beverages
    │   │   │   │   ├── Subcategory: Carbonated
    │   │   │   │   │   └── SKU: Diet Coke 12-pack

Two broad strategies exist for hierarchical forecasting:

Bottom-up: Forecast each SKU-store combination independently, then aggregate. Captures local patterns but misses cross-level signal. Noisy at low levels.

Top-down: Forecast at the national or regional level, then disaggregate using historical proportions. Smoother, but loses local variation.

Middle-out + Reconciliation: Forecast at a middle level (maybe store-category), then reconcile both up and down. The optimal linear reconciliation approach (MinT, Wickramasuriya et al. 2019) minimizes trace of the covariance of reconciled forecast errors.

The M5 competition winner used a bottom-up approach with LightGBM - suggesting that with enough features, individual item forecasts can be made accurate enough to aggregate reliably.

Classical Methods: When They Still Work

Exponential Smoothing (ETS) remains competitive for slow-moving items with clear seasonal patterns and no external signal. It handles intermittent demand (many zeros) better than complex models.

ARIMA works well for stable, stationary series. SARIMA extends this to seasonal data. The Box-Jenkins methodology (identify, estimate, diagnose) is still a valid workflow for individual series analysis.

When to use classical methods:

Fewer than 1000 SKUs
Clear seasonality, stable trend
No complex external signals
Interpretability required for regulators

When to move to ML:

Thousands to millions of SKUs
Promotions, weather, events as inputs
Cross-series patterns to exploit
Need for continuous retraining as patterns shift

Gradient Boosting for Forecasting

The key insight that made LightGBM dominate the M5 competition: forecasting can be reframed as a supervised regression problem if you engineer the right features from time lags.

You are not training a model on time series directly. You are training a model where each row represents a (store, SKU, date) combination, and the features are derived from historical values of that series and related series.

Feature engineering categories:

Lag features: sales at t-7, t-14, t-28, t-365 (same day last week, two weeks ago, four weeks ago, last year)
Rolling statistics: rolling mean and standard deviation over 7, 14, 28, 56, 112-day windows
Calendar features: day of week, month, week of year, is_holiday, days_until_holiday, days_since_holiday
Promotional features: is_on_promotion, discount_percentage, promotion_type
Price features: current price, relative price vs. average, competitor price ratio
External signals: temperature, precipitation, local events, economic indicators
Product attributes: category, subcategory, brand, price tier, weight

The target is typically log-transformed sales to handle skew: $\hat{y} = \log(1 + \text{sales})$ .

Multi-step forecasting strategy: For a 28-day horizon, you either train 28 separate models (direct multi-step) or recursively feed predictions back as inputs (recursive). Direct is more accurate. Recursive compounds errors. Most production systems use direct with shared feature representations.

Temporal Fusion Transformer (TFT)

TFT (Lim et al., 2021) is the state-of-the-art deep learning architecture for multi-horizon forecasting with multiple input types. Understanding its architecture clarifies why it outperforms simpler models.

TFT explicitly handles three types of inputs that are common in retail:

Static covariates: store location, store format, product category - things that do not change over time
Past observed inputs: historical sales, past prices, past promotions
Future known inputs: planned promotions, holidays, scheduled events

The architecture uses:

Variable Selection Networks: gate which features matter at each step (handles irrelevant features gracefully)
LSTM encoder-decoder: processes temporal dependencies
Multi-head attention: captures long-range dependencies beyond LSTM's reach
Quantile outputs: produces P10, P50, P90 predictions, not just point estimates

The quantile output is critical for inventory optimization downstream. You do not just need a point forecast - you need a forecast distribution to compute safety stock correctly.

The Cold Start Problem

New products have no sales history. New stores have no local history. This is the cold start problem.

Approaches in order of sophistication:

Similar item lookup: find the most similar existing SKU by attributes (category, price tier, brand) and use its launch trajectory
Meta-learning: train a model on "launch weeks 1-N" for all historical product launches, predicting trajectory from product attributes
Hierarchical warm-up: use category-level forecasts as a prior, update as initial sales data arrives (Bayesian updating)
Expert priors + rapid adaptation: start with a buyer's range estimate, use Thompson sampling to update the forecast distribution as sales trickle in

Amazon uses a combination of #2 and #3 at scale - a launch prediction model trained on millions of product histories, combined with rapid Bayesian updating as actual sales arrive.

External Signals at Scale

The most impactful external signals for retail demand:

Signal	Impact	Data Source
Weather (temperature, precipitation)	15-25% variance in weather-sensitive categories	NOAA, DarkSky, Tomorrow.io
Holidays (national, religious, local)	50-300% lift on relevant categories	Calendar databases
Promotions	20-200% lift depending on discount depth	Internal promotion planning systems
Competitor pricing	5-20% demand shift	Web scraping, price intelligence vendors
Social media trends	50-500% spike for viral items	Twitter API, Google Trends
Economic indicators (unemployment, CPI)	Slow-moving, affects category mix	BLS, Federal Reserve

Integrating these signals requires matching them to the right spatial and temporal granularity. A temperature signal for New York City is not useful for a store in Phoenix. A national promotion signal needs to be joined at store level.

Practical Implementation

LightGBM Demand Forecasting Pipeline

import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_percentage_error
import warnings
warnings.filterwarnings('ignore')

# ============================================================
# 1. Data Loading and Preprocessing
# ============================================================

def load_retail_data(filepath: str) -> pd.DataFrame:
    """
    Expected columns: date, store_id, sku_id, sales, price,
                       is_on_promo, promo_discount_pct, temperature
    """
    df = pd.read_parquet(filepath)
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values(['store_id', 'sku_id', 'date']).reset_index(drop=True)
    # Log-transform sales (add 1 to handle zeros)
    df['log_sales'] = np.log1p(df['sales'])
    return df


def create_calendar_features(df: pd.DataFrame) -> pd.DataFrame:
    """Add calendar-based features to the dataframe."""
    df = df.copy()
    df['day_of_week'] = df['date'].dt.dayofweek
    df['month'] = df['date'].dt.month
    df['week_of_year'] = df['date'].dt.isocalendar().week.astype(int)
    df['quarter'] = df['date'].dt.quarter
    df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
    df['day_of_month'] = df['date'].dt.day
    df['days_in_month'] = df['date'].dt.days_in_month

    # Cyclical encoding for day_of_week and month
    df['dow_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
    df['dow_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

    return df


# ============================================================
# 2. Lag and Rolling Feature Engineering
# ============================================================

def create_lag_features(
    df: pd.DataFrame,
    target_col: str = 'log_sales',
    group_cols: list = ['store_id', 'sku_id'],
    lags: list = [7, 14, 21, 28, 35, 42, 56, 84, 112, 365]
) -> pd.DataFrame:
    """
    Create lag features within each store-SKU group.
    Lags are in days - we shift within the sorted group.
    """
    df = df.copy()
    for lag in lags:
        df[f'lag_{lag}d'] = (
            df.groupby(group_cols)[target_col]
            .shift(lag)
        )
    return df


def create_rolling_features(
    df: pd.DataFrame,
    target_col: str = 'log_sales',
    group_cols: list = ['store_id', 'sku_id'],
    windows: list = [7, 14, 28, 56, 112]
) -> pd.DataFrame:
    """
    Rolling mean and std for each window size.
    Uses shift(1) to prevent data leakage.
    """
    df = df.copy()
    for window in windows:
        rolled = (
            df.groupby(group_cols)[target_col]
            .transform(lambda x: x.shift(1).rolling(window, min_periods=1).mean())
        )
        df[f'rolling_mean_{window}d'] = rolled

        rolled_std = (
            df.groupby(group_cols)[target_col]
            .transform(lambda x: x.shift(1).rolling(window, min_periods=1).std())
        )
        df[f'rolling_std_{window}d'] = rolled_std

    # Rolling max (captures peaks like holiday spikes)
    for window in [7, 28]:
        df[f'rolling_max_{window}d'] = (
            df.groupby(group_cols)[target_col]
            .transform(lambda x: x.shift(1).rolling(window, min_periods=1).max())
        )

    return df


def create_trend_features(
    df: pd.DataFrame,
    target_col: str = 'log_sales',
    group_cols: list = ['store_id', 'sku_id']
) -> pd.DataFrame:
    """Momentum and trend features."""
    df = df.copy()
    # Week-over-week change
    df['wow_change'] = (
        df.groupby(group_cols)[target_col].shift(7) -
        df.groupby(group_cols)[target_col].shift(14)
    )
    # Month-over-month growth rate
    lag_28 = df.groupby(group_cols)[target_col].shift(28)
    lag_56 = df.groupby(group_cols)[target_col].shift(56)
    df['mom_growth'] = (lag_28 - lag_56) / (lag_56 + 1e-6)

    return df


# ============================================================
# 3. Full Feature Pipeline
# ============================================================

def build_feature_matrix(df: pd.DataFrame) -> tuple:
    """
    Build the complete feature matrix for LightGBM training.
    Returns (features_df, feature_columns, target_column).
    """
    df = create_calendar_features(df)
    df = create_lag_features(df)
    df = create_rolling_features(df)
    df = create_trend_features(df)

    feature_cols = (
        # Calendar
        ['day_of_week', 'month', 'week_of_year', 'quarter',
         'is_weekend', 'day_of_month',
         'dow_sin', 'dow_cos', 'month_sin', 'month_cos'] +
        # Lag features
        [f'lag_{lag}d' for lag in [7, 14, 21, 28, 35, 42, 56, 84, 112, 365]] +
        # Rolling features
        [f'rolling_mean_{w}d' for w in [7, 14, 28, 56, 112]] +
        [f'rolling_std_{w}d' for w in [7, 14, 28, 56, 112]] +
        [f'rolling_max_{w}d' for w in [7, 28]] +
        # Trend
        ['wow_change', 'mom_growth'] +
        # Store/product context
        ['price', 'is_on_promo', 'promo_discount_pct', 'temperature']
    )

    # Drop rows with NaN from lag computation
    df_clean = df.dropna(subset=feature_cols)

    return df_clean, feature_cols, 'log_sales'


# ============================================================
# 4. Model Training with Time Series Cross-Validation
# ============================================================

def train_lgbm_forecaster(
    df: pd.DataFrame,
    feature_cols: list,
    target_col: str,
    forecast_horizon: int = 28
) -> tuple:
    """
    Train LightGBM with walk-forward validation.
    Returns trained model and validation metrics.
    """
    # Time-based train/val split
    cutoff_date = df['date'].max() - pd.Timedelta(days=forecast_horizon)
    train_df = df[df['date'] <= cutoff_date]
    val_df = df[df['date'] > cutoff_date]

    X_train = train_df[feature_cols]
    y_train = train_df[target_col]
    X_val = val_df[feature_cols]
    y_val = val_df[target_col]

    lgb_params = {
        'objective': 'regression',
        'metric': 'rmse',
        'num_leaves': 127,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'min_child_samples': 20,
        'reg_alpha': 0.1,
        'reg_lambda': 0.1,
        'n_jobs': -1,
        'verbose': -1,
    }

    train_data = lgb.Dataset(X_train, label=y_train)
    val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

    callbacks = [
        lgb.early_stopping(50, verbose=False),
        lgb.log_evaluation(period=100)
    ]

    model = lgb.train(
        lgb_params,
        train_data,
        num_boost_round=1000,
        valid_sets=[val_data],
        callbacks=callbacks
    )

    # Evaluate
    val_preds_log = model.predict(X_val)
    val_preds = np.expm1(val_preds_log)  # Inverse log transform
    val_actual = np.expm1(y_val.values)

    # WRMSSE metric (Weighted Root Mean Squared Scaled Error) - M5 metric
    mape = mean_absolute_percentage_error(
        val_actual[val_actual > 0],
        val_preds[val_actual > 0]
    )

    print(f"Validation MAPE: {mape:.4f}")
    print(f"Best iteration: {model.best_iteration}")

    return model, mape


# ============================================================
# 5. Hierarchical Reconciliation
# ============================================================

def bottom_up_reconciliation(
    sku_store_forecasts: pd.DataFrame,
    hierarchy: dict
) -> dict:
    """
    Aggregate SKU-store forecasts up the hierarchy.
    hierarchy: {'national': ['region'], 'region': ['district'], ...}

    Returns dict of forecasts at each level.
    """
    results = {'sku_store': sku_store_forecasts}

    # Aggregate to store level
    store_forecasts = (
        sku_store_forecasts
        .groupby(['store_id', 'date'])['forecast']
        .sum()
        .reset_index()
    )
    results['store'] = store_forecasts

    # Aggregate to region level (requires store-to-region mapping)
    if 'store_to_region' in hierarchy:
        mapping = hierarchy['store_to_region']
        store_forecasts['region_id'] = store_forecasts['store_id'].map(mapping)
        region_forecasts = (
            store_forecasts
            .groupby(['region_id', 'date'])['forecast']
            .sum()
            .reset_index()
        )
        results['region'] = region_forecasts

    return results


# ============================================================
# 6. Inference Pipeline
# ============================================================

def generate_forecasts(
    model: lgb.Booster,
    df: pd.DataFrame,
    feature_cols: list,
    forecast_horizon: int = 28
) -> pd.DataFrame:
    """
    Generate rolling forecasts for the next forecast_horizon days.
    Uses actual features when available, predicted when not.
    """
    last_date = df['date'].max()
    forecast_dates = pd.date_range(
        start=last_date + pd.Timedelta(days=1),
        periods=forecast_horizon,
        freq='D'
    )

    all_forecasts = []

    for store_id in df['store_id'].unique():
        store_df = df[df['store_id'] == store_id].copy()

        for sku_id in store_df['sku_id'].unique():
            series = store_df[store_df['sku_id'] == sku_id].copy()

            for forecast_date in forecast_dates:
                # Build feature row for this date
                # In production, future promotions/prices would be known inputs
                feature_row = build_forecast_row(series, forecast_date)
                if feature_row is None:
                    continue

                log_pred = model.predict(feature_row[feature_cols].values.reshape(1, -1))[0]
                pred_sales = np.expm1(log_pred)

                # Append prediction to series for subsequent lags
                new_row = feature_row.copy()
                new_row['log_sales'] = log_pred
                new_row['sales'] = pred_sales
                series = pd.concat([series, new_row.to_frame().T], ignore_index=True)

                all_forecasts.append({
                    'store_id': store_id,
                    'sku_id': sku_id,
                    'forecast_date': forecast_date,
                    'forecast_sales': pred_sales,
                    'log_forecast': log_pred
                })

    return pd.DataFrame(all_forecasts)


def build_forecast_row(series: pd.DataFrame, forecast_date: pd.Timestamp) -> pd.Series:
    """Build a single feature row for a future date using available history."""
    # This function would extract lag features from the series
    # for the given forecast_date. Implementation depends on feature set.
    # Placeholder for illustration.
    row = pd.Series({'date': forecast_date})
    row['day_of_week'] = forecast_date.dayofweek
    row['month'] = forecast_date.month
    # ... populate remaining features from series history
    return row


# ============================================================
# 7. Model Evaluation - Multiple Metrics
# ============================================================

def evaluate_forecasts(
    actuals: pd.DataFrame,
    forecasts: pd.DataFrame
) -> dict:
    """
    Compute retail-relevant metrics.
    Both dataframes need: store_id, sku_id, date, sales/forecast_sales
    """
    merged = actuals.merge(
        forecasts,
        left_on=['store_id', 'sku_id', 'date'],
        right_on=['store_id', 'sku_id', 'forecast_date']
    )

    actual = merged['sales'].values
    predicted = merged['forecast_sales'].values

    # Mask zeros for percentage metrics
    nonzero = actual > 0

    metrics = {}

    # MAPE - Mean Absolute Percentage Error
    metrics['mape'] = np.mean(
        np.abs(actual[nonzero] - predicted[nonzero]) / actual[nonzero]
    )

    # RMSE
    metrics['rmse'] = np.sqrt(np.mean((actual - predicted) ** 2))

    # Bias - are we systematically over or under forecasting?
    metrics['bias'] = np.mean(predicted - actual)

    # Stockout rate (when actual > 1.2 * forecast - under-forecasted)
    metrics['underforecast_rate'] = np.mean(actual > 1.2 * predicted)

    # Overstock rate (when forecast > 1.5 * actual)
    metrics['overforecast_rate'] = np.mean(predicted > 1.5 * actual)

    return metrics

Architecture Diagrams

Forecasting System Architecture

Feature Engineering Flow

Hierarchical Forecasting

Production Engineering Notes

Scale Challenges

At Walmart or Amazon scale, you are forecasting 500 million+ store-SKU combinations. This changes the engineering constraints entirely.

Parallelization: Use distributed Spark or Dask for feature engineering. Group by SKU category or store region and process in parallel. LightGBM supports distributed training via its built-in MPI backend.

Feature storage: Precompute and store lag/rolling features in a feature store. Do not recompute at inference time. At 500M rows, a rolling 28-day mean that takes 100ms per series takes 14 years sequentially.

Incremental updates: Do not retrain from scratch daily. Use online learning variants or fine-tune existing models on recent data. LightGBM supports continuing training from a checkpoint.

Model lifecycle: Maintain separate models per category (beverages, apparel, electronics) rather than one global model. Different categories have different temporal dynamics.

Data Quality Issues

The biggest source of production forecast errors is not model selection - it is data quality.

Common retail data issues:

POS system outages: stores report 0 sales but the store was actually open - you need to detect and impute
Promotional flag lag: promotions are entered retroactively in some systems, causing the model to see promotional lift without the promotional flag
Cannibalization: when one SKU goes on promotion, similar SKUs see demand drop - this is often unmodeled
Returns: gross sales vs net sales - large return events distort rolling statistics

Monitoring: Track mean absolute error per category per week. Alert when a category's error rate exceeds 2x its historical baseline. This catches distribution shifts before they cascade.

Retraining Strategy

Do not retrain on a fixed calendar. Retrain when:

A statistically significant distribution shift is detected (Population Stability Index > 0.25)
Forecast error exceeds 2x rolling average for 3 consecutive days
A major structural change occurs (new store opening, category reset)

Use Champion/Challenger deployment: new model serves 5-10% of SKUs in shadow mode before full rollout.

Common Mistakes

:::danger Data Leakage in Lag Features The most common and most damaging mistake in retail forecasting: using same-day data in your lag features. If you compute rolling_mean_7d without shifting by 1, you include the current day's sales as a feature when predicting the current day. The model learns to copy its input. Validation looks perfect. Production is a disaster. Always use .shift(1) before any rolling computation. :::

:::danger Ignoring Promotion Events in Validation If you split train/val by time and your validation period contains a major promotion that your training period did not have, your validation error will be inflated for the wrong reason. Worse, if you tune your model to perform well on a promotional validation set, it will over-fit to promotional patterns and under-perform on non-promotional periods. Use multiple validation windows across different promotional cycles. :::

:::warning Metric Selection Mismatch RMSE heavily penalizes large errors on high-volume items. MAPE penalizes errors on low-volume items (an error of 1 unit on a 2-unit-per-week item is 50% MAPE). Neither aligns with business value. Use WRMSSE (M5 metric) which weights errors by item revenue, or define your own business metric based on the cost of overstock vs stockout for each category. :::

:::warning Cold Start with Similar Item Heuristics Using "nearest neighbor by category and price" as your cold start proxy is fine for stable categories. In fashion, electronics, and seasonal goods, it fails spectacularly because product success is highly unpredictable and the distribution of launch trajectories is fat-tailed. Supplement with explicit uncertainty - give inventory planning the P10 and P90 of your cold start forecast, not just the median. :::

Interview Questions and Answers

Q1: How would you handle the cold start problem for a new product launch at a retailer with 80,000 SKUs?

A: The cold start problem requires a layered approach. Start with meta-learning: train a model on the first 4-8 weeks of sales for all historical product launches, using only product attributes (category, price tier, brand, physical attributes) as features. This gives you a launch trajectory prior. When a new product launches, you immediately have a day-1 forecast from attributes alone. As actual sales data arrives, apply Bayesian updating - your uncertainty about the true demand rate narrows with each observation. Critically, communicate uncertainty to downstream systems. Give inventory planning a P10-P90 range, not a point estimate. For fashion specifically, where launch trajectories are highly unpredictable, use a conservative safety stock multiplier for the first 2-3 weeks.

Q2: The M5 competition was dominated by LightGBM. When would you choose TFT over LightGBM in production?

A: LightGBM wins when you have well-engineered tabular features, sufficient training data, and need fast inference. TFT wins in three situations: (1) when you want calibrated probabilistic forecasts natively - TFT's quantile outputs give you P10/P50/P90 without separate quantile regression models; (2) when you have future known inputs like planned promotions that span multiple future time steps - TFT's architecture is designed for this; (3) when series are long and exhibit complex long-range dependencies that lag features cannot capture cleanly. In practice, the best production systems ensemble both: LightGBM for speed and tabular signal, TFT for probabilistic coverage and temporal structure.

Q3: Explain hierarchical forecasting and why naive summation fails.

A: In hierarchical forecasting, you forecast at multiple levels (national, regional, store, SKU) and need forecasts to be consistent - SKU forecasts must sum to store totals, store totals to regional totals, etc. Naive approach: forecast each level independently. Problem: the independently-generated forecasts are not coherent - SKU forecasts might sum to 120 units at a store while the store-level forecast is 95 units. This is confusing for planners. Optimal Reconciliation (MinT) solves this by projecting all forecasts onto the set of coherent forecasts that minimize the trace of the forecast error covariance matrix. In practice, bottom-up (aggregate from SKU) often outperforms top-down (disaggregate from national) because granular models capture local patterns better.

Q4: How do you detect and handle outliers in retail time series - specifically holiday spikes and promotional effects?

A: There are two categories. Legitimate outliers like holiday spikes should be kept but modeled explicitly. Add is_holiday and days_since_holiday features; these are not outliers but structural patterns. Anomalous outliers - a store reporting 10,000 units sold on a day it was closed due to a snowstorm - should be identified and imputed. Detection: STL decomposition extracts a seasonal + trend baseline; residuals exceeding 3-5 standard deviations on known closure days are anomalies. Imputation: for short outages (1-3 days), use the rolling median from adjacent periods. For longer outages, use sales from the same day in the same week from prior years. The key is that imputed values should inform the rolling feature computation used by the model - so downstream features do not inherit the anomaly.

Q5: You have a forecasting system with MAPE of 15% in offline validation, but after deployment the actual MAPE is 35%. What are the likely causes and how would you diagnose this?

A: This is a classic train-serve skew problem. Three primary causes: (1) Feature leakage in training - validate that no same-day or future information leaks into lag features; inspect feature computation code carefully. (2) Distribution shift - the deployment period has different statistical properties than the validation period (new products, competitor activity, macro events). Compare the distribution of each feature in validation vs. production. (3) Training/serving feature inconsistency - the feature computation in training uses batch historical data, but at serving time features are computed from a live feature store that may have latency issues, missing values, or different join logic. This is the most common production issue. Instrument feature values at serving time and compare their distributions to training data distributions. Also check: is the MAPE higher on specific categories or stores? Localizing the error pattern usually points to the root cause.

Q6: Describe how you would build a real-time demand sensing system that can react to a viral social media event within hours.

A: Real-time demand sensing requires a streaming pipeline that bypasses the typical batch forecasting cycle. Architecture: (1) Ingest social media signals (Twitter/TikTok trending products) via API streaming into Kafka. (2) A Flink streaming job matches trending product mentions to SKUs in the product catalog using fuzzy string matching and category classification. (3) For matched SKUs, compute a "social velocity" feature: rate of mentions over last 1 hour vs last 24-hour baseline. (4) Trigger a micro-forecast update for affected SKUs using the social velocity as an exogenous override. (5) Push updated forecasts to inventory systems with a "high confidence" or "speculative" flag. The challenge is false positives - not every trending mention translates to actual purchase intent. Use historical examples (past viral products) to calibrate the social velocity to expected demand multiplier. Keep humans in the loop for large purchase orders: alert the buying team rather than auto-generating POs above a threshold.

The 3 AM Restock Alert​

Why This Exists​

Historical Context​

Core Concepts​

The Forecasting Hierarchy​

Classical Methods: When They Still Work​

Gradient Boosting for Forecasting​

Temporal Fusion Transformer (TFT)​

The Cold Start Problem​

External Signals at Scale​

Practical Implementation​

LightGBM Demand Forecasting Pipeline​

Architecture Diagrams​

Forecasting System Architecture​

Feature Engineering Flow​

Hierarchical Forecasting​

Production Engineering Notes​

Scale Challenges​

Data Quality Issues​

Retraining Strategy​

Common Mistakes​

Interview Questions and Answers​