Patient Outcome Prediction

The Sepsis Alert That Cried Wolf

At a 600-bed teaching hospital in Pittsburgh, the clinical informatics team deployed a sepsis prediction model in 2019. The model was a gradient boosted tree trained on vital signs, lab values, and nursing assessments. Internal validation looked excellent: AUC 0.87, sensitivity 82% at the chosen operating threshold.

Six months after deployment, nursing leadership asked to shut it down.

The alert firing rate was 2.3 alerts per patient per day. On a typical 30-bed ICU, that was 69 alerts every 24 hours. Nurses were spending an average of 8 minutes per alert - reviewing the patient, checking vitals, calling the resident. This was consuming over 9 hours of nursing time per day from a single model. And because most alerts did not correspond to actual clinical deterioration, nurses began acknowledging them without acting. The sensitivity of the alert was mathematically 82%, but the effective sensitivity - the probability that a nurse would respond meaningfully - had dropped to near zero through alert fatigue.

This story has been repeated at dozens of health systems across the United States. Epic Systems, which runs clinical software for roughly 40% of US hospitals, has deployed their Deterioration Index model to thousands of hospitals. Studies have found that the alert firing rate varies enormously across institutions, and the correlation between alert response and actual patient outcomes is weak in many settings.

The fundamental engineering challenge in clinical prediction is not building a model with high AUC. It is building a model that fires the right number of alerts, at the right time, with the right contextual information, to drive the right clinical action. This requires thinking about the entire sociotechnical system - not just the model's ROC curve.

Patient outcome prediction is one of the most mature applications of ML in healthcare. The MIMIC dataset has enabled hundreds of publications on ICU outcome prediction. Sepsis prediction, readmission risk, and in-hospital mortality scoring models are deployed in real clinical systems at scale. The lessons from these deployments are as instructive as the technical foundations.

The gap between academic benchmark performance and real-world clinical impact is larger in this domain than almost any other in applied AI. Understanding why is as important as understanding how to build the models.

Why This Exists - The Clinical Decision Support Gap

Clinicians make decisions under extreme uncertainty. On a 20-patient ICU floor, multiple patients are on ventilators, continuous infusions, and multiple monitoring lines. The human brain cannot continuously track all physiological trends across all patients simultaneously. Pattern recognition fatigue sets in within hours.

The classic evidence base for clinical prediction tools comes from scoring systems developed in the 1980s and 1990s: APACHE II for ICU severity, SOFA for organ failure assessment, CURB-65 for pneumonia severity. These were derived from logistic regression on paper chart data, with at most a handful of variables. They are still widely used because they have been validated prospectively across thousands of patients and clinicians trust them.

The hypothesis behind ML-based prediction: if simple logistic regression on 5-6 variables gives AUC 0.75-0.80, a gradient boosting model on hundreds of features should give AUC 0.85-0.90 or better. This hypothesis has been largely confirmed in retrospective studies. The question is whether better retrospective performance translates to better clinical outcomes when deployed prospectively.

Historical Context

Clinical prediction rules based on bedside observations date to the mid-20th century. The Apgar Score (1952) quantified neonatal condition with 5 variables on a 0-10 scale. The APACHE scoring system (Knaus et al., 1981) formalized ICU severity of illness with physiological measurements. These were the foundations of prognostic scoring in critical care.

The shift toward ML-based prediction began with the availability of large electronic clinical databases. MIMIC-II was released in 2011, providing longitudinal ICU data from over 25,000 patients. This enabled the first systematic ML studies on ICU outcomes at a scale that simple regression models could not have been built on.

The 2014 i2b2 challenge on heart disease risk from clinical notes, and the 2016 "Big Data in Intensive Care Medicine" roadmap paper by Celi et al., crystallized the research agenda. By 2018, multiple papers demonstrated LSTM models on MIMIC vital sign time series achieving AUC above 0.88 for in-hospital mortality, significantly outperforming APACHE IV.

The Google Health study on sepsis (Rajpurkar et al. and later Henry et al.) and Sepsis-3 (Singer et al., JAMA 2016) redefined what sepsis meant, complicating model development: if the definition of the outcome changes, all pre-2016 sepsis prediction models were predicting something slightly different from the current clinical standard.

The Epic Deterioration Index, deployed to hundreds of Epic hospitals, is the most widely deployed clinical prediction model in history. Published in JAMIA in 2020 (Downing et al.), it uses logistic regression with L1 regularization on 29 features. The simplicity was intentional: interpretability and workflow integration took precedence over maximum AUC.

Core Concepts

Feature Engineering from EHR Data

Raw EHR data is not model-ready. Transforming it into useful features requires domain knowledge combined with practical data engineering.

Vital signs: heart rate, blood pressure (systolic, diastolic, mean), respiratory rate, oxygen saturation, temperature. These are often recorded at irregular intervals - every 15 minutes in a monitored ICU, or once per shift on a general ward. Missing data is informative: a missing vital sign often means "patient is stable and not requiring monitoring" or alternatively "the nurse was too busy to record it."

Lab values: complete blood count (CBC), basic metabolic panel (BMP), lactate, procalcitonin, troponin, coagulation studies (PT/INR). Lab values are even more irregularly sampled - a critically ill patient may have arterial blood gas measured every hour; a stable post-surgical patient may have a CBC once daily.

Medications: vasopressor administration (norepinephrine, vasopressin) indicates circulatory shock. Antibiotics initiation may indicate suspected infection. Insulin infusions indicate metabolic derangement or stress hyperglycemia.

Demographics: age, sex, BMI, admission type (elective, emergency, medical, surgical), admission diagnosis.

Prior utilization: number of previous hospitalizations, prior ICU admissions, comorbidity burden (Charlson Comorbidity Index).

Temporal features: time since admission, time of day (circadian effects on physiology), day of week (weekend effect on staffing and outcomes is real).

Feature engineering strategies:

For time series (vitals, labs): compute summary statistics over rolling windows - mean, min, max, standard deviation, slope (trend), time-since-last-measurement, number of measurements in window. A falling mean arterial pressure over the past 6 hours with increasing lactate is a very different signal than a stable low MAP.

For missing values: do not impute naively with mean/median. In clinical data, missingness is often informative. Strategies: forward-fill (carry last known value forward), multiple imputation, or explicit missingness indicators (a binary feature "lactate was measured in last 4 hours" captures whether clinicians ordered the test, which is itself a clinical decision).

LSTM for Vital Sign Time Series

The structured temporal nature of ICU physiological data - measurements arriving at irregular intervals over hours to days - makes it a natural fit for recurrent neural networks.

An LSTM processes a sequence $\{x_1, x_2, ..., x_T\}$ where each $x_t \in \mathbb{R}^d$ is a feature vector (vital signs + labs at time $t$ ). The LSTM maintains a hidden state $h_t$ that summarizes all information up to time $t$ :

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$ $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$ $\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \quad \text{(candidate cell state)}$ $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad \text{(cell state update)}$ $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$ $h_t = o_t \odot \tanh(c_t) \quad \text{(hidden state)}$

The final hidden state $h_T$ (or attention-weighted combination of all hidden states) is used for classification or regression. For mortality prediction, this is passed through a sigmoid layer to produce a probability in $[0, 1]$ .

For irregular time series, a simple modification: include time-since-last-observation as an additional input feature at each timestep, or use an ODE-based model (Neural ODE, GRU-ODE-Bayes) that explicitly models continuous dynamics between observations.

Survival Analysis - When, Not Just If

Binary classification (will this patient be readmitted?) ignores timing. Readmission at 3 days is clinically very different from readmission at 29 days. Survival analysis answers the question: when will the event occur?

Kaplan-Meier curves are non-parametric estimates of the survival function $S(t) = P(T > t)$ - the probability that the event has not occurred by time $t$ . The KM estimator handles censored observations (patients who left the study without experiencing the event):

$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$

where $d_i$ is the number of events at time $t_i$ and $n_i$ is the number at risk just before $t_i$ .

Cox Proportional Hazards model extends this to include covariates. The hazard function for patient $i$ is:

$h_i(t) = h_0(t) \cdot \exp(\beta^T x_i)$

where $h_0(t)$ is the baseline hazard (non-parametric, estimated from data), $x_i$ is the feature vector, and $\beta$ are coefficients estimated by maximum partial likelihood. The proportional hazards assumption: the ratio of hazards between any two patients is constant over time, determined entirely by their covariates.

The Cox model outputs a risk score $\exp(\hat{\beta}^T x_i)$ for each patient, proportional to their relative hazard. Concordance index (C-statistic) measures how often the model correctly ranks pairs of patients by survival time - analogous to AUC-ROC for classification.

For deep learning survival models: DeepSurv (Katzman et al., 2018) replaces the linear $\beta^T x$ with a deep neural network, capturing non-linear covariate effects while preserving the Cox framework. The partial likelihood loss is differentiable and can be optimized with SGD.

Calibration - Are Predicted Probabilities Trustworthy?

A model with AUC 0.90 can still be terribly calibrated. Calibration measures whether predicted probabilities match observed event rates. If the model says "40% probability of readmission" for a group of patients, roughly 40% of those patients should actually be readmitted.

Perfect calibration: for any threshold $p$ , $P(\text{event} | \hat{p} = p) = p$ .

The Brier score measures calibration alongside discrimination:

$BS = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)^2$

where $y_i \in \{0, 1\}$ is the observed outcome and $\hat{p}_i$ is the predicted probability.

The calibration plot (reliability diagram) is the standard visualization: bin predictions by decile, plot mean predicted probability vs observed event rate. A perfectly calibrated model lies on the diagonal.

Common calibration problem in clinical ML: models trained on imbalanced data with oversampling or class weights tend to be overconfident - they predict higher probabilities than are empirically observed. Platt scaling (logistic regression on the raw output scores) and isotonic regression are standard post-hoc calibration methods.

Why calibration matters clinically: if a physician sees "85% probability of sepsis," they expect that 85% of patients with this alert actually develop sepsis. If only 20% do (poor calibration), the physician quickly learns not to trust the probability, and the model becomes less useful than a simple threshold alert.

Code Examples

EHR Feature Engineering Pipeline

import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass

@dataclass
class TimeSeriesConfig:
    """Configuration for vital sign feature extraction."""
    vital_columns: List[str]
    look_back_hours: List[int]  # e.g., [2, 6, 12, 24]
    aggregations: List[str]     # e.g., ['mean', 'min', 'max', 'std', 'slope']


def extract_vital_features(
    vitals_df: pd.DataFrame,
    patient_id: str,
    prediction_time: pd.Timestamp,
    config: TimeSeriesConfig
) -> Dict[str, float]:
    """
    Extract summary features from vital sign time series
    for a given patient at a given prediction time.

    Args:
        vitals_df: DataFrame with columns [patient_id, charttime, vital_name, value]
        patient_id: Patient identifier
        prediction_time: The time at which we are making the prediction
        config: Feature extraction configuration

    Returns:
        Dictionary of feature_name -> value
    """
    features = {}

    # Filter to this patient's vitals before prediction time
    patient_vitals = vitals_df[
        (vitals_df['patient_id'] == patient_id) &
        (vitals_df['charttime'] <= prediction_time)
    ].copy()

    for vital in config.vital_columns:
        vital_data = patient_vitals[patient_vitals['vital_name'] == vital].copy()

        for hours in config.look_back_hours:
            window_start = prediction_time - pd.Timedelta(hours=hours)
            window_data = vital_data[vital_data['charttime'] >= window_start]

            # Missing data features - is this vital being monitored?
            features[f"{vital}_{hours}h_count"] = len(window_data)
            features[f"{vital}_{hours}h_has_data"] = int(len(window_data) > 0)

            if len(window_data) == 0:
                # No data in window - fill with NaN, handle downstream
                for agg in config.aggregations:
                    features[f"{vital}_{hours}h_{agg}"] = np.nan
                features[f"{vital}_{hours}h_hours_since"] = hours  # max lookback
                continue

            values = window_data['value'].values

            # Time-weighted features
            for agg in config.aggregations:
                if agg == 'mean':
                    features[f"{vital}_{hours}h_mean"] = np.mean(values)
                elif agg == 'min':
                    features[f"{vital}_{hours}h_min"] = np.min(values)
                elif agg == 'max':
                    features[f"{vital}_{hours}h_max"] = np.max(values)
                elif agg == 'std':
                    features[f"{vital}_{hours}h_std"] = np.std(values) if len(values) > 1 else 0.0
                elif agg == 'slope':
                    if len(values) > 1:
                        # Linear trend over the window
                        times_hours = (
                            window_data['charttime'].values.astype(np.int64) / 1e9 / 3600
                        )
                        times_normalized = times_hours - times_hours[0]
                        slope = np.polyfit(times_normalized, values, 1)[0]
                        features[f"{vital}_{hours}h_slope"] = slope
                    else:
                        features[f"{vital}_{hours}h_slope"] = 0.0

            # Time since most recent measurement
            most_recent = vital_data['charttime'].max()
            hours_since = (prediction_time - most_recent).total_seconds() / 3600
            features[f"{vital}_{hours}h_hours_since_last"] = min(hours_since, hours)

    return features


def compute_shock_index(sbp: float, hr: float) -> float:
    """
    Shock Index = Heart Rate / Systolic BP.
    SI > 0.9 is associated with hemodynamic instability.
    SI > 1.0 is associated with major adverse outcomes.
    """
    if sbp is None or sbp <= 0:
        return np.nan
    return hr / sbp


def compute_sofa_score(
    pao2_fio2: Optional[float],    # Respiratory component
    platelets: Optional[float],    # Coagulation
    bilirubin: Optional[float],    # Liver
    map_vasopressor: Optional[str], # Cardiovascular
    gcs: Optional[float],          # Neurological
    creatinine: Optional[float]    # Renal
) -> int:
    """
    Sequential Organ Failure Assessment (SOFA) score.
    Used in Sepsis-3 definition (total SOFA >= 2 from baseline).
    Range 0-24. Higher = more severe organ failure.
    """
    score = 0

    # Respiratory: PaO2/FiO2 ratio
    if pao2_fio2 is not None:
        if pao2_fio2 < 100: score += 4
        elif pao2_fio2 < 200: score += 3
        elif pao2_fio2 < 300: score += 2
        elif pao2_fio2 < 400: score += 1

    # Coagulation: Platelets (x10^3/uL)
    if platelets is not None:
        if platelets < 20: score += 4
        elif platelets < 50: score += 3
        elif platelets < 100: score += 2
        elif platelets < 150: score += 1

    # Liver: Bilirubin (mg/dL)
    if bilirubin is not None:
        if bilirubin >= 12.0: score += 4
        elif bilirubin >= 6.0: score += 3
        elif bilirubin >= 2.0: score += 2
        elif bilirubin >= 1.2: score += 1

    # Neurological: Glasgow Coma Scale
    if gcs is not None:
        if gcs < 6: score += 4
        elif gcs < 10: score += 3
        elif gcs < 13: score += 2
        elif gcs < 15: score += 1

    # Renal: Creatinine (mg/dL)
    if creatinine is not None:
        if creatinine >= 5.0: score += 4
        elif creatinine >= 3.5: score += 3
        elif creatinine >= 2.0: score += 2
        elif creatinine >= 1.2: score += 1

    return score

30-Day Readmission Prediction Model

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, brier_score_loss
from sklearn.calibration import calibration_curve
import lightgbm as lgb
from typing import List, Dict, Tuple, Optional

class ReadmissionDataset(Dataset):
    """
    Dataset for 30-day hospital readmission prediction.

    Features:
    - Static: demographics, comorbidities, admission type
    - Aggregated vitals: mean/min/max over last 24h before discharge
    - Labs: last values before discharge, trend indicators
    - Utilization: prior admission history
    """

    def __init__(
        self,
        features: np.ndarray,
        labels: np.ndarray,
        feature_names: List[str]
    ):
        assert len(features) == len(labels)
        self.features = torch.tensor(features, dtype=torch.float32)
        self.labels = torch.tensor(labels, dtype=torch.float32)
        self.feature_names = feature_names

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]


def train_readmission_lgbm(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_val: np.ndarray,
    y_val: np.ndarray,
    feature_names: List[str]
) -> lgb.Booster:
    """
    Train a LightGBM model for readmission prediction.

    LightGBM is often the best choice for tabular EHR features:
    - Handles missing values natively (uses a learned split direction)
    - Much faster than XGBoost on large datasets
    - Built-in categorical feature handling
    - Strong performance with less hyperparameter tuning
    """
    # Handle class imbalance: readmission rate is typically 10-15%
    pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    print(f"Class imbalance ratio (neg:pos): {pos_weight:.1f}:1")

    train_data = lgb.Dataset(
        X_train, label=y_train, feature_name=feature_names
    )
    val_data = lgb.Dataset(
        X_val, label=y_val, feature_name=feature_names, reference=train_data
    )

    params = {
        'objective': 'binary',
        'metric': ['auc', 'binary_logloss'],
        'scale_pos_weight': pos_weight,  # Handle class imbalance

        # Regularization to prevent overfitting on small clinical datasets
        'num_leaves': 31,
        'min_child_samples': 50,   # Minimum samples in leaf - important for medical data
        'max_depth': 6,
        'learning_rate': 0.05,
        'n_estimators': 1000,

        'feature_fraction': 0.8,   # Column subsampling
        'bagging_fraction': 0.8,   # Row subsampling
        'bagging_freq': 5,

        'lambda_l1': 0.1,
        'lambda_l2': 1.0,

        'verbosity': -1,
        'random_state': 42,
    }

    callbacks = [
        lgb.early_stopping(stopping_rounds=50, verbose=True),
        lgb.log_evaluation(period=50),
    ]

    model = lgb.train(
        params,
        train_data,
        valid_sets=[val_data],
        num_boost_round=1000,
        callbacks=callbacks,
    )

    # Evaluate
    val_probs = model.predict(X_val)
    auc = roc_auc_score(y_val, val_probs)
    brier = brier_score_loss(y_val, val_probs)
    print(f"\nValidation AUC: {auc:.4f}")
    print(f"Validation Brier Score: {brier:.4f}")

    # Feature importance analysis
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)

    print("\nTop 15 Features by Gain Importance:")
    print(importance_df.head(15).to_string(index=False))

    return model


class PlattCalibrator:
    """
    Post-hoc calibration using Platt scaling.
    Trains a logistic regression on model output scores.
    """

    def __init__(self):
        from sklearn.linear_model import LogisticRegression
        self.lr = LogisticRegression()
        self.fitted = False

    def fit(self, raw_scores: np.ndarray, labels: np.ndarray):
        """Fit calibration on a held-out calibration set."""
        self.lr.fit(raw_scores.reshape(-1, 1), labels)
        self.fitted = True

    def predict_proba(self, raw_scores: np.ndarray) -> np.ndarray:
        """Return calibrated probabilities."""
        assert self.fitted, "Must call fit() before predict_proba()"
        return self.lr.predict_proba(raw_scores.reshape(-1, 1))[:, 1]

    def evaluate_calibration(
        self,
        probs: np.ndarray,
        labels: np.ndarray,
        n_bins: int = 10
    ) -> Dict[str, float]:
        """Compute calibration metrics."""
        fraction_pos, mean_pred = calibration_curve(
            labels, probs, n_bins=n_bins
        )

        # Expected Calibration Error (ECE)
        # Weighted average absolute difference between predicted and actual rate
        bin_sizes = np.histogram(probs, bins=n_bins)[0]
        total = len(probs)
        ece = np.sum(
            np.abs(fraction_pos - mean_pred) * bin_sizes[:len(fraction_pos)] / total
        )

        return {
            "ece": ece,
            "mean_calibration_error": np.mean(np.abs(fraction_pos - mean_pred)),
            "max_calibration_error": np.max(np.abs(fraction_pos - mean_pred)),
            "brier_score": brier_score_loss(labels, probs),
        }

LSTM for ICU Mortality Prediction

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
from typing import List, Tuple, Optional

class ICUTimeSeriesDataset(Dataset):
    """
    Dataset for ICU mortality prediction from multivariate time series.

    Input: physiological measurements at each hour of ICU stay
    Output: in-hospital mortality (binary)

    Handles irregular sampling by using:
    - Forward-fill for short gaps (< 4 hours)
    - Missingness indicators for longer gaps
    """

    def __init__(
        self,
        sequences: List[np.ndarray],  # Each: (T, n_features)
        labels: np.ndarray,
        max_seq_len: int = 48,   # 48 hours of history
        n_features: int = 34     # Number of vital/lab features
    ):
        self.max_seq_len = max_seq_len
        self.n_features = n_features
        self.labels = torch.tensor(labels, dtype=torch.float32)

        # Pad or truncate sequences to max_seq_len
        self.sequences = []
        self.masks = []
        for seq in sequences:
            T = min(len(seq), max_seq_len)
            padded = np.zeros((max_seq_len, n_features), dtype=np.float32)
            mask = np.zeros(max_seq_len, dtype=np.float32)

            padded[:T] = seq[:T]
            mask[:T] = 1.0

            self.sequences.append(padded)
            self.masks.append(mask)

        self.sequences = torch.tensor(np.array(self.sequences), dtype=torch.float32)
        self.masks = torch.tensor(np.array(self.masks), dtype=torch.float32)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.sequences[idx], self.masks[idx], self.labels[idx]


class ICUMortalityLSTM(nn.Module):
    """
    Bidirectional LSTM with temporal attention for ICU mortality prediction.

    Architecture notes:
    - Bidirectional: future context helps even in online prediction
      because we use the full available history at prediction time
    - Temporal attention: allows the model to focus on the most
      clinically relevant time windows (e.g., the deterioration event)
    - Multi-task: simultaneously predicts mortality and 24h readmission
      risk - joint training can improve both tasks
    """

    def __init__(
        self,
        n_features: int = 34,
        hidden_dim: int = 256,
        num_layers: int = 2,
        dropout: float = 0.3,
        bidirectional: bool = True,
    ):
        super().__init__()

        self.n_features = n_features
        self.hidden_dim = hidden_dim
        self.bidirectional = bidirectional
        self.directions = 2 if bidirectional else 1

        # Feature normalization layer (learns scale/shift per feature)
        self.input_norm = nn.LayerNorm(n_features)

        # LSTM
        self.lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional
        )

        lstm_output_dim = hidden_dim * self.directions

        # Temporal attention mechanism
        self.attention = nn.Sequential(
            nn.Linear(lstm_output_dim, 128),
            nn.Tanh(),
            nn.Linear(128, 1),
        )

        self.dropout = nn.Dropout(dropout)

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(
        self,
        x: torch.Tensor,  # (batch, T, n_features)
        mask: torch.Tensor  # (batch, T) - 1 for valid, 0 for padding
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Returns:
            logit: (batch,) mortality prediction logit
            attention_weights: (batch, T) for interpretability
        """
        batch_size, T, _ = x.shape

        # Normalize input features
        x = self.input_norm(x)

        # LSTM encoding
        lstm_out, _ = self.lstm(x)  # (batch, T, hidden*directions)
        lstm_out = self.dropout(lstm_out)

        # Temporal attention
        attn_scores = self.attention(lstm_out).squeeze(-1)  # (batch, T)

        # Mask padding positions to -inf before softmax
        mask_expanded = mask.bool()
        attn_scores = attn_scores.masked_fill(~mask_expanded, float('-inf'))
        attn_weights = F.softmax(attn_scores, dim=1)  # (batch, T)

        # Attention-weighted pooling
        context = (lstm_out * attn_weights.unsqueeze(-1)).sum(dim=1)  # (batch, hidden*directions)

        # Classification
        logit = self.classifier(context).squeeze(-1)  # (batch,)

        return logit, attn_weights


def train_icu_lstm(
    model: ICUMortalityLSTM,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 30,
    lr: float = 1e-3,
    device: str = "cuda"
) -> ICUMortalityLSTM:
    """Train the ICU LSTM model."""
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

    # Compute positive weight for class imbalance
    # ICU mortality rate is roughly 10-15%
    pos_weight = torch.tensor([6.0]).to(device)  # ~85% neg, 15% pos
    criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

    best_val_auc = 0.0

    for epoch in range(epochs):
        # Training
        model.train()
        total_loss = 0.0
        for sequences, masks, labels in train_loader:
            sequences = sequences.to(device)
            masks = masks.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()
            logits, _ = model(sequences, masks)
            loss = criterion(logits, labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            total_loss += loss.item()

        # Validation
        model.eval()
        all_preds, all_labels = [], []
        with torch.no_grad():
            for sequences, masks, labels in val_loader:
                sequences, masks = sequences.to(device), masks.to(device)
                logits, _ = model(sequences, masks)
                probs = torch.sigmoid(logits).cpu().numpy()
                all_preds.extend(probs)
                all_labels.extend(labels.numpy())

        from sklearn.metrics import roc_auc_score
        val_auc = roc_auc_score(all_labels, all_preds)
        scheduler.step()

        print(f"Epoch {epoch+1}/{epochs}: Loss={total_loss/len(train_loader):.4f}, "
              f"Val AUC={val_auc:.4f}")

        if val_auc > best_val_auc:
            best_val_auc = val_auc
            torch.save(model.state_dict(), "best_icu_lstm.pt")

    return model

Survival Analysis with Cox Model

import numpy as np
import pandas as pd
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from typing import List, Dict, Tuple

class ReadmissionSurvivalAnalysis:
    """
    Survival analysis for hospital readmission.

    Two components:
    1. Kaplan-Meier curves for descriptive analysis
       (how does survival differ across risk groups?)
    2. Cox proportional hazards for multivariate adjustment
       (which features independently predict time-to-readmission?)
    """

    def __init__(self):
        self.km_fitter = KaplanMeierFitter()
        self.cox_model = CoxPHFitter(penalizer=0.1)  # L2 regularization

    def fit_kaplan_meier(
        self,
        df: pd.DataFrame,
        duration_col: str = "days_to_readmission",
        event_col: str = "readmitted_30d",
        group_col: Optional[str] = None,
    ):
        """
        Fit Kaplan-Meier survival curves.

        Args:
            duration_col: time to event or censoring (days)
            event_col: 1 if event occurred, 0 if censored
            group_col: if provided, plot separate curves per group
                       (e.g., high-risk vs low-risk patients)
        """
        if group_col is None:
            self.km_fitter.fit(
                df[duration_col], event_observed=df[event_col]
            )
            median_survival = self.km_fitter.median_survival_time_
            print(f"Median time to readmission: {median_survival:.1f} days")
            return self.km_fitter

        # Plot stratified curves
        groups = df[group_col].unique()
        km_per_group = {}
        for group in sorted(groups):
            mask = df[group_col] == group
            km = KaplanMeierFitter(label=f"{group_col}={group}")
            km.fit(df.loc[mask, duration_col], event_observed=df.loc[mask, event_col])
            km_per_group[group] = km

        # Log-rank test for group difference
        if len(groups) == 2:
            g1, g2 = sorted(groups)
            m1, m2 = df[group_col] == g1, df[group_col] == g2
            result = logrank_test(
                df.loc[m1, duration_col], df.loc[m2, duration_col],
                event_observed_A=df.loc[m1, event_col],
                event_observed_B=df.loc[m2, event_col]
            )
            print(f"Log-rank test p-value: {result.p_value:.4f}")

        return km_per_group

    def fit_cox_model(
        self,
        df: pd.DataFrame,
        duration_col: str = "days_to_readmission",
        event_col: str = "readmitted_30d",
        feature_cols: List[str] = None
    ):
        """
        Fit Cox proportional hazards model.

        Returns hazard ratios for each feature with confidence intervals.
        Hazard ratio > 1: feature increases risk (shortens time to readmission)
        Hazard ratio < 1: feature decreases risk (protective)
        """
        if feature_cols is None:
            feature_cols = [c for c in df.columns if c not in [duration_col, event_col]]

        # lifelines expects duration and event in the same DataFrame
        cox_df = df[feature_cols + [duration_col, event_col]].copy()
        cox_df = cox_df.dropna()  # Cox model requires complete cases (or use imputation)

        self.cox_model.fit(
            cox_df,
            duration_col=duration_col,
            event_col=event_col,
        )

        self.cox_model.print_summary()
        return self.cox_model

    def compute_c_index(
        self,
        df: pd.DataFrame,
        duration_col: str,
        event_col: str,
        feature_cols: List[str],
        n_splits: int = 5
    ) -> Tuple[float, float]:
        """
        Cross-validated concordance index (C-statistic) for Cox model.

        C-index = probability that predictions correctly rank two patients.
        Equivalent to AUC-ROC for survival models.
        """
        from lifelines.utils import concordance_index

        kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
        c_indices = []

        cox_df = df[feature_cols + [duration_col, event_col]].dropna()
        X = cox_df[feature_cols].values
        T = cox_df[duration_col].values
        E = cox_df[event_col].values

        for train_idx, test_idx in kf.split(X):
            # Train Cox model on fold
            train_df = cox_df.iloc[train_idx]
            test_df = cox_df.iloc[test_idx]

            fold_cox = CoxPHFitter(penalizer=0.1)
            fold_cox.fit(train_df, duration_col=duration_col, event_col=event_col)

            # Predict risk scores on test fold
            # Higher risk score = expected shorter time to event
            risk_scores = fold_cox.predict_partial_hazard(test_df).values
            c_idx = concordance_index(
                test_df[duration_col].values,
                -risk_scores,  # Negate because concordance_index expects higher=longer survival
                test_df[event_col].values
            )
            c_indices.append(c_idx)

        mean_c = np.mean(c_indices)
        std_c = np.std(c_indices)
        print(f"Cross-validated C-index: {mean_c:.4f} +/- {std_c:.4f}")
        return mean_c, std_c

System Architecture Diagrams

Production Engineering Notes

Alert Fatigue - The Deployment Killer

The gap between a model performing well in benchmarks and being useful in production is often explained by alert fatigue. When a clinical decision support system fires too frequently, clinicians learn to ignore it through classical conditioning. A study by Shah et al. (2020) found that physicians overrode 96% of drug-drug interaction alerts in a major EHR - the alerts were technically correct but clinically irrelevant in most contexts.

Strategies to combat alert fatigue:

Positive Predictive Value (PPV) targeting: do not choose your operating threshold by sensitivity alone. Target a minimum PPV of 20-30% for alerts that require clinical action. If only 5% of alerts represent true positive deterioration, you will overwhelm clinicians.

Alert suppression rules: do not fire an alert for the same patient more than once per 4-hour window without a significant new signal. If a patient was assessed and documented as stable, suppress alerts for 2 hours. These rules reduce noise without systematically missing new events.

Contextual filtering: do not fire a sepsis alert on a patient who is already on 4 antibiotics and whose team has documented "known sepsis under treatment." NLP on the recent nursing notes can enable this context-aware suppression.

Tiered alert severity: not all alerts are equal. A confidence level of 90% warrants a pager message to the bedside physician. A confidence of 65% warrants a flag in the EMR for the next time a clinician opens the chart. Tiering based on confidence and severity drastically reduces the burden on clinicians.

Temporal Validation - The Prospective Reality Check

A model validated only on retrospective data is not validated at all for production use. Two problems:

Label leakage: features derived from the medical record after the prediction time can leak into training. "Number of blood cultures ordered in the next 4 hours" is a perfect predictor of sepsis - it is also not available at prediction time. Careful temporal feature extraction (only use data from before the prediction timestamp) prevents this.

Distribution shift: the model's training distribution may not match the deployment distribution. A model trained on 2015-2018 data deployed in 2023 may perform poorly because patient demographics, illness severity, treatment protocols, and EHR documentation practices have all changed.

The prospective shadow mode deployment (running the model without firing alerts, collecting predictions alongside ground truth, comparing to retrospective validation AUC) is the minimum standard before any production deployment.

MIMIC-IV for Development - Practical Notes

MIMIC-IV (version 2.2) requires a PhysioNet account and completion of the CITI "Data or Specimens Only Research" course. Access is free after DUA signing.

The database is structured in PostgreSQL. The MIMIC-IV demo (100 patients) is available without DUA for initial development. Key tables:

mimiciv_hosp.admissions: one row per hospital admission
mimiciv_hosp.labevents: all lab values with timestamps
mimiciv_icu.chartevents: all ICU charted vitals (very large table, ~300M rows)
mimiciv_icu.icustays: ICU admission/discharge times
mimiciv_hosp.prescriptions: medication orders

Use the mimic-code repository on GitHub for pre-built SQL scripts that extract SOFA scores, APACHE scores, and standard cohorts.

Common Mistakes

:::danger Time Leakage in Feature Engineering

The most common and most damaging bug in clinical prediction: including features computed from data that occurs after the prediction time. A 30-day readmission model that includes "number of ER visits in next 7 days" as a feature will appear to have AUC 0.99 and will be completely useless in production. This bug is easier to introduce than it sounds, especially when working with denormalized feature tables.

Fix: always track a prediction_time for each prediction row, and filter all feature computation to only use data with charttime < prediction_time. Write a test that verifies no feature in the training set uses future data by shuffling the timestamps and confirming performance drops to near random. :::

:::danger Reporting AUC Without Calibration Metrics

AUC measures discrimination (can the model rank patients correctly?) but not calibration (are predicted probabilities accurate?). A model with excellent AUC but poor calibration is misleading to clinicians who interpret probabilities literally. "This patient has a 78% chance of readmission" from a poorly calibrated model may actually correspond to a 30% observed readmission rate.

Fix: always report Brier score and Expected Calibration Error alongside AUC. Always include calibration plots (reliability diagrams) in model evaluation reports. Apply Platt scaling or isotonic regression on a held-out calibration set before deployment. :::

:::warning Using Random Split for Temporal EHR Data

EHR data has a natural time ordering. Using random train/test splitting on temporal EHR data allows future data to inform past predictions (temporal leakage at the dataset level). A sepsis model trained on 2020 data but using a random split may include 2020 test samples in training alongside 2019 training samples that reflect the model.

Fix: use temporal splitting: train on data from years 1-3, validate on year 4, test on year 5. This simulates the real deployment scenario: the model is trained on historical data and deployed prospectively. :::

:::warning Ignoring Patient-Level Clustering in Train/Val Split

A single patient may have multiple hospital admissions. If admission 1 goes to train and admission 3 goes to validation for the same patient, the model sees the same patient in both splits. This inflates performance metrics because the model learns patient-specific patterns (which would not generalize to new patients).

Fix: split at the patient level, not the admission level. Use GroupKFold with patient ID as the group key. Report performance per-patient rather than per-admission. :::

Interview Questions and Answers

Q1: You build a sepsis prediction model with AUC 0.88 on the MIMIC-IV benchmark. How would you approach deploying it in a real ICU?

AUC is a starting point, not a finish line. The deployment journey from 0.88 AUC to a clinically useful system involves:

First, prospective shadow deployment: run the model live but suppress all alerts for 4-8 weeks. Collect predictions alongside documented clinical assessments and outcomes. Compare prospective AUC to the MIMIC benchmark. If performance drops more than 5-8 points, investigate - likely causes are distribution shift between MIMIC and the target institution.

Second, alert threshold selection: choose the operating threshold by targeting a clinically acceptable positive predictive value (at least 20-30%) rather than by sensitivity alone. Work with the clinical informatics team and frontline nurses to determine the maximum alert burden the workflow can absorb.

Third, workflow integration: where does the alert appear? On a monitor screen, in the EHR, as a pager message? What is the expected clinical action? Who should be notified? An alert without a clear prescribed action is noise.

Fourth, measurement framework: define prospectively what success means. Does the alert change clinical behavior (time to antibiotic administration, blood cultures ordered)? Does it improve outcomes (sepsis mortality, ICU length of stay)? Instrument these metrics before deployment.

Fifth, post-market monitoring: track alert firing rate, override rate, and outcome metrics weekly. Alert fatigue develops over 3-6 months as clinicians learn to ignore alerts. Catch it early.

Q2: What is the difference between model discrimination and calibration? Why does it matter for clinical prediction?

Discrimination (measured by AUC-ROC) is the model's ability to rank patients: given two patients, one who will be readmitted and one who will not, what is the probability the model assigns a higher risk score to the one who gets readmitted? A model with perfect discrimination always ranks true positives above true negatives. AUC = 1.0 means perfect discrimination.

Calibration measures whether the predicted probability is the true probability. If I have 100 patients each assigned a 30% readmission probability by the model, and 30 of them actually get readmitted, the model is perfectly calibrated at that probability level. Expected Calibration Error (ECE) quantifies the average magnitude of the gap between predicted and actual rates.

Clinical relevance: physicians make decisions based on predicted probabilities. A physician seeing "70% readmission risk" takes different action than "30% readmission risk." If the model's 70% corresponds to a true rate of 30% (overconfident), the physician is systematically over-treating. If 30% corresponds to a true rate of 70% (underconfident), they are systematically under-treating. Good discrimination with poor calibration produces a model that is useful for ranking patients (triaging) but not for absolute risk communication.

Q3: Explain the Cox proportional hazards model. What does the proportional hazards assumption mean and how would you test it?

The Cox model relates patient covariates to the hazard (instantaneous risk) of an event. The hazard at time $t$ for patient $i$ is $h_i(t) = h_0(t) \exp(\beta^T x_i)$ . The key elements: $h_0(t)$ is the baseline hazard (common to all patients), and $\exp(\beta^T x_i)$ is the hazard ratio - how much higher or lower the patient's hazard is relative to baseline.

The proportional hazards (PH) assumption: the ratio of hazards between any two patients is constant over time. Patient A's hazard is always twice patient B's hazard, at every point in time. This means that the effect of covariates on hazard does not change over time.

Testing PH assumption: (1) Schoenfeld residuals test - plot scaled Schoenfeld residuals versus time for each covariate; a non-zero slope indicates a time-varying effect. (2) log(-log(S(t))) vs log(t) plot for categorical variables - parallel lines indicate proportional hazards. (3) Include covariate-by-time interaction terms; a significant interaction rejects PH.

If PH assumption is violated for a covariate: use time-varying covariates, stratify on that variable (allowing a different baseline hazard per stratum), or use a parametric survival model (Weibull, log-normal) instead.

Q4: How do you handle extreme class imbalance in 30-day readmission prediction where only 12% of patients are readmitted?

The 12% positive rate is moderate imbalance - worse than balanced but not extreme. Several strategies, in order of practical impact:

Scale the loss: scale_pos_weight = (100-12)/12 = 7.3 in LightGBM/XGBoost penalizes missing a readmission 7.3x more than a false positive. This is equivalent to SMOTE in expectation but more principled.

Probability threshold tuning: by default, 0.5 is used. At 12% prevalence, a 0.5 threshold classifies nearly everything as negative. Use the Youden index (J = sensitivity + specificity - 1) to find the threshold that maximizes both sensitivity and specificity jointly on the validation set.

Evaluate with the right metrics: accuracy is nearly useless at 12% prevalence. Use F1 score (harmonic mean of precision and recall), AUCPRC (area under the precision-recall curve, which is more sensitive than AUROC for imbalanced data), and sensitivity at a clinically relevant specificity level.

Avoid SMOTE for tree-based models: SMOTE (synthetic oversampling) is often recommended for imbalanced data but its benefit is unclear for gradient boosted trees which naturally handle imbalance through class weights. Reserve SMOTE for neural networks if class weights alone are insufficient.

Q5: A clinical prediction model performs AUC 0.87 on internal validation but only 0.74 when deployed at a different hospital. What are the likely causes and how do you investigate?

A 13-point AUC drop on external deployment is large and indicates significant covariate shift. Investigation approach:

Check population demographics: compare age distribution, sex ratio, racial composition, admission type mix (elective vs emergency), and disease severity indices (APACHE scores) between training data and deployment site. Any systematic difference in case mix is a suspect.

Check feature distributions: for each feature in the model, compare the distribution at training site versus deployment site. Use statistical tests (KS test for continuous, chi-squared for categorical). Features with p < 0.05 are shifted. Vital sign distributions may differ because of different monitoring protocols. Lab value distributions may differ because of different analyzer calibration standards.

Check label definitions: is "readmission" defined identically? Training data may have included all-cause readmissions, but the deployment site may only capture readmissions within their health system, missing readmissions to competitor hospitals.

Check time granularity: if the training site had vitals charted every 15 minutes (ICU) but the deployment site charts every 4 hours (general ward), features like "standard deviation of HR over past 2 hours" will have very different meaning.

Remediation options: re-train on combined data from both sites. Apply domain adaptation (reweighting training samples to better match the target distribution). Add site-specific calibration with a small labeled dataset from the deployment site.

This lesson is part of the Applied AI - AI in Healthcare module. Next: Radiology AI in Production.

The Sepsis Alert That Cried Wolf​

Why This Exists - The Clinical Decision Support Gap​

Historical Context​

Core Concepts​

Feature Engineering from EHR Data​

LSTM for Vital Sign Time Series​

Survival Analysis - When, Not Just If​

Calibration - Are Predicted Probabilities Trustworthy?​

Code Examples​

EHR Feature Engineering Pipeline​

30-Day Readmission Prediction Model​

LSTM for ICU Mortality Prediction​

Survival Analysis with Cox Model​

System Architecture Diagrams​

Production Engineering Notes​

Alert Fatigue - The Deployment Killer​

Temporal Validation - The Prospective Reality Check​

MIMIC-IV for Development - Practical Notes​

Common Mistakes​

Interview Questions and Answers​