Patient Outcome Prediction
The Sepsis Alert That Cried Wolf
At a 600-bed teaching hospital in Pittsburgh, the clinical informatics team deployed a sepsis prediction model in 2019. The model was a gradient boosted tree trained on vital signs, lab values, and nursing assessments. Internal validation looked excellent: AUC 0.87, sensitivity 82% at the chosen operating threshold.
Six months after deployment, nursing leadership asked to shut it down.
The alert firing rate was 2.3 alerts per patient per day. On a typical 30-bed ICU, that was 69 alerts every 24 hours. Nurses were spending an average of 8 minutes per alert - reviewing the patient, checking vitals, calling the resident. This was consuming over 9 hours of nursing time per day from a single model. And because most alerts did not correspond to actual clinical deterioration, nurses began acknowledging them without acting. The sensitivity of the alert was mathematically 82%, but the effective sensitivity - the probability that a nurse would respond meaningfully - had dropped to near zero through alert fatigue.
This story has been repeated at dozens of health systems across the United States. Epic Systems, which runs clinical software for roughly 40% of US hospitals, has deployed their Deterioration Index model to thousands of hospitals. Studies have found that the alert firing rate varies enormously across institutions, and the correlation between alert response and actual patient outcomes is weak in many settings.
The fundamental engineering challenge in clinical prediction is not building a model with high AUC. It is building a model that fires the right number of alerts, at the right time, with the right contextual information, to drive the right clinical action. This requires thinking about the entire sociotechnical system - not just the model's ROC curve.
Patient outcome prediction is one of the most mature applications of ML in healthcare. The MIMIC dataset has enabled hundreds of publications on ICU outcome prediction. Sepsis prediction, readmission risk, and in-hospital mortality scoring models are deployed in real clinical systems at scale. The lessons from these deployments are as instructive as the technical foundations.
The gap between academic benchmark performance and real-world clinical impact is larger in this domain than almost any other in applied AI. Understanding why is as important as understanding how to build the models.
Why This Exists - The Clinical Decision Support Gap
Clinicians make decisions under extreme uncertainty. On a 20-patient ICU floor, multiple patients are on ventilators, continuous infusions, and multiple monitoring lines. The human brain cannot continuously track all physiological trends across all patients simultaneously. Pattern recognition fatigue sets in within hours.
The classic evidence base for clinical prediction tools comes from scoring systems developed in the 1980s and 1990s: APACHE II for ICU severity, SOFA for organ failure assessment, CURB-65 for pneumonia severity. These were derived from logistic regression on paper chart data, with at most a handful of variables. They are still widely used because they have been validated prospectively across thousands of patients and clinicians trust them.
The hypothesis behind ML-based prediction: if simple logistic regression on 5-6 variables gives AUC 0.75-0.80, a gradient boosting model on hundreds of features should give AUC 0.85-0.90 or better. This hypothesis has been largely confirmed in retrospective studies. The question is whether better retrospective performance translates to better clinical outcomes when deployed prospectively.
Historical Context
Clinical prediction rules based on bedside observations date to the mid-20th century. The Apgar Score (1952) quantified neonatal condition with 5 variables on a 0-10 scale. The APACHE scoring system (Knaus et al., 1981) formalized ICU severity of illness with physiological measurements. These were the foundations of prognostic scoring in critical care.
The shift toward ML-based prediction began with the availability of large electronic clinical databases. MIMIC-II was released in 2011, providing longitudinal ICU data from over 25,000 patients. This enabled the first systematic ML studies on ICU outcomes at a scale that simple regression models could not have been built on.
The 2014 i2b2 challenge on heart disease risk from clinical notes, and the 2016 "Big Data in Intensive Care Medicine" roadmap paper by Celi et al., crystallized the research agenda. By 2018, multiple papers demonstrated LSTM models on MIMIC vital sign time series achieving AUC above 0.88 for in-hospital mortality, significantly outperforming APACHE IV.
The Google Health study on sepsis (Rajpurkar et al. and later Henry et al.) and Sepsis-3 (Singer et al., JAMA 2016) redefined what sepsis meant, complicating model development: if the definition of the outcome changes, all pre-2016 sepsis prediction models were predicting something slightly different from the current clinical standard.
The Epic Deterioration Index, deployed to hundreds of Epic hospitals, is the most widely deployed clinical prediction model in history. Published in JAMIA in 2020 (Downing et al.), it uses logistic regression with L1 regularization on 29 features. The simplicity was intentional: interpretability and workflow integration took precedence over maximum AUC.
Core Concepts
Feature Engineering from EHR Data
Raw EHR data is not model-ready. Transforming it into useful features requires domain knowledge combined with practical data engineering.
Vital signs: heart rate, blood pressure (systolic, diastolic, mean), respiratory rate, oxygen saturation, temperature. These are often recorded at irregular intervals - every 15 minutes in a monitored ICU, or once per shift on a general ward. Missing data is informative: a missing vital sign often means "patient is stable and not requiring monitoring" or alternatively "the nurse was too busy to record it."
Lab values: complete blood count (CBC), basic metabolic panel (BMP), lactate, procalcitonin, troponin, coagulation studies (PT/INR). Lab values are even more irregularly sampled - a critically ill patient may have arterial blood gas measured every hour; a stable post-surgical patient may have a CBC once daily.
Medications: vasopressor administration (norepinephrine, vasopressin) indicates circulatory shock. Antibiotics initiation may indicate suspected infection. Insulin infusions indicate metabolic derangement or stress hyperglycemia.
Demographics: age, sex, BMI, admission type (elective, emergency, medical, surgical), admission diagnosis.
Prior utilization: number of previous hospitalizations, prior ICU admissions, comorbidity burden (Charlson Comorbidity Index).
Temporal features: time since admission, time of day (circadian effects on physiology), day of week (weekend effect on staffing and outcomes is real).
Feature engineering strategies:
For time series (vitals, labs): compute summary statistics over rolling windows - mean, min, max, standard deviation, slope (trend), time-since-last-measurement, number of measurements in window. A falling mean arterial pressure over the past 6 hours with increasing lactate is a very different signal than a stable low MAP.
For missing values: do not impute naively with mean/median. In clinical data, missingness is often informative. Strategies: forward-fill (carry last known value forward), multiple imputation, or explicit missingness indicators (a binary feature "lactate was measured in last 4 hours" captures whether clinicians ordered the test, which is itself a clinical decision).
LSTM for Vital Sign Time Series
The structured temporal nature of ICU physiological data - measurements arriving at irregular intervals over hours to days - makes it a natural fit for recurrent neural networks.
An LSTM processes a sequence where each is a feature vector (vital signs + labs at time ). The LSTM maintains a hidden state that summarizes all information up to time :
The final hidden state (or attention-weighted combination of all hidden states) is used for classification or regression. For mortality prediction, this is passed through a sigmoid layer to produce a probability in .
For irregular time series, a simple modification: include time-since-last-observation as an additional input feature at each timestep, or use an ODE-based model (Neural ODE, GRU-ODE-Bayes) that explicitly models continuous dynamics between observations.
Survival Analysis - When, Not Just If
Binary classification (will this patient be readmitted?) ignores timing. Readmission at 3 days is clinically very different from readmission at 29 days. Survival analysis answers the question: when will the event occur?
Kaplan-Meier curves are non-parametric estimates of the survival function - the probability that the event has not occurred by time . The KM estimator handles censored observations (patients who left the study without experiencing the event):
where is the number of events at time and is the number at risk just before .
Cox Proportional Hazards model extends this to include covariates. The hazard function for patient is:
where is the baseline hazard (non-parametric, estimated from data), is the feature vector, and are coefficients estimated by maximum partial likelihood. The proportional hazards assumption: the ratio of hazards between any two patients is constant over time, determined entirely by their covariates.
The Cox model outputs a risk score for each patient, proportional to their relative hazard. Concordance index (C-statistic) measures how often the model correctly ranks pairs of patients by survival time - analogous to AUC-ROC for classification.
For deep learning survival models: DeepSurv (Katzman et al., 2018) replaces the linear with a deep neural network, capturing non-linear covariate effects while preserving the Cox framework. The partial likelihood loss is differentiable and can be optimized with SGD.
Calibration - Are Predicted Probabilities Trustworthy?
A model with AUC 0.90 can still be terribly calibrated. Calibration measures whether predicted probabilities match observed event rates. If the model says "40% probability of readmission" for a group of patients, roughly 40% of those patients should actually be readmitted.
Perfect calibration: for any threshold , .
The Brier score measures calibration alongside discrimination:
where is the observed outcome and is the predicted probability.
The calibration plot (reliability diagram) is the standard visualization: bin predictions by decile, plot mean predicted probability vs observed event rate. A perfectly calibrated model lies on the diagonal.
Common calibration problem in clinical ML: models trained on imbalanced data with oversampling or class weights tend to be overconfident - they predict higher probabilities than are empirically observed. Platt scaling (logistic regression on the raw output scores) and isotonic regression are standard post-hoc calibration methods.
Why calibration matters clinically: if a physician sees "85% probability of sepsis," they expect that 85% of patients with this alert actually develop sepsis. If only 20% do (poor calibration), the physician quickly learns not to trust the probability, and the model becomes less useful than a simple threshold alert.
Code Examples
EHR Feature Engineering Pipeline
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass
@dataclass
class TimeSeriesConfig:
"""Configuration for vital sign feature extraction."""
vital_columns: List[str]
look_back_hours: List[int] # e.g., [2, 6, 12, 24]
aggregations: List[str] # e.g., ['mean', 'min', 'max', 'std', 'slope']
def extract_vital_features(
vitals_df: pd.DataFrame,
patient_id: str,
prediction_time: pd.Timestamp,
config: TimeSeriesConfig
) -> Dict[str, float]:
"""
Extract summary features from vital sign time series
for a given patient at a given prediction time.
Args:
vitals_df: DataFrame with columns [patient_id, charttime, vital_name, value]
patient_id: Patient identifier
prediction_time: The time at which we are making the prediction
config: Feature extraction configuration
Returns:
Dictionary of feature_name -> value
"""
features = {}
# Filter to this patient's vitals before prediction time
patient_vitals = vitals_df[
(vitals_df['patient_id'] == patient_id) &
(vitals_df['charttime'] <= prediction_time)
].copy()
for vital in config.vital_columns:
vital_data = patient_vitals[patient_vitals['vital_name'] == vital].copy()
for hours in config.look_back_hours:
window_start = prediction_time - pd.Timedelta(hours=hours)
window_data = vital_data[vital_data['charttime'] >= window_start]
# Missing data features - is this vital being monitored?
features[f"{vital}_{hours}h_count"] = len(window_data)
features[f"{vital}_{hours}h_has_data"] = int(len(window_data) > 0)
if len(window_data) == 0:
# No data in window - fill with NaN, handle downstream
for agg in config.aggregations:
features[f"{vital}_{hours}h_{agg}"] = np.nan
features[f"{vital}_{hours}h_hours_since"] = hours # max lookback
continue
values = window_data['value'].values
# Time-weighted features
for agg in config.aggregations:
if agg == 'mean':
features[f"{vital}_{hours}h_mean"] = np.mean(values)
elif agg == 'min':
features[f"{vital}_{hours}h_min"] = np.min(values)
elif agg == 'max':
features[f"{vital}_{hours}h_max"] = np.max(values)
elif agg == 'std':
features[f"{vital}_{hours}h_std"] = np.std(values) if len(values) > 1 else 0.0
elif agg == 'slope':
if len(values) > 1:
# Linear trend over the window
times_hours = (
window_data['charttime'].values.astype(np.int64) / 1e9 / 3600
)
times_normalized = times_hours - times_hours[0]
slope = np.polyfit(times_normalized, values, 1)[0]
features[f"{vital}_{hours}h_slope"] = slope
else:
features[f"{vital}_{hours}h_slope"] = 0.0
# Time since most recent measurement
most_recent = vital_data['charttime'].max()
hours_since = (prediction_time - most_recent).total_seconds() / 3600
features[f"{vital}_{hours}h_hours_since_last"] = min(hours_since, hours)
return features
def compute_shock_index(sbp: float, hr: float) -> float:
"""
Shock Index = Heart Rate / Systolic BP.
SI > 0.9 is associated with hemodynamic instability.
SI > 1.0 is associated with major adverse outcomes.
"""
if sbp is None or sbp <= 0:
return np.nan
return hr / sbp
def compute_sofa_score(
pao2_fio2: Optional[float], # Respiratory component
platelets: Optional[float], # Coagulation
bilirubin: Optional[float], # Liver
map_vasopressor: Optional[str], # Cardiovascular
gcs: Optional[float], # Neurological
creatinine: Optional[float] # Renal
) -> int:
"""
Sequential Organ Failure Assessment (SOFA) score.
Used in Sepsis-3 definition (total SOFA >= 2 from baseline).
Range 0-24. Higher = more severe organ failure.
"""
score = 0
# Respiratory: PaO2/FiO2 ratio
if pao2_fio2 is not None:
if pao2_fio2 < 100: score += 4
elif pao2_fio2 < 200: score += 3
elif pao2_fio2 < 300: score += 2
elif pao2_fio2 < 400: score += 1
# Coagulation: Platelets (x10^3/uL)
if platelets is not None:
if platelets < 20: score += 4
elif platelets < 50: score += 3
elif platelets < 100: score += 2
elif platelets < 150: score += 1
# Liver: Bilirubin (mg/dL)
if bilirubin is not None:
if bilirubin >= 12.0: score += 4
elif bilirubin >= 6.0: score += 3
elif bilirubin >= 2.0: score += 2
elif bilirubin >= 1.2: score += 1
# Neurological: Glasgow Coma Scale
if gcs is not None:
if gcs < 6: score += 4
elif gcs < 10: score += 3
elif gcs < 13: score += 2
elif gcs < 15: score += 1
# Renal: Creatinine (mg/dL)
if creatinine is not None:
if creatinine >= 5.0: score += 4
elif creatinine >= 3.5: score += 3
elif creatinine >= 2.0: score += 2
elif creatinine >= 1.2: score += 1
return score
30-Day Readmission Prediction Model
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, brier_score_loss
from sklearn.calibration import calibration_curve
import lightgbm as lgb
from typing import List, Dict, Tuple, Optional
class ReadmissionDataset(Dataset):
"""
Dataset for 30-day hospital readmission prediction.
Features:
- Static: demographics, comorbidities, admission type
- Aggregated vitals: mean/min/max over last 24h before discharge
- Labs: last values before discharge, trend indicators
- Utilization: prior admission history
"""
def __init__(
self,
features: np.ndarray,
labels: np.ndarray,
feature_names: List[str]
):
assert len(features) == len(labels)
self.features = torch.tensor(features, dtype=torch.float32)
self.labels = torch.tensor(labels, dtype=torch.float32)
self.feature_names = feature_names
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return self.features[idx], self.labels[idx]
def train_readmission_lgbm(
X_train: np.ndarray,
y_train: np.ndarray,
X_val: np.ndarray,
y_val: np.ndarray,
feature_names: List[str]
) -> lgb.Booster:
"""
Train a LightGBM model for readmission prediction.
LightGBM is often the best choice for tabular EHR features:
- Handles missing values natively (uses a learned split direction)
- Much faster than XGBoost on large datasets
- Built-in categorical feature handling
- Strong performance with less hyperparameter tuning
"""
# Handle class imbalance: readmission rate is typically 10-15%
pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
print(f"Class imbalance ratio (neg:pos): {pos_weight:.1f}:1")
train_data = lgb.Dataset(
X_train, label=y_train, feature_name=feature_names
)
val_data = lgb.Dataset(
X_val, label=y_val, feature_name=feature_names, reference=train_data
)
params = {
'objective': 'binary',
'metric': ['auc', 'binary_logloss'],
'scale_pos_weight': pos_weight, # Handle class imbalance
# Regularization to prevent overfitting on small clinical datasets
'num_leaves': 31,
'min_child_samples': 50, # Minimum samples in leaf - important for medical data
'max_depth': 6,
'learning_rate': 0.05,
'n_estimators': 1000,
'feature_fraction': 0.8, # Column subsampling
'bagging_fraction': 0.8, # Row subsampling
'bagging_freq': 5,
'lambda_l1': 0.1,
'lambda_l2': 1.0,
'verbosity': -1,
'random_state': 42,
}
callbacks = [
lgb.early_stopping(stopping_rounds=50, verbose=True),
lgb.log_evaluation(period=50),
]
model = lgb.train(
params,
train_data,
valid_sets=[val_data],
num_boost_round=1000,
callbacks=callbacks,
)
# Evaluate
val_probs = model.predict(X_val)
auc = roc_auc_score(y_val, val_probs)
brier = brier_score_loss(y_val, val_probs)
print(f"\nValidation AUC: {auc:.4f}")
print(f"Validation Brier Score: {brier:.4f}")
# Feature importance analysis
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)
print("\nTop 15 Features by Gain Importance:")
print(importance_df.head(15).to_string(index=False))
return model
class PlattCalibrator:
"""
Post-hoc calibration using Platt scaling.
Trains a logistic regression on model output scores.
"""
def __init__(self):
from sklearn.linear_model import LogisticRegression
self.lr = LogisticRegression()
self.fitted = False
def fit(self, raw_scores: np.ndarray, labels: np.ndarray):
"""Fit calibration on a held-out calibration set."""
self.lr.fit(raw_scores.reshape(-1, 1), labels)
self.fitted = True
def predict_proba(self, raw_scores: np.ndarray) -> np.ndarray:
"""Return calibrated probabilities."""
assert self.fitted, "Must call fit() before predict_proba()"
return self.lr.predict_proba(raw_scores.reshape(-1, 1))[:, 1]
def evaluate_calibration(
self,
probs: np.ndarray,
labels: np.ndarray,
n_bins: int = 10
) -> Dict[str, float]:
"""Compute calibration metrics."""
fraction_pos, mean_pred = calibration_curve(
labels, probs, n_bins=n_bins
)
# Expected Calibration Error (ECE)
# Weighted average absolute difference between predicted and actual rate
bin_sizes = np.histogram(probs, bins=n_bins)[0]
total = len(probs)
ece = np.sum(
np.abs(fraction_pos - mean_pred) * bin_sizes[:len(fraction_pos)] / total
)
return {
"ece": ece,
"mean_calibration_error": np.mean(np.abs(fraction_pos - mean_pred)),
"max_calibration_error": np.max(np.abs(fraction_pos - mean_pred)),
"brier_score": brier_score_loss(labels, probs),
}
LSTM for ICU Mortality Prediction
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import numpy as np
from typing import List, Tuple, Optional
class ICUTimeSeriesDataset(Dataset):
"""
Dataset for ICU mortality prediction from multivariate time series.
Input: physiological measurements at each hour of ICU stay
Output: in-hospital mortality (binary)
Handles irregular sampling by using:
- Forward-fill for short gaps (< 4 hours)
- Missingness indicators for longer gaps
"""
def __init__(
self,
sequences: List[np.ndarray], # Each: (T, n_features)
labels: np.ndarray,
max_seq_len: int = 48, # 48 hours of history
n_features: int = 34 # Number of vital/lab features
):
self.max_seq_len = max_seq_len
self.n_features = n_features
self.labels = torch.tensor(labels, dtype=torch.float32)
# Pad or truncate sequences to max_seq_len
self.sequences = []
self.masks = []
for seq in sequences:
T = min(len(seq), max_seq_len)
padded = np.zeros((max_seq_len, n_features), dtype=np.float32)
mask = np.zeros(max_seq_len, dtype=np.float32)
padded[:T] = seq[:T]
mask[:T] = 1.0
self.sequences.append(padded)
self.masks.append(mask)
self.sequences = torch.tensor(np.array(self.sequences), dtype=torch.float32)
self.masks = torch.tensor(np.array(self.masks), dtype=torch.float32)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return self.sequences[idx], self.masks[idx], self.labels[idx]
class ICUMortalityLSTM(nn.Module):
"""
Bidirectional LSTM with temporal attention for ICU mortality prediction.
Architecture notes:
- Bidirectional: future context helps even in online prediction
because we use the full available history at prediction time
- Temporal attention: allows the model to focus on the most
clinically relevant time windows (e.g., the deterioration event)
- Multi-task: simultaneously predicts mortality and 24h readmission
risk - joint training can improve both tasks
"""
def __init__(
self,
n_features: int = 34,
hidden_dim: int = 256,
num_layers: int = 2,
dropout: float = 0.3,
bidirectional: bool = True,
):
super().__init__()
self.n_features = n_features
self.hidden_dim = hidden_dim
self.bidirectional = bidirectional
self.directions = 2 if bidirectional else 1
# Feature normalization layer (learns scale/shift per feature)
self.input_norm = nn.LayerNorm(n_features)
# LSTM
self.lstm = nn.LSTM(
input_size=n_features,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0,
bidirectional=bidirectional
)
lstm_output_dim = hidden_dim * self.directions
# Temporal attention mechanism
self.attention = nn.Sequential(
nn.Linear(lstm_output_dim, 128),
nn.Tanh(),
nn.Linear(128, 1),
)
self.dropout = nn.Dropout(dropout)
# Classification head
self.classifier = nn.Sequential(
nn.Linear(lstm_output_dim, 128),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
def forward(
self,
x: torch.Tensor, # (batch, T, n_features)
mask: torch.Tensor # (batch, T) - 1 for valid, 0 for padding
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Returns:
logit: (batch,) mortality prediction logit
attention_weights: (batch, T) for interpretability
"""
batch_size, T, _ = x.shape
# Normalize input features
x = self.input_norm(x)
# LSTM encoding
lstm_out, _ = self.lstm(x) # (batch, T, hidden*directions)
lstm_out = self.dropout(lstm_out)
# Temporal attention
attn_scores = self.attention(lstm_out).squeeze(-1) # (batch, T)
# Mask padding positions to -inf before softmax
mask_expanded = mask.bool()
attn_scores = attn_scores.masked_fill(~mask_expanded, float('-inf'))
attn_weights = F.softmax(attn_scores, dim=1) # (batch, T)
# Attention-weighted pooling
context = (lstm_out * attn_weights.unsqueeze(-1)).sum(dim=1) # (batch, hidden*directions)
# Classification
logit = self.classifier(context).squeeze(-1) # (batch,)
return logit, attn_weights
def train_icu_lstm(
model: ICUMortalityLSTM,
train_loader: DataLoader,
val_loader: DataLoader,
epochs: int = 30,
lr: float = 1e-3,
device: str = "cuda"
) -> ICUMortalityLSTM:
"""Train the ICU LSTM model."""
model = model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
# Compute positive weight for class imbalance
# ICU mortality rate is roughly 10-15%
pos_weight = torch.tensor([6.0]).to(device) # ~85% neg, 15% pos
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
best_val_auc = 0.0
for epoch in range(epochs):
# Training
model.train()
total_loss = 0.0
for sequences, masks, labels in train_loader:
sequences = sequences.to(device)
masks = masks.to(device)
labels = labels.to(device)
optimizer.zero_grad()
logits, _ = model(sequences, masks)
loss = criterion(logits, labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_loss += loss.item()
# Validation
model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
for sequences, masks, labels in val_loader:
sequences, masks = sequences.to(device), masks.to(device)
logits, _ = model(sequences, masks)
probs = torch.sigmoid(logits).cpu().numpy()
all_preds.extend(probs)
all_labels.extend(labels.numpy())
from sklearn.metrics import roc_auc_score
val_auc = roc_auc_score(all_labels, all_preds)
scheduler.step()
print(f"Epoch {epoch+1}/{epochs}: Loss={total_loss/len(train_loader):.4f}, "
f"Val AUC={val_auc:.4f}")
if val_auc > best_val_auc:
best_val_auc = val_auc
torch.save(model.state_dict(), "best_icu_lstm.pt")
return model
Survival Analysis with Cox Model
import numpy as np
import pandas as pd
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from typing import List, Dict, Tuple
class ReadmissionSurvivalAnalysis:
"""
Survival analysis for hospital readmission.
Two components:
1. Kaplan-Meier curves for descriptive analysis
(how does survival differ across risk groups?)
2. Cox proportional hazards for multivariate adjustment
(which features independently predict time-to-readmission?)
"""
def __init__(self):
self.km_fitter = KaplanMeierFitter()
self.cox_model = CoxPHFitter(penalizer=0.1) # L2 regularization
def fit_kaplan_meier(
self,
df: pd.DataFrame,
duration_col: str = "days_to_readmission",
event_col: str = "readmitted_30d",
group_col: Optional[str] = None,
):
"""
Fit Kaplan-Meier survival curves.
Args:
duration_col: time to event or censoring (days)
event_col: 1 if event occurred, 0 if censored
group_col: if provided, plot separate curves per group
(e.g., high-risk vs low-risk patients)
"""
if group_col is None:
self.km_fitter.fit(
df[duration_col], event_observed=df[event_col]
)
median_survival = self.km_fitter.median_survival_time_
print(f"Median time to readmission: {median_survival:.1f} days")
return self.km_fitter
# Plot stratified curves
groups = df[group_col].unique()
km_per_group = {}
for group in sorted(groups):
mask = df[group_col] == group
km = KaplanMeierFitter(label=f"{group_col}={group}")
km.fit(df.loc[mask, duration_col], event_observed=df.loc[mask, event_col])
km_per_group[group] = km
# Log-rank test for group difference
if len(groups) == 2:
g1, g2 = sorted(groups)
m1, m2 = df[group_col] == g1, df[group_col] == g2
result = logrank_test(
df.loc[m1, duration_col], df.loc[m2, duration_col],
event_observed_A=df.loc[m1, event_col],
event_observed_B=df.loc[m2, event_col]
)
print(f"Log-rank test p-value: {result.p_value:.4f}")
return km_per_group
def fit_cox_model(
self,
df: pd.DataFrame,
duration_col: str = "days_to_readmission",
event_col: str = "readmitted_30d",
feature_cols: List[str] = None
):
"""
Fit Cox proportional hazards model.
Returns hazard ratios for each feature with confidence intervals.
Hazard ratio > 1: feature increases risk (shortens time to readmission)
Hazard ratio < 1: feature decreases risk (protective)
"""
if feature_cols is None:
feature_cols = [c for c in df.columns if c not in [duration_col, event_col]]
# lifelines expects duration and event in the same DataFrame
cox_df = df[feature_cols + [duration_col, event_col]].copy()
cox_df = cox_df.dropna() # Cox model requires complete cases (or use imputation)
self.cox_model.fit(
cox_df,
duration_col=duration_col,
event_col=event_col,
)
self.cox_model.print_summary()
return self.cox_model
def compute_c_index(
self,
df: pd.DataFrame,
duration_col: str,
event_col: str,
feature_cols: List[str],
n_splits: int = 5
) -> Tuple[float, float]:
"""
Cross-validated concordance index (C-statistic) for Cox model.
C-index = probability that predictions correctly rank two patients.
Equivalent to AUC-ROC for survival models.
"""
from lifelines.utils import concordance_index
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
c_indices = []
cox_df = df[feature_cols + [duration_col, event_col]].dropna()
X = cox_df[feature_cols].values
T = cox_df[duration_col].values
E = cox_df[event_col].values
for train_idx, test_idx in kf.split(X):
# Train Cox model on fold
train_df = cox_df.iloc[train_idx]
test_df = cox_df.iloc[test_idx]
fold_cox = CoxPHFitter(penalizer=0.1)
fold_cox.fit(train_df, duration_col=duration_col, event_col=event_col)
# Predict risk scores on test fold
# Higher risk score = expected shorter time to event
risk_scores = fold_cox.predict_partial_hazard(test_df).values
c_idx = concordance_index(
test_df[duration_col].values,
-risk_scores, # Negate because concordance_index expects higher=longer survival
test_df[event_col].values
)
c_indices.append(c_idx)
mean_c = np.mean(c_indices)
std_c = np.std(c_indices)
print(f"Cross-validated C-index: {mean_c:.4f} +/- {std_c:.4f}")
return mean_c, std_c
System Architecture Diagrams
Production Engineering Notes
Alert Fatigue - The Deployment Killer
The gap between a model performing well in benchmarks and being useful in production is often explained by alert fatigue. When a clinical decision support system fires too frequently, clinicians learn to ignore it through classical conditioning. A study by Shah et al. (2020) found that physicians overrode 96% of drug-drug interaction alerts in a major EHR - the alerts were technically correct but clinically irrelevant in most contexts.
Strategies to combat alert fatigue:
Positive Predictive Value (PPV) targeting: do not choose your operating threshold by sensitivity alone. Target a minimum PPV of 20-30% for alerts that require clinical action. If only 5% of alerts represent true positive deterioration, you will overwhelm clinicians.
Alert suppression rules: do not fire an alert for the same patient more than once per 4-hour window without a significant new signal. If a patient was assessed and documented as stable, suppress alerts for 2 hours. These rules reduce noise without systematically missing new events.
Contextual filtering: do not fire a sepsis alert on a patient who is already on 4 antibiotics and whose team has documented "known sepsis under treatment." NLP on the recent nursing notes can enable this context-aware suppression.
Tiered alert severity: not all alerts are equal. A confidence level of 90% warrants a pager message to the bedside physician. A confidence of 65% warrants a flag in the EMR for the next time a clinician opens the chart. Tiering based on confidence and severity drastically reduces the burden on clinicians.
Temporal Validation - The Prospective Reality Check
A model validated only on retrospective data is not validated at all for production use. Two problems:
Label leakage: features derived from the medical record after the prediction time can leak into training. "Number of blood cultures ordered in the next 4 hours" is a perfect predictor of sepsis - it is also not available at prediction time. Careful temporal feature extraction (only use data from before the prediction timestamp) prevents this.
Distribution shift: the model's training distribution may not match the deployment distribution. A model trained on 2015-2018 data deployed in 2023 may perform poorly because patient demographics, illness severity, treatment protocols, and EHR documentation practices have all changed.
The prospective shadow mode deployment (running the model without firing alerts, collecting predictions alongside ground truth, comparing to retrospective validation AUC) is the minimum standard before any production deployment.
MIMIC-IV for Development - Practical Notes
MIMIC-IV (version 2.2) requires a PhysioNet account and completion of the CITI "Data or Specimens Only Research" course. Access is free after DUA signing.
The database is structured in PostgreSQL. The MIMIC-IV demo (100 patients) is available without DUA for initial development. Key tables:
mimiciv_hosp.admissions: one row per hospital admissionmimiciv_hosp.labevents: all lab values with timestampsmimiciv_icu.chartevents: all ICU charted vitals (very large table, ~300M rows)mimiciv_icu.icustays: ICU admission/discharge timesmimiciv_hosp.prescriptions: medication orders
Use the mimic-code repository on GitHub for pre-built SQL scripts that extract SOFA scores, APACHE scores, and standard cohorts.
Common Mistakes
:::danger Time Leakage in Feature Engineering
The most common and most damaging bug in clinical prediction: including features computed from data that occurs after the prediction time. A 30-day readmission model that includes "number of ER visits in next 7 days" as a feature will appear to have AUC 0.99 and will be completely useless in production. This bug is easier to introduce than it sounds, especially when working with denormalized feature tables.
Fix: always track a prediction_time for each prediction row, and filter all feature computation to only use data with charttime < prediction_time. Write a test that verifies no feature in the training set uses future data by shuffling the timestamps and confirming performance drops to near random.
:::
:::danger Reporting AUC Without Calibration Metrics
AUC measures discrimination (can the model rank patients correctly?) but not calibration (are predicted probabilities accurate?). A model with excellent AUC but poor calibration is misleading to clinicians who interpret probabilities literally. "This patient has a 78% chance of readmission" from a poorly calibrated model may actually correspond to a 30% observed readmission rate.
Fix: always report Brier score and Expected Calibration Error alongside AUC. Always include calibration plots (reliability diagrams) in model evaluation reports. Apply Platt scaling or isotonic regression on a held-out calibration set before deployment. :::
:::warning Using Random Split for Temporal EHR Data
EHR data has a natural time ordering. Using random train/test splitting on temporal EHR data allows future data to inform past predictions (temporal leakage at the dataset level). A sepsis model trained on 2020 data but using a random split may include 2020 test samples in training alongside 2019 training samples that reflect the model.
Fix: use temporal splitting: train on data from years 1-3, validate on year 4, test on year 5. This simulates the real deployment scenario: the model is trained on historical data and deployed prospectively. :::
:::warning Ignoring Patient-Level Clustering in Train/Val Split
A single patient may have multiple hospital admissions. If admission 1 goes to train and admission 3 goes to validation for the same patient, the model sees the same patient in both splits. This inflates performance metrics because the model learns patient-specific patterns (which would not generalize to new patients).
Fix: split at the patient level, not the admission level. Use GroupKFold with patient ID as the group key. Report performance per-patient rather than per-admission.
:::
Interview Questions and Answers
Q1: You build a sepsis prediction model with AUC 0.88 on the MIMIC-IV benchmark. How would you approach deploying it in a real ICU?
AUC is a starting point, not a finish line. The deployment journey from 0.88 AUC to a clinically useful system involves:
First, prospective shadow deployment: run the model live but suppress all alerts for 4-8 weeks. Collect predictions alongside documented clinical assessments and outcomes. Compare prospective AUC to the MIMIC benchmark. If performance drops more than 5-8 points, investigate - likely causes are distribution shift between MIMIC and the target institution.
Second, alert threshold selection: choose the operating threshold by targeting a clinically acceptable positive predictive value (at least 20-30%) rather than by sensitivity alone. Work with the clinical informatics team and frontline nurses to determine the maximum alert burden the workflow can absorb.
Third, workflow integration: where does the alert appear? On a monitor screen, in the EHR, as a pager message? What is the expected clinical action? Who should be notified? An alert without a clear prescribed action is noise.
Fourth, measurement framework: define prospectively what success means. Does the alert change clinical behavior (time to antibiotic administration, blood cultures ordered)? Does it improve outcomes (sepsis mortality, ICU length of stay)? Instrument these metrics before deployment.
Fifth, post-market monitoring: track alert firing rate, override rate, and outcome metrics weekly. Alert fatigue develops over 3-6 months as clinicians learn to ignore alerts. Catch it early.
Q2: What is the difference between model discrimination and calibration? Why does it matter for clinical prediction?
Discrimination (measured by AUC-ROC) is the model's ability to rank patients: given two patients, one who will be readmitted and one who will not, what is the probability the model assigns a higher risk score to the one who gets readmitted? A model with perfect discrimination always ranks true positives above true negatives. AUC = 1.0 means perfect discrimination.
Calibration measures whether the predicted probability is the true probability. If I have 100 patients each assigned a 30% readmission probability by the model, and 30 of them actually get readmitted, the model is perfectly calibrated at that probability level. Expected Calibration Error (ECE) quantifies the average magnitude of the gap between predicted and actual rates.
Clinical relevance: physicians make decisions based on predicted probabilities. A physician seeing "70% readmission risk" takes different action than "30% readmission risk." If the model's 70% corresponds to a true rate of 30% (overconfident), the physician is systematically over-treating. If 30% corresponds to a true rate of 70% (underconfident), they are systematically under-treating. Good discrimination with poor calibration produces a model that is useful for ranking patients (triaging) but not for absolute risk communication.
Q3: Explain the Cox proportional hazards model. What does the proportional hazards assumption mean and how would you test it?
The Cox model relates patient covariates to the hazard (instantaneous risk) of an event. The hazard at time for patient is . The key elements: is the baseline hazard (common to all patients), and is the hazard ratio - how much higher or lower the patient's hazard is relative to baseline.
The proportional hazards (PH) assumption: the ratio of hazards between any two patients is constant over time. Patient A's hazard is always twice patient B's hazard, at every point in time. This means that the effect of covariates on hazard does not change over time.
Testing PH assumption: (1) Schoenfeld residuals test - plot scaled Schoenfeld residuals versus time for each covariate; a non-zero slope indicates a time-varying effect. (2) log(-log(S(t))) vs log(t) plot for categorical variables - parallel lines indicate proportional hazards. (3) Include covariate-by-time interaction terms; a significant interaction rejects PH.
If PH assumption is violated for a covariate: use time-varying covariates, stratify on that variable (allowing a different baseline hazard per stratum), or use a parametric survival model (Weibull, log-normal) instead.
Q4: How do you handle extreme class imbalance in 30-day readmission prediction where only 12% of patients are readmitted?
The 12% positive rate is moderate imbalance - worse than balanced but not extreme. Several strategies, in order of practical impact:
Scale the loss: scale_pos_weight = (100-12)/12 = 7.3 in LightGBM/XGBoost penalizes missing a readmission 7.3x more than a false positive. This is equivalent to SMOTE in expectation but more principled.
Probability threshold tuning: by default, 0.5 is used. At 12% prevalence, a 0.5 threshold classifies nearly everything as negative. Use the Youden index (J = sensitivity + specificity - 1) to find the threshold that maximizes both sensitivity and specificity jointly on the validation set.
Evaluate with the right metrics: accuracy is nearly useless at 12% prevalence. Use F1 score (harmonic mean of precision and recall), AUCPRC (area under the precision-recall curve, which is more sensitive than AUROC for imbalanced data), and sensitivity at a clinically relevant specificity level.
Avoid SMOTE for tree-based models: SMOTE (synthetic oversampling) is often recommended for imbalanced data but its benefit is unclear for gradient boosted trees which naturally handle imbalance through class weights. Reserve SMOTE for neural networks if class weights alone are insufficient.
Q5: A clinical prediction model performs AUC 0.87 on internal validation but only 0.74 when deployed at a different hospital. What are the likely causes and how do you investigate?
A 13-point AUC drop on external deployment is large and indicates significant covariate shift. Investigation approach:
Check population demographics: compare age distribution, sex ratio, racial composition, admission type mix (elective vs emergency), and disease severity indices (APACHE scores) between training data and deployment site. Any systematic difference in case mix is a suspect.
Check feature distributions: for each feature in the model, compare the distribution at training site versus deployment site. Use statistical tests (KS test for continuous, chi-squared for categorical). Features with p < 0.05 are shifted. Vital sign distributions may differ because of different monitoring protocols. Lab value distributions may differ because of different analyzer calibration standards.
Check label definitions: is "readmission" defined identically? Training data may have included all-cause readmissions, but the deployment site may only capture readmissions within their health system, missing readmissions to competitor hospitals.
Check time granularity: if the training site had vitals charted every 15 minutes (ICU) but the deployment site charts every 4 hours (general ward), features like "standard deviation of HR over past 2 hours" will have very different meaning.
Remediation options: re-train on combined data from both sites. Apply domain adaptation (reweighting training samples to better match the target distribution). Add site-specific calibration with a small labeled dataset from the deployment site.
This lesson is part of the Applied AI - AI in Healthcare module. Next: Radiology AI in Production.
