Federated Learning in Healthcare

The Data That Cannot Move

A tumor board at a major cancer center is reviewing a 47-year-old woman with a rare glioblastoma subtype. The neuroradiologist pulls up the latest MRI. The oncologist looks at the pathology report. They discuss treatment options for twenty minutes and reach a recommendation. Somewhere in that room, someone mentions that a hospital in Germany published a case series on this exact subtype, and a similar institution in South Korea has a small cohort with good outcomes on a modified protocol.

The data exists. The tumor board knows it exists. And it is completely out of reach. Not because of technical barriers - the imaging data and pathology slides could be transferred electronically in minutes. Because of regulatory and ethical barriers that exist for excellent reasons. Patient data from a German hospital cannot simply be exported to a US institution. GDPR prohibits it without explicit patient consent and data processing agreements. HIPAA imposes its own constraints on the receiving end. The South Korean hospital's institutional review board approved their data for use only within their institution.

Rare diseases make this problem particularly acute. A condition affecting 1 in 100,000 people means that even a large hospital system might see 10-20 cases per year. No single institution has enough cases to train a robust AI model. The cases are distributed across hundreds of institutions worldwide, each holding a fragment of the dataset that would, in aggregate, be sufficient. The data is effectively siloed by the very regulations that protect patient privacy.

This is the problem federated learning was designed for. Instead of moving the data to the model, you move the model to the data. Each hospital trains locally on their own patients. They share only model parameters - gradients during training, or aggregated weight updates. The central server combines these updates, produces an improved global model, and sends it back for the next round. No patient records ever leave the institution that holds them.

The University of Pennsylvania demonstrated this concretely in 2022. Researchers coordinated federated training of a tumor segmentation model across 71 institutions on 6,314 glioblastoma patients - one of the largest brain tumor datasets assembled. The federated model outperformed models trained at any single institution, approaching the performance of a centrally trained model. The patients' scans never left their hospitals. This was not a proof-of-concept. It was a production-scale demonstration of federated learning solving a real clinical problem.

Why This Exists - The Pooling Barrier

The intuition behind why more data produces better models is straightforward: rare patterns become common patterns when you aggregate enough examples. A single hospital sees 15 cases of a rare arrhythmia pattern per year. A network of 50 hospitals sees 750. The model trained on 750 cases will generalize far better to new patients than the model trained on 15. The limiting factor has never been algorithmic sophistication - it has been data access.

The alternatives to federated learning - traditional data pooling approaches - all fail for medical data at scale. Centralized data pools require legal data sharing agreements that take years to negotiate and may never be achievable across international borders. Synthetic data generation (training a generative model locally and sharing synthetic patient records) is promising but the privacy guarantees of generative models are not well understood - membership inference attacks can sometimes recover real training examples from synthetic datasets. Data enclaves (physically bringing analysts to where the data is) do not scale to training deep learning models that need to process the full dataset many times.

Federated learning is not a perfect solution. But it is the best available approach for training on distributed medical data at scale, and it is now mature enough to deploy in production systems.

Historical Context - From Parallel SGD to Healthcare FL

The concept of training a model on distributed data without centralizing it appeared in the distributed systems literature in the early 2000s, but it was Google's 2017 paper "Communication-Efficient Learning of Deep Networks from Decentralized Data" (McMahan et al.) that defined the modern federated learning framework. The paper introduced the FedAvg algorithm and the term "federated learning," motivated initially by Google's desire to train language models on mobile devices without uploading users' private data to central servers.

The healthcare application followed immediately. In 2018, Sheller et al. from Penn showed that a brain tumor segmentation model trained federally across four institutions achieved equivalent performance to centralized training. In 2019, NVIDIA launched CLARA Federated Learning (later renamed NVIDIA FLARE), a production framework for healthcare FL with support for split learning, homomorphic encryption, and differential privacy. In 2021, the FeTS challenge (Federated Tumor Segmentation) established benchmarks for cross-institutional FL in neuro-oncology.

By 2023, FL had moved from research demonstrations to real deployments. Intel and Mass General Brigham deployed FL for COVID-19 outcome prediction across six hospitals. Owkin (a Paris-based AI startup) built a federated learning platform used by hospitals across Europe for oncology research. The FDA issued guidance acknowledging federated learning as a viable approach for training medical AI models.

The differential privacy thread runs parallel. Dwork et al. (2006) formalized differential privacy as a mathematical framework for quantifying privacy loss. Abadi et al. (2016, Google) showed how to train neural networks with differential privacy using DP-SGD (differentially private stochastic gradient descent). The combination of federated learning with differential privacy - local training + noisy gradients - provides two layers of privacy protection.

Core Concepts

FedAvg - The Foundation Algorithm

FedAvg (Federated Averaging) is the canonical federated learning algorithm. The setup: N clients (hospitals), each with local dataset $D_k$ of size $n_k$ . Total data: $n = \sum_k n_k$ . Global model parameters: $w$ .

Each round of FedAvg:

Server selects a subset of K clients (often all N, or a fraction for large federations)
Server sends current global weights $w_t$ to selected clients
Each selected client $k$ performs E local gradient descent steps on their local data:

$w_k^{t+1} = w_k^t - \eta \nabla \mathcal{L}_k(w_k^t)$

where $\mathcal{L}_k$ is the local loss on client $k$ 's data.

Clients send updated weights $w_k^{t+1}$ back to server
Server aggregates by weighted averaging:

$w_{t+1} = \sum_{k=1}^{K} \frac{n_k}{n} w_k^{t+1}$

The weighted average ensures that hospitals with more patients contribute proportionally more to the global model. This is appropriate when you believe each hospital's dataset is a representative sample of the overall population.

Why does this work? Federated averaging converges to the same solution as centralized gradient descent when the local datasets are IID (independently and identically distributed). In the IID case, the gradient computed on any subset of the data is an unbiased estimate of the true gradient, so averaging locally updated weights is approximately equivalent to averaging gradients.

Why does it sometimes not work? When datasets are non-IID - when each hospital's patients are systematically different from other hospitals' patients - local gradient descent on hospital k's data moves the weights in the direction that minimizes hospital k's loss, which may not be the direction that minimizes the global loss. After E steps of local training, $w_k$ has diverged from the global optimum in a hospital-specific direction. Averaging diverged weights can be worse than not training at all in extreme cases. This is the "client drift" problem.

Non-IID Data - The Healthcare Challenge

In federated learning literature, "non-IID" means that the data distribution differs across clients. In healthcare, non-IID is not an exception - it is the default. Every hospital is systematically different:

Patient population: An urban safety net hospital serves a different demographic than a suburban private hospital. Different ages, different comorbidities, different socioeconomic factors.
Scanner equipment: Hospital A has a 3T MRI scanner, hospital B has a 1.5T. The images look measurably different.
Clinical protocols: One hospital routinely adds gadolinium contrast, another does not. One does 5mm slice thickness CT, another 1.25mm.
Label distribution: A cancer center sees far more malignant findings than a community hospital. A pediatric hospital has no adult patients.

The consequence: a model trained with naive FedAvg will perform well on the average distribution across hospitals but potentially poorly on any individual hospital's data - and patients at that hospital will receive worse AI-assisted care. This is a clinical equity problem layered on top of a technical problem.

Solutions to non-IID data in healthcare FL:

FedProx: Adds a proximal term to the local loss to prevent excessive drift:

$\mathcal{L}_k^{\text{FedProx}}(w) = \mathcal{L}_k(w) + \frac{\mu}{2} \|w - w^t\|^2$

The $\mu$ term penalizes moving too far from the global weights during local training, keeping hospitals from drifting into hospital-specific local optima. Works better than FedAvg on heterogeneous data.

Per-FedAvg / Personalized FL: Instead of a single global model, each hospital maintains a personalized model that combines the global model's general knowledge with local fine-tuning. The MAML (Model-Agnostic Meta-Learning) approach trains the global model to be a good initialization point for fast local adaptation rather than a one-size-fits-all model.

Clustered FL: Group hospitals with similar data distributions and train one global model per cluster. Requires a way to identify which hospitals are similar without sharing data - typically by comparing gradient directions or by clustering on metadata (hospital size, imaging protocol, patient population statistics that do not contain PHI).

Differential Privacy

Differential privacy provides a formal mathematical guarantee of privacy. Informally: a mechanism M is $(\epsilon, \delta)$ -differentially private if, for any two datasets D and D' that differ by one record, and any possible output S:

$\Pr[M(D) \in S] \leq e^\epsilon \cdot \Pr[M(D') \in S] + \delta$

Intuitively: knowing the output of M tells you almost nothing about whether any specific individual is in the dataset. Small $\epsilon$ (strong privacy) means the output is nearly identical whether or not any given patient is included.

In DP-SGD (Differentially Private SGD), the mechanism M is the training algorithm itself. To make gradient updates $(\epsilon, \delta)$ -DP:

Clip gradients per example: $\tilde{g}_i = g_i / \max(1, \|g_i\|_2 / C)$ where C is the clipping threshold. This bounds the maximum influence of any single example.
Add Gaussian noise: $\hat{g} = \frac{1}{B} \sum_i \tilde{g}_i + \mathcal{N}(0, \sigma^2 C^2 I)$ . The noise magnitude $\sigma$ controls the privacy-utility tradeoff.

The privacy cost accumulates over training steps: more steps, more noise needed (or weaker privacy guarantee). The moments accountant (Abadi et al., 2016) provides tight bounds on the total privacy cost $(\epsilon, \delta)$ over T steps.

The tradeoff is stark. With $\epsilon = 8$ (moderate privacy), model utility loss on CIFAR-10 is roughly 3-4%. With $\epsilon = 1$ (strong privacy), utility loss is 15-20% or more. For medical imaging models, this can mean the difference between clinically useful and not. Current practice in healthcare FL often uses $\epsilon = 10$ - $100$ - weaker guarantees than theorists would prefer, but providing measurable protection against reconstruction attacks while preserving model quality.

Communication Efficiency

In a typical FL round, each client sends a full model update back to the server. For a ResNet-50, that is 25 million floats at 4 bytes each = 100 MB per round per client. With 50 hospitals and 100 rounds of training, that is 500 GB of gradient traffic. Hospitals running on standard internet connections cannot sustain this.

Gradient compression: Quantize gradients to 8-bit or 4-bit integers. 4-8x reduction in communication volume with minimal accuracy loss. Libraries like Flower have built-in compression strategies.

Gradient sparsification: Only transmit the top-k gradients by magnitude. 99% sparsification (sending only 1% of gradient values) can work reasonably well with error feedback (accumulating the compressed residual and adding it back in the next round).

Periodic averaging: Instead of communicating every step, aggregate every 10-20 local steps. This is essentially FedAvg (E > 1). More communication efficient but more susceptible to non-IID problems.

Code Examples

FedAvg Implementation with Flower

# Flower (flwr) is a production federated learning framework
# Install: pip install flwr torch torchvision

import flwr as fl
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from typing import Dict, List, Optional, Tuple
from collections import OrderedDict


# A simple CNN for chest X-ray classification
class ChestXRayClassifier(nn.Module):
    def __init__(self, num_classes: int = 14):
        super().__init__()
        # In production, use a pretrained DenseNet-121 (CheXNet architecture)
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))


def get_weights(model: nn.Module) -> List[np.ndarray]:
    """Extract model weights as numpy arrays for Flower."""
    return [val.cpu().numpy() for _, val in model.state_dict().items()]


def set_weights(model: nn.Module, weights: List[np.ndarray]) -> None:
    """Set model weights from numpy arrays."""
    params_dict = zip(model.state_dict().keys(), weights)
    state_dict = OrderedDict({k: torch.tensor(v) for k, v in params_dict})
    model.load_state_dict(state_dict, strict=True)


class HospitalClient(fl.client.NumPyClient):
    """
    Federated learning client representing a single hospital.
    Each hospital runs this independently with their own data.
    """

    def __init__(
        self,
        hospital_id: str,
        train_loader: DataLoader,
        val_loader: DataLoader,
        device: str = "cpu",
    ):
        self.hospital_id = hospital_id
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.device = torch.device(device)
        self.model = ChestXRayClassifier().to(self.device)
        self.criterion = nn.BCEWithLogitsLoss()

    def get_parameters(self, config: Dict) -> List[np.ndarray]:
        """Return current model parameters to server."""
        return get_weights(self.model)

    def fit(self, parameters: List[np.ndarray], config: Dict) -> Tuple[List[np.ndarray], int, Dict]:
        """
        Receive global model, train locally, return updated weights.
        config can contain: num_epochs, learning_rate, proximal_mu (for FedProx)
        """
        # Load global model weights
        set_weights(self.model, parameters)

        num_epochs = config.get("num_epochs", 5)
        lr = config.get("learning_rate", 1e-4)
        proximal_mu = config.get("proximal_mu", 0.0)  # 0 = FedAvg, >0 = FedProx

        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)

        # FedProx: store the global model weights for proximal term
        global_weights = [p.clone().detach() for p in self.model.parameters()] if proximal_mu > 0 else None

        self.model.train()
        total_loss = 0.0
        n_batches = 0

        for epoch in range(num_epochs):
            for images, labels in self.train_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                optimizer.zero_grad()

                outputs = self.model(images)
                loss = self.criterion(outputs, labels.float())

                # FedProx proximal term
                if proximal_mu > 0 and global_weights is not None:
                    proximal_term = sum(
                        torch.norm(p - g) ** 2
                        for p, g in zip(self.model.parameters(), global_weights)
                    )
                    loss += (proximal_mu / 2) * proximal_term

                loss.backward()
                optimizer.step()
                total_loss += loss.item()
                n_batches += 1

        avg_loss = total_loss / max(n_batches, 1)

        return (
            get_weights(self.model),
            len(self.train_loader.dataset),
            {"hospital": self.hospital_id, "train_loss": avg_loss},
        )

    def evaluate(self, parameters: List[np.ndarray], config: Dict) -> Tuple[float, int, Dict]:
        """Evaluate global model on local validation set."""
        set_weights(self.model, parameters)
        self.model.eval()

        total_loss = 0.0
        n_correct = 0
        n_total = 0

        with torch.no_grad():
            for images, labels in self.val_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                outputs = self.model(images)
                loss = self.criterion(outputs, labels.float())
                total_loss += loss.item()

                # Binary classification accuracy per label
                preds = (torch.sigmoid(outputs) > 0.5).float()
                n_correct += (preds == labels).all(dim=1).sum().item()
                n_total += labels.size(0)

        accuracy = n_correct / max(n_total, 1)

        return (
            total_loss / len(self.val_loader),
            len(self.val_loader.dataset),
            {"accuracy": accuracy, "hospital": self.hospital_id},
        )


# Server-side strategy
def create_fedavg_strategy(num_hospitals: int, min_fit_clients: int = None) -> fl.server.strategy.FedAvg:
    """
    Create a FedAvg aggregation strategy for the central server.
    """
    min_fit = min_fit_clients or max(2, num_hospitals // 2)

    strategy = fl.server.strategy.FedAvg(
        fraction_fit=1.0,                # Train on 100% of available clients
        fraction_evaluate=1.0,           # Evaluate on 100% of clients
        min_fit_clients=min_fit,
        min_evaluate_clients=min_fit,
        min_available_clients=min_fit,
        on_fit_config_fn=lambda server_round: {
            "num_epochs": 5,
            "learning_rate": max(1e-5, 1e-4 * (0.95 ** server_round)),  # LR decay
            "proximal_mu": 0.01,  # Use FedProx
        },
    )

    return strategy


# Simulate a federated training run (for testing without actual distributed nodes)
def simulate_federated_training(
    hospital_datasets: Dict[str, Tuple[DataLoader, DataLoader]],
    num_rounds: int = 20,
):
    """
    Simulate federated learning across multiple hospitals.
    In production, each hospital runs a Flower client on their own server.
    """
    def client_fn(cid: str) -> HospitalClient:
        train_loader, val_loader = hospital_datasets[cid]
        return HospitalClient(
            hospital_id=cid,
            train_loader=train_loader,
            val_loader=val_loader,
        )

    strategy = create_fedavg_strategy(len(hospital_datasets))

    history = fl.simulation.start_simulation(
        client_fn=client_fn,
        num_clients=len(hospital_datasets),
        config=fl.server.ServerConfig(num_rounds=num_rounds),
        strategy=strategy,
        client_resources={"num_cpus": 2, "num_gpus": 0.0},
    )

    return history

Differential Privacy with DP-SGD

from opacus import PrivacyEngine
from opacus.validators import ModuleValidator
import torch
import torch.nn as nn
from torch.utils.data import DataLoader


def train_with_differential_privacy(
    model: nn.Module,
    train_loader: DataLoader,
    target_epsilon: float = 10.0,
    target_delta: float = 1e-5,
    max_grad_norm: float = 1.0,
    num_epochs: int = 10,
    learning_rate: float = 1e-4,
    device: str = "cpu",
) -> tuple[nn.Module, float]:
    """
    Train a model with differential privacy using Opacus (Facebook's DP library).
    Provides formal (epsilon, delta)-DP guarantees.

    target_epsilon: privacy budget (lower = more private, more noise, less utility)
    target_delta: probability of privacy failure (typically 1/n_training_examples)
    max_grad_norm: gradient clipping bound C
    """
    device_obj = torch.device(device)
    model = model.to(device_obj)

    # Validate model for DP compatibility (no unsupported layers like BatchNorm)
    # Replace BatchNorm with GroupNorm for DP compatibility
    model = ModuleValidator.fix(model)
    errors = ModuleValidator.validate(model, strict=False)
    if errors:
        print(f"DP validation warnings: {errors}")

    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.BCEWithLogitsLoss()

    # Attach PrivacyEngine to the optimizer
    privacy_engine = PrivacyEngine()
    model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
        module=model,
        optimizer=optimizer,
        data_loader=train_loader,
        epochs=num_epochs,
        target_epsilon=target_epsilon,
        target_delta=target_delta,
        max_grad_norm=max_grad_norm,
    )

    # Training loop
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0.0
        n_batches = 0

        for images, labels in train_loader:
            images, labels = images.to(device_obj), labels.to(device_obj)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels.float())
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            n_batches += 1

        # Track privacy budget spent so far
        epsilon_spent = privacy_engine.get_epsilon(delta=target_delta)
        print(f"Epoch {epoch+1}/{num_epochs}: loss={total_loss/n_batches:.4f}, epsilon={epsilon_spent:.2f}")

        if epsilon_spent >= target_epsilon:
            print(f"Privacy budget exhausted at epoch {epoch+1}. Stopping training.")
            break

    final_epsilon = privacy_engine.get_epsilon(delta=target_delta)
    print(f"Final privacy guarantee: ({final_epsilon:.2f}, {target_delta})-DP")

    return model, final_epsilon


def privacy_utility_analysis(
    model_class,
    train_loader: DataLoader,
    epsilon_values: list[float] = [1.0, 2.0, 5.0, 10.0, 50.0, float("inf")],
    num_epochs: int = 10,
) -> list[dict]:
    """
    Empirically measure the privacy-utility tradeoff for a given model and dataset.
    Trains models at different epsilon values and reports validation accuracy.
    """
    results = []

    for epsilon in epsilon_values:
        model = model_class()
        if epsilon == float("inf"):
            # Train without DP as baseline
            optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
            # ... standard training loop ...
            final_epsilon = float("inf")
            # Placeholder for illustration
            val_accuracy = 0.82
        else:
            model, final_epsilon = train_with_differential_privacy(
                model, train_loader,
                target_epsilon=epsilon,
                num_epochs=num_epochs,
            )
            val_accuracy = 0.0  # Would evaluate on val set

        results.append({
            "target_epsilon": epsilon,
            "actual_epsilon": final_epsilon,
            "val_accuracy": val_accuracy,
        })
        print(f"epsilon={epsilon}: val_accuracy={val_accuracy:.4f}")

    return results

Production Engineering Notes

NVIDIA FLARE for Healthcare: NVIDIA FLARE (Federated Learning Application Runtime Environment) is the most mature production FL framework for healthcare. Key features: admin console for monitoring training across sites, secure aggregation with homomorphic encryption support, compatibility with MONAI (Medical Open Network for AI) for medical image processing, site-level privacy filters that prevent exfiltrating raw gradients, and an event-driven architecture for custom pre/post processing. For new healthcare FL projects, NVIDIA FLARE is the default choice unless you need something FLARE does not support.

Communication Security: In federated learning, gradient updates transmitted between hospitals and the central server must be encrypted in transit (TLS 1.3 minimum) and ideally also aggregated using secure aggregation protocols that prevent the server from seeing individual hospital's updates. Without secure aggregation, a compromised server can potentially reconstruct training data from gradient updates (gradient inversion attacks). For highly sensitive data (HIV status, mental health records, substance use), require secure aggregation using protocols like SecAgg.

System Heterogeneity: Not all hospitals have the same compute capacity. A major academic center may have 8 A100 GPUs; a rural critical access hospital may have only CPU compute. FL frameworks must accommodate stragglers: clients that take 10x longer to complete local training. FedAvg handles this by setting a minimum number of responding clients and discarding late responses. But if the rural hospitals are systematically slow and never contribute their updates, the model will be biased toward large academic centers. Consider different update frequency policies for resource-constrained sites.

Validation Strategy: Federated models must be validated on held-out data from each site to detect non-IID performance disparities. A global model with AUC 0.88 on average but 0.72 at one site is not a good model - it is a model with a site-specific failure mode. Build validation pipelines that report per-site performance breakdowns and flag sites where the model underperforms by more than 2 standard deviations from the site mean.

Legal and Governance Infrastructure: FL eliminates the need to share data but does not eliminate the need for legal agreements. You still need: a multi-site IRB protocol, a business associate agreement (BAA) with the FL platform provider if they are a covered entity, data use agreements specifying what the model updates can contain, and an agreement on who owns the resulting model and how it can be used. Building the legal framework typically takes longer than building the technical system.

Common Mistakes

:::danger Assuming Federated = Private Federated learning without differential privacy provides no formal privacy guarantee. Gradient inversion attacks (Zhu et al., NeurIPS 2019) showed that model gradients can be inverted to approximately reconstruct training images for small batch sizes. For natural images, an attacker with access to gradients computed on a batch of 1-8 images can often recover recognizable images. For medical imaging, this means an adversary controlling the central server could potentially reconstruct patient images from gradient updates. If your threat model includes a malicious or compromised server, you must add differential privacy or secure aggregation - FedAvg alone is not sufficient. :::

:::danger Evaluating Only Global Average Performance Reporting "our federated model achieved AUC 0.88 on the held-out test set" without breaking down performance by site is misleading in healthcare. The average can look excellent while one hospital has AUC 0.65 - which in clinical practice means that hospital's patients are receiving systematically worse AI assistance. Always report per-site performance. For deployment decisions, use minimum site performance (not mean) as the primary criterion. A model should not be deployed to a site where its performance is below clinical utility threshold even if the federated average is high. :::

:::warning Convergence with Too Few Local Steps FedAvg with E=1 (one local gradient step) is very communication-efficient but may not converge. FedAvg with E=100 local steps before aggregation reduces communication overhead but increases client drift on non-IID data. The right value of E depends on the degree of data heterogeneity across clients. A practical default: E=5-10 local epochs, with FedProx regularization (mu=0.01). Monitor both global loss convergence and per-client loss - if per-client losses diverge from each other over rounds, increase mu or decrease E. :::

:::warning Ignoring the Cold Start Problem at New Sites When deploying a federated model to a new hospital that did not participate in training, the model may perform poorly on that hospital's data due to distribution shift from the training hospitals. Naive deployment without local validation and fine-tuning can result in silent performance degradation. Always run a local validation study at any new site before clinical deployment, using 100-200 locally labeled examples to measure the model's performance on local data. If performance is below threshold, run a local fine-tuning round (with or without the federated framework) before enabling clinical use. :::

Interview Q&A

Q: Explain the FedAvg algorithm from scratch. When does it converge to the same solution as centralized training, and when does it diverge?

A: FedAvg works in rounds. Each round: the server sends the current global model to all clients, each client runs E epochs of local SGD on their own data, clients send updated weights back to the server, and the server computes a weighted average of the updates (weighted by dataset size). In the IID case - where each client's data is drawn from the same distribution - the local gradient direction on each client is an unbiased estimate of the global gradient. Averaging locally updated weights is therefore approximately equivalent to running more SGD steps on the global dataset, and FedAvg converges to roughly the same solution as centralized training. In the non-IID case, each client's gradient points toward the local minimum of that client's loss function, which may differ substantially from the global minimum. After E local steps, each client's weights have moved toward their local minimum. Averaging these diverged weights may not point in a useful direction for the global objective. The more heterogeneous the client distributions and the more local steps E, the worse client drift becomes. Convergence guarantees for FedAvg on non-IID data exist (Li et al., 2020) but require assumptions on the degree of heterogeneity and often require smaller learning rates or fewer local steps than the IID case.

Q: What is $(\epsilon, \delta)$ -differential privacy and what does it mean in plain English for a healthcare ML application?

A: Differential privacy quantifies how much the presence or absence of any single individual in your training dataset can affect the model's output. An $(\epsilon, \delta)$ -DP mechanism guarantees that for any two datasets D and D' that differ by one record, the probability that the mechanism outputs any particular result shifts by at most a multiplicative factor of $e^\epsilon$ , with probability $1 - \delta$ . In plain English for healthcare: after training your model with $(\epsilon=1, \delta=10^{-5})$ -DP, an adversary who can query the model cannot determine with meaningful confidence whether patient Alice was in your training data, even if they know everything about Alice. Practically: $\epsilon = 1$ is very strong privacy at significant utility cost; $\epsilon = 10$ is moderate privacy; $\epsilon = 100$ provides weaker but still measurable protection. The $\delta$ term is the probability that the bound fails - set it well below $1/n$ where n is the number of training examples. For a hospital dataset of 10,000 patients, $\delta = 10^{-5}$ is appropriate.

Q: Describe gradient inversion attacks. What are their practical implications for healthcare FL deployment?

A: Gradient inversion (Zhu et al., 2019; also Geiping et al., 2020) is a reconstruction attack where an adversary with access to a model's gradient update attempts to recover the training data that produced it. The attack works by solving an optimization problem: find input data $x^*$ such that the gradient produced by $x^*$ matches the observed gradient. For image data with small batch sizes (1-8 images), these attacks can reconstruct recognizable images within minutes on a single GPU. Practical implications for healthcare FL: (1) the central server is a potential adversary; a compromised server operator could reconstruct patient images from gradient updates without ever seeing the images directly; (2) single-sample batches should be avoided in medical imaging FL for this reason - use batch size 8-32 which increases gradient mixing and makes inversion harder; (3) gradient noise from DP-SGD substantially degrades inversion quality; (4) secure aggregation protocols prevent the server from seeing individual gradients by aggregating at the cryptographic layer; (5) the attack is less practical at the scale of real medical imaging models (millions of parameters, larger batches) but the risk is non-negligible for high-stakes data.

Q: Your federated model shows AUC 0.85 on the aggregated test set but only 0.68 at one specific hospital. What are the likely causes and how do you diagnose each?

A: There are four main candidate causes. First, severe non-IID data: this hospital's patient population may be dramatically different from the other hospitals (different disease prevalence, different imaging protocol, different demographics). Diagnose by comparing label distributions and image statistics from this site to others. Second, the hospital contributed few training samples: if this hospital had 50 patients while others had 500+, its local data pattern is underrepresented in the global model. Diagnose by checking each hospital's weight in the FedAvg aggregation. Third, covariate shift from different equipment: scanner model, field strength, or acquisition protocol differs from training hospitals. Diagnose by comparing image-level statistics (mean intensity, SNR, spatial resolution) to training data. Fourth, label noise at this site: if this hospital's radiologists use different labeling criteria, the "ground truth" labels are inconsistent with the model's training distribution. Diagnose by running inter-rater reliability checks on a sample of this site's labels. The remediation strategy differs: for non-IID, use personalized FL with local fine-tuning. For underrepresentation, increase oversampling of this site's data or use stratified aggregation. For covariate shift, apply domain adaptation. For label noise, re-label a subset with consensus reads.

Q: Compare NVIDIA FLARE and Flower as federated learning frameworks for a healthcare deployment. What factors would drive your choice?

A: Flower is a research-oriented framework with a clean, Pythonic API, strong simulation capabilities for rapid prototyping, and excellent documentation for implementing custom aggregation strategies. It supports TensorFlow, PyTorch, and JAX, has a growing ecosystem, and is well-suited for academic collaborations or smaller deployments where the engineering team controls both server and all client environments. NVIDIA FLARE is a production-hardened framework specifically designed for healthcare, with enterprise features: an admin console for monitoring multi-site training, integration with MONAI for medical image processing, support for DICOM data pipelines, site-level access controls and audit logging, homomorphic encryption for secure aggregation, and production deployment patterns on air-gapped hospital networks. Choose Flower when: you are doing research, prototyping, or have a small number of well-controlled clients. Choose NVIDIA FLARE when: you are deploying to actual hospital IT environments, you need compliance audit trails, you are working with MONAI-based imaging pipelines, or you need enterprise support. For a production multi-hospital deployment in healthcare, NVIDIA FLARE is almost always the right choice despite its steeper learning curve.

Q: How would you design a federated learning system for training a model on EHR data (structured tabular clinical data) rather than medical images? What changes relative to imaging FL?

A: EHR-based FL differs from imaging FL in several important ways. Data representation: tabular EHR data has missing values at rates up to 60-70% (labs not ordered, diagnoses not coded) and the missingness patterns differ by hospital - different hospitals order different lab panels, use different diagnostic codes, and document differently. The model must handle missing data explicitly (imputation or missingness-aware architectures) rather than assuming complete inputs. Feature engineering: a glucose value of 250 mg/dL means the same thing regardless of hospital, but how it is coded (ICD-10 code for diabetes type, lab value vs. diagnosis code) may differ. A feature alignment step across institutions is required before training. Statistical heterogeneity: for mortality prediction, ICU hospitals have systematically higher mortality rates than general medicine wards - this is not noise, it is real heterogeneity. The FL aggregation must account for this. Privacy surface: EHR data with many features is more re-identifiable than a single image. Even model gradients computed on EHR data may reveal more information than gradients from images. DP requirements may be stricter. Model architecture: tabular data with EHR is well-served by gradient boosted trees (XGBoost, LightGBM) or transformer architectures (TabTransformer, SAINT) rather than CNNs. Federated gradient boosting is less mature than federated neural network training - research solutions exist but production-grade tools are limited.

The Data That Cannot Move​

Why This Exists - The Pooling Barrier​

Historical Context - From Parallel SGD to Healthcare FL​

Core Concepts​

FedAvg - The Foundation Algorithm​

Non-IID Data - The Healthcare Challenge​

Differential Privacy​

Communication Efficiency​

Code Examples​

FedAvg Implementation with Flower​

Differential Privacy with DP-SGD​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​