Clinical NLP and EHR Systems
The Note That Almost Killed the Patient
A 67-year-old man was admitted to a hospital in Boston with shortness of breath. His discharge summary, written at 11 PM by a tired intern, read: "Patient has no history of atrial fibrillation. Denies chest pain. No known drug allergies. Previous workup negative for PE. Prescribed aspirin 81mg daily."
Three weeks later, he was readmitted. The triage nurse used a clinical decision support tool that scraped his chart and flagged him as low risk for thromboembolic events - because the system had parsed "no history of atrial fibrillation" and extracted "atrial fibrillation" as a positive diagnosis. The negation "no history of" was silently ignored. The patient was not anticoagulated. He coded on day 2.
This is not a hypothetical. Negation handling is one of the oldest and most critical problems in clinical NLP, documented in the literature since 2001. The NegEx algorithm, a rule-based negation detection system published by Wendy Chapman at the University of Pittsburgh, is still used in production healthcare systems today precisely because the consequences of getting this wrong are catastrophic.
Clinical text is unlike any other domain. It is written under time pressure, full of abbreviations ("SOB" for shortness of breath, "hx" for history, "Rx" for prescription), specialized jargon, misspellings, and temporal qualifications. A patient chart spanning a 10-day hospitalization might contain 200 individual text entries from nurses, physicians, pharmacists, and therapists - each written in a slightly different style, often contradicting each other as the patient's condition evolves.
Unlocking the information in this unstructured text is one of the highest-value applications of NLP in healthcare. Over 70% of clinically relevant information in an EHR is in free-text form, not in structured fields. A patient's smoking status, family history, the details of their pain, their functional status, their concerns - these are in the notes, not in checkboxes. Building systems that can accurately extract this information, at scale, across millions of patients, is what clinical NLP is about.
The market opportunity is enormous. The global clinical NLP market is projected to exceed $5 billion by 2028. Health systems use it for automated ICD-10 coding (saving billions in manual coding costs), quality measurement, adverse drug event surveillance, clinical trial recruitment, and real-world evidence generation. Getting the engineering right is both technically demanding and genuinely consequential.
Why This Exists - The Structured vs Unstructured Gap
Electronic Health Records were designed around billing, not clinical intelligence. The structured fields in an EHR - diagnosis codes, procedure codes, medication orders - are optimized for reimbursement workflows. They systematically miss the clinical nuance that matters for prediction and research.
Consider a patient with type 2 diabetes. Their EHR has an ICD-10 code E11.9 (type 2 diabetes without complications). But the physician's note from last month says: "Poorly controlled T2DM despite triple therapy. A1c trending up from 7.8 to 9.1 over 6 months. Patient reports inconsistent medication adherence due to cost concerns. Starting GLP-1 agonist today, discussed dietary modifications." This information - clinical trajectory, adherence barriers, treatment response - is nowhere in the structured data.
This is why clinical NLP exists: to extract the signal that billing-optimized structured data cannot capture.
Historical Context
Clinical NLP predates the deep learning era by decades. The field traces its roots to MEDLINE and the development of Medical Language Processing systems in the 1970s and 1980s at research groups like Columbia University and MIT. The Medical Language Extraction and Encoding system (MedLEE), developed by Carol Friedman at Columbia in the 1990s, was one of the earliest clinical NLP systems used in production for radiology report processing.
The 2000s saw the development of several influential open-source tools: the National Center for Biomedical Ontology's tools, cTAKES (Clinical Text Analysis and Knowledge Extraction System) from Mayo Clinic in 2010, and MedSpaCy in later years. These rule-based and hybrid systems are still widely deployed.
The MIMIC-III dataset, released by MIT in 2016, was transformational. It provided de-identified EHR data from over 40,000 ICU patients, including free-text clinical notes. This enabled the first large-scale NLP research on real ICU notes and became the benchmark dataset for clinical NLP through the late 2010s.
BioBERT (Lee et al., 2019) marked the transition to deep learning dominance. By pretraining BERT on PubMed abstracts and PMC full-text articles, then fine-tuning on NLP benchmarks, BioBERT achieved new state-of-the-art on multiple biomedical NLP tasks. ClinicalBERT (Alsentzer et al., 2019) took this further by pretraining on MIMIC-III clinical notes, capturing the informal language, abbreviations, and note-specific syntax that PubMed abstracts do not contain.
PubMedBERT (Gu et al., 2021) demonstrated that pretraining from scratch on biomedical text, rather than starting from general-domain BERT and continuing pretraining, produced better results - the domain gap between Wikipedia/BookCorpus (BERT's training data) and biomedical literature is large enough that domain-specific pretraining from scratch beats continued pretraining.
Core Concepts
The Clinical NLP Stack
Clinical NLP systems are typically built as pipelines of components, each handling a specific aspect of the text. Understanding the pipeline structure is critical for building production systems.
Raw clinical text
-> Sentence splitting
-> Tokenization
-> Named Entity Recognition (NER)
-> Negation detection
-> Temporal normalization
-> Coreference resolution
-> Relation extraction
-> Structured output
Each stage is a potential failure mode. The overall system performance is bounded by the weakest stage.
Named Entity Recognition for Clinical Text
NER in the clinical domain identifies spans of text that correspond to medical entities: medications, diagnoses, procedures, anatomical locations, lab values, dosages.
The classic sequence labeling formulation uses BIO tagging: each token is labeled B (beginning of an entity), I (inside an entity), or O (outside any entity). For a sentence like "started metformin 500mg twice daily," the labels would be:
started O
metformin B-MEDICATION
500mg B-DOSAGE
twice B-FREQUENCY
daily I-FREQUENCY
Modern NER models use transformer-based token classification. The input sentence is tokenized, passed through BERT (or a clinical BERT variant), and a linear classification head on top of the token representations predicts the BIO label for each token.
The loss function is standard cross-entropy over token labels:
where is the sequence length, is the number of label classes, and is the model's predicted probability of token having label .
Negation Detection - NegEx and Beyond
Negation in clinical text is pervasive. Studies estimate that 20-30% of clinical findings mentioned in notes are negated. The word "no" in "no fever" negates the finding. The phrase "denies" in "denies chest pain" negates. "History of" versus "no history of" can mean the opposite.
The NegEx algorithm (Chapman et al., 2001) is a simple but effective rule-based approach. It maintains three lists:
- Pseudo-negation terms: phrases that look like negations but are not ("not only," "not just")
- Pre-negation terms: negation words that appear before the finding ("no," "without," "denies," "negative for")
- Post-negation terms: negation words that appear after the finding ("absent," "ruled out," "unlikely")
Given a candidate medical concept, NegEx searches a window of 5 tokens before and after the concept for negation trigger phrases. If a trigger is found (and no sentence boundary or conjunction intervenes), the concept is marked as negated.
For production use, this rule-based foundation is extended with transformer-based models that can handle complex nested negations and context-dependent cases. NegBERT (Khandelwal & Sawant, 2020) fine-tunes BERT specifically for negation detection, outperforming NegEx on complex clinical sentences.
De-identification - HIPAA and Protected Health Information
Before any clinical NLP research or model training, data must be de-identified. HIPAA's Safe Harbor de-identification standard specifies 18 categories of Protected Health Information (PHI) that must be removed or generalized: names, geographic subdivisions smaller than a state, dates (except year) for individuals over 89, phone numbers, fax numbers, email addresses, Social Security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers, device identifiers and serial numbers, URLs, IP addresses, biometric identifiers, full-face photographs, and any other unique identifying number or code.
Clinical NLP systems that operate on identifiable data must run in secure, HIPAA-compliant environments (BAAs with cloud providers, network segmentation, audit logging). De-identified research datasets (MIMIC-III/IV, i2b2) can be processed without these constraints, but only after verifying de-identification completeness.
Automated de-identification itself is an NLP task: detecting PHI mentions and either removing them or replacing them with synthetic equivalents. PHILTER (Norgeot et al., 2020) and the i2b2 2014 de-identification challenge systems are standard benchmarks. F1 scores above 98% are required for production de-identification.
BERT Variants for Clinical NLP
The choice of pretrained model significantly affects downstream performance on clinical tasks. The key options:
BioBERT (Lee et al., 2019): BERT-base initialized from BERT-base-uncased, then continued pretraining on PubMed abstracts (4.5B tokens) and PMC full-text articles (13.5B tokens). Strong on biomedical NER and relation extraction from scientific literature. Less suited for clinical notes, which differ substantially from academic prose.
ClinicalBERT (Alsentzer et al., 2019): BERT-base initialized from BioBERT, then continued pretraining on all clinical notes from MIMIC-III (2 billion tokens). Strong for clinical NER, ICD coding, and mortality prediction from notes. Best choice for tasks involving inpatient clinical notes.
PubMedBERT (Gu et al., 2021): BERT-base-sized model pretrained from scratch (random initialization) on all PubMed abstracts (3.1B tokens), without Wikipedia or BookCorpus. Demonstrates that domain-specific pretraining from scratch outperforms domain adaptation starting from general BERT. Best for biomedical literature NLP.
GatorTron (Yang et al., 2022): Large transformer (8.9B parameters) pretrained on 90 billion words of clinical text from University of Florida Health system. State-of-the-art on most clinical NLP benchmarks. Requires significant GPU resources for fine-tuning.
For most production use cases with limited GPU budget, ClinicalBERT fine-tuned on task-specific labeled data is the pragmatic choice.
ICD-10 Coding Automation
Medical billing requires assigning ICD-10 diagnosis and procedure codes to every encounter. This is traditionally done by human medical coders who read discharge summaries and assign codes from the 70,000+ ICD-10 code set. Automated ICD coding is a multi-label classification problem: given a discharge summary, predict all applicable ICD codes.
The challenge is the extremely long tail of code distribution. The top 50 codes cover perhaps 30% of encounters; the remaining codes appear rarely, with some codes appearing only a few times in large multi-hospital datasets.
State-of-the-art systems (CAML, MultiResCNN, LAAT) combine convolutional or attention architectures with label-attention mechanisms: for each target code, the model learns to attend to the text spans most relevant to that code. This is efficient because most codes depend only on specific phrases rather than the entire note.
Code Examples
Clinical NER Pipeline with HuggingFace Transformers
from transformers import (
AutoTokenizer, AutoModelForTokenClassification,
pipeline, TrainingArguments, Trainer
)
from datasets import Dataset, DatasetDict
import numpy as np
from typing import List, Dict, Tuple
import re
# -----------------------------------------------
# Label schema for medication extraction NER
# -----------------------------------------------
# We use BIO tagging with 7 entity types:
# MEDICATION, DOSAGE, FREQUENCY, ROUTE, DURATION, REASON, ADE (adverse drug event)
LABEL_LIST = [
"O",
"B-MEDICATION", "I-MEDICATION",
"B-DOSAGE", "I-DOSAGE",
"B-FREQUENCY", "I-FREQUENCY",
"B-ROUTE", "I-ROUTE",
"B-DURATION", "I-DURATION",
"B-REASON", "I-REASON",
"B-ADE", "I-ADE",
]
LABEL2ID = {l: i for i, l in enumerate(LABEL_LIST)}
ID2LABEL = {i: l for l, i in LABEL2ID.items()}
class ClinicalNERDataset:
"""
Tokenize and align labels for clinical NER.
The challenge: BERT uses WordPiece tokenization, which splits words
into subwords. "metformin" might become ["met", "##form", "##in"].
Original word-level labels must be realigned to subword tokens.
Standard practice: only label the first subword of each word.
"""
def __init__(self, tokenizer_name: str = "emilyalsentzer/Bio_ClinicalBERT"):
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
def tokenize_and_align_labels(
self,
examples: Dict,
label_all_tokens: bool = False
) -> Dict:
"""
Args:
examples: dict with 'tokens' (list of word lists) and
'ner_tags' (list of label lists)
label_all_tokens: If True, label all subwords.
If False (standard), only label first subword.
"""
tokenized_inputs = self.tokenizer(
examples["tokens"],
truncation=True,
is_split_into_words=True,
max_length=512,
padding="max_length",
)
all_labels = []
for i, labels in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
# Special tokens [CLS], [SEP], [PAD] -> -100 (ignored in loss)
label_ids.append(-100)
elif word_idx != previous_word_idx:
# First subword of a word: use the true label
label_ids.append(LABEL2ID[labels[word_idx]])
else:
# Continuation subword: -100 if only labeling first subword
label_ids.append(
LABEL2ID[labels[word_idx]] if label_all_tokens else -100
)
previous_word_idx = word_idx
all_labels.append(label_ids)
tokenized_inputs["labels"] = all_labels
return tokenized_inputs
def compute_ner_metrics(eval_pred) -> Dict[str, float]:
"""
Compute span-level F1, precision, recall for NER evaluation.
Span-level (not token-level) is the standard for NER reporting.
"""
from seqeval.metrics import (
classification_report, f1_score, precision_score, recall_score
)
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=2)
# Convert ids back to label strings, removing -100 padding
true_labels = [
[ID2LABEL[l] for l in label if l != -100]
for label in labels
]
pred_labels = [
[ID2LABEL[p] for p, l in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return {
"precision": precision_score(true_labels, pred_labels),
"recall": recall_score(true_labels, pred_labels),
"f1": f1_score(true_labels, pred_labels),
}
def train_clinical_ner(
train_data: List[Dict],
val_data: List[Dict],
model_name: str = "emilyalsentzer/Bio_ClinicalBERT",
output_dir: str = "./clinical_ner_model",
epochs: int = 5,
):
"""
Fine-tune ClinicalBERT for medication NER.
Expected data format:
[
{
"tokens": ["Patient", "started", "metformin", "500mg", "BID"],
"ner_tags": ["O", "O", "B-MEDICATION", "B-DOSAGE", "B-FREQUENCY"]
},
...
]
"""
ner_dataset_builder = ClinicalNERDataset(tokenizer_name=model_name)
tokenizer = ner_dataset_builder.tokenizer
# Convert list of dicts to HuggingFace Dataset format
train_ds = Dataset.from_list(train_data)
val_ds = Dataset.from_list(val_data)
dataset = DatasetDict({"train": train_ds, "validation": val_ds})
# Tokenize and align labels
tokenized_dataset = dataset.map(
ner_dataset_builder.tokenize_and_align_labels,
batched=True,
remove_columns=dataset["train"].column_names,
)
# Load model
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=len(LABEL_LIST),
id2label=ID2LABEL,
label2id=LABEL2ID,
)
training_args = TrainingArguments(
output_dir=output_dir,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=epochs,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1",
warmup_ratio=0.1,
fp16=True, # Mixed precision if GPU supports it
logging_steps=50,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_ner_metrics,
)
trainer.train()
return trainer
NegEx - Negation Detection
import re
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class MedicalConcept:
text: str
start: int
end: int
label: str
negated: bool = False
uncertain: bool = False
class NegExDetector:
"""
Implementation of the NegEx algorithm (Chapman et al., 2001).
Extended with uncertainty detection.
The algorithm searches a window around each detected concept
for trigger phrases that indicate negation or uncertainty.
"""
# Phrases that APPEAR to negate but don't
PSEUDO_NEGATION = [
"no increase", "no change", "no significant change",
"not only", "not just", "gram negative", "not certain",
"no further", "without further",
]
# Phrases appearing BEFORE the concept that negate it
PRE_NEGATION = [
"no", "not", "without", "denies", "denied", "deny",
"no evidence of", "no evidence for", "absence of",
"negative for", "free of", "rules out", "ruled out",
"rule out", "unremarkable for", "unlikely",
"cannot see", "cannot identify", "fails to reveal",
"no sign of", "no signs of", "no complaint of",
"not complain of", "does not appear", "does not indicate",
]
# Phrases appearing AFTER the concept that negate it
POST_NEGATION = [
"absent", "was ruled out", "has been ruled out",
"is ruled out", "are ruled out", "is not present",
"was not present", "is not demonstrated",
"is not appreciated", "is not seen",
]
# Uncertainty triggers
UNCERTAINTY_TRIGGERS = [
"possible", "possibly", "probable", "probably", "likely",
"suspected", "suspect", "questionable", "question of",
"cannot exclude", "may represent", "differential includes",
"consider", "cannot rule out", "rule out",
]
# Window size in tokens
WINDOW_SIZE = 5
def __init__(self):
# Compile patterns sorted by length (longer patterns first, more specific)
self._pre_neg_patterns = sorted(
self.PRE_NEGATION, key=len, reverse=True
)
self._post_neg_patterns = sorted(
self.POST_NEGATION, key=len, reverse=True
)
self._pseudo_patterns = sorted(
self.PSEUDO_NEGATION, key=len, reverse=True
)
self._uncertainty_patterns = sorted(
self.UNCERTAINTY_TRIGGERS, key=len, reverse=True
)
def _get_sentence_window(
self, text: str, concept: MedicalConcept
) -> tuple:
"""
Get the text window before and after the concept within the sentence.
Respects sentence boundaries (., !, ?).
"""
# Find sentence boundaries
# A sentence ends at '.', '!', '?' followed by whitespace or end
sentence_end_pattern = re.compile(r'[.!?]\s')
# Get pre-concept window
pre_text = text[:concept.start].lower()
# Find last sentence boundary before concept
boundaries = list(sentence_end_pattern.finditer(pre_text))
if boundaries:
last_boundary = boundaries[-1].end()
pre_window = pre_text[last_boundary:]
else:
pre_window = pre_text
# Get post-concept window
post_text = text[concept.end:].lower()
# Find first sentence boundary after concept
boundary_match = sentence_end_pattern.search(post_text)
if boundary_match:
post_window = post_text[:boundary_match.start()]
else:
post_window = post_text
# Further limit to WINDOW_SIZE tokens
pre_tokens = pre_window.split()[-self.WINDOW_SIZE:]
post_tokens = post_window.split()[:self.WINDOW_SIZE]
return ' '.join(pre_tokens), ' '.join(post_tokens)
def detect(
self, text: str, concepts: List[MedicalConcept]
) -> List[MedicalConcept]:
"""
Apply NegEx to a list of extracted concepts in a clinical text.
Returns concepts with negated/uncertain flags updated.
"""
for concept in concepts:
pre_window, post_window = self._get_sentence_window(text, concept)
# Check for pseudo-negation first (overrides negation)
is_pseudo = any(
p in pre_window or p in post_window
for p in self._pseudo_patterns
)
if is_pseudo:
continue
# Check pre-negation triggers
is_negated = any(
pre_window.endswith(trigger) or
f"{trigger} " in pre_window or
pre_window == trigger
for trigger in self._pre_neg_patterns
)
# Check post-negation triggers
if not is_negated:
is_negated = any(
post_window.startswith(trigger) or
f" {trigger}" in post_window
for trigger in self._post_neg_patterns
)
# Check uncertainty
is_uncertain = any(
trigger in pre_window or trigger in post_window
for trigger in self._uncertainty_patterns
)
concept.negated = is_negated
concept.uncertain = is_uncertain and not is_negated
return concepts
# Example usage
def demo_negex():
negex = NegExDetector()
test_cases = [
("Patient has no fever and denies chest pain.",
[MedicalConcept("fever", 15, 20, "SYMPTOM"),
MedicalConcept("chest pain", 32, 42, "SYMPTOM")]),
("CT chest negative for pulmonary embolism.",
[MedicalConcept("pulmonary embolism", 22, 40, "DIAGNOSIS")]),
("History of hypertension, no current symptoms.",
[MedicalConcept("hypertension", 11, 23, "DIAGNOSIS")]),
("Possible pneumonia in right lower lobe.",
[MedicalConcept("pneumonia", 9, 18, "DIAGNOSIS")]),
]
for text, concepts in test_cases:
result = negex.detect(text, concepts)
print(f"\nText: {text}")
for c in result:
status = "NEGATED" if c.negated else ("UNCERTAIN" if c.uncertain else "AFFIRMED")
print(f" {c.text} ({c.label}): {status}")
De-identification Pipeline
import re
from typing import List, Tuple
from dataclasses import dataclass
@dataclass
class PHISpan:
start: int
end: int
phi_type: str
original_text: str
class ClinicalDeidentifier:
"""
Rule-based PHI detection for the 18 HIPAA Safe Harbor categories.
In production, this would be augmented with a trained NER model
for high-recall detection of names and institution names.
This demonstrates the rule-based components; transformer-based
name detection is handled by a separate NER model.
"""
# Regex patterns for structured PHI
PATTERNS = {
"PHONE": re.compile(
r'\b(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
),
"SSN": re.compile(
r'\b\d{3}[-\s]?\d{2}[-\s]?\d{4}\b'
),
"DATE": re.compile(
r'\b(?:0?[1-9]|1[0-2])[/\-.](?:0?[1-9]|[12]\d|3[01])[/\-.](?:19|20)\d{2}\b'
r'|\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|'
r'Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|'
r'Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2},?\s+\d{4}\b',
re.IGNORECASE
),
"EMAIL": re.compile(
r'\b[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}\b'
),
"IP_ADDRESS": re.compile(
r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
),
"URL": re.compile(
r'https?://[^\s<>"{}|\\^`\[\]]+'
),
"MRN": re.compile(
# Common MRN formats: MRN: 12345678 or MR# 12345678
r'(?:MRN|MR#|Medical Record(?:\s+Number)?)[:\s#]*(\d{5,10})',
re.IGNORECASE
),
"ZIP_5DIGIT": re.compile(
r'\b\d{5}(?:-\d{4})?\b'
),
"AGE_OVER_89": re.compile(
r'\b(9\d|1[01]\d)[-\s]?(?:year(?:s)?[-\s]old|y/o|yo)\b',
re.IGNORECASE
),
}
# Replacement tokens by PHI type
REPLACEMENT = {
"PHONE": "[PHONE]",
"SSN": "[SSN]",
"DATE": "[DATE]",
"EMAIL": "[EMAIL]",
"IP_ADDRESS": "[IP]",
"URL": "[URL]",
"MRN": "[MRN]",
"ZIP_5DIGIT": "[ZIP]",
"AGE_OVER_89": "[AGE_OVER_89]",
"NAME": "[NAME]",
"LOCATION": "[LOCATION]",
}
def detect_phi(self, text: str) -> List[PHISpan]:
"""Detect all PHI spans in text using rule-based patterns."""
spans = []
for phi_type, pattern in self.PATTERNS.items():
for match in pattern.finditer(text):
spans.append(PHISpan(
start=match.start(),
end=match.end(),
phi_type=phi_type,
original_text=match.group()
))
# Sort by start position
spans.sort(key=lambda s: s.start)
# Remove overlapping spans (keep longer spans)
non_overlapping = []
last_end = -1
for span in spans:
if span.start >= last_end:
non_overlapping.append(span)
last_end = span.end
return non_overlapping
def deidentify(self, text: str, phi_spans: List[PHISpan]) -> str:
"""Replace detected PHI spans with placeholder tokens."""
result = []
last_end = 0
for span in phi_spans:
# Keep text before this span
result.append(text[last_end:span.start])
# Replace PHI with placeholder
result.append(self.REPLACEMENT.get(span.phi_type, "[PHI]"))
last_end = span.end
# Keep remaining text after last span
result.append(text[last_end:])
return ''.join(result)
def process(self, text: str) -> Tuple[str, List[PHISpan]]:
"""Full de-identification pipeline."""
phi_spans = self.detect_phi(text)
deidentified_text = self.deidentify(text, phi_spans)
return deidentified_text, phi_spans
ICD-10 Coding Pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn as nn
from typing import List, Tuple
import numpy as np
class ICD10Coder:
"""
Multi-label ICD-10 code assignment from clinical discharge notes.
This implements a label-attention architecture:
for each ICD code, a learned query vector attends over
the note's token representations to find relevant evidence.
"""
def __init__(
self,
model_checkpoint: str,
code_list: List[str],
threshold: float = 0.5
):
self.tokenizer = AutoTokenizer.from_pretrained(
"emilyalsentzer/Bio_ClinicalBERT"
)
self.code_list = code_list
self.threshold = threshold
self.num_labels = len(code_list)
# In production, this would be a fine-tuned model loaded from checkpoint
# For illustration, showing the architecture
self.model = self._build_model()
def _build_model(self) -> nn.Module:
return AutoModelForSequenceClassification.from_pretrained(
"emilyalsentzer/Bio_ClinicalBERT",
num_labels=self.num_labels,
problem_type="multi_label_classification"
)
def predict(self, note_text: str) -> List[Tuple[str, float]]:
"""
Predict ICD-10 codes for a discharge note.
Long notes (>512 tokens) are handled by chunking and
averaging predictions across chunks.
"""
inputs = self.tokenizer(
note_text,
max_length=512,
truncation=True,
padding="max_length",
return_tensors="pt"
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.sigmoid(outputs.logits).squeeze().numpy()
# Return codes above threshold with their probabilities
predicted = [
(self.code_list[i], float(probs[i]))
for i in range(self.num_labels)
if probs[i] >= self.threshold
]
return sorted(predicted, key=lambda x: x[1], reverse=True)
def predict_long_note(
self, note_text: str, chunk_overlap: int = 50
) -> List[Tuple[str, float]]:
"""
Handle notes longer than 512 tokens using sliding window.
Max pool predictions across chunks.
"""
tokens = self.tokenizer.tokenize(note_text)
chunk_size = 510 # Leave room for [CLS] and [SEP]
all_probs = []
for i in range(0, len(tokens), chunk_size - chunk_overlap):
chunk = tokens[i:i + chunk_size]
chunk_text = self.tokenizer.convert_tokens_to_string(chunk)
inputs = self.tokenizer(
chunk_text,
max_length=512,
truncation=True,
padding="max_length",
return_tensors="pt"
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.sigmoid(outputs.logits).squeeze().numpy()
all_probs.append(probs)
# Max pool: if any chunk predicts a code, the note has that code
aggregated = np.max(all_probs, axis=0)
predicted = [
(self.code_list[i], float(aggregated[i]))
for i in range(self.num_labels)
if aggregated[i] >= self.threshold
]
return sorted(predicted, key=lambda x: x[1], reverse=True)
System Architecture Diagrams
Production Engineering Notes
Handling Clinical Abbreviations
Clinical text is dense with abbreviations that are highly context-dependent. "MS" can mean Multiple Sclerosis, Morphine Sulfate, Mitral Stenosis, or Mental Status. "CP" is Chest Pain or Cerebral Palsy. "SOB" is Shortness of Breath.
Production systems use medical abbreviation dictionaries (UMLS has an extensive abbreviation inventory) combined with context disambiguation. The current best approach is to train a token classification model that expands abbreviations in context, using BERT embeddings to disambiguate based on surrounding terms. Alternatively, training NER models on data that contains abbreviations explicitly teaches the model to handle them without expansion.
Long Document Handling
Clinical notes can be very long. ICU nursing notes, discharge summaries for complex patients, and H&P documents frequently exceed 1,000 words - well beyond BERT's 512-token limit. Standard strategies:
Chunking with max-pooling: split into overlapping 512-token windows, run inference on each, pool predictions. Simple and works well for multi-label classification (ICD coding).
Hierarchical models: encode each sentence independently with BERT, then apply a second transformer or LSTM over sentence embeddings. Captures document structure better than flat chunking.
Longformer / BigBird: transformer architectures with sparse attention that scale to 4096+ tokens. LongformerForTokenClassification is a practical choice for NER on long clinical notes.
Calibration and Uncertainty
Clinical NLP models that are overconfident cause clinical errors. A medication extraction model that outputs "metformin" with 0.99 probability on a note that says "no metformin" (where negation was missed) gives clinicians false confidence.
Calibration approaches: temperature scaling on the output logits, conformal prediction for set-valued outputs with coverage guarantees, and Monte Carlo dropout for uncertainty estimation. Report calibration curves (reliability diagrams) alongside F1 in model evaluation.
Common Mistakes
:::danger Ignoring Negation in Production Systems
The most common and dangerous bug in clinical NLP: treating every extracted entity as affirmed (positively present). A system that extracts "diabetes" from "no history of diabetes" and marks it as a positive diagnosis can cause incorrect risk stratification, wrong medication recommendations, and patient harm.
Fix: Negation detection is not optional. Integrate NegEx or a trained negation model as a mandatory step in every clinical NER pipeline. Add explicit regression tests covering negation patterns. :::
:::danger Training on Non-De-identified Data Without HIPAA Compliance
Using raw clinical data from an EHR without proper de-identification or without operating under a HIPAA Business Associate Agreement (BAA) is a federal violation. The penalties are significant: up to $1.9 million per violation category per year, and criminal charges for willful neglect.
Fix: Never process identifiable PHI outside of a HIPAA-compliant environment. For model development, use properly de-identified datasets (MIMIC-III/IV under PhysioNet DUA, i2b2 datasets under their DUA). For production systems, ensure cloud infrastructure is covered by a BAA. :::
:::warning Token Alignment Bugs with BERT WordPiece Tokenization
A very common implementation bug: training a token classifier with BERT tokenization but incorrectly aligning labels to subword tokens. If "metformin" tokenizes to ["met", "##form", "##in"] and you assign B-MEDICATION to all three subword tokens, the model learns inconsistently. The standard is to label only the first subword and assign -100 (ignored in loss) to continuation subwords.
Fix: Use the word_ids() method from HuggingFace tokenizers (shown in the code examples above) and carefully map word-level labels to subword tokens. Write unit tests with known tokenization cases.
:::
:::warning Evaluating at Token Level Instead of Span Level
A model that correctly labels 4 out of 5 subwords in a single entity span achieves 80% token-level F1 on that span, but 0% span-level F1 (the span is wrong). Token-level metrics dramatically overestimate real NER performance.
Fix: Always use span-level evaluation with the seqeval library for NER. Report entity-level F1, not token-level F1. The seqeval library implements the CoNLL-2003 evaluation protocol which is the standard for NER benchmarking.
:::
Interview Questions and Answers
Q1: What makes clinical NLP different from general-domain NLP, and why can't you just use off-the-shelf BERT?
Clinical text differs from general text along every dimension that matters for NLP:
Vocabulary: clinical notes contain thousands of medical abbreviations (SOB, HTN, DM2, CABG), brand drug names, procedure codes, and specialized terminology not present in Wikipedia or BookCorpus - BERT's training data.
Syntax: clinical text is telegraphic, not grammatical. "Pt c/o SOB x3d, +DOE, -orthopnea" is a complete nursing note. Standard sentence splitters and POS taggers fail on this.
Discourse: clinical notes follow note-type specific templates (SOAP format, systems review) that differ from narrative prose.
Negation density: 20-30% of clinical mentions are negated, vs roughly 5% in news text. A general BERT model has no learned representation for clinical negation patterns.
These factors mean that general BERT significantly underperforms domain-adapted models like ClinicalBERT on clinical tasks. ClinicalBERT, pretrained on 2B tokens of MIMIC-III notes, encodes the statistical regularities of clinical language that general BERT cannot capture.
Q2: Explain the NegEx algorithm. What are its failure modes?
NegEx searches a fixed-size token window around each detected medical concept for negation trigger phrases. It maintains lists of pre-negation triggers (appearing before the concept) and post-negation triggers (appearing after). If a trigger is found within the window and no sentence boundary intervenes, the concept is marked negated.
Failure modes:
Long-range negation: "The patient was admitted with fever, tachycardia, and hypotension, none of which were present at discharge." NegEx's 5-token window misses "none of which" if the concept span is far from the trigger.
Scope ambiguity: "No fever or chills" - both fever and chills are negated. NegEx handles this via window size but can fail on complex conjunctions.
Double negation: "Cannot rule out pneumonia" is an assertion, not a negation. NegEx partially handles this with pseudo-negation lists, but coverage is incomplete.
Conditional negation: "If fever develops, start antibiotics" - fever here is in a conditional context, not actually present or absent.
Modern transformer-based negation models (NegBERT) outperform NegEx on complex cases by learning these patterns from annotated data rather than relying on handcrafted trigger lists.
Q3: How would you design a de-identification pipeline that achieves F1 above 98% for production use?
A production de-identification system uses a hybrid approach:
First layer - rule-based: regex patterns for structured PHI (dates, phone numbers, SSNs, emails, MRNs). These achieve near-perfect precision and recall for the patterns they cover.
Second layer - NER model: a ClinicalBERT fine-tuned on annotated de-identification data (i2b2 2014 challenge data is the standard benchmark) for unstructured PHI like person names, organization names, geographic locations. This model handles the free-text cases that rules cannot cover.
Third layer - post-processing: additional passes for residual PHI patterns, consistency checking (if "John" was identified as a name in sentence 1, other occurrences of "John" in the same note should also be flagged).
Validation: use a held-out annotated gold standard dataset. Manually review a sample of de-identified output weekly in production. Track recall specifically (a missed PHI is worse than an over-redacted PHI from a compliance perspective).
F1 above 98% at the entity level is achievable on the i2b2 benchmark with fine-tuned transformer models. The gap between benchmark performance and production performance is usually due to institution-specific naming conventions and MRN formats not seen in training data.
Q4: What is the MIMIC-III dataset and why is it the gold standard for clinical NLP research?
MIMIC-III (Medical Information Mart for Intensive Care III) is a de-identified critical care database containing data from over 40,000 ICU admissions at Beth Israel Deaconess Medical Center from 2001 to 2012. It includes: complete ICU time series (vital signs, lab values at hourly granularity), all clinical notes (discharge summaries, nursing notes, radiology reports, ECG reports), ICD-9 billing codes, medication orders, procedure records, and demographics.
It is the gold standard because: it is the largest and most comprehensive publicly available clinical EHR dataset, it is properly de-identified under the HIPAA Safe Harbor standard, access is free for researchers after completing a training course and data use agreement, and it has extensive associated literature establishing benchmark results.
The MIMIC-IV release (2020) extended coverage to 2019, added a hospital database, and improved data quality. PhysioNet hosts both datasets.
Q5: How do you handle the extreme label imbalance in ICD-10 coding? Some codes appear in less than 0.1% of cases.
ICD-10 code frequency follows a power law. The top 50 codes might cover 40% of all coded encounters; the bottom 3,000 codes might each appear in fewer than 10 cases in a typical training dataset.
Strategies, in order of effectiveness:
Hierarchical code representation: ICD-10 codes have a natural hierarchy (E11.9 is under E11 is under E10-E14). Training with code hierarchy awareness - predicting at multiple levels of granularity - transfers signal from common parent codes to rare child codes.
Code description embeddings: encode the ICD-10 code description text (each code has a natural language description) using BERT and use this embedding to initialize the classifier head for that code. Rare codes benefit from semantic similarity to common codes with similar descriptions.
Macro-averaged F1 as training objective: standard cross-entropy with pos_weight handles binary imbalance but does not directly optimize for macro-average F1 across all codes. Loss functions that more directly approximate macro F1 (e.g., poly loss, asymmetric loss) improve rare code performance.
Few-shot learning: treat rare codes as a few-shot classification problem, using meta-learning techniques to generalize from the few available examples.
In practice, most production ICD coding systems focus on the top 1,000-2,000 codes that cover 90%+ of encounter volume, and route rare codes to human coders.
This lesson is part of the Applied AI - AI in Healthcare module. Next: Drug Discovery with AI.
