Skip to main content

Genomics and Protein Folding

The 50-Year Problem That Fell in a Weekend

In 1972, Christian Anfinsen won the Nobel Prize in Chemistry for demonstrating that a protein's three-dimensional structure is entirely determined by its amino acid sequence. The sequence encodes the structure. Somewhere in those letters - A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y - is the complete blueprint for how the chain of amino acids will fold into the precise shape that makes a hemoglobin molecule carry oxygen, a p53 protein suppress tumors, or an ACE2 receptor bind to a coronavirus spike protein.

The catch was that we had no idea how to compute it. The "protein folding problem" - predicting 3D structure from amino acid sequence - was identified as one of the great open problems in biology. Structural biologists spent careers solving single protein structures using X-ray crystallography, cryo-electron microscopy, or NMR spectroscopy. Each structure took months to years and hundreds of thousands of dollars. The Protein Data Bank, the global repository of experimentally determined structures, had accumulated roughly 180,000 structures over 50 years of collective effort by thousands of labs worldwide.

DeepMind's AlphaFold 2, presented at CASP14 in December 2020, solved the protein folding problem with a median accuracy of 0.96 angstroms RMSD on the test set - within experimental error of actual crystal structures. The computational biology community's reaction was a mixture of awe and mild existential crisis. John Moult, who co-founded CASP, described it as "a stunning advance." The head of structural biology at the Medical Research Council said it was "the most significant contribution" to structural biology in decades.

By 2022, DeepMind and EMBL's European Bioinformatics Institute had published predicted structures for over 200 million proteins - virtually every known protein sequence in UniProt. The AlphaFold Protein Structure Database became the most-used structural biology resource in history within months of launch. A tool trained on sequence data had, in a single release, expanded the known structural landscape of life by more than a factor of 1000.

The implications for medicine, drug discovery, and biology are still unfolding (no pun intended). Understanding what proteins look like enables rational drug design. Identifying structural similarities between pathogen proteins and human proteins explains cross-reactivity. Predicting the impact of mutations on protein structure is now computationally tractable rather than requiring years of experimental work. And the techniques that made AlphaFold possible - attention mechanisms over multiple sequence alignments, geometric deep learning, iterative structure refinement - are now being applied to DNA, RNA, and entire biological pathways.

Why This Exists - Biology's Data Problem

Biology generates data at a scale that overwhelms human analysis. A single whole-genome sequencing run produces 200 GB of raw data. A single single-cell RNA sequencing experiment on a tumor biopsy generates expression profiles for 50,000 individual cells. A clinical trial of a new cancer drug generates genomic, transcriptomic, proteomic, and clinical data for hundreds of patients. The bottleneck is not generating biological data - the bottleneck is understanding what it means.

The drug discovery pipeline has a fundamental efficiency problem: it costs over $2 billion and takes 12 years on average to bring a new drug to market, with a clinical trial failure rate above 90%. The majority of late-stage failures are due to unexpected toxicity or lack of efficacy that could have been predicted from structural or genomic data if we knew how to read it. AI in genomics and protein science is fundamentally about compressing this cycle - identifying drug targets from genomic data, predicting whether a molecule will bind a protein before synthesizing it, and understanding which patients are likely to respond to a treatment before enrolling them.

Historical Context - From the Human Genome to Foundation Models

The Human Genome Project completed in 2003 after 13 years of work at a cost of approximately $2.7 billion. It produced the first complete human genome sequence - 3 billion base pairs encoding approximately 20,000 protein-coding genes. The leaders of the project announced at the completion ceremony that this would "revolutionize medicine." That revolution was slower and stranger than anyone anticipated.

The first generation of genomics AI focused on genome-wide association studies (GWAS) - scanning the genomes of thousands of patients and healthy controls to find single-nucleotide polymorphisms (SNPs) statistically associated with disease. By 2015, GWAS had identified thousands of associations but explained only a small fraction of disease heritability. Type 2 diabetes had hundreds of identified risk variants that collectively explained less than 20% of its heritability.

The missing heritability problem drove the development of polygenic risk scores (PRS) - combining hundreds or thousands of individually weak SNP associations into a single summary score. PRS for coronary artery disease, developed by researchers at the Broad Institute, identified individuals at 3x the population risk of heart attack. For breast cancer, PRS could identify women at the same risk level as BRCA1 carriers without having the BRCA1 mutation. These were clinically actionable predictions, but they required large, diverse training sets to be valid across ancestries.

AlphaFold 1 (2018) used gradient boosting and neural networks on hand-engineered features from sequence databases to predict contact maps - the probability that two amino acids in a sequence are close in 3D space. It showed progress but did not solve the problem. AlphaFold 2 (2020) replaced this with an end-to-end differentiable architecture based on attention over multiple sequence alignments, geometrically-aware transformers, and physical energy-based refinement. The jump in accuracy between AF1 and AF2 was larger than all progress made in the previous 25 years of the CASP competition combined.

Since 2022, the field has moved toward biological foundation models: large language models trained on DNA, RNA, and protein sequences in the same way GPT was trained on text. Nucleotide Transformer (2023) trained a 2.5B parameter transformer on 3,202 human genomes and 850 other species. Evo (2024, Arc Institute) trained a 7B parameter model on 2.7 million prokaryotic genomes. These models can predict variant effects, generate novel functional sequences, and design CRISPR guide RNAs.

Core Concepts

How AlphaFold 2 Works

AlphaFold 2 approaches protein structure prediction as a geometric learning problem. Given a protein sequence, the goal is to predict the 3D coordinates of every atom in the protein (typically, only the backbone and side chain atoms, which is on the order of 5 atoms per residue for a protein of length L, so roughly 5L atoms total).

Step 1 - Multiple Sequence Alignment (MSA): Evolution is the key insight. If a protein sequence has been conserved across many species, residues that co-evolve (change together across species) are likely to be in physical contact in 3D space. AlphaFold 2 builds a multiple sequence alignment (MSA) of the query sequence against thousands of related sequences from databases like UniRef90 and MGnify. This produces a matrix of shape (N_seq, L) where N_seq can be thousands of sequences.

Step 2 - Pairwise Features: AlphaFold 2 also retrieves template structures from PDB - known protein structures that have some sequence similarity to the query. Templates provide geometric priors about what folds are plausible.

Step 3 - Evoformer: The core neural network is the Evoformer, a stack of 48 transformer blocks that operate on two representations simultaneously:

  • MSA representation: shape (N_seq, L, c_m) - tracks information about each sequence-residue pair
  • Pair representation: shape (L, L, c_z) - tracks information about each pair of residue positions

The Evoformer alternates between:

  • MSA row attention (attention across residues within each sequence in the MSA)
  • MSA column attention (attention across sequences at each position)
  • Outer product mean (updates pair representation from MSA)
  • Triangle attention (maintains geometric consistency - if residue A is close to B and B is close to C, A and C should be related)
  • Transition feedforward layers

The triangle attention is the key geometric insight. Real protein structures must satisfy triangle inequality constraints in 3D space. Standard attention has no geometric inductive bias. Triangle attention explicitly models the relationship between triplets of residue pairs, encouraging the pair representation to encode geometrically valid distance and orientation information.

Step 4 - Structure Module: Takes the pair representation and iteratively builds a 3D structure. Represents each residue as a rigid body (a backbone frame described by a rotation and translation). Uses "invariant point attention" - attention that is equivariant to global rotation and translation of the protein, so predictions don't depend on arbitrary coordinate system choices.

Step 5 - Recycling: The entire pipeline is run 3-4 times, with each cycle's output used as input for the next. This iterative refinement is crucial for accuracy.

The loss function combines structure prediction accuracy with a predicted local distance difference test (pLDDT) - a per-residue confidence score that AlphaFold outputs alongside the structure. The pLDDT score is remarkably well-calibrated: regions with pLDDT above 70 are generally correct; below 50 are unreliable (often intrinsically disordered regions).

pLDDT(i)=expected LDDT score for residue i[0,100]\text{pLDDT}(i) = \text{expected LDDT score for residue } i \in [0, 100]

The full LDDT metric measures the fraction of CαC\alpha distances within 15 angstroms that are correctly predicted within thresholds of 0.5, 1, 2, and 4 angstroms:

LDDT=14t{0.5,1,2,4}contacts correctly predicted within t angstromstotal contacts within 15 angstroms\text{LDDT} = \frac{1}{4} \sum_{t \in \{0.5, 1, 2, 4\}} \frac{\text{contacts correctly predicted within } t \text{ angstroms}}{\text{total contacts within 15 angstroms}}

Variant Calling

Variant calling is the process of identifying differences between an individual's genome sequence and a reference genome. When you sequence someone's genome with short-read sequencing (Illumina), you get 150-base-pair fragments that must be aligned to the 3-billion-base-pair reference genome, and then the aligned reads are analyzed to find positions where the individual's sequence differs.

The main types of variants:

  • SNVs (Single Nucleotide Variants): Single base changes. The most common. rs334 is a SNV in the HBB gene that causes sickle cell disease when homozygous.
  • Indels: Small insertions or deletions (1-50 bp). The BRCA1 frameshift mutations that dramatically increase breast cancer risk are mostly indels.
  • CNVs (Copy Number Variants): Regions of the genome that are duplicated or deleted. HER2 amplification in breast cancer is a CNV.
  • SVs (Structural Variants): Large rearrangements - translocations, inversions, large deletions. The BCR-ABL fusion gene in CML is caused by a translocation.

Deep learning has improved variant calling significantly. Google's DeepVariant (2018) treats variant calling as an image classification problem: each candidate variant position is rendered as a pileup image (read alignments visualized as a 2D grid), and a CNN classifies whether the variant is real or an artifact. DeepVariant achieves lower error rates than the best traditional tools (GATK HaplotypeCaller) especially for difficult genomic regions with high GC content or repetitive sequences.

Polygenic Risk Scores

A polygenic risk score (PRS) aggregates the effects of thousands of genetic variants to estimate an individual's genetic predisposition to a complex disease. For a disease with N associated variants, the PRS for individual ii is:

PRSi=j=1NβjGij\text{PRS}_i = \sum_{j=1}^{N} \beta_j \cdot G_{ij}

where βj\beta_j is the effect size of variant jj (typically the log odds ratio from GWAS) and GijG_{ij} is the dosage of the effect allele at variant jj for individual ii (0, 1, or 2 copies).

The clinical utility of PRS is that it captures risk information that a clinician examining family history cannot see. A patient with no family history of heart disease but a top 5% PRS for coronary artery disease has a lifetime risk comparable to someone with familial hypercholesterolemia - and should be managed accordingly (early statin therapy, aggressive lipid monitoring).

The critical limitation of PRS is portability across ancestries. GWAS studies have been conducted predominantly in populations of European ancestry. A PRS trained on European GWAS performs dramatically worse in African-ancestry individuals because: (1) allele frequencies differ, so variants discovered in Europeans may not be common in African populations; (2) linkage disequilibrium patterns differ, so the tagging SNPs that proxy the causal variants in Europeans may not tag the same variants in Africans. This is not just a technical problem - a PRS that is worse at identifying high-risk individuals in Black patients than white patients is a health equity problem.

DNA Language Models

The analogy between protein sequences and natural language is surprisingly deep. Amino acids are "words," proteins are "sentences," and the training objective of predicting masked residues (BERT-style) or predicting the next residue (GPT-style) forces the model to learn the statistical regularities of functional protein sequences.

DNA is harder. The "vocabulary" is only 4 tokens (A, C, G, T), but the sequence is 3 billion tokens long and encodes information at multiple scales - individual codons, regulatory motifs, entire chromosomes. Early DNA language models like DNABERT (2021) tokenized DNA into 6-mers and trained BERT on human genome sequences, achieving state-of-the-art performance on promoter prediction, splice site detection, and transcription factor binding.

Nucleotide Transformer (2023, InstaDeep/EMBL-EBI) scaled to 2.5B parameters and trained on 3,202 human genomes plus 850 other species. The multi-species pretraining creates representations that capture evolutionary conservation - positions conserved across millions of years of evolution are likely functionally important. The model achieves strong performance on 18 downstream genomic tasks without any task-specific architecture changes.

Evo (2024, Arc Institute) took a different approach: train a 7B parameter model on DNA sequences from 2.7 million prokaryotic and phage genomes at single-nucleotide resolution, then use it to generate novel sequences, predict variant effects across all possible single-nucleotide changes in a gene, and design CRISPR guide RNAs.

Code Examples

ESMFold for Fast Protein Structure Prediction

# ESMFold is Meta's single-sequence protein structure predictor
# Much faster than AlphaFold2 (no MSA required) at some cost to accuracy
# Install: pip install fair-esm

import torch
import esm
from pathlib import Path


def predict_structure_esm(sequence: str, output_pdb: str) -> dict:
"""
Predict protein structure using ESMFold.
ESMFold uses ESM-2 language model embeddings directly - no MSA needed.
Suitable for rapid screening; use AlphaFold2 for research-quality predictions.

Args:
sequence: Amino acid sequence in single-letter code (e.g., "MKTIIALSYIFCLVFA...")
output_pdb: Path to save PDB file

Returns:
Dict with pLDDT scores and pTM score
"""
# Load ESMFold model (downloads ~2.5GB on first run)
model = esm.pretrained.esmfold_v1()
model = model.eval().cuda() if torch.cuda.is_available() else model.eval()

# Clean the sequence
sequence = sequence.upper().replace(" ", "").replace("\n", "")

# Validate amino acids
valid_aas = set("ACDEFGHIKLMNPQRSTVWY")
invalid = set(sequence) - valid_aas
if invalid:
raise ValueError(f"Invalid amino acids in sequence: {invalid}")

with torch.no_grad():
output = model.infer_pdb(sequence)

# Save PDB file
Path(output_pdb).parent.mkdir(parents=True, exist_ok=True)
with open(output_pdb, "w") as f:
f.write(output)

# Parse pLDDT from PDB B-factor column
plddt_scores = []
for line in output.split("\n"):
if line.startswith("ATOM"):
try:
bfactor = float(line[60:66].strip())
plddt_scores.append(bfactor)
except (ValueError, IndexError):
pass

mean_plddt = sum(plddt_scores) / len(plddt_scores) if plddt_scores else 0

return {
"pdb_path": output_pdb,
"mean_plddt": mean_plddt,
"residue_count": len(sequence),
"high_confidence_fraction": sum(1 for s in plddt_scores if s >= 70) / len(plddt_scores) if plddt_scores else 0,
}


def visualize_structure_notebook(pdb_path: str):
"""
Visualize protein structure in Jupyter using py3Dmol.
Color by pLDDT confidence (blue=high, red=low).
"""
try:
import py3Dmol
except ImportError:
print("Install py3Dmol: pip install py3Dmol")
return

with open(pdb_path) as f:
pdb_str = f.read()

view = py3Dmol.view(width=800, height=600)
view.addModel(pdb_str, "pdb")

# Color by B-factor (pLDDT for AlphaFold/ESMFold predictions)
# Blue (high confidence) to red (low confidence)
view.setStyle({"cartoon": {
"colorscheme": {
"prop": "b",
"gradient": "roygb",
"min": 50,
"max": 100,
}
}})
view.zoomTo()
view.show()
return view


# Batch prediction for variant effect analysis
def predict_variant_effects_batch(
wild_type_seq: str,
variants: list[tuple[int, str, str]], # (position, original_aa, mutant_aa)
output_dir: str,
) -> list[dict]:
"""
Predict structural effects of amino acid variants using ESMFold.
Compares pLDDT and structural deviation for WT vs each mutant.

variants: list of (1-indexed position, WT amino acid, mutant amino acid)
e.g., [(157, 'V', 'I'), (183, 'R', 'H')] for Val157Ile and Arg183His
"""
results = []
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)

# Predict WT structure
wt_result = predict_structure_esm(wild_type_seq, str(output_dir / "wildtype.pdb"))
results.append({"type": "wildtype", "plddt": wt_result["mean_plddt"], **wt_result})

for pos, orig_aa, mut_aa in variants:
idx = pos - 1 # Convert to 0-indexed
if idx < 0 or idx >= len(wild_type_seq):
print(f"Warning: position {pos} out of range for sequence of length {len(wild_type_seq)}")
continue

if wild_type_seq[idx] != orig_aa:
print(f"Warning: position {pos} is {wild_type_seq[idx]}, not {orig_aa}")

mutant_seq = wild_type_seq[:idx] + mut_aa + wild_type_seq[idx+1:]
variant_name = f"{orig_aa}{pos}{mut_aa}"
pdb_path = str(output_dir / f"{variant_name}.pdb")

try:
mut_result = predict_structure_esm(mutant_seq, pdb_path)
plddt_delta = mut_result["mean_plddt"] - wt_result["mean_plddt"]
results.append({
"type": "variant",
"variant": variant_name,
"plddt": mut_result["mean_plddt"],
"plddt_delta": plddt_delta,
"predicted_destabilizing": plddt_delta < -5.0,
**mut_result,
})
except Exception as e:
results.append({"type": "variant", "variant": variant_name, "error": str(e)})

return results

Polygenic Risk Score Calculation

import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional


def calculate_prs(
gwas_summary_stats: pd.DataFrame,
genotype_file: str,
output_path: Optional[str] = None,
) -> pd.Series:
"""
Calculate polygenic risk scores from GWAS summary statistics and genotype data.

gwas_summary_stats: DataFrame with columns [snp_id, effect_allele, other_allele, beta, p_value]
genotype_file: Path to PLINK .raw format file (individuals x SNPs, dosage 0/1/2)

Returns: Series of PRS values indexed by individual ID
"""
# Load GWAS summary statistics
# Typically you would use p-value thresholding and LD clumping before this step
# Here we assume the caller has already filtered to independent, significant variants
stats = gwas_summary_stats.copy()
stats = stats.set_index("snp_id")

# Load genotype dosage matrix (from PLINK --recode A)
# Format: columns are SNP IDs, values are 0/1/2 copies of effect allele
geno = pd.read_csv(genotype_file, sep="\t", index_col="IID")
snp_cols = [c for c in geno.columns if c in stats.index]

if len(snp_cols) == 0:
raise ValueError("No overlapping SNPs between summary stats and genotype file")

print(f"Calculating PRS using {len(snp_cols)} / {len(stats)} available variants")

# Align effect sizes to genotype coding
betas = stats.loc[snp_cols, "beta"].values # shape: (n_snps,)
dosages = geno[snp_cols].values # shape: (n_individuals, n_snps)

# PRS = sum of (beta_j * dosage_j) for each individual
prs = dosages @ betas # matrix multiply: (n_individuals, n_snps) @ (n_snps,) = (n_individuals,)
prs_series = pd.Series(prs, index=geno.index, name="PRS")

# Standardize to mean=0, SD=1 for interpretability across different variant sets
prs_standardized = (prs_series - prs_series.mean()) / prs_series.std()

if output_path:
prs_standardized.to_csv(output_path, header=True)
print(f"PRS saved to {output_path}")

return prs_standardized


def stratify_by_prs(prs: pd.Series, n_bins: int = 10) -> pd.DataFrame:
"""
Stratify individuals into PRS percentile bins for clinical risk communication.
Returns DataFrame with PRS, percentile, and risk tier.
"""
df = pd.DataFrame({"prs": prs})
df["percentile"] = df["prs"].rank(pct=True) * 100

# Define clinical risk tiers based on percentile
# These thresholds are illustrative - real thresholds depend on disease and PRS
def risk_tier(percentile: float) -> str:
if percentile >= 95:
return "Very High (top 5%)"
elif percentile >= 80:
return "High (top 20%)"
elif percentile >= 20:
return "Average"
elif percentile >= 5:
return "Low (bottom 20%)"
else:
return "Very Low (bottom 5%)"

df["risk_tier"] = df["percentile"].apply(risk_tier)

return df


def ancestry_adjusted_prs(
prs_raw: pd.Series,
principal_components: pd.DataFrame,
n_pcs: int = 10,
) -> pd.Series:
"""
Adjust PRS for ancestry using linear regression on principal components.
Removes population stratification effects that can inflate PRS in some ancestries.

principal_components: DataFrame of genomic PCs, indexed by individual ID
"""
from sklearn.linear_model import LinearRegression

# Align indices
common_ids = prs_raw.index.intersection(principal_components.index)
y = prs_raw.loc[common_ids].values
X = principal_components.loc[common_ids].iloc[:, :n_pcs].values

# Regress PRS on top PCs
reg = LinearRegression().fit(X, y)

# Residuals are the ancestry-adjusted PRS
residuals = y - reg.predict(X)
prs_adjusted = pd.Series(residuals, index=common_ids, name="PRS_adjusted")

# Standardize
prs_adjusted = (prs_adjusted - prs_adjusted.mean()) / prs_adjusted.std()

return prs_adjusted

Single-Cell RNA Sequencing Analysis

# scVI: deep generative model for single-cell RNA data
# Paper: Lopez et al. (2018) Nature Methods
# Install: pip install scvi-tools

import scvi
import anndata as ad
import scanpy as sc
import numpy as np
import pandas as pd


def analyze_scrna_with_scvi(
adata: ad.AnnData,
n_latent: int = 10,
n_layers: int = 2,
batch_key: Optional[str] = "batch",
n_epochs: int = 400,
) -> ad.AnnData:
"""
Analyze single-cell RNA data using scVI for:
- Batch correction across multiple samples/technologies
- Dimensionality reduction to latent space
- Differential expression analysis

adata: AnnData object with raw counts in adata.X
batch_key: Column in adata.obs indicating batch/sample
"""
# Preprocessing
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# Calculate QC metrics
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True)

# Filter low quality cells
adata = adata[adata.obs.pct_counts_mt < 20].copy()
adata = adata[adata.obs.n_genes_by_counts < 5000].copy()

# Highly variable genes (speeds up scVI significantly)
sc.pp.highly_variable_genes(
adata,
n_top_genes=3000,
subset=True,
flavor="seurat_v3",
batch_key=batch_key,
)

# Setup scVI model
scvi.model.SCVI.setup_anndata(
adata,
layer=None, # Use adata.X (raw counts)
batch_key=batch_key,
)

# Train scVI - learns latent representation that captures biology
# while removing batch effects (different protocols, sequencing runs)
vae = scvi.model.SCVI(
adata,
n_latent=n_latent,
n_layers=n_layers,
gene_likelihood="nb", # Negative binomial for count data
)

vae.train(
max_epochs=n_epochs,
early_stopping=True,
early_stopping_patience=45,
)

# Get batch-corrected latent representation
adata.obsm["X_scVI"] = vae.get_latent_representation()

# UMAP for visualization
sc.pp.neighbors(adata, use_rep="X_scVI", n_neighbors=15)
sc.tl.umap(adata)

# Leiden clustering
sc.tl.leiden(adata, resolution=0.5)

# Store model for differential expression
adata.uns["scvi_model"] = vae

return adata


def differential_expression_scvi(
adata: ad.AnnData,
group1_cells: list[str],
group2_cells: list[str],
top_n: int = 50,
) -> pd.DataFrame:
"""
Differential expression between two cell groups using scVI's Bayesian test.
Returns genes with Bayes Factor > 3 (strong evidence) sorted by effect size.
"""
vae = adata.uns["scvi_model"]

de_results = vae.differential_expression(
adata=adata,
groupby=None,
group1=group1_cells,
group2=group2_cells,
)

# Filter to significant results
significant = de_results[de_results["bayes_factor"] > 3].copy()
significant = significant.sort_values("lfc_mean", key=abs, ascending=False)

return significant.head(top_n)

Production Engineering Notes

MSA Quality Determines AlphaFold Quality: AlphaFold 2's accuracy degrades significantly for proteins with few homologs in sequence databases. Orphan proteins, recently evolved proteins, and proteins from poorly sampled organisms (e.g., many marine microbes) may have fewer than 30 sequences in their MSA, causing pLDDT scores to drop substantially. For drug discovery pipelines, always check the depth of the MSA before trusting a structure. ESMFold, which skips the MSA entirely, trades accuracy for speed - appropriate for screening but not for lead optimization.

Variant Effect Prediction at Scale: A typical missense effect prediction task for a clinical sequencing lab involves scoring thousands of variants of uncertain significance (VUS) per month. Running full AlphaFold 2 on each VUS is not tractable. In practice, use ESMFold or ESM-1v (trained specifically for variant effect prediction) for high-throughput screening, then escalate to full AlphaFold 2 plus molecular dynamics simulation for the handful of variants with strong clinical evidence. ESM-1v can score a single variant in under 1 second on a GPU vs. minutes for AlphaFold 2.

Data Engineering for Genomics: Genomics data pipelines are among the most demanding in bioinformatics. A single whole-genome sequencing FASTQ file is 50-100 GB. A cohort of 10,000 patients generates a petabyte of raw data. The standard tools (BWA-MEM2 for alignment, GATK for variant calling) are designed to process one sample at a time. At scale, you need: cloud object storage (S3/GCS) for raw data, a workflow manager (Nextflow, Snakemake, WDL) for parallelization, scatter-gather patterns across chromosomes to parallelize within-sample processing, and a variant database (Hail, BigQuery) for joint genotyping across cohorts.

Ancestry Bias in Models: Every model trained on publicly available genomic data inherits the ancestry composition of those datasets. gnomAD v3 has 10x more samples of European ancestry than African ancestry. The UK Biobank that trains many PRS models is 94% white British. Models trained on these datasets will be better calibrated for European-ancestry individuals. For clinical deployment, always test PRS performance stratified by ancestry using AUC and calibration curves. If your AUC drops from 0.80 in European-ancestry patients to 0.65 in African-ancestry patients, that is a clinical equity problem that must be addressed before deployment.

Common Mistakes

:::danger Using pLDDT as a Measure of Prediction Accuracy for All Regions AlphaFold's pLDDT score tells you how confident the model is about a prediction, not whether the prediction is biologically correct. Intrinsically disordered regions (IDRs) - stretches of protein that do not adopt a fixed structure - will have low pLDDT scores (below 50) in AlphaFold predictions. This is correct behavior: low pLDDT means the model is correctly signaling uncertainty about a region that genuinely does not have a fixed structure. IDRs are functionally important (many transcription factor activation domains are disordered) and are not failures of the model. Never filter out or ignore low-pLDDT regions as "poor predictions" without checking whether the region is known to be disordered. :::

:::danger Training PRS on Test Population Data Polygenic risk scores must be trained (GWAS discovery) and validated on independent cohorts. A common error is discovering variants in a large meta-analysis that includes your validation cohort, then reporting inflated performance because the model has seen related individuals. This is not the same as standard train/test split - the issue is that GWAS summary statistics from a large study will have learned the specific population structure of your validation cohort even if no individual was present in both. Always use strictly independent validation cohorts, ideally from different countries or studies. :::

:::warning Confusing Gene Expression with Protein Abundance Gene expression (RNA-seq) measures mRNA levels, not protein levels. These are correlated (r ~ 0.4-0.6) but not the same. Post-transcriptional regulation, protein stability, and secretion all mean that a gene with 10x higher expression does not necessarily produce 10x more protein. For mechanistic drug target validation, mRNA data is a starting point. For actual drug target engagement, you need protein-level evidence (proteomics, Western blot, immunofluorescence). Building clinical ML models that claim to predict protein-level phenotypes from RNA-seq alone is a common overreach in computational biology papers. :::

:::warning Reference Genome Version Incompatibility Human genome builds (GRCh37/hg19 vs GRCh38/hg38) use different coordinate systems. A variant at position chr1:100,000 in GRCh37 is at a completely different genomic location in GRCh38. If your GWAS summary statistics are in GRCh37 coordinates but your genotype data is aligned to GRCh38, your PRS calculation will be nonsense - you will be assigning GWAS effect sizes to wrong genomic positions. Always verify the genome build of every dataset in your pipeline. Use liftOver to convert coordinates when mixing builds. Many public GWAS databases have both builds available; always check the version column. :::

Interview Q&A

Q: Explain the role of multiple sequence alignments in AlphaFold 2. Why does evolutionary information help predict protein structure?

A: The key insight is co-evolution. If two residues in a protein are in physical contact in 3D space, and a mutation in one residue disrupts the protein's function, evolution will often compensate by changing the contacting residue to restore the interaction. Over millions of years and across thousands of related organisms, residues that are in physical contact will therefore tend to change together - they co-evolve. By looking at an alignment of the query protein against thousands of homologs, the Evoformer can detect which pairs of positions have correlated mutations. High mutual information between positions i and j is a strong signal that residues i and j are close in 3D space. This is why AlphaFold 2 performs much worse on proteins with few homologs (shallow MSA): there is less evolutionary data to learn from. The MSA-based approach is effectively crowdsourcing structural information from the evolutionary record.

Q: What is the difference between DeepVariant and traditional variant callers like GATK HaplotypeCaller? When would you choose each?

A: GATK HaplotypeCaller is a probabilistic model based on Bayesian statistics: it models the probability of observing the reads given each possible genotype (0/0, 0/1, 1/1) using local haplotype assembly and a hidden Markov model. DeepVariant is a CNN-based approach that represents each candidate variant as a pileup image and classifies it as real or artifact. DeepVariant generally outperforms GATK in accuracy (lower false positive and false negative rates) especially in repetitive regions and regions with high GC content where the HMM struggles. However, GATK remains preferred when: you need joint genotyping across a large cohort (DeepVariant currently lacks a standard joint genotyping workflow comparable to GATK's GenotypeGVCFs), you are working in a regulated clinical environment where a tool with an extensive validation history is required, or you are analyzing non-human genomes (DeepVariant is trained primarily on human data). For research sequencing, DeepVariant is generally the better choice today.

Q: Describe the main sources of batch effects in single-cell RNA sequencing and how scVI addresses them.

A: Batch effects in scRNA-seq arise from differences in sample handling (fresh vs. frozen tissue), sequencing platforms (10x Chromium v2 vs. v3 vs. v3.1), library preparation date, sequencing depth, and ambient RNA contamination. These technical sources of variation can be as large as or larger than genuine biological differences between cell types or conditions. scVI addresses batch effects through a variational autoencoder where the encoder maps each cell's gene expression to a latent space z, and the decoder reconstructs the counts conditioned on both z and the batch label. By conditioning the decoder on batch but not the encoder, scVI forces the latent space z to encode only biology-dependent variation. During inference, you can either (1) use the latent space directly for clustering (it's batch-corrected), or (2) generate "corrected" counts by decoding with a reference batch label. scVI models counts as negative binomial, which is appropriate for the count data with overdispersion typical of scRNA-seq.

Q: What is a variant of uncertain significance (VUS) and how would you design an AI pipeline to help reclassify VUSs?

A: A VUS is a genetic variant observed in a patient where the clinical significance (pathogenic, benign, or neutral) is unknown. VUSs are an enormous problem in clinical genetics - Clinvar contains over 600,000 VUSs for hereditary disease genes. An AI reclassification pipeline would combine multiple lines of evidence: (1) Computational pathogenicity predictors - CADD score (uses evolutionary conservation + functional genomics), AlphaMissense (Google's variant effect predictor using AlphaFold representations), ESM-1v (language model log-likelihood under mutation); (2) Structural context from AlphaFold - is the variant in the protein's active site, a buried hydrophobic core, or an exposed loop? Variants disrupting core packing or catalytic residues are more likely pathogenic; (3) Population frequency from gnomAD - variants observed in more than 0.1% of healthy individuals are unlikely to be highly penetrant; (4) Functional assay data - deep mutational scanning experiments that measure the functional impact of every possible substitution in a gene. The pipeline combines these signals using a gradient boosted model or ensemble approach, calibrated to ClinVar's curated pathogenic/benign labels. The output is a reclassification probability and the evidence features, surfaced for clinical geneticist review.

Q: AlphaFold predicted structures for 200 million proteins. What are the remaining unsolved problems in computational structural biology?

A: AlphaFold solved the single-chain monomeric structure prediction problem. The unsolved problems are: (1) Multi-protein complexes - protein-protein interactions, where AlphaFold-Multimer (2022) made progress but accuracy drops for heteromeric complexes; (2) Protein-ligand interactions - predicting where and how small molecules bind to proteins, critical for drug discovery; AlphaFold does not model ligands; (3) Protein dynamics - proteins are not static structures, they flex, change conformation with binding, and transition between functional states. AlphaFold predicts one low-energy structure, not the conformational ensemble. NMR gives an ensemble; AlphaFold cannot yet; (4) Membrane proteins in lipid bilayers - the membrane environment dramatically affects structure, and AlphaFold was trained on soluble proteins; (5) Intrinsically disordered proteins that fold upon binding - predicting the structure of a disordered protein when it binds its partner; (6) Predicting mutation effects on thermodynamic stability quantitatively, not just qualitatively; (7) RNA structure prediction - still much harder than protein structure prediction despite RNA being shorter; and (8) Protein design - generating novel sequences that fold into specified target structures, which has advanced significantly with RFdiffusion and ProteinMPNN but is not fully solved.

Q: Explain the concept of linkage disequilibrium and why it matters for PRS portability across ancestries.

A: Linkage disequilibrium (LD) refers to the non-random association of alleles at two loci - knowing the allele at one position gives you information about the allele at a nearby position. LD arises because when a mutation first occurs, it is on a specific haplotype (background combination of nearby variants), and it takes many generations of recombination to break up the association. In practice, variants within ~500 kb of each other often have LD above 0.3 in European populations. For PRS, most GWAS-identified variants are not the causal variants themselves - they are "tag SNPs" in LD with the causal variant. In a European-ancestry study, SNP rs12345 might have an odds ratio of 1.2 for diabetes not because rs12345 directly causes the disease, but because it tags the actual causal variant rs67890 (in LD with r2=0.85 in Europeans). In African-ancestry populations, the LD pattern around the same causal variant is different due to a different population history and more ancient effective population size. rs12345 might not be in LD with rs67890 in African-ancestry individuals (r2=0.2), so using rs12345's European-derived effect size gives incorrect information. The solution is to discover and validate variants in diverse populations, use multi-ancestry GWAS, or fine-map to the actual causal variants which are portable across ancestries.

© 2026 EngineersOfAI. All rights reserved.