Drug Discovery with AI

Twelve Years, Two Point Five Billion Dollars, and One Approved Drug

In 2003, a team of chemists at a mid-sized pharmaceutical company began work on a promising new compound. The target was a kinase involved in a specific cancer pathway. The biology was compelling. The early in-vitro results were exciting. Twelve years and an estimated $2.3 billion later, after two Phase II failures and one partial Phase III response, the program was terminated. The compound had unacceptable liver toxicity that the existing screening methods had not caught until late-stage clinical trials.

This is not an exceptional story. It is the median story. The average time from target identification to an approved drug is 12-15 years. The average cost is $2.5 billion (including the costs of failed programs). The Phase II to Phase III failure rate is roughly 50%. Of compounds that enter Phase I trials, only about 10% receive FDA approval. The attrition is brutal and expensive, and it is getting worse - as the easy molecular targets have been addressed, the remaining biology is harder, and the regulatory bar is higher.

The fundamental problem is one of search space size. The estimated number of drug-like molecules (following Lipinski's Rule of Five for oral bioavailability) is approximately $10^{60}$ . The entire history of pharmaceutical chemistry has synthesized and tested roughly 100 million distinct compounds. Even if you could synthesize and test one million new compounds per day, it would take longer than the age of the universe to explore the drug-like chemical space exhaustively.

This is exactly the problem that machine learning was built to solve: learning to predict structure from a vast combinatorial space without exhaustive enumeration. AlphaFold 2's solution to the protein structure prediction problem in 2020 was perhaps the most dramatic demonstration of this in scientific history. A problem that had stumped structural biologists for 50 years was solved, not by incremental experimental progress, but by a deep learning system trained on the Protein Data Bank.

The implications extend far beyond protein folding. Graph neural networks can now predict binding affinity, solubility, toxicity, and metabolic stability - properties that would each have required days of laboratory work - in milliseconds. Generative models can propose novel molecular scaffolds with specified property profiles that chemists have never synthesized. Virtual screening can triage a library of a billion virtual compounds to a shortlist of a thousand high-priority candidates before a single molecule is synthesized.

The first AI-designed drug candidate to enter Phase I trials was DSP-1181, a compound generated by Exscientia's AI platform and approved for clinical trials in 2020 - approximately 12 months from project start, versus an industry average of 4-5 years for hit identification alone. The race to validate AI drug discovery is underway, and the engineering challenges are formidable.

Why This Exists - The Productivity Crisis in Pharma

Eroom's Law (Eroom is Moore's Law spelled backwards) describes an empirical observation first noted by Jack Scannell et al. in 2012: the number of new drugs approved per billion dollars of R&D spending has roughly halved every nine years since the 1950s. While semiconductor transistor density doubles every two years (Moore's Law), pharmaceutical productivity declines at a comparable rate.

The causes are structural: better (more stringent) safety requirements mean more failures; the focus on harder biology (CNS diseases, rare diseases, complex cancers) increases risk; and "better than the Beatles" - it is increasingly hard to be clinically meaningfully better than existing treatments to earn reimbursement.

AI attacks the productivity crisis at multiple points in the pipeline. The three highest-leverage applications are: (1) protein structure prediction to enable structure-based drug design for previously undruggable targets, (2) ADMET property prediction to filter out toxic compounds early before expensive synthesis, and (3) generative molecular design to propose novel candidates with specified property profiles.

Historical Context

The computational approaches to drug discovery predate deep learning by decades. Quantitative Structure-Activity Relationship (QSAR) modeling, developed in the 1960s by Corwin Hansch and Toshio Fujita, established the principle that molecular structure determines biological activity and that this relationship can be modeled mathematically. Classical QSAR used handcrafted molecular descriptors - atom counts, topological indices, electrostatic properties - as features for linear regression or classical ML models.

The SMILES (Simplified Molecular Input Line Entry System) notation was developed by David Weininger at the EPA in 1988. SMILES encodes molecular structure as a linear string, enabling molecular databases and computational processing. The ZINC database (Irwin & Shoichet, 2005) compiled millions of commercially available compounds in SMILES format, enabling large-scale virtual screening.

The deep learning transition happened in stages. In 2012, Merck sponsored a Kaggle competition on molecular property prediction. The winning solution used a deep feedforward network on circular fingerprints (ECFP), outperforming all classical methods. This result attracted immediate attention from the pharma industry.

The graph neural network revolution for molecules came from the 2017 paper "Molecular Graph Convolutions" (Duvenaud et al. at Aspuru-Guzik group) and the 2018 "Neural Message Passing for Quantum Chemistry" (Gilmer et al. at Google Brain). These established that treating molecules natively as graphs - atoms as nodes, bonds as edges - outperformed all fingerprint-based approaches on quantum chemistry property prediction.

The AlphaFold 2 moment came at the CASP14 competition in 2020. AlphaFold 2 predicted protein structures with atomic accuracy (median GDT_TS score of 92.4, approaching the error margin of experimental methods) on protein families it had never seen. The subsequent open-sourcing of AlphaFold 2 and the release of structure predictions for the entire human proteome (20,000+ proteins) fundamentally changed what is possible in structure-based drug design.

Core Concepts

Molecules as Graphs - The Foundational Insight

The key insight that unlocked deep learning for molecular science is representing molecules as graphs rather than as SMILES strings or fixed-length fingerprint vectors.

A molecule is naturally a graph: atoms are nodes, covalent bonds are edges. Each node has features: atomic number, formal charge, number of hydrogens, hybridization state, aromaticity. Each edge has features: bond type (single, double, triple, aromatic), whether the bond is in a ring.

Graph Neural Networks (GNNs) process this structure directly through message passing. In each layer, every atom aggregates information from its bonded neighbors:

$h_v^{(k+1)} = \text{UPDATE}\left(h_v^{(k)}, \text{AGGREGATE}\left(\{h_u^{(k)} : u \in \mathcal{N}(v)\}\right)\right)$

where $h_v^{(k)}$ is the hidden representation of atom $v$ at layer $k$ , and $\mathcal{N}(v)$ is the set of atoms bonded to $v$ .

After $K$ layers of message passing, each atom's representation encodes information from all atoms within $K$ bonds. A global readout then aggregates atom representations into a molecular representation:

$h_{\text{mol}} = \text{READOUT}(\{h_v^{(K)} : v \in V\})$

Typical readout functions: sum, mean, or attention-weighted sum over atom representations. This $h_{\text{mol}}$ vector is then used for property prediction.

The advantage over SMILES-string models or fingerprints: GNNs operate on the actual molecular topology, can share parameters across chemically similar substructures, and naturally handle molecules of any size without padding or truncation.

SMILES Notation - The Molecular Language

SMILES encodes molecular structure as a linear string. Rules:

Atoms are represented by their atomic symbol: C (carbon), N (nitrogen), O (oxygen), F (fluorine), etc.
Lowercase letters indicate aromatic atoms: c (aromatic carbon), n (aromatic nitrogen)
Bonds: - single (implicit), = double, # triple
Branches: enclosed in parentheses
Rings: indicated by matching numbers
Charges: [N+], [O-]

Aspirin: CC(=O)Oc1ccccc1C(=O)O Caffeine: Cn1cnc2c1c(=O)n(c(=O)n2C)C Ibuprofen: CC(C)Cc1ccc(cc1)C(C)C(=O)O

SMILES enables treating molecules as sequences, enabling sequence models (RNNs, Transformers) to be applied. The SELFIES (Self-Referencing Embedded Strings) notation, developed by Aspuru-Guzik's group, improves on SMILES by guaranteeing chemical validity for any generated string - a critical property for generative models.

ADMET Properties - The Filters That Kill Drugs

ADMET stands for Absorption, Distribution, Metabolism, Excretion, Toxicity. These are the pharmacokinetic properties that determine whether a compound with good in-vitro activity against a target will actually work as a drug in a human body.

Absorption: does the compound reach the bloodstream from the gut? (oral bioavailability)
Distribution: does it reach the target tissue? Blood-brain barrier penetration?
Metabolism: how quickly is it broken down? Are the metabolites toxic?
Excretion: how is it cleared from the body? Kidney-cleared or liver-cleared?
Toxicity: does it damage liver cells (hepatotoxicity)? Disrupt cardiac ion channels (hERG liability)? Genotoxic?

Lipinski's Rule of Five (Lipinski et al., 1997) provides a first-pass filter for oral bioavailability:

Molecular weight $\leq$ 500 Da
LogP (lipophilicity) $\leq$ 5
Hydrogen bond donors $\leq$ 5
Hydrogen bond acceptors $\leq$ 10

Compounds violating more than one rule are unlikely to be orally bioavailable. ML models trained on experimental ADMET data from screening campaigns now predict all of these properties with accuracy comparable to, or exceeding, early experimental assays - at a fraction of the cost and time.

AlphaFold 2 - Structure-Based Drug Design

Knowing a protein's 3D structure enables rational drug design: designing a molecule that physically fits into the protein's binding pocket like a key into a lock. Without the structure, you are designing molecules blind.

Before AlphaFold 2, X-ray crystallography and cryo-electron microscopy were the methods for determining protein structures, but they are slow (months to years), expensive, and some proteins resist crystallization. Of the estimated 200 million protein sequences in UniProt, experimental structures existed for roughly 170,000 - less than 0.1%.

AlphaFold 2's innovation was treating protein structure prediction as an end-to-end deep learning problem. Key architectural components:

Multiple Sequence Alignment (MSA) encoding: protein sequences that are evolutionarily related (homologs) constrain the structure through coevolutionary signals. Residues that are in physical contact tend to coevolve. AlphaFold 2 extracts these coevolution signals from thousands of aligned homologous sequences using a novel MSA transformer.

Evoformer: a stack of transformer blocks operating on both the MSA representation and a pairwise residue interaction matrix, refining both representations iteratively.

Structure module: an SE(3)-equivariant transformer that places residues in 3D space, explicitly handling the rigid body geometry of amino acid backbones.

Recycling: the full network is applied 3-4 times, each time refining the predicted structure.

The output is atomic coordinates for every amino acid in the protein. AlphaFold 2 predicted structures for the entire human proteome, the proteomes of 47 other organisms, and over 200 million proteins from UniProt - all released publicly by DeepMind in collaboration with EMBL-EBI.

Molecular Generation - Designing New Drugs

Once you can predict molecular properties, the next question is: can you generate novel molecules with desired properties? This is the inverse problem.

Two main approaches:

SMILES-based generative models: treat SMILES strings as sequences and apply language model techniques (RNN, VAE, Transformer). The character-level model learns to generate valid SMILES strings. Property optimization is done by:

Bayesian optimization in the latent space of a VAE
Reinforcement learning: reward the model for generating molecules with high predicted property scores
Genetic algorithms: mutate and crossbreed high-scoring SMILES strings

Graph generative models: directly generate molecular graphs atom by atom, bond by bond. JUNCTION TREE VAE (Jin et al., 2018) first generates a scaffold from molecular fragments (junction tree), then assembles the atoms. This guarantees chemical validity and makes the generation more interpretable.

Diffusion models for molecules: DiffSBDD and DiffLinker apply diffusion processes to 3D molecular coordinates, generating molecules conditioned on a protein binding pocket structure. These are currently state-of-the-art for structure-based drug design.

Virtual Screening and Binding Affinity

Given a protein structure and a library of candidate molecules, virtual screening predicts which molecules are most likely to bind to the protein and inhibit its function.

Classical docking software (AutoDock Vina, Glide) uses physics-based scoring functions to estimate binding pose and affinity. These are computationally expensive: screening a million compounds against a single protein target takes days.

ML-based scoring functions replace the physics-based scoring with a neural network trained on experimental binding affinity data (Ki, IC50, Kd values from ChEMBL or BindingDB). This is faster but requires experimental data for training.

The binding affinity prediction problem is often framed as regression:

$\hat{\Delta G} = f_\theta(G_{\text{ligand}}, G_{\text{protein pocket}})$

where $\Delta G$ is the free energy of binding (directly related to $K_d$ via $\Delta G = RT \ln K_d$ ), and the model takes both the ligand graph and a graph representation of the protein binding pocket residues as input. Graph transformer models (e.g., EquiBind, DiffDock) that operate on 3D atomic coordinates are current state-of-the-art.

Code Examples

Molecular Property Prediction with DeepChem and GNNs

import deepchem as dc
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, global_mean_pool, GATConv
from torch_geometric.data import Data, DataLoader
from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem
import numpy as np
from typing import List, Optional, Tuple

# --------------------------------------------------
# 1. Molecular featurization using DeepChem
# --------------------------------------------------

def load_lipophilicity_dataset():
    """
    Load the Lipophilicity dataset from MoleculeNet.
    Predicting LogD (distribution coefficient) from molecular structure.

    This is a benchmark regression task in molecular ML.
    Dataset: 4200 compounds with experimental LogD values.
    """
    tasks, datasets, transformers = dc.molnet.load_lipo(
        featurizer='GraphConv',
        splitter='scaffold'  # Scaffold split is CRITICAL for molecular ML
    )
    train, val, test = datasets
    return tasks, train, val, test, transformers


def smiles_to_graph(smiles: str) -> Optional[Data]:
    """
    Convert a SMILES string to a PyTorch Geometric graph.
    Atoms as nodes with features, bonds as edges with features.
    """
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    # --- Node features (per atom) ---
    atom_features = []
    for atom in mol.GetAtoms():
        features = [
            atom.GetAtomicNum(),             # Atomic number (1-118)
            atom.GetFormalCharge(),          # Formal charge
            atom.GetNumImplicitHs(),         # Number of implicit H
            int(atom.GetIsAromatic()),       # Aromaticity flag
            int(atom.IsInRing()),            # Ring membership
            atom.GetDegree(),                # Number of bonds
            # Hybridization: SP=1, SP2=2, SP3=3, SP3D=4, SP3D2=5
            int(atom.GetHybridization()),
        ]
        atom_features.append(features)

    x = torch.tensor(atom_features, dtype=torch.float)

    # --- Edge features (per bond) ---
    edge_index = []
    edge_attr = []

    for bond in mol.GetBonds():
        i = bond.GetBeginAtomIdx()
        j = bond.GetEndAtomIdx()

        bond_features = [
            # Bond type: SINGLE=1, DOUBLE=2, TRIPLE=3, AROMATIC=12
            bond.GetBondTypeAsDouble(),
            int(bond.GetIsAromatic()),
            int(bond.IsInRing()),
            int(bond.GetIsConjugated()),
        ]

        # Add both directions (undirected graph)
        edge_index.extend([[i, j], [j, i]])
        edge_attr.extend([bond_features, bond_features])

    if len(edge_index) == 0:
        # Handle single-atom molecules
        edge_index = torch.zeros((2, 0), dtype=torch.long)
        edge_attr = torch.zeros((0, 4), dtype=torch.float)
    else:
        edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()
        edge_attr = torch.tensor(edge_attr, dtype=torch.float)

    return Data(x=x, edge_index=edge_index, edge_attr=edge_attr)


# --------------------------------------------------
# 2. Graph Neural Network for molecular property prediction
# --------------------------------------------------

class MolecularGNN(nn.Module):
    """
    Message-passing GNN for molecular property prediction.

    Architecture:
    - 4 Graph Attention Layers (GAT) for atom message passing
    - Global mean pooling to get molecular representation
    - MLP regression head

    GAT is preferred over GCN for molecular tasks:
    attention weights allow the model to learn which
    neighboring atoms are most relevant for predicting
    the target property.
    """

    def __init__(
        self,
        node_features: int = 7,
        hidden_dim: int = 128,
        output_dim: int = 1,
        num_layers: int = 4,
        dropout: float = 0.1,
        heads: int = 4
    ):
        super().__init__()

        self.node_embedding = nn.Linear(node_features, hidden_dim)

        # Stack of GAT layers
        self.gat_layers = nn.ModuleList()
        for i in range(num_layers):
            in_dim = hidden_dim * heads if i > 0 else hidden_dim
            out_dim = hidden_dim
            self.gat_layers.append(
                GATConv(
                    in_dim if i > 0 else hidden_dim,
                    hidden_dim,
                    heads=heads,
                    dropout=dropout,
                    concat=True if i < num_layers - 1 else False
                )
            )

        self.dropout = nn.Dropout(dropout)

        # Property prediction head
        self.mlp = nn.Sequential(
            nn.Linear(hidden_dim, 256),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, output_dim)
        )

    def forward(self, data: Data) -> torch.Tensor:
        x, edge_index, batch = data.x, data.edge_index, data.batch

        # Initial node embedding
        x = self.node_embedding(x)
        x = F.relu(x)

        # Message passing layers
        for i, layer in enumerate(self.gat_layers):
            x = layer(x, edge_index)
            if i < len(self.gat_layers) - 1:
                x = F.elu(x)
                x = self.dropout(x)
            else:
                x = F.relu(x)

        # Global pooling: aggregate atom representations to molecule representation
        x = global_mean_pool(x, batch)  # (num_graphs, hidden_dim)

        # Property prediction
        return self.mlp(x).squeeze(-1)


def train_gnn_property(
    model: MolecularGNN,
    train_loader: DataLoader,
    val_loader: DataLoader,
    epochs: int = 100,
    lr: float = 1e-3,
    device: str = "cuda",
):
    """Train GNN for molecular property regression."""
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, patience=10, factor=0.5
    )

    best_val_rmse = float('inf')

    for epoch in range(epochs):
        # Training
        model.train()
        total_loss = 0.0
        for batch in train_loader:
            batch = batch.to(device)
            optimizer.zero_grad()
            pred = model(batch)
            loss = F.mse_loss(pred, batch.y)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            total_loss += loss.item() * batch.num_graphs

        avg_train_loss = total_loss / len(train_loader.dataset)

        # Validation
        model.eval()
        val_preds, val_labels = [], []
        with torch.no_grad():
            for batch in val_loader:
                batch = batch.to(device)
                pred = model(batch)
                val_preds.extend(pred.cpu().numpy())
                val_labels.extend(batch.y.cpu().numpy())

        val_rmse = np.sqrt(np.mean((np.array(val_preds) - np.array(val_labels))**2))
        scheduler.step(val_rmse)

        print(f"Epoch {epoch+1}/{epochs}: "
              f"Train MSE={avg_train_loss:.4f}, Val RMSE={val_rmse:.4f}")

        if val_rmse < best_val_rmse:
            best_val_rmse = val_rmse
            torch.save(model.state_dict(), "best_gnn_model.pt")

    return best_val_rmse

ADMET Property Prediction Pipeline

from rdkit import Chem
from rdkit.Chem import Descriptors, QED, Lipinski
from rdkit.Chem import rdMolDescriptors
import deepchem as dc
import numpy as np
from typing import Dict, List

class ADMETPredictor:
    """
    Comprehensive ADMET property prediction.

    Combines:
    1. Rule-based filters (Lipinski, PAINS, structural alerts)
    2. ML-predicted properties (solubility, BBB, hERG, hepatotoxicity)
    3. Drug-likeness scoring (QED)

    In production at a pharma company, this would be connected to
    trained models validated against proprietary assay data.
    Here we use public DeepChem models for illustration.
    """

    # Structural alerts for known toxicophores
    TOXICOPHORE_SMARTS = {
        "Michael acceptor": "[CX3]=[CX3][CX3]=[O]",
        "Aldehyde": "[CX3H1](=O)[#6]",
        "Epoxide": "[OX2r3]",
        "Peroxide": "[OX2,OX1-][OX2,OX1-]",
        "Quinone": "O=C1C=CC(=O)C=C1",
        "Nitro (aliphatic)": "[CX4][NX3](=[OX1])[OX1-,OX2H]",
    }

    def __init__(self):
        # Compile SMARTS patterns
        self.alert_patterns = {
            name: Chem.MolFromSmarts(smarts)
            for name, smarts in self.TOXICOPHORE_SMARTS.items()
        }

    def lipinski_filter(self, mol) -> Dict[str, float]:
        """Lipinski Rule of Five assessment."""
        mw = Descriptors.MolWt(mol)
        logp = Descriptors.MolLogP(mol)
        hbd = Lipinski.NumHDonors(mol)
        hba = Lipinski.NumHAcceptors(mol)

        violations = sum([
            mw > 500,
            logp > 5,
            hbd > 5,
            hba > 10
        ])

        return {
            "molecular_weight": mw,
            "logP": logp,
            "HBD": hbd,
            "HBA": hba,
            "lipinski_violations": violations,
            "passes_lipinski": violations <= 1
        }

    def structural_alerts(self, mol) -> Dict[str, bool]:
        """Check for known toxicophore substructures."""
        alerts = {}
        for name, pattern in self.alert_patterns.items():
            if pattern is not None:
                alerts[name] = mol.HasSubstructMatch(pattern)
        return alerts

    def drug_likeness_qed(self, mol) -> float:
        """
        QED (Quantitative Estimate of Drug-likeness, Bickerton et al., 2012).
        Composite score from 0 to 1 based on 8 molecular properties.
        > 0.67 is considered drug-like.
        """
        return QED.qed(mol)

    def tpsa(self, mol) -> float:
        """Topological Polar Surface Area - predictor of oral bioavailability."""
        return rdMolDescriptors.CalcTPSA(mol)

    def predict_properties(self, smiles: str) -> Dict:
        """Full ADMET prediction for a SMILES string."""
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return {"valid_smiles": False}

        result = {"valid_smiles": True, "smiles": smiles}

        # Rule-based properties
        result.update(self.lipinski_filter(mol))
        result["structural_alerts"] = self.structural_alerts(mol)
        result["qed"] = self.drug_likeness_qed(mol)
        result["tpsa"] = self.tpsa(mol)

        # Rotatable bonds (TPSA > 140 and rotatable bonds > 10: poor oral absorption)
        result["rotatable_bonds"] = rdMolDescriptors.CalcNumRotatableBonds(mol)

        # Number of aromatic rings (higher = often more metabolically labile)
        result["aromatic_rings"] = rdMolDescriptors.CalcNumAromaticRings(mol)

        # Overall assessment
        result["oral_bioavailability_likely"] = (
            result["passes_lipinski"] and
            result["tpsa"] <= 140 and
            result["rotatable_bonds"] <= 10 and
            not any(result["structural_alerts"].values())
        )

        return result

    def batch_screen(
        self, smiles_list: List[str], top_n: int = 100
    ) -> List[Dict]:
        """
        Screen a list of SMILES and return top candidates by QED.
        This is the first-pass filter in a virtual screening campaign.
        """
        results = []
        for smiles in smiles_list:
            props = self.predict_properties(smiles)
            if props.get("valid_smiles") and props.get("oral_bioavailability_likely"):
                results.append(props)

        # Sort by QED (higher is better)
        results.sort(key=lambda x: x.get("qed", 0), reverse=True)
        return results[:top_n]


# Example usage
predictor = ADMETPredictor()

test_molecules = [
    ("Aspirin",      "CC(=O)Oc1ccccc1C(=O)O"),
    ("Ibuprofen",    "CC(C)Cc1ccc(cc1)C(C)C(=O)O"),
    ("Caffeine",     "Cn1cnc2c1c(=O)n(c(=O)n2C)C"),
    ("Lovastatin",   "CCC(C)C(=O)O[C@@H]1CC(=C/C=C/2C[C@@H](O)C[C@H](O)C2=O)\\C=C\\[C@]3([H])C1C[C@H](C)CC3"),
]

for name, smiles in test_molecules:
    props = predictor.predict_properties(smiles)
    print(f"\n{name}:")
    print(f"  MW: {props['molecular_weight']:.1f}, LogP: {props['logP']:.2f}")
    print(f"  QED: {props['qed']:.3f}, TPSA: {props['tpsa']:.1f}")
    print(f"  Oral bioavailability likely: {props['oral_bioavailability_likely']}")

Generative Molecular Design with SMILES

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from typing import List, Tuple

class SMILESDataset(Dataset):
    """Dataset for SMILES string language modeling."""

    # Vocabulary for SMILES character-level model
    # Covers atoms, bonds, branches, rings
    VOCAB = list("CNOSPFClBrI=#()-+[]123456789") + [
        "@@", "@", "/", "\\", ".", "Br", "Cl", "<SOS>", "<EOS>", "<PAD>"
    ]

    def __init__(self, smiles_list: List[str], max_length: int = 120):
        self.smiles_list = smiles_list
        self.max_length = max_length
        self.char2idx = {c: i for i, c in enumerate(self.VOCAB)}
        self.idx2char = {i: c for c, i in self.char2idx.items()}
        self.pad_idx = self.char2idx["<PAD>"]
        self.sos_idx = self.char2idx["<SOS>"]
        self.eos_idx = self.char2idx["<EOS>"]

    def encode(self, smiles: str) -> torch.Tensor:
        """Encode SMILES string to integer indices."""
        tokens = ["<SOS>"] + list(smiles) + ["<EOS>"]
        indices = [self.char2idx.get(t, self.pad_idx) for t in tokens]

        # Pad or truncate to max_length
        if len(indices) < self.max_length:
            indices += [self.pad_idx] * (self.max_length - len(indices))
        else:
            indices = indices[:self.max_length]

        return torch.tensor(indices, dtype=torch.long)

    def __len__(self):
        return len(self.smiles_list)

    def __getitem__(self, idx):
        encoded = self.encode(self.smiles_list[idx])
        return encoded[:-1], encoded[1:]  # Input, target (shifted by 1)


class SMILESGenerator(nn.Module):
    """
    LSTM-based SMILES generator.

    Trained as a language model: predicts next character
    given all previous characters. Generation is done by
    autoregressive sampling.

    More powerful alternatives: GPT-style transformers,
    but LSTM is sufficient and faster to train on small datasets.
    """

    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 128,
        hidden_dim: int = 512,
        num_layers: int = 3,
        dropout: float = 0.1
    ):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim, num_layers,
            batch_first=True, dropout=dropout
        )
        self.output_layer = nn.Linear(hidden_dim, vocab_size)

    def forward(
        self, x: torch.Tensor, hidden=None
    ) -> Tuple[torch.Tensor, tuple]:
        embedded = self.embedding(x)
        output, hidden = self.lstm(embedded, hidden)
        logits = self.output_layer(output)
        return logits, hidden

    def generate(
        self,
        dataset: SMILESDataset,
        max_length: int = 120,
        temperature: float = 1.0,
        device: str = "cpu"
    ) -> str:
        """
        Autoregressive SMILES generation.

        Temperature controls diversity:
        - Low T (0.5): more conservative, higher validity rate
        - High T (1.2): more diverse, lower validity rate
        """
        self.eval()
        with torch.no_grad():
            # Start with <SOS> token
            x = torch.tensor([[dataset.sos_idx]], dtype=torch.long).to(device)
            hidden = None
            generated = []

            for _ in range(max_length):
                logits, hidden = self(x, hidden)
                # Apply temperature scaling
                probs = torch.softmax(logits[0, -1] / temperature, dim=-1)
                # Sample from distribution
                next_token = torch.multinomial(probs, num_samples=1).item()

                if next_token == dataset.eos_idx:
                    break
                if next_token == dataset.pad_idx:
                    break

                generated.append(dataset.idx2char.get(next_token, ''))
                x = torch.tensor([[next_token]], dtype=torch.long).to(device)

        return ''.join(generated)


def generate_and_filter(
    model: SMILESGenerator,
    dataset: SMILESDataset,
    n_samples: int = 10000,
    target_qed_min: float = 0.6,
    device: str = "cpu"
) -> List[str]:
    """
    Generate molecules and filter for drug-like candidates.
    This is the basic RL-free generation pipeline.
    """
    from rdkit import Chem

    valid_molecules = []
    drug_like_molecules = []

    for _ in range(n_samples):
        smiles = model.generate(dataset, temperature=0.8, device=device)

        # Check validity
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue

        # Check drug-likeness
        from rdkit.Chem import QED
        qed_score = QED.qed(mol)
        valid_molecules.append(smiles)

        if qed_score >= target_qed_min:
            drug_like_molecules.append(smiles)

    validity_rate = len(valid_molecules) / n_samples
    drug_like_rate = len(drug_like_molecules) / n_samples

    print(f"Validity rate: {validity_rate:.2%}")
    print(f"Drug-like rate (QED >= {target_qed_min}): {drug_like_rate:.2%}")

    return drug_like_molecules

System Architecture Diagrams

Production Engineering Notes

Scaffold Splitting - The Critical Validation Practice

This is the single most important methodological point in molecular ML: you must use scaffold splitting, not random splitting.

In chemistry, a scaffold is the core ring system of a molecule. Molecules sharing the same scaffold tend to have similar properties. If you randomly split your dataset, related molecules will appear in both train and test sets, and your model will appear to generalize far better than it actually does.

Scaffold splitting (Bemis-Murcko scaffolds) ensures that molecules with the same core scaffold are assigned to the same split. This creates a more realistic test of generalization to structurally novel chemotypes - which is the actual challenge in drug discovery.

The performance gap between random and scaffold splits is often 20-40% in RMSE. A model that looks excellent with random splitting may completely fail in prospective virtual screening because all the test compounds are structurally novel.

Activity Cliffs - The Hard Problem

An activity cliff is a pair of structurally similar molecules that have drastically different biological activities. Two molecules that differ by a single fluorine atom may have a 1000-fold difference in binding affinity. Current GNNs struggle with activity cliffs because small structural differences that create large property differences are hard to learn from local graph neighborhoods.

Production molecular ML systems handle activity cliffs by: flagging predictions when the query molecule is structurally dissimilar to training data (using Tanimoto similarity to nearest neighbor), reporting confidence intervals, and routing flagged compounds to additional experimental validation rather than trusting the ML prediction blindly.

Data Provenance and IP

In pharma, the provenance of training data has legal implications. Models trained on proprietary internal assay data have IP implications - the model weights may encode patentable information about structure-activity relationships. Production molecular ML systems need:

Clear documentation of data sources. Separation of models trained on public data (ChEMBL, BindingDB, PubChem BioAssay) from models trained on proprietary data. Legal review of model weights before any external publication or open-sourcing.

Common Mistakes

:::danger Random Split in Molecular ML Validation

Using random train/test splitting in molecular ML produces optimistically biased performance metrics. The test set will contain structural analogs of training compounds, and the model will appear to generalize far better than it will in prospective use (where all test compounds are structurally novel).

Fix: Always use scaffold-based splitting (Bemis-Murcko scaffolds) or time-based splitting for prospective validity. Use DeepChem's ScaffoldSplitter or RDKit's MolecularScaffold utilities. Report both random and scaffold split performance so readers understand the gap. :::

:::danger Ignoring Activity Cliffs in Model Confidence Reporting

A GNN that performs well on average RMSE but fails catastrophically on activity cliffs will mislead medicinal chemists into deprioritizing highly active compounds. An ML-predicted "low potency" compound that actually has 100nM activity will be incorrectly filtered from the virtual screening hit list.

Fix: Add a novelty/applicability domain check to every prediction: compute Tanimoto similarity to the nearest training compound. Flag predictions with nearest-neighbor Tanimoto < 0.3 as "out of domain" with reduced confidence. Never report a single point prediction without a confidence estimate. :::

:::warning Conflating In-vitro Activity with In-vivo Efficacy

A model trained on IC50 values from a biochemical assay predicts in-vitro potency against an isolated target protein. This is completely different from in-vivo efficacy in a whole organism. A compound with sub-nanomolar IC50 may have no in-vivo effect because it cannot reach the target tissue (poor distribution), is rapidly metabolized (poor half-life), or the assay target is not the disease-relevant mechanism.

Fix: Be explicit about what property each model predicts. Never use the word "efficacy" when you mean "in-vitro potency." The ADMET pipeline exists precisely because target activity alone is insufficient for drug discovery. :::

:::warning Training on Single-Source Assay Data Without Considering Assay Artifacts

Different labs measuring the same compound against the same target can produce IC50 values that differ by 10-100 fold due to assay conditions, cell line differences, and protocol variations. Models trained on a single assay source may learn assay-specific artifacts.

Fix: When training on ChEMBL or similar aggregated databases, filter for single-concentration confirmatory assay data (not primary screening data), check for high variability across reported values for the same compound-target pair, and curate carefully. Use confidence scores from ChEMBL to filter low-quality measurements. :::

Interview Questions and Answers

Q1: What is AlphaFold 2 and why was it a breakthrough for drug discovery?

AlphaFold 2 is a deep learning system from DeepMind that predicts the 3D structure of a protein from its amino acid sequence. At CASP14 in 2020, it achieved median GDT_TS score of 92.4 on previously unseen proteins, approaching the accuracy of experimental methods like X-ray crystallography.

The drug discovery breakthrough is in structure-based drug design. To design a molecule that inhibits a protein, you ideally need to know the 3D shape of the protein's binding site. Before AlphaFold 2, experimental structure determination (X-ray crystallography, cryo-EM) was the bottleneck - expensive, slow, and often impossible for membrane proteins that resist crystallization. AlphaFold 2 provides predicted structures for essentially any protein sequence in days rather than months, at no cost.

DeepMind released predicted structures for the entire human proteome (20,000+ proteins) and 200 million proteins from UniProt. This has opened structure-based drug design for thousands of previously "undruggable" targets where no structural information existed.

Q2: Why are graph neural networks preferred over fingerprint-based methods for molecular property prediction?

Morgan fingerprints (circular fingerprints, ECFP) are fixed-length binary vectors that encode the presence of specific molecular substructures within a given radius of each atom. They are effective and fast, but have limitations.

Fixed representation: a 2048-bit fingerprint has a fixed information capacity, regardless of molecular complexity. Bit collisions (different substructures hashing to the same bit) cause information loss, particularly for large molecules.

No learned features: fingerprint bits correspond to predefined substructure patterns, not patterns optimized for the target prediction task.

GNNs learn task-specific features through message passing. The atom and bond representations are optimized end-to-end for the property being predicted. For quantum chemistry properties (HOMO-LUMO gap, dipole moment), GNNs trained on QM9 significantly outperform ECFP-based models. For ADMET prediction tasks, the performance gain of GNNs over deep networks on ECFP is smaller but consistent.

The practical tradeoff: ECFP + XGBoost is faster to train and competitive for simple tasks. GNNs are worth the added complexity for tasks where local chemical context over several bond lengths matters.

Q3: Explain the concept of Eroom's Law and how AI is expected to address it.

Eroom's Law (Jack Scannell et al., 2012) observes that the number of new drugs approved per billion dollars of R&D spending has halved approximately every 9 years since the 1950s. This is the inverse of Moore's Law. Pharma R&D productivity has been declining despite massive increases in spending and technological capability.

The causes: increasingly stringent safety requirements (harder to pass Phase III), focus on difficult targets (CNS, oncology) with high failure rates, regulatory requirements for better-than-existing-treatment efficacy, and the "best picks" being exhausted early.

AI addresses specific nodes of this inefficiency. Virtual screening reduces the cost of hit identification by orders of magnitude - identifying candidates with ML predictions costs $0.001 per compound versus$ 100+ per compound in HTS. ADMET prediction catches toxic compounds before expensive synthesis, reducing late-stage attrition. AlphaFold enables structure-based design for previously inaccessible targets.

AI cannot address the fundamental biology difficulty or the regulatory environment. But if it moves the 10% approval rate for Phase I compounds to 15%, or reduces average time from target to IND by 2 years, the economic impact is enormous. Whether this is sufficient to reverse Eroom's Law remains to be proven - we are only beginning to see Phase II data from AI-designed programs.

Q4: What is the difference between virtual screening and de novo molecular design?

Virtual screening involves selecting compounds from an existing library (physical or virtual) and ranking them by predicted activity against a target. The search space is constrained to enumerated compounds. Ultra-large virtual libraries now contain 10^9 to 10^11 compounds accessible via on-demand synthesis - these are screened computationally, and only the top hits are synthesized.

De novo design starts from scratch, using generative models to propose novel molecular structures not necessarily in any existing library. The model explores chemical space unconstrained by existing compound collections. The challenge is that generative models can easily produce chemically valid molecules that are synthetically inaccessible (too many steps to make) or have other practical problems.

In practice, the two are often combined: generate novel scaffolds with de novo design, screen them with virtual screening tools, filter with ADMET predictors, then pass the top hits to synthetic chemistry feasibility assessment before ordering synthesis.

Q5: How would you validate a molecular property prediction model before using it in a real virtual screening campaign?

A rigorous validation protocol involves multiple steps:

External validation: evaluate on an external test set from a different source than the training data. For example, train on ChEMBL and test on internal proprietary assay data. This tests true generalization.

Scaffold split performance: as discussed, this is the minimum standard for prospective relevance. Report performance on both random and scaffold splits.

Temporal validation: if your data has timestamps (assay data collected over time), split by time: train on older data, test on newer data. This simulates prospective use.

Applicability domain: characterize the chemical space where the model is reliable. Compute nearest-neighbor Tanimoto similarity between test compounds and training set. Plot RMSE as a function of Tanimoto similarity to quantify how performance degrades with structural novelty.

Prospective validation: the gold standard. Run a virtual screen on a target, synthesize the top ML-ranked compounds, measure experimental activity, and compute enrichment factor (how many actives in the top N ML predictions vs random). This is expensive but the only truly honest validation.

This lesson is part of the Applied AI - AI in Healthcare module. Next: Patient Outcome Prediction.

Twelve Years, Two Point Five Billion Dollars, and One Approved Drug​

Why This Exists - The Productivity Crisis in Pharma​

Historical Context​

Core Concepts​

Molecules as Graphs - The Foundational Insight​

SMILES Notation - The Molecular Language​

ADMET Properties - The Filters That Kill Drugs​

AlphaFold 2 - Structure-Based Drug Design​

Molecular Generation - Designing New Drugs​

Virtual Screening and Binding Affinity​

Code Examples​

Molecular Property Prediction with DeepChem and GNNs​

ADMET Property Prediction Pipeline​

Generative Molecular Design with SMILES​

System Architecture Diagrams​

Production Engineering Notes​

Scaffold Splitting - The Critical Validation Practice​

Activity Cliffs - The Hard Problem​

Data Provenance and IP​

Common Mistakes​

Interview Questions and Answers​