Therapeutic properties
Molchanica uses a neural network to infer therapeutic properties of arbitrary small organic molecules. It does so using the Burn toolkit, trained on data from the Therapeutic Data Commons (TDC). As a consequence, the TDC documentation provides details on each of these properties.
Viewing properties
To view the therapeutic properties of a small molecule, click the Show details button near the top of the sidebar. This will show both molecular characterization data, and inferred therapeutic properties for the active molecule.
Screening
(todo)
Model description
Our model implements a hybrid Graph Neural Network (GNN) and Multi-Layer Perceptron (MLP) architecture designed for Quantitative Structure-Property Relationship (QSPR) modeling. The system is engineered to regress emperical data, e.g. from TDC. The model combines both topological graph data from the molecule with computed properties that characterize the molecule as a whole.
The GNN branch processes the molecule's topology using a Graph Convolutional Network (GCN) approach. Atoms are treated as nodes initialized with one-hot encoded elemental identity and degree connectivity, while bonds define the edges. The network utilizes a symmetric normalized adjacency matrix to perform three hops of message passing (graph convolution), allowing the model to aggregate local chemical environments into node embeddings. This branch concludes with a global average pooling layer (masked mean) to generate a translation-invariant latent structural embedding of the molecule.
The MLP branch encodes a vector of global molecular descriptors—including LogP, Topological Polar Surface Area (TPSA), molecular weight, and ring counts—which are pre-processed via log-transformation and Z-score normalization (StandardScaler). The latent representations from the GNN and MLP branches are concatenated, stabilized via layer normalization, and passed through a final dense head to predict the scalar biological property.
Training is performed using a separate binary compiled from Molchanica's code base. We train on a GPU using Burn's WGPU or Cuda backends. Inference performed on the CPU execution using the NdArray backend, and takes <1ms per molecule, per property on typical PC hardare. (GPU takes longer for these properties, as they're relatively simple to compute, and therefore I/O limited). This high speed makes screening molecules for specific therapeutic applications feasible.
Typical performance
(todo: Including metrics and how we analyzed using train/test splits)
We use the recommended train/validation split metrics for each data set recommended by TDC. In most cases, this a Scaffold split, which takes into account molecule geometry. We perform this splitting using the TDC python library's built-in functions, then feed the appropriate split indices into our native training, inference, and validation model.
ADME
Absorption
Cell permeability (Caco-2)
Predicts the effective permeability of the molecule using the Caco-2 cell line (human colon epithelial cancer cells) as a proxy. This metric approximates the rate at which a drug permeates through human intestinal tissue, a critical factor for oral delivery.
PAMPA permeability
Predicts the outcome of the Parallel Artificial Membrane Permeability Assay (PAMPA). This is a non-cell-based high-throughput assay that evaluates passive diffusion across an artificial membrane, identifying compounds with high permeability (1) versus low-to-moderate permeability (0).
Intestinal absorption (HIA)
Predicts Human Intestinal Absorption (HIA). This binary classification determines if an orally administered drug can be successfully absorbed from the gastrointestinal system into the bloodstream.
P-glycoprotein inhibition
Predicts the likelihood of the molecule inhibiting P-glycoprotein (Pgp/ABCB1). Pgp is an ABC transporter involved in intestinal absorption, brain penetration, and drug metabolism. Inhibitors can alter the bioavailability of other drugs (drug-drug interactions) or overcome multidrug resistance.
Bioavailability (Oral)
Predicts the rate and extent to which the active ingredient is absorbed from a drug product and becomes available at the site of action. This is a binary classification of oral bioavailability.
Lipophilicity
Predicts the ability of the drug to dissolve in a lipid (fat/oil) environment. High lipophilicity often correlates with high metabolic turnover, poor solubility, and low absorption.
Solubility (AqSolDB)
Predicts aqueous solubility, measuring the drug's ability to dissolve in water. Poor water solubility can lead to slow absorption, inadequate bioavailability, and toxicity.
Hydration Free Energy (FreeSolv)
Regresses the hydration free energy of the molecule in water. This is derived from the Free Solvation Database (FreeSolv), combining experimental data with alchemical free energy calculations from molecular dynamics.
Distribution
Blood brain barrier permeability (BBB)
Predicts the ability of the molecule to penetrate the Blood-Brain Barrier (BBB), the protective membrane separating circulating blood from brain extracellular fluid. This is a binary classification crucial for CNS-targeting drugs (which must cross) and non-CNS drugs (which should generally avoid crossing).
Plasma protein binding rate (PPBR)
Predicts the percentage of the drug that binds to plasma proteins in the blood. A high binding rate decreases the efficiency of diffusion to the site of action.
Volume of distribution (VDss)
Predicts the Volume of Distribution at steady state (VDss). This measures the degree of a drug's concentration in body tissue compared to its concentration in blood. Higher VD indicates higher distribution in tissue, often associated with high lipid solubility.
Metabolism
CYP P450 Inhibition
The Cytochrome P450 (CYP) gene family is essential for the metabolic breakdown of drugs in the liver. Inhibition of these enzymes leads to poor metabolism and adverse drug-drug interactions. Molchanica predicts binary inhibition for the following major isoforms: * CYP1A2: Metabolizes polycyclic aromatic hydrocarbons and substrates like caffeine. * CYP2C9: Major role in oxidation of xenobiotic and endogenous compounds. * CYP2C19: Involved in protein processing and transport. * CYP2D6: Highly expressed in the liver and CNS (substantia nigra). * CYP3A4: Found in the liver and intestine; oxidizes toxins and small foreign organic molecules.
Toxicity
Acute Toxicity (LD50)
Regresses the acute toxicity LD50, defined as the most conservative dose that leads to lethal adverse effects. A higher dose indicates lower lethality.
hERG Blockers (Cardiotoxicity)
Predicts the blockade of the human ether-à-go-go related gene (hERG) potassium channel. hERG is crucial for coordinating the heart's beating; blocking it can lead to severe arrhythmias (e.g., Torsades de Pointes). Molchanica incorporates data from multiple datasets (hERG Central, Karim et al.) to predict inhibition probability and IC50 concentration thresholds.
Ames Mutagenicity
Predicts mutagenicity based on the Ames test (bacterial reverse mutation assay). A positive result (1) indicates the compound can induce genetic damage and frameshift mutations, posing a risk of cell death or cancer.
DILI (Drug Induced Liver Injury)
Predicts the potential for Drug-Induced Liver Injury (DILI). DILI is a fatal liver disease and a frequent cause of safety-related drug withdrawals and clinical trial failures.
Skin Reaction
Predicts the potential for skin sensitization. This binary classifier identifies if repetitive exposure to the chemical agent can induce an immune reaction leading to allergic contact dermatitis.
Carcinogens
Predicts whether the substance promotes carcinogenesis (the formation of cancer) via genome damage or disruption of cellular metabolic processes.
ClinTox
Predicts the likelihood of clinical trial failure due to toxicity. This dataset differentiates between drugs that failed clinical trials for toxicity reasons and those associated with successful trials.
Molecule parameters used to train the model
- The molecule's bond graph and elements
- Atom count
- Bond count
- Molecular weight
- Number of heavy atoms
- Number of Hydrogen bond acceptors
- Number of Hydrogen bond conodrs
- Number of hetero (non-hydrogen) atoms
- Number of Halogen atoms
- Number of rotatatable bonds
- Number of Amine groups
- Number of Amide groups
- Number of Carbonyl groups
- Number of Carboxyl groups
- Number of valence electrons
- Number of aromatic rings
- Number of saturated rings
- Number of aliphatic rings
- LogP
- Molar refractivity
- Polarizable surface area, calculated using topology
- Molecule volume, calculated using topology
- Wiener index
Molecular dynamics
(WIP: Ab-inito solubility determination using molecular dynamics)****