# Ignota DSG Team Notes - Day 1
Day 2 notes: https://hackmd.io/0qNBkU0HRXKf3Jrgsl7HiQ
## Outline
(thanks Pareena for making the outline!)
1) Introductions – come up with team name for breaking the ice
2) Admin
a) Agreeing on a way to meet every day
b) Talking about Git, make sure everyone knows how to use it and encourage to push and pull often
c) Talk about writing up, making sure everyone knows how to use overleaf
d) Maybe make draft slides simultaneously while writing up.
e) A group review at the end of the day/beginning of the next day about report and progress
3) Group discussion about challenge
a) what people think would work well with the problem
b) and what they are keen to work on
4) Agreeing on a broad approach
a) Split main problem into smaller chunks
b) Talking about the common code we need (way to load and cross fold the data, evaluation of models)
5) Split team into small groeups to work on different things
## Organisational things
- Groups structure thoughts
- Compound representation
- SMILES -> Descriptors
- SMILES -> Fingerprints
# Notes
- Meeting for standup at 11am
- Remember to use python 3.10
- https://github.com/alan-turing-institute/DSGDec2023Ig
- [DeepChem Splitters](https://deepchem.readthedocs.io/en/latest/api_reference/splitters.html)
- [OPIG Out-of-distribution generalisation](https://www.blopig.com/blog/2021/06/out-of-distribution-generalisation-and-scaffold-splitting-in-molecular-property-prediction/)
## Group 1 Notes
- [Review of Physicochemical Descriptors](https://www.sciencedirect.com/science/article/pii/S0928098720300634)
- scaffolding (structurally similar molecules & data leakage)
- scaffolding across the three datasets, skikit-learn k-fold
- deepchem for scaffolding
- three separate models?
- sharing scaffolding results & models for comparison between groups 1 & 2
Organisation:
- starting to generate PyTorch pipeline
- Literature searching
- Beginning scaffolding pipeline
## Group 2 Notes
- Random forest on the molecular fingerprint as a baseline
- Build loader for k-fold cross validation:
- First ascertain imbalance of data set
- Stratified split for each fold: group split where scaffolds are the groups
- Investigate potential for scaffold structuring
- DeepChem has library for separating SMILES into scaffolds
- Variety of metrics: MCC, F1 score, AUC, Accuracy
- Exploration of data sets: bullet point summary
- Similarly for existing literature: bullet point summary
Data:
- Load as NumPy array for larger datasets
- Cav: 553 entries, _ (50%) toxic, _ (50%) non-toxic
- Herg: 1845 entries, 200 (10%) toxic, 1800 (90%) non-toxic
- Note that descriptor data set has 1845 entries, fingerprint data has 1835
- Nav: _ entries, _ (39%) toxic, _ (61%) non-toxic
- Data found by running scripts/show-class-inbalance.py
- /cav/ignota_turing_dsg_cav_ecfp.csv positives: 278 (0.50%), negatives: 276 (0.50%)
- /cav/ignota_turing_dsg_cav_mordred.csv positives: 286 (0.51%), negatives: 276 (0.49%)
- /herg/ignota_turing_dsg_herg_ecfp.csv positives: 9124 (0.10%), negatives: 85180 (0.90%)
- /herg/ignota_turing_dsg_herg_mordred.csv positives: 9148 (0.10%), negatives: 85279 (0.90%)
- /nav/ignota_turing_dsg_nav_ecfp.csv positives: 712 (0.39%), negatives: 1124 (0.61%)
- /nav/ignota_turing_dsg_nav_mordred.csv positives: 718 (0.39%), negatives: 1128 (0.61%)
- Note that different representations have slightly different counts as some molecules failed to be created and are dropped
--- Post Lunch
## Framework
Using Pytorch Lightning to setup up pipeline for loading and training data - working on lightning_skeleton branch
## Scaffolding
- Use DeepChem ScaffoldSplitter class and / or existing snippet to separate data into scaffolds
- Investigate number of scaffolds, size of each scaffold, then see similarity / overlap between scaffolds by UMAP / clustering
- Use this to determine:
- how train / val / test split should be done
- how many buckets we should use, of what size, etc.
Nav:
- 909 scaffolds, largest of size 99
- 5-fold cross validation, umap plots for each fold
- are there disctinct clusters between training and test sets? Or are there overlaps between train-test in chemical space?
POA:
- fingerprint splitter based on tanimoto similarity
- visualise clustering between train and test on UMAP, PCA, & pairwise tanimoto similarity
- scaffold splitter
- visualise clusering between train and test on UMAP, PCA, & pairwise tanimoto similarity
- visualise top n scaffolds in our train and test
- maximise distance between tanimoto similarity between scaffolds?
Implemented `scaffolding.py` file with `DataSplitter` class. Splits dataset into 5x train-test splits for 5-fold CV.
Split data -> train/test & hoholdout
Rob:
Implemented 'loader.py' that allows loading of 5 fold cross fold based on the 'scaffolding.py' into numpy arrays and into a pytorch Dataset
## Exploratory Data Analysis
focused on descriptors rather than fingerprints
- correlations between mordred descriptors and toxicity label
- notes on mordred descriptors
- [list of descriptors](https://mordred-descriptor.github.io/documentation/master/descriptors.html)
- [list of constructors and number of descriptors](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0258-y/tables/3)
- 2d: 43 constructors
- 3d: 5 constructors
- 19 descriptors were found to be significant (at 0.01 level) with label
- all found significant descriptors are 2D descriptors, constructed by:
- ATS (autocorrelation of topological structure)
- ATSC (centered ATS)
- MATS (Moran coefficient descriptor)
- GATS (Gery coeffcient descriptor)
- AtomTypeEState (atom type e-state)
- number of sCH3
- number of ssNH
- number of sOH
- sum of sCH3
- sum of ssNH
- sum of sOH
- ETA delta epsilon (type A)
- HBondDonor (hydrogen bond donor descriptor)
- RingCount (ring count)
- 6-membered aromatic hetero ring count
- SLogP (Wildman-Crippen LogP)
- [some useful figures to illustrate the structures](https://www.sciencedirect.com/science/article/pii/S0223523420302592)
- if time allows: random forest of descriptors
- comparison of different representation
- [this work](https://pubmed.ncbi.nlm.nih.gov/31138104/) used Dragon (discontinued) for physiochemical descriptors (2432 nonconstant molecular descriptors) of hERG
- The neural network model achieved an area under the receiver operating characteristic curve (AUC) of 0.764, with an accuracy of 90.1%, a Matthews correlation coefficient (MCC) of 0.368, a sensitivity of 0.321, and a specificity of 0.967, when ten-fold cross-validation was performed.
- its prediction webtool is not available any longer
## Literature Research
### [Images of chemical structures as molecular representations for deep learning](https://link.springer.com/article/10.1557/s43578-022-00628-9)
- 2D images of chemical structures
Advantages:
- allows the user to leverage data augmentation- increases the predictive capabilities of the model by expanding the size and diversity of the datasets.
- does not require specialist chemometrics understanding.
#### Methodology
- Input styles used:
- Spectrophore
- PubChem Fingerprint
- Extended Connectivity Fingerprint (ECFP)
- MACCS Key Fingerprint
- Mol2Vec
- RDKit Descriptors
- Mordred Chemical Descriptors (https://github.com/mordred-descriptor/mordred)
Rest were generated using - https://github.com/deepchem/deepchem
- Data Augmentation:
- Solubility data augmentation - Rotations were applied (6 × 30°) to each training example first, and then the resulting images were subjected to three reflections. These reflections occurred along the horizontal axis, along the vertical axis and along both axes together.
- Co-crystal data augmentation - Components were additionally stacked in both possible positions, namely component 1 above component 2 and vice versa. Following this, all of the augmentations outlined for the solubility dataset were carried out.

- Model implementation - ResNet architecture was used to evaluate the performance of images as inputs.
- Evaluation & Metrics - each dataset was split into training and validation subsets using tenfold cross-validation. The metrics used to assess the performance were accuracy (ACC) and Area Under the Receiver Operating Characteristic Curve (ROC) for the classification task. In the regression task, R2, mean-squared error (MSE), root-mean-squared error (RMSE), and mean absolute error (MAE) were recorded. These metrics were chosen as they are all commonly used in the literature for evaluating AI models.
### Molecular representations in AI-driven drug discovery: a review and practical guide
Graph representation
- distinction between the notations and file formats built using molecular graphs (e.g. SMILES strings, Molfiles), and the abstract mathematical structure/data structure of a graph itself
- general idea: mapping the atoms and bonds that make up a molecule into sets of nodes and edges
- Graphs are formally 2D data structures with no spatial relationships between elements (only pair-wise realtionships)-> 3D info must be encoded into a graph representation
- nodes: chirality, physiochemical properties, etc
- edges: bond lengths, bond types, etc.
- all subgraphs are interpretable whereas all SMILES Substrings are not
- matrix representations of graphs are node order dependent
- graph traversal algorithm determines node order (depth-first or breadth-first search)
- Limitations of molecular graph representations
- multi-valent bonds (have been introduced by hypergraphs, but not commonly used)
- not very compact
### SMILES-BERT
- https://dl.acm.org/doi/10.1145/3307339.3342186
### ChemBERT
- pre-trained on ChEMBL database
- paper: https://spj.science.org/doi/10.34133/research.0004
-
## Fingerprints (Sarveshwari)
Four classes of 2D fingerprint types can be distinguished: (i) dictionary-based, (ii) topological or path-based, (iii) circular fingerprints and (iv) pharmacophores.
### Evaluating fingerprints
Just as there is wide variety of fingerprinting algorithms, there are multiple methods for evaluating Virtual Screening performance and little consensus as to which is best.
The area under the receiver operating characteristic (**ROC**) curve (**AUC**)/enrichment factor (**EF**) at a given fraction χ of the data set
* Advantage: The advantage of AUC is that it is bounded, running from 0 to 1 with 0.5 corresponding to randomness, and that it is independent of the ratio of actives to inactives and other external parameters.
* Disadvantage: The AUC method has been critized as being inappropriate for comparing VS methods as it is not sufficiently sensitive to early recognition. The EF explicitly measures early recognition but it is dependent on the ratio of actives to inactives and the choice of χ.
To try and overcome these limitations numerous other evaluation methods, such as robust initial enhancement (**RIE**) and Boltzmann-enhanced discrimination of ROC (**BEDROC**), have been proposed.
* Advantage: The RIE uses a continuously decreasing exponential weight as a function of rank and is thus sensitive to early recognition.
* Disadvantage: It is dependent both on an adjustable parameter, the exponential weight, and the ratio of actives to inactives. RIE values can therefore not be easily compared between different data sets. The BEDROC is constructed by, in essence, forcing the RIE to be bounded by 0 and 1, avoiding the dependence on the active/inactive ratio.
| Name | Type |
|---------- | ----------|
| ECFP4 | circular fingerprints with bit string | | |
| long ECFP4 | circular fingerprints with bit string | |
| ECFP6 | circular fingerprints with bit string | | |
| long ECFP6 | circular fingerprints with bit string | | |
| FCFP4 | circular fingerprints with bit string | | |
| FCFC4 | circular fingerprints with count vector | | |
| ECFC4 | circular fingerprints with count vector | | |
| MACCS | dictionary-based fingerprint | | |
| RDK5 | Path-based fingerprint | | |
| long Avalon | Path-based fingerprint | | |
| Avalon | Path-based fingerprint | | |
| TT | Path-based fingerprint | | |
| AP | Path-based fingerprint | | |
| ECFC0 | Text | Text | |
#### ECFPs
ECFP belongs to the best performing fingerprints in small molecule virtual screening and target prediction benchmarks, together with the related MinHashed fingerprint MHFP6.
ECFPs are circular fingerprints with a number of useful qualities: they can be very **rapidly calculated**; they are **not predefined** and can represent an essentially infinite number of different molecular features (including stereochemical information); their **features represent the presence of particular substructures**, allowing easier interpretation of analysis results; and the ECFP algorithm can be **tailored to generate different types of circular fingerprints**, optimized for different uses. *Being already formatted as a vector with numerical entries, they are well suited to be used as features in ML models for chemistry.*
**Comparison to other fingerprint methods**
ECFPs have several advantages when contrasted to ***predefined substructural keys***.
1. Since the set of keys is fixed, the system may not contain keys appropriate to the novel structural variation in a given compound library.
2. The keys are designed for substructure search, which limits their utility for activity modeling and categorization (see the Hydrogen-Filled Substructural Features Section for this discussion).
3. Even the most intelligently selected set of keys18 is limited in coverage, making it unlikely that more detailed and specific substructures particular to some unusual activity class would be represented.
Another commonly used class of fingerprints is available through Daylight Chemical Information Systems. It uses features based upon the presence of paths of varying lengths containing specific atom types. This generates a sparse binary vector, which is commonly “folded” to a bitset of specified size (e.g., 1024 bits) to reduce its size for ease of manipulation.
ECFPs have several advantages when contrasted to such ***pathbased fingerprints***.
1. Daylight keys were designed for substructure and similarity search, which limits their effectiveness for activity modeling. If some path feature is found to be useful during an analysis, then interpretation can still be difficult, as a given path may not correspond to an easily identifiable function group or substructural entity.
2. Further complicated but often relevant chemical attachment patterns, such as quaternary centers or cyclic patterns, cannot be directly represented by a path-based description. Path-based schemes also cannot capture atom-based stereochemistry.
**Disadvantages of ECFPs (ECFP4) and MHFP6:**
Both have a poor perception of the global features of molecules such as size and shape. They also fail at perceiving structural differences that may be important in larger molecules, such as distinguishing between regioisomers in extended ring systems (e.g. 2,7- versus 2,8-dichlorodioxin), between linkers of different lengths, or between scrambled peptide sequences of identical composition and length.
### Contemporary fingerprints
#### MAP4 (MinHashed Atom-Pair fingerprint up to four bonds)
The ECFP4 and MHFP6 limitations can be addressed by using atom-pair fingerprints, which encode molecular shape and are often used for scaffold-hopping. However, they do not encode molecular structure in detail and perform poorly in small molecule benchmarking studies compared to substructure fingerprints such as ECFP4 and MHFP6.
MAP4 encodes atom pairs and their bond distance similarly to the **AP fingerprint implemented by RDKit**, however in MAP4 atom characteristics are replaced by the circular substructure around each atom of the pair, written in **SMILES format**. MAP4 uses the same MinHashing technique as **MHFP6**, a principle borrowed from natural language processing which enables fast similarity searches in very large databases by locality sensitive hashing (LSH). LSH is a technique that allows the creation of self-tuning indexes, which are then used to generate a forest of trees that can be traversed for an approximate but fast similarity search.
---
## Alternative representations (Sarveshwari)
* **Coulomb Matrix**
Ab-initio electronic structure calculations typically require a set of nuclear charges {Z} and the corresponding Cartesian coordinates {R} as input. The Coulomb Matrix (CM) M, proposed by Rupp et al., encodes this information by use of the atomic self-energies and internuclear Coulomb repulsion operator.
In the construction of coulomb matrix, we first use the nuclear charges and distance matrix generated by RDKit to acquire the original coulomb matrix, then an optional random atom index sorting and binary expansion transformation can be applied during training in order to achieve atom index invariance, as reported by Montavon et al.
*Example code for converting from SMILES to Coulomb Matrices **using RDKit in Python**:*
```
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.rdMolDescriptors import CalcCoulombMat
smiles = 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C'
caffeine = Chem.MolFromSmiles(smiles, sanitize=True)
caffeine = Chem.AddHs(caffeine)
AllChem.EmbedMolecule(caffeine)
AllChem.UFFOptimizeMolecule(caffeine)
CM = CalcCoulombMat(caffeine)
list(CM[0])
```
*The implementation of coulomb matrix descriptors by Matthias Rupp et. al. **using the ChemML library***
```
from chemml.chem import CoulombMatrix, Molecule
m1 = Molecule('c1ccc1', 'smiles')
m2 = Molecule('CNC', 'smiles')
m3 = Molecule('CC', 'smiles')
m4 = Molecule('CCC', 'smiles')
molecules = [m1, m2, m3, m4]
for mol in molecules: mol.to_xyz(optimizer='UFF')
cm = CoulombMatrix(cm_type='SC', n_jobs=-1)
features = cm.represent(molecules)
```
In the above, cm_type is Coulomb matrix type, which can be one of the five below -
* ‘Unsorted_Matrix’ or ‘UM’
* ‘Unsorted_Triangular’ or ‘UT’
* ‘Eigenspectrum’ or ‘E’
* ‘Sorted_Coulomb’ or ‘SC’ (default)
* ‘Random_Coulomb’ or ‘RC’
*Generate Coulomb matrices **from scratch using RDKit**:*
```
"""
Generate coulomb matrices for molecules.
See Montavon et al., _New Journal of Physics_ __15__ (2013) 095003.
"""
import numpy as np
from typing import Any, List, Optional
from deepchem.utils.typing import RDKitMol
from deepchem.utils.data_utils import pad_array
from deepchem.feat.base_classes import MolecularFeaturizer
class CoulombMatrix(MolecularFeaturizer):
"""Calculate Coulomb matrices for molecules.
Coulomb matrices provide a representation of the electronic structure of
a molecule. For a molecule with `N` atoms, the Coulomb matrix is a
`N X N` matrix where each element gives the strength of the
electrostatic interaction between two atoms. The method is described
in more detail in [1]_.
Examples
--------
>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrix(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)
References
----------
.. [1] Montavon, Grégoire, et al. "Learning invariant representations of
molecules for atomization energy prediction." Advances in neural information
processing systems. 2012.
Note
----
This class requires RDKit to be installed.
"""
def __init__(self,
max_atoms: int,
remove_hydrogens: bool = False,
randomize: bool = False,
upper_tri: bool = False,
n_samples: int = 1,
seed: Optional[int] = None):
"""Initialize this featurizer.
Parameters
----------
max_atoms: int
The maximum number of atoms expected for molecules this featurizer will
process.
remove_hydrogens: bool, optional (default False)
If True, remove hydrogens before processing them.
randomize: bool, optional (default False)
If True, use method `randomize_coulomb_matrices` to randomize Coulomb matrices.
upper_tri: bool, optional (default False)
Generate only upper triangle part of Coulomb matrices.
n_samples: int, optional (default 1)
If `randomize` is set to True, the number of random samples to draw.
seed: int, optional (default None)
Random seed to use.
"""
self.max_atoms = int(max_atoms)
self.remove_hydrogens = remove_hydrogens
self.randomize = randomize
self.upper_tri = upper_tri
self.n_samples = n_samples
if seed is not None:
seed = int(seed)
self.seed = seed
def _featurize(self, datapoint: RDKitMol, **kwargs) -> np.ndarray:
"""
Calculate Coulomb matrices for molecules. If extra randomized
matrices are generated, they are treated as if they are features
for additional conformers.
Since Coulomb matrices are symmetric, only the (flattened) upper
triangular portion is returned.
Parameters
----------
datapoint: rdkit.Chem.rdchem.Mol
RDKit Mol object
Returns
-------
np.ndarray
The coulomb matrices of the given molecule.
The default shape is `(num_confs, max_atoms, max_atoms)`.
If num_confs == 1, the shape is `(max_atoms, max_atoms)`.
"""
if 'mol' in kwargs:
datapoint = kwargs.get("mol")
raise DeprecationWarning(
'Mol is being phased out as a parameter, please pass "datapoint" instead.'
)
features = self.coulomb_matrix(datapoint)
if self.upper_tri:
features = np.asarray(
[f[np.triu_indices_from(f)] for f in features])
if features.shape[0] == 1:
# `(1, max_atoms, max_atoms)` -> `(max_atoms, max_atoms)`
features = np.squeeze(features, axis=0)
return features
def coulomb_matrix(self, mol: RDKitMol) -> np.ndarray:
"""
Generate Coulomb matrices for each conformer of the given molecule.
Parameters
----------
mol: rdkit.Chem.rdchem.Mol
RDKit Mol object
Returns
-------
np.ndarray
The coulomb matrices of the given molecule
"""
try:
from rdkit import Chem
from rdkit.Chem import AllChem
except ModuleNotFoundError:
raise ImportError("This class requires RDKit to be installed.")
# Check whether num_confs >=1 or not
num_confs = len(mol.GetConformers())
if num_confs == 0:
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDG())
if self.remove_hydrogens:
mol = Chem.RemoveHs(mol)
n_atoms = mol.GetNumAtoms()
z = [atom.GetAtomicNum() for atom in mol.GetAtoms()]
rval = []
for conf in mol.GetConformers():
d = self.get_interatomic_distances(conf)
m = np.outer(z, z) / d
m[range(n_atoms), range(n_atoms)] = 0.5 * np.array(z)**2.4
if self.randomize:
for random_m in self.randomize_coulomb_matrix(m):
random_m = pad_array(random_m, self.max_atoms)
rval.append(random_m)
else:
m = pad_array(m, self.max_atoms)
rval.append(m)
return np.asarray(rval)
def randomize_coulomb_matrix(self, m: np.ndarray) -> List[np.ndarray]:
"""Randomize a Coulomb matrix as decribed in [1]_:
1. Compute row norms for M in a vector row_norms.
2. Sample a zero-mean unit-variance noise vector e with dimension
equal to row_norms.
3. Permute the rows and columns of M with the permutation that
sorts row_norms + e.
Parameters
----------
m: np.ndarray
Coulomb matrix.
Returns
-------
List[np.ndarray]
List of the random coulomb matrix
References
----------
.. [1] Montavon et al., New Journal of Physics, 15, (2013), 095003
"""
rval = []
row_norms = np.asarray([np.linalg.norm(row) for row in m], dtype=float)
rng = np.random.RandomState(self.seed)
for i in range(self.n_samples):
e = rng.normal(size=row_norms.size)
p = np.argsort(row_norms + e)
new = m[p][:, p] # permute rows first, then columns
rval.append(new)
return rval
@staticmethod
def get_interatomic_distances(conf: Any) -> np.ndarray:
"""
Get interatomic distances for atoms in a molecular conformer.
Parameters
----------
conf: rdkit.Chem.rdchem.Conformer
Molecule conformer.
Returns
-------
np.ndarray
The distances matrix for all atoms in a molecule
"""
n_atoms = conf.GetNumAtoms()
coords = [
# Convert AtomPositions from Angstrom to bohr (atomic units)
conf.GetAtomPosition(i).__idiv__(0.52917721092)
for i in range(n_atoms)
]
d = np.zeros((n_atoms, n_atoms), dtype=float)
for i in range(n_atoms):
for j in range(i):
d[i, j] = coords[i].Distance(coords[j])
d[j, i] = d[i, j]
return d
class CoulombMatrixEig(CoulombMatrix):
"""Calculate the eigenvalues of Coulomb matrices for molecules.
This featurizer computes the eigenvalues of the Coulomb matrices for provided
molecules. Coulomb matrices are described in [1]_.
Examples
--------
>>> import deepchem as dc
>>> featurizers = dc.feat.CoulombMatrixEig(max_atoms=23)
>>> input_file = 'deepchem/feat/tests/data/water.sdf' # really backed by water.sdf.csv
>>> tasks = ["atomization_energy"]
>>> loader = dc.data.SDFLoader(tasks, featurizer=featurizers)
>>> dataset = loader.create_dataset(input_file)
References
----------
.. [1] Montavon, Grégoire, et al. "Learning invariant representations of
molecules for atomization energy prediction." Advances in neural information
processing systems. 2012.
"""
def __init__(self,
max_atoms: int,
remove_hydrogens: bool = False,
randomize: bool = False,
n_samples: int = 1,
seed: Optional[int] = None):
"""Initialize this featurizer.
Parameters
----------
max_atoms: int
The maximum number of atoms expected for molecules this featurizer will
process.
remove_hydrogens: bool, optional (default False)
If True, remove hydrogens before processing them.
randomize: bool, optional (default False)
If True, use method `randomize_coulomb_matrices` to randomize Coulomb matrices.
n_samples: int, optional (default 1)
If `randomize` is set to True, the number of random samples to draw.
seed: int, optional (default None)
Random seed to use.
"""
self.max_atoms = int(max_atoms)
self.remove_hydrogens = remove_hydrogens
self.randomize = randomize
self.n_samples = n_samples
if seed is not None:
seed = int(seed)
self.seed = seed
def _featurize(self, datapoint: RDKitMol, **kwargs) -> np.ndarray:
"""
Calculate eigenvalues of Coulomb matrix for molecules. Eigenvalues
are returned sorted by absolute value in descending order and padded
by max_atoms.
Parameters
----------
datapoint: rdkit.Chem.rdchem.Mol
RDKit Mol object
Returns
-------
np.ndarray
The eigenvalues of Coulomb matrix for molecules.
The default shape is `(num_confs, max_atoms)`.
If num_confs == 1, the shape is `(max_atoms,)`.
"""
if 'mol' in kwargs:
datapoint = kwargs.get("mol")
raise DeprecationWarning(
'Mol is being phased out as a parameter, please pass "datapoint" instead.'
)
cmat = self.coulomb_matrix(datapoint)
features_list = []
for f in cmat:
w, v = np.linalg.eig(f)
w_abs = np.abs(w)
sortidx = np.argsort(w_abs)
sortidx = sortidx[::-1]
w = w[sortidx]
f = pad_array(w, self.max_atoms)
features_list.append(f)
features = np.asarray(features_list)
if features.shape[0] == 1:
# `(1, max_atoms)` -> `(max_atoms,)`
features = np.squeeze(features, axis=0)
return features
```
* **Grid Featurizer**
The grid featurizer was inspired by the NNscore featurizer and SPLIF69 but optimized for speed, robustness, and generalizability. The intermolecular interactions enumerated by the featurizer include salt bridges and hydrogen bonding between protein and ligand, intraligand circular fingerprints, intra-protein circular fingerprints, and protein-ligand SPLIF fingerprints.
RdkitGridFeaturizer featurizes protein-ligand complex using flat features or a 3D grid (in which each voxel is described with a vector of features).
```
```
* **Symmetry Function**
Symmetry function, first introduced by Behler and Parrinello, is another common encoding of atomic coordinates information. It focuses on preserving the rotational and permutation symmetry of the system. The local environment of an atom in the molecule is expressed as a series of radial and angular symmetry functions with different distance and angle cutoffs, the former focusing on distances between atom pairs and the latter focusing on angles formed within triplets of atoms.
In the paper titled "ANI-1: An extensible neural network potential with DFT accuracy at force
field computational cost", the authors demonstrate how a deep neural network (NN) trained on quantum mechanical (QM) DFT calculations can learn an accurate and fully transferable potential for organic molecules. Authors introduce ANAKIN-ME (Accurate NeurAl networK engINe for Molecular Energies) or ANI in short, which can be implemented using the DeepChem package in Python using the following command:
```
deepchem.models.ANIRegression
```
Dataset preparation from SMILES to symmetry function in Python:
```
import numpy as np
from deepchem.utils.typing import RDKitMol
from deepchem.utils.data_utils import pad_array
from deepchem.feat.base_classes import MolecularFeaturizer
from deepchem.feat.molecule_featurizers.atomic_coordinates import AtomicCoordinates
class BPSymmetryFunctionInput(MolecularFeaturizer):
"""Calculate symmetry function for each atom in the molecules
This method is described in [1]_.
Examples
--------
>>> import deepchem as dc
>>> smiles = ['C1C=CC=CC=1']
>>> featurizer = dc.feat.BPSymmetryFunctionInput(max_atoms=10)
>>> features = featurizer.featurize(smiles)
>>> type(features[0])
<class 'numpy.ndarray'>
>>> features[0].shape # (max_atoms, 4)
(10, 4)
References
----------
.. [1] Behler, Jörg, and Michele Parrinello. "Generalized neural-network
representation of high-dimensional potential-energy surfaces." Physical
review letters 98.14 (2007): 146401.
Note
----
This class requires RDKit to be installed.
"""
def __init__(self, max_atoms: int):
"""Initialize this featurizer.
Parameters
----------
max_atoms: int
The maximum number of atoms expected for molecules this featurizer will
process.
"""
self.max_atoms = max_atoms
self.coordfeat = AtomicCoordinates(use_bohr=True)
def _featurize(self, datapoint: RDKitMol, **kwargs) -> np.ndarray:
"""Calculate symmetry function.
Parameters
----------
datapoint: rdkit.Chem.rdchem.Mol
RDKit Mol object
Returns
-------
np.ndarray
A numpy array of symmetry function. The shape is `(max_atoms, 4)`.
"""
if 'mol' in kwargs:
datapoint = kwargs.get("mol")
raise DeprecationWarning(
'Mol is being phased out as a parameter, please pass "datapoint" instead.'
)
coordinates = self.coordfeat._featurize(datapoint)
atom_numbers = np.array(
[atom.GetAtomicNum() for atom in datapoint.GetAtoms()])
atom_numbers = np.expand_dims(atom_numbers, axis=1)
assert atom_numbers.shape[0] == coordinates.shape[0]
features = np.concatenate([atom_numbers, coordinates], axis=1)
return pad_array(features, (self.max_atoms, 4))
```
* **Weave**
Weave featurization encodes both local chemical environment and connectivity of atoms in a molecule. Atomic feature vectors are exactly the same, while connectivity is represented by more detailed pair features instead of neighbor listing. The weave featurization calculates a feature vector for each pair of atoms in the molecule, including bond properties (if directly connected), graph distance and ring info, forming a feature matrix. The method supports graph based models that utilize properties of both nodes (atoms) and edges (bonds).
Weave convolutions were introduced in Kearnes, Steven, et al. Unlike Duvenaud graph convolutions, weave convolutions require a quadratic matrix of interaction descriptors for each pair of atoms. These extra descriptors may provide for additional descriptive power but at the cost of a larger featurized dataset.
```
import deepchem as dc
mols = ["CCC"]
featurizer = dc.feat.WeaveFeaturizer()
features = featurizer.featurize(mols)
type(features[0])
features[0].get_num_atoms() # 3 atoms in compound
features[0].get_num_features() # feature size
type(features[0].get_atom_features())
features[0].get_atom_features().shape
type(features[0].get_pair_features())
features[0].get_pair_features().shape
```
* **Image**
*SMILEs to image code:*
```
"""
Featurizer implementations used in ChemCeption models.
SmilesToImage featurizer for ChemCeption models taken from https://arxiv.org/abs/1710.02238
"""
import numpy as np
from deepchem.utils.typing import RDKitMol
from deepchem.feat.base_classes import MolecularFeaturizer
class SmilesToImage(MolecularFeaturizer):
"""Convert SMILES string to an image.
SmilesToImage Featurizer takes a SMILES string, and turns it into an image.
Details taken from [1]_.
The default size of for the image is 80 x 80. Two image modes are currently
supported - std & engd. std is the gray scale specification,
with atomic numbers as pixel values for atom positions and a constant value of
2 for bond positions. engd is a 4-channel specification, which uses atom
properties like hybridization, valency, charges in addition to atomic number.
Bond type is also used for the bonds.
The coordinates of all atoms are computed, and lines are drawn between atoms
to indicate bonds. For the respective channels, the atom and bond positions are
set to the property values as mentioned in the paper.
References
----------
.. [1] Goh, Garrett B., et al. "Using rule-based labels for weak supervised
learning: a ChemNet for transferable chemical property prediction."
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining. 2018.
Notes
-----
This class requires RDKit to be installed.
"""
def __init__(self,
img_size: int = 80,
res: float = 0.5,
max_len: int = 250,
img_spec: str = "std"):
"""
Parameters
----------
img_size: int, default 80
Size of the image tensor
res: float, default 0.5
Displays the resolution of each pixel in Angstrom
max_len: int, default 250
Maximum allowed length of SMILES string
img_spec: str, default std
Indicates the channel organization of the image tensor
"""
if img_spec not in ["std", "engd"]:
raise ValueError(
"Image mode must be one of std or engd. {} is not supported".format(
img_spec))
self.img_size = img_size
self.max_len = max_len
self.res = res
self.img_spec = img_spec
self.embed = int(img_size * res / 2)
def _featurize(self, mol: RDKitMol) -> np.ndarray:
"""Featurizes a single SMILE into an image.
Parameters
----------
mol: rdkit.Chem.rdchem.Mol
RDKit Mol object
Returns
-------
np.ndarray
A 3D array of image, the shape is `(img_size, img_size, 1)`.
If the length of SMILES is longer than `max_len`, this value is an empty array.
"""
try:
from rdkit import Chem
from rdkit.Chem import AllChem
except ModuleNotFoundError:
raise ImportError("This class requires RDKit to be installed.")
smile = Chem.MolToSmiles(mol)
if len(smile) > self.max_len:
return np.array([])
cmol = Chem.Mol(mol.ToBinary())
cmol.ComputeGasteigerCharges()
AllChem.Compute2DCoords(cmol)
atom_coords = cmol.GetConformer(0).GetPositions()
if self.img_spec == "std":
# Setup image
img = np.zeros((self.img_size, self.img_size, 1))
# Compute bond properties
bond_props = np.array(
[[2.0, bond.GetBeginAtomIdx(),
bond.GetEndAtomIdx()] for bond in mol.GetBonds()])
# Compute atom properties
atom_props = np.array([[atom.GetAtomicNum()] for atom in cmol.GetAtoms()])
bond_props = bond_props.astype(np.float32)
atom_props = atom_props.astype(np.float32)
else:
# Setup image
img = np.zeros((self.img_size, self.img_size, 4))
# Compute bond properties
bond_props = np.array([[
bond.GetBondTypeAsDouble(),
bond.GetBeginAtomIdx(),
bond.GetEndAtomIdx()
] for bond in mol.GetBonds()])
# Compute atom properties
atom_props = np.array([[
atom.GetAtomicNum(),
atom.GetProp("_GasteigerCharge"),
atom.GetExplicitValence(),
atom.GetHybridization().real,
] for atom in cmol.GetAtoms()])
bond_props = bond_props.astype(np.float32)
atom_props = atom_props.astype(np.float32)
partial_charges = atom_props[:, 1]
if np.any(np.isnan(partial_charges)):
return np.array([])
frac = np.linspace(0, 1, int(1 / self.res * 2))
# Reshape done for proper broadcast
frac = frac.reshape(-1, 1, 1)
bond_begin_idxs = bond_props[:, 1].astype(int)
bond_end_idxs = bond_props[:, 2].astype(int)
# Reshapes, and axes manipulations to facilitate vector processing.
begin_coords = atom_coords[bond_begin_idxs]
begin_coords = np.expand_dims(begin_coords.T, axis=0)
end_coords = atom_coords[bond_end_idxs]
end_coords = np.expand_dims(end_coords.T, axis=0)
# Draw a line between the two atoms.
# The coordinates of this line, are indicated in line_coords
line_coords = frac * begin_coords + (1 - frac) * end_coords
# Turn the line coordinates into image positions
bond_line_idxs = np.ceil(
(line_coords[:, 0] + self.embed) / self.res).astype(int)
bond_line_idys = np.ceil(
(line_coords[:, 1] + self.embed) / self.res).astype(int)
# Set the bond line coordinates to the bond property used.
img[bond_line_idxs, bond_line_idys, 0] = bond_props[:, 0]
# Turn atomic coordinates into image positions
atom_idxs = np.round(
(atom_coords[:, 0] + self.embed) / self.res).astype(int)
atom_idys = np.round(
(atom_coords[:, 1] + self.embed) / self.res).astype(int)
# Set the atom positions in image to different atomic properties in channels
img[atom_idxs, atom_idys, :] = atom_props
return img
```
---
### Scaffold Splitting -
- dgl-lifesci - https://github.com/awslabs/dgl-lifesci
- GNN-based modeling on custom datasets for molecular property prediction, reaction prediction, and molecule generation.
-
### A systematic study of key elements underlying molecular property prediction
- MoleculeNet
## References
*Fingerprint Representations*
* Rogers, D., & Hahn, M. (2010). Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5), 742–754. doi:10.1021/ci100050t
* https://docs.chemaxon.com/display/docs/extended-connectivity-fingerprint-ecfp.md
* Capecchi, A., Probst, D. & Reymond, JL. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J Cheminform 12, 43 (2020).
* Riniker, S., Landrum, G.A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminform 5, 26 (2013). https://doi.org/10.1186/1758-2946-5-26
* Wu, Zhenqin et al. “MoleculeNet: a benchmark for molecular machine learning.” Chemical science vol. 9,2 513-530. 31 Oct. 2017
* Goh, Garrett B., et al. “Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
* Kearnes, Steven, et al. “Molecular graph convolutions: moving beyond fingerprints.” Journal of computer-aided molecular design 30.8 (2016): 595-608.
*Graph Molecular Representations*
1. Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2019)
2. Na, G. S., Chang, H. & Kim, H. W. Machine-guided representation for accurate graph-based molecular machine learning. Phys. Chem. Chem. Phys. 22, 18526–18535 (2020)
3. Jiang, D. et al. Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models. J. Cheminformatics 13, 1–23 (2021)
4. Wilkinson, M.R., Martinez-Hernandez, U., Wilson, C.C. et al. Images of chemical structures as molecular representations for deep learning. Journal of Materials Research 37, 2293–2303 (2022). https://doi.org/10.1557/s43578-022-00628-9
*Molecular property prediction*
1. Deng, J., Yang, Z., Wang, H. et al. A systematic study of key elements underlying molecular property prediction. Nat Commun 14, 6395 (2023). https://doi.org/10.1038/s41467-023-41948-6
## Baseline models
- Based on tabular data
- dropped smiles, nans
- Models:
- Random forest
**Random split stratified by class
**
ECFP
| | CAV | HERG | NAV |
| --------- | ---- | ---- | ---- |
| Accuracy: |0.67(+-0.11)|0.91 (+-0.00) |0.6(+-0.06)|
|F1: |0.64(+-0.13)| 0.88 (+-0.00) |0.55(+-0.04)|
| MCC: | 0.39(+-0.22) | 0.29 (+-0.01) | 0.11(+-0.12) |
MORDRED
| | CAV | HERG | NAV |
| --- | ---- | ---- | ---- |
| Accuracy: | Text | Text | Text |
Scaffold split:
ECFP
| | CAV | HERG | NAV |
| --------- | ---- | ---- | ---- |
| Accuracy: |0.72(+-0.04)|0.91 (+-0.00) |0.74(+-0.04)|
|F1: |0.71(+-0.06)|0.87 (+-0.00) |0.71(+-0.06)|
| MCC: | 0.44(+-0.11) | 0.25 (+-0.03) | 0.44(+-0.11) |
MORDRED
| | CAV | HERG | NAV |
| --------- | ---- | ---- | ---- |
| Accuracy:| 0.77 (+-0.04) | | 0.77 (+-0.04) |
|F1: | 0.77 (+-0.05) | | 0.77 (+-0.05)
|MCC: | 0.52 (+-0.10) | Text | 0.52 (+-0.10)
ADABOOST [CM sum across folds]:
* NAV, ECFP:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|942| 181|
| 1 | 290| 422|
* NAV, MORDRED:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|938 |189|
| 1 | 304 |414 |
* CAV, ECFP:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|218 | 57|
| 1 | 75 |203|
* CAV, MORDRED:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|215 | 60|
| 1 | 59| 227|
* HERG, ECFP:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|83548 |1631
| 1 | 5874 | 3250|
* HERG, MORDRED:
| Predicted: | 0 | 1 |
| -------- | -------- | -------- |
|0|83372 | 1906|
| 1 | 4781 | 4367|