# Ignota DSG day 2 notes
Day 1 notes: https://hackmd.io/a8bFPxfuQgm9HHT9rirXwg
## Literature
3D CNN:
[Pafnucy](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6198856/)
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1702-0
[KDEEP](https://pubs.acs.org/doi/10.1021/acs.jcim.7b00650)
for conversion from smiles to 3d coordinates: https://downloads.ccdc.cam.ac.uk/documentation/API/descriptive_docs/1d_to_3d.html << needs a license
GNN:
* [DGL-LifeSci](https://pubs.acs.org/doi/10.1021/acsomega.1c04017#)
* https://github.com/junxia97/awesome-pretrain-on-molecules#Applications
## Scaffolding
- Scaffolding splits to visualise train / test similarity
- Alternative using tanimoto similarity based on ECFP
- UMAP based on top descriptors to see how tanimoto separates
- Looking into graph representations
## Framework
- Being tested on basic MLPs before more complex networks
- May need help to incorporate k-fold data
## Baseline Models
- Results in HackMD Day 1
- Random forest, ADABoost, XGBoost
- 5-fold split, recorded values are averaged over all 5 with standard deviations listed
- Scaffolding yields better results than random split stratified
- Metrics:
- Generally disregard accuracy
- F1 and MMC good to look at
- Coen's kappa, AUPRC also recommended
- Precision and recall scores also usefully to be recorded
- Mean's balanced accuracy, available in scikit-learn
- ADABoost shows slight improvement over RF
- Next steps to incorporate info into report
- Recommended to include separate metrics for train/val/test
## Feature Analysis
- Feature importance using random forests
- Took 3 most important features and saw comparable results to using full dataset
- Next steps: try with inclusion of scaffolds
## Representations
- Variance of GNNs, literature review
- Need to decide on one or two representations to take further
- One that simplifies the data, another that incorporates all variables?
- Image representation could allow used of pre-trained models
- Transformers or GNNs?
### Images
- 2D Generation:
- RDKIT: https://www.rdkit.org/docs/GettingStartedInPython.html
- DeepCHEM: https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#smilestoimage
## Literature Review
- Insufficiency of 2D descriptors compared to 3D ones, lose key info related to bonding mechanisms?
## Discussion of Representations and Models
- Transformer would need to learn 3D structure from (e.g.) smiles, graphical rep would incorporate this structure already
- GNNs could only capture local interactions between atoms, not global structure. Any way round this:
- Residual information can carry through as long as the graph has sufficient layers
- Transformers can be implemented more easily (native PyTorch), GNNs more difficult
- Half a day to set up libraries / environments potentially
- Smiles2seq / Mol2vec
- 3 teams: transformers, images, graphs?
- More useful to explore all routes and find which connects best
- E.g., transformer with fp might be best for nav group, another combo best for herg data
- 3D representations would be novel to explore
- Could various models be concatenated at the end (through an MLP, e.g.)
- Does this divide the group too much and reduce overall power?
- GNNs
- Features on nodes (atoms) and edges (bonds)
- Can 3D info be implemented as well? Standardise axes and incorporate bond angles?
- FCHL19: hierarchy of 3D structural information
- Multi-body tensor representations?
- Determine which representation goes best with which method
- Dividing into small groups, explore for half a day and get an idea of how promising each avenue is (e.g., if graph setup takes too long etc.)
- Determine groups based on domain expertise
## Pros and cons list for each representation
- 3D Images:
- 3D matrix of 0s and 1s
- 4th dimension, vector of length n to indicate qualities of molecules at those positions
- Sparse matrix but proven efficacy
- Could take a 2D image and include third dimension with info vector
- could look at sphyerical coordinates, ?what atom to reference from?
- ETKDG and other approaches for approximating the 3D structures
- by converting to 3d images bonds would need to be inferred from spacing (may not be very robust)
- possible to extract coordinates from smiles, could convert this to spherical coordinates. ?What reference axis would be used?
proposed approach:
* generate 3d coordinates with rdkit or deep chem
* an alternative is to generate 3d with 3d prediction [like this](https://github.com/divelab/MoleculeX/tree/molx/Molecule3D)
* Need image resolution - get smallest bond length and use X multiple of this? Look at differences in distances between bonds
* Need to calculate image size - get max width of compounds across dataset
*
- 2D Images:
- Proven capability of just 2D image as input, without third vector of information
- Use of pre-trained network for transfer learning
- ?normalise resolution based on standard bond sizes?
- one idea is to use prediction tools for bond distances and obtain the distributions of the bond distances (?)
- ImageMol (pretrained ResNet-18 on 10M molecular drug like images for pubchem):
- https://github.com/HongxinXiang/ImageMol
- includes visualisations such as gradcam for explainability
- Data processing involves running a script from a different repo: https://github.com/jrwnter/cddd/blob/master/cddd/preprocessing.py . This strips salts and removes stereochemistry information from the SMILES as well as compute certain attributes ("MolLogP", "MolMR", "BalabanJ", "NumHAcceptors", "NumHDonors", "NumValenceElectrons", "TPSA")
- one representation to explore: [SELFIES](https://github.com/aspuru-guzik-group/selfies)
- [SMILES to SELFIES](https://github.com/JohnMommers/SMILES_TO_SELFIES/blob/main/SMILES_TO_SELFIES.ipynb)
- Image example:
- 
# Graphs for Toxicity Prediction
[GNN Overview](https://distill.pub/2021/gnn-intro/)
Graph level task - predict property of entire graph
Node-level task - predict property of particular nodes within graph
Edge-level task - predict property of particular edges within graph
Github repository for lit. review on pretrained models, applications etc. - https://github.com/junxia97/awesome-pretrain-on-molecules#Applications
Paper: [Recent Advances in Toxicity Prediction: Applications of Deep Graph Learning](
https://pubs.acs.org/doi/full/10.1021/acs.chemrestox.2c00384?casa_token=v2nf132twGMAAAAA%3A_ImqE8TQ6q9aTGv6_WbdHWmmcmdPCllBTPQJE9s2RKv_yrULL9KKLaw7BRGySaWn7mQKxFrIRAb3R2mp)
GROVER (pre-trained GNN) paper - https://arxiv.org/pdf/2007.02835.pdf
## Architectures
Considerations:
1. Min/max distance between nodes / size of molecules
- Larger molecules will require more GNN layers for message passing
- Some techniques, e.g. virtual edges, can be applied to small molecules to transfer information between distant nodes
- Alternatively, include a context vector (master node connected to all other nodes)
Our general modeling template for this problem will be built up using sequential GNN layers, followed by a linear model with a sigmoid activation for classification. The design space for our GNN has many levers that can customize the model:
1. The number of GNN layers, also called the depth.
2. The dimensionality of each attribute when updated. The update function is a 1-layer MLP with a relu activation function and a layer norm for normalization of activations.
3. The aggregation function used in pooling: max, mean or sum.
4. The graph attributes that get updated, or styles of message passing: nodes, edges and global representation. We control these via boolean toggles (on or off). A baseline model would be a graph-independent GNN (all message-passing off) which aggregates all data at the end into a single global attribute. Toggling on all message-passing functions yields a GraphNets architecture.
### Equivariance
- Equivariance is a property of certain mathematical models and algorithms that ensures that the output of the model or algorithm is the same as that if the input had been transformed in a particular way.
- In the context of 3D graph neural networks, equivariance is crucial because it allows the network to process 3D data that have been transformed in various ways such as through rotation or translation of the data.
> TorchMD-NET is an equivariant message passing NN (MPNN)
[TorchMD-NET](https://arxiv.org/abs/2202.02541):
By building on top of the Transformer (Vaswani et al., 2017) architecture, we are centering the design around the attention mechanism, achieving state-of-the-art accuracy on multiple benchmarks while relying solely on a learned featurization of atomic types and coordinates.
### Graphormer
The Transformer was in part the answer to a problem in sequence processing that is also faced by GNNs: signal dilution between far away elements.
[OPIG](https://www.blopig.com/blog/2022/10/graphormer-merging-gnns-and-transformers-for-cheminformatics/)
Powerful technique, the Graphormer won the 2021 Open Graph Benchmark Large Scale Challenge (OGB-LSC) in quantum chemistry.
### Strategies for pre-training [GNNs](https://arxiv.org/pdf/1905.12265.pdf)
The key to the success of our strategy is to pre-train an expressive GNN at the level of individual nodes as well as entire graphs so that the GNN can learn useful local
and global representations simultaneously.
We find that naïve strategies, which pre-train GNNs at the level of either entire graphs or individual nodes, give limited improvement and can even lead to negative transfer on many downstream tasks.
In contrast, our strategy avoids negative transfer and improves generalization significantly across downstream tasks, leading up to 9.4% absolute improvements in ROC-AUC over non-pre-trained models and achieving state-of-the-art performance for molecular property prediction and protein function prediction.
## Representations for Molecules
1. ChemRL-GEM, Geometry Enhanced Molecular Representation Learning for Property Prediction: https://arxiv.org/pdf/2106.06130v4.pdf
2. Uni-Mol, A Universal 3D Molecular Representation Learning Framework: https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/628e5b4d5d948517f5ce6d72/original/uni-mol-a-universal-3d-molecular-representation-learning-framework.pdf

| Frameworks | Trained on | Applied on | Task | Performance |
| ------------- | ----------------- | ---------- | ------------- | ------------ |
| GROVER(large) | ChemBL and Zinc15 | ToxCast | Classifcation | 0.737(0.010) |
| ChemRL-GEM | Zinc15 | ToxCast | Classification | 0.742(0.004) |
### ChemRL-GEM
#### Pre-training:
For the geometry-level tasks, Merck molecular force field function from the RDKit4 package is utilized to obtain the simulated 3D coordinates of the atoms in the molecules.
**The geometric features of the molecule, including bond lengths, bond angles and atomic distance matrices, are calculated by the simulated 3D coordinates.**
For the graph-level tasks, two kinds of molecular fingerprints are predicted: 1) The Molecular ACCess System (MACCS) key; 2) The extended-connectivity fingerprint (ECFP)
####
Graph-based Representations. Many works [24, 40, 44, 45, 18] have showcased the great potential of graph neural networks on modeling molecules by taking each atom as a node and each chemical bond as an edge. For example, Attentive FP [54] proposes to extend graph attention mechanism in order to learn aggregation weights. Furthermore, several works [42, 30] start to take the atomic distance into edge features to consider partial geometry information. **However, they still lack the ability to model the full geometry information due to the shortage of traditional GNN architecture.**
Recently, some studies [23, 40] apply self-supervised learning methods to GNNs for molecular property prediction to overcome the insufficiency of the labeled samples. These works learn the molecular representation vectors by exploiting the node-level and graph-level tasks, where the node-level tasks learn the local domain knowledge by predicting the node properties and the graph-level tasks learn the global domain knowledge by predicting biological activities. **Although existing self-supervised learning methods can boost the generalization ability, they neglect the spatial knowledge that is strongly related to the molecular properties.**
### What atomic information to include with nodes and edges?
1. [Pafnucy](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6198856/)
2. [KDEEP](https://pubs.acs.org/doi/10.1021/acs.jcim.7b00650)
## Alternative applications & data integrations
1. [A Deep Learning Approach to Antibiotic Discovery](https://www.sciencedirect.com/science/article/pii/S0092867420301021)
### Initial model training
- using FDA approved drugs for growth inhibition and natural products.
- Evaluation metrics - ROC-AUC plots
- Rank-ordered prediction scores of Drug Repurposing Hub molecules that were not present in the training dataset.
- 50 % True positives.
- t-Distributed stochastic neighbor embedding
- Tanimoto similarity score
2. [The prediction of molecular toxicity based on BiGRU and GraphSAGE](https://www.sciencedirect.com/science/article/pii/S001048252201232X)
## Baseline GNN creation & performance
### Dataset creation
#### Node features
type/atomic number
degree (excluding neighboring hydrogen atoms)
total degree (including neighboring hydrogen atoms)
explicit valence
implicit valence
hybridization
total number of neighboring hydrogen atoms
formal charge
number of radical electrons
aromatic atom
ring membership
chirality
mass
(ChemBERT encoding)
#### edge features
type
conjugated bond
ring membership
stereo configuration
Others:
- bond length
- bond angle
## Baseline Models
Simple GNN model trained on a set of graphs (incorporating the node and edge features above) from the NaV dataset.
**Lilli's results**
MCC = 0.50
**Matt's results**
MCC = 0.541
F-score = 0.733
Cohen's kappa = 0.539
### Pretraining model
Deepchem to train a GNN model on Tox21 data
from deepchem.molnet import load_tox21
Example for graph CNN -
"""
Script that trains graph-conv models on Tox21 dataset.
"""
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import json
np.random.seed(123)
import tensorflow as tf
tf.random.set_seed(123)
import deepchem as dc
from deepchem.molnet import load_tox21
from deepchem.models.graph_models import PetroskiSuchModel
model_dir = "/tmp/graph_conv"
# Load Tox21 dataset
tox21_tasks, tox21_datasets, transformers = load_tox21(
featurizer='AdjacencyConv')
train_dataset, valid_dataset, test_dataset = tox21_datasets
print(train_dataset.data_dir)
print(valid_dataset.data_dir)
# Fit models
metric = dc.metrics.Metric(
dc.metrics.roc_auc_score, np.mean, mode="classification")
# Batch size of models
batch_size = 128
model = PetroskiSuchModel(
len(tox21_tasks), batch_size=batch_size, mode='classification')
model.fit(train_dataset, nb_epoch=10)
print("Evaluating model")
train_scores = model.evaluate(train_dataset, [metric], transformers)
valid_scores = model.evaluate(valid_dataset, [metric], transformers)
print("Train scores")
print(train_scores)
print("Validation scores")
print(valid_scores)
### GNN architecture
https://lifesci.dgl.ai/api/model.zoo.html
## Potential exploration
1. Geometric approaches
- Modelling 3D molecular geometry (e.g., GeoGNN)
2. Graphormers
- reducing signal dilution between distant elements using attention.
3. Equivariant models (EGNNs)
- Transformers
- SMILE Transformer: https://arxiv.org/pdf/1911.04738.pdf
- Created baseline model using fingerprints as inputs to the transformer, this is only a single vector not a sequence of vectors to exercise some transformer code. Using simple random splitting (i.e. scaffolding not yet applied) model performs as follows on the herg dataset
Test Accuracy: 0.9305
Test Precision: 0.6122
Test Recall: 0.7840
Test F1 Score: 0.6876
Test MCC Score: 0.6554

- Reweighting the minority class helped quite a bit in reducing the imbalance problem - more complex methods for dealing with class imbalance could be considered. We could just investigate changing the probability cut off for a molecule to be predicted to belong to the
- Finger prints aren't ideal due to having dimension of 2048, so even a single hidden layer results in large input size
- Idea for any of the representations: descriptors represent molecule level features, we can use the most performant of them as contexts in whatever other model we use. E.g. can add them as a global features in a graph
- drug likeliness features (https://www.nature.com/articles/s41467-023-41948-6.pdf)- ['MolWt', 'MolLogP', 'NumHDonors', 'NumHAcceptors', 'NumRotatableBonds', 'MolMR', 'TPSA']
- drug-likeliness feature datasets for ecfp files can be found in "DSGDec2023Ig/src/representation"
- also writing code to remove molecules that are not represented in both ecfp and mordred files for each dataset "DSGDec2023Ig/src/scripts/smiles_mismatch_ecfp_mordred.py"
- these resulting data files have been saved in "/shared"
## Method comparison discussion
### Graphs
- Testing done on nav dataset so far
- Training taking ~1,2 minutes
- RDKit can easily extract info for node and edge characteristics
- Limitations:
- Don't represent 3D molecular geometry well, only relationships between atoms. Potential solutions:
- Geometric: pretrain on large data set with geometric info encoded (bond angles, bond lengths etc.)
- Graphormers: combination of transformers with graphs to prevent signal dilution across large molecules, by using attention.
- Equivariant models: model structure allows for encoding of rotations of same molecule, increases model robustness.
- Would need to generate the 3D data ourselves
- Pros:
- Existing code (github repos) for geometric and graphormers.
### Transformers
- Baseline model with single vector (ECFP) input (not intended input for transformers)
- Graph input data could be more useful here
- Baseline model on herg dataset, random split without scaffold
- Class imbalance considerations, weighting loss function improved performance
- Descriptors could be included at a global level
- Rerun with scaffold split and fingerprint splits
- Rerun with herg data for comparison with graphs
- Many pre-trained transformers available on hugging face for example, but fitting these to our purposes could be challenging
- Try smiles2vec as well:
- If this can be implemented quickly could see if it improves performance
### Images
- Can extract 3D info from SMILES
- Challenges:
- Finding best resolution for grid, could get too large
- Bond structures not explicitly encoded, would be challenging to represent. Could use spherical coordinate representation to get around this.
2D images:
- Existing network trained on 10M PubChem images through 3 different self-supervised learning methods
- Could be tuned for our purposes
Going forward:
- Focus on 2D, getting pretrained model to work
- Could have model set up and training by lunch tomorrow
### Conclusion
All groups continue for now, discuss again tomorrow lunch time