owned this note
owned this note
Linked with GitHub
# Generating Plastic-Degrading Enzymes
## 1. Abstract
We present modifications to the design algorithm Conditioning by Adaptive Sampling (CbAS) (Brookes et al.) and apply it to the design of plastic-degrading enzymes. We curated a dataset of 212 unique PETase sequences and their relative catalytic activities, 159 of which have an experimental Tm value. We implemented CbAS in PyTorch and report the optimization trajectory along with the model uncertainties. Generative models discovered by CbAS are able to consistently sample sequences that are predicted to surpass the catalytic activity and thermostability of the best sequence in the training data. We plan to synthesize the generated sequences in the wetlab for iGEM Toronto's 2021 project.
## 2. Motivation
Our main motivation for applying CbAS to protein design is the promising results of iGEM Toronto's 2019 project. In it, the best sequence as predicted by CbAS was synthesized and not only was it able to fold, but also had a catalytic activity competitive with the rationally-designed enzyme in Austin et al. As such, we sought to more carefully monitor the CbAS optimization trajectory with various modifications to the code described below.
## 3. Generative Model
### Variational Autoencoder
We train a VAE encoder with two hidden layers for the encoder and two hidden layers for the decoder with a 20-dimensional latent space. The model that is trained on the initial set of PETase sequences is denoted `vae_0` and is a parameterization of the prior probability distribution over PETase sequences. For the prior, we ensured that resulting model is not overfitted to the training data by selecting the model with the lowest loss on a held-out set over 100 training epochs. This is only done for `vae_0` since it is unclear what overfitting means for subsequent VAE models during optimization. 1000 samples from `vae_0` with no decoder noise
i.e. only take the $\mu$ output of the decoder and ignore the $\sigma$ output) are visualized below.
### Masked Language Model
Our work on using a masked language model to generate PETase sequences is described [here](https://hackmd.io/@m5-sBsYvQRKTCgxto14g4g/BJjTWJLrv).
## 4. Discriminative Model
### Catalytic Activity Oracle
We train 20 feed-forward neural networks with independent initializations on the same train/test split. Each network takes a batch of protein sequences as input, represented in binary using one-hot-encoding. The output of each network is passed through the $Tanh$ function. The mean of the ensemble was taken to be the predicted catalytic activity. In practice, we predict the log of the relative catalytic activity to wildtype PETase, normalized between -0.5 and 0.5. The intuition against a (0,1) or (-1,1) normalization is to avoid regions of the activation function with zero gradients. Additionally, improvements made to the sequences during optimization can be easily seen as any value above 0.5.
Plots of the oracle's performance on the training (blue) data and the test (orange) data are show below:
### Thermostability Oracle
The same procedure was used to train an ensemble of 20 feed-forward neural networks to predict Tm normalized between -0.5 and 0.5.
The thermostability oracle performs surprisingly well on test data, having an $R^2 = 0.95$ in the second figure.
## 5. Filtering
A limitation of CbAS is that it only incorporates protein sequence information. How do we know the generated sequences will even fold, putting aside the objective that it also needs to act as a thermostable enzyme? To increase our confidence in the generated sequences, we can incorporate information from protein structure and additional heuristics from the biochemistry of PETase.
Example interactions with a plastic ligand that should be preserved in generated sequences.
Proteinsolver is an algorithm based on Graph Neural Networks that takes as input a graph that represents the geometry of a protein backbone. Nodes represent amino acids and two nodes are connected by an edge if they have atoms within 6 Angstroms apart. The graph for PETase looks like the following:
Proteinsolver outputs sequences that are predicted to satisfy the backbone geometry. Thus, we generated 1000 sequences using Proteinsolver based on the backbone structure of PETase (PDB: 6EQD) and parameterized a probability distribution over these sequences with a VAE. We calculate the evidence lower bound (ELBO) for each of the generated sequences under this Proteinsolver distribution and use it to rank the generated sequences. The higher the ELBO, the more likely the sequence is to fold into the PETase structure. A visualization of the sequence distribution that is predicted to fold into the PETase structure is shown below.
In addition to proper folding, we should also ensure that the generated sequences have the active site and other important residues preserved. For PETase, these are Ser160, Asp206, His237, Cys176, Cys212, Cys246, Cys262, Tyr87, Trp185. We align the generated sequences with the wildtype sequence using Clustal Omega. Then, if the important residues listed above are not present within 2 amino acid positions from the postion in the wildtype sequence, we replace the amino acid at the aligned position in the generated sequence with the appropriate important residue. For example, the following figures show a before and after of such a procedure.
## 6. Optimization Algorithm
We use Brookes et al.'s Conditioning by Adaptive Sampling algorithm for design. The algorithm starts by training oracles--in our case 40 oracles, 20 for catalytic activity prediction and 20 for thermostability prediction. It also trains a base VAE model, `vae_0` on the existing PETase sequences. At each iteration, the algorithm samples 1000 proposal sequences using `vae_i`, where $i$ is the current iteration number. The oracles are used to make predictions of the generated sequences' property values. We moniter the progress of the optimization procedure by reporting top five sequences ranked by the sum of their predicted catalytic activity and thermostability values. We also report the oracles' uncertainty with its predictions as the variance of each ensemble. We made three modifications to their original code:
1. We implemented our models in PyTorch
2. We add an additional filtering step as described above
3. The weights used for weighted MLE was normalized between 0 and 2 to prevent numerical overflow. While we realize this overflow suggests that the variance of the weights is too large, the results after normalization appear sensible after sequence alignment.
## 7. Future Directions
We aim to validate our designs in the wetlab. In addition, we aim to explore alternative sequence optimization approaches, including but not limited to simulated annealing, Bayesian optimization, genetic algorithm, greedy search, stochastic sequence propagation, and activation maximization. A fair comparison between all such algorithms is crucial--to that end we aim to benchmark these algorithms on the Rough Mount Fuji model, akin to benchmarking general optimization algorithms on the Ackley function. To incorporate information from structure, a moonshot goal is to implement a differentiable molecular dynamics network for PETase and its interaction with its ligand. The feature of differentiability allows us to backpropagate through the simulator and perturb variables in the system that can affect our desiderata. A desideratum can be the interatomic distance between the $C\beta$ of Ser160 and a carbonyl carbon on the ligand. A continuous representation of the atoms in the PETase+ligand system would be required for backpropagation to perturb the system's variables such as amino acid identity.
## 8. References
Papers for constructing PETase sequence dataset:
Austin, Harry P., et al. "Characterization and engineering of a plastic-degrading aromatic polyesterase." Proceedings of the National Academy of Sciences 115.19 (2018): E4350-E4357.
Cui, Yinglu, et al. "Computational redesign of PETase for plastic biodegradation by GRAPE strategy." BioRxiv (2019): 787069.
Han, Xu, et al. "Structural insight into catalytic mechanism of PET hydrolase." Nature communications 8.1 (2017): 1-6.
Joo, Seongjoon, et al. "Structural insight into molecular mechanism of poly (ethylene terephthalate) degradation." Nature communications 9.1 (2018): 1-12.
Liu, Bing, et al. "protein crystallography and site‐direct mutagenesis analysis of the poly (ethylene terephthalate) hydrolase PETase from Ideonella sakaiensis." ChemBioChem 19.14 (2018): 1471-1475.
Liu, Congcong, et al. "Structural and functional characterization of polyethylene terephthalate hydrolase from Ideonella sakaiensis." Biochemical and biophysical research communications 508.1 (2019): 289-294.
Ma, Yuan, et al. "Enhanced poly (ethylene terephthalate) hydrolase activity by protein engineering." Engineering 4.6 (2018): 888-893.
Son, Hyeoncheol Francis, et al. "Rational protein engineering of thermo-stable PETase from Ideonella sakaiensis for highly efficient PET degradation." ACS Catalysis 9.4 (2019): 3519-3526.
Taniguchi, Ikuo, et al. "Biodegradation of PET: current status and application aspects." ACS Catalysis 9.5 (2019): 4089-4105.
Yoshida, Shosuke, et al. "A bacterium that degrades and assimilates poly (ethylene terephthalate)." Science 351.6278 (2016): 1196-1199.
Brookes, David H., Hahnbeom Park, and Jennifer Listgarten. "Conditioning by adaptive sampling for robust design." arXiv preprint arXiv:1901.10060 (2019).
DeLano, Warren Lyford. "PyMOL." (2002): 700.
Gligorijevic, Vladimir, et al. "Structure-based function prediction using graph convolutional networks." bioRxiv (2020): 786236.
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017).
Rao, Roshan, et al. "Evaluating protein transfer learning with TAPE." Advances in Neural Information Processing Systems. 2019.
Vig, Jesse, et al. "Bertology meets biology: Interpreting attention in protein language models." arXiv preprint arXiv:2006.15222 (2020).
## 9. Supplemental
### Graph Convolutional Network
Information from structure can also be useful when predicting protein function. The intuition is that knowledge of which amino acids are close together in 3D space can be useful when such amino acids are mutated together or separatedly. To explore this idea, we implemented a Graph Convolutional Network (GCN) to predict catalytic activity. An edge between two nodes exist if any two atoms of the two nodes are within 6 Angstroms apart. Node features represent amino acid identity. We tried two ways of encoding amino acids, one-hot-encoding, and a biochemical feature based embedding. In the second case, the model performance was worse than one-hot-encoding. However, the performance of the one-hot-encoding model was similar to the performance of a simple feed-forward neural network. We decided against using GCNs as oracles because the added model complexity was not warranted for no improvement in performance. A possible way to improve performance could be to add a sequence model, e.g. an LSTM to process the sequence information and concatenate its output with the GCN output for a final prediction, similar to the approach described in Gligorijevic et al.
### TAPE Embeddings
Motivated by the results in Vig et al, we hypothesized that using TAPE embeddings (Rao et al.) as input to our oracles, instead of using one-hot-encoding, may improve model performance. However, model performance severely degrades compared to using one-hot-encoding inputs. Visualized below is actual vs. predicted catalytic activity of the oracle with the lowest training loss. A variety of architecture and learning rate schedules could not rescue the model performance.
Code to reproduce the above result is available [here](https://colab.research.google.com/drive/1_cKlfw0INN54R3Pfrai4fnQTYJtXwJCB?usp=sharing).
All files listed below can be found on [iGEM Toronto Drylab's Github](https://github.com/igemto-drylab/igemto-drylab).
`petase_activity.csv` contains 212 sequences and their relative catalytic activity to wildtype.
`petase_stability.csv` contains 159 sequences and their Tm values.
`CbAS.ipynb` includes all the code for the optimization algorithm, including training the VAE and oracles.