Tianyu Lu
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Generating Plastic-Degrading Enzymes ## 1. Abstract We present modifications to the design algorithm Conditioning by Adaptive Sampling (CbAS) (Brookes et al.) and apply it to the design of plastic-degrading enzymes. We curated a dataset of 212 unique PETase sequences and their relative catalytic activities, 159 of which have an experimental Tm value. We implemented CbAS in PyTorch and report the optimization trajectory along with the model uncertainties. Generative models discovered by CbAS are able to consistently sample sequences that are predicted to surpass the catalytic activity and thermostability of the best sequence in the training data. We plan to synthesize the generated sequences in the wetlab for iGEM Toronto's 2021 project. ## 2. Motivation Our main motivation for applying CbAS to protein design is the promising results of iGEM Toronto's 2019 project. In it, the best sequence as predicted by CbAS was synthesized and not only was it able to fold, but also had a catalytic activity competitive with the rationally-designed enzyme in Austin et al. As such, we sought to more carefully monitor the CbAS optimization trajectory with various modifications to the code described below. ## 3. Generative Model ### Variational Autoencoder We train a VAE encoder with two hidden layers for the encoder and two hidden layers for the decoder with a 20-dimensional latent space. The model that is trained on the initial set of PETase sequences is denoted `vae_0` and is a parameterization of the prior probability distribution over PETase sequences. For the prior, we ensured that resulting model is not overfitted to the training data by selecting the model with the lowest loss on a held-out set over 100 training epochs. This is only done for `vae_0` since it is unclear what overfitting means for subsequent VAE models during optimization. 1000 samples from `vae_0` with no decoder noise i.e. only take the $\mu$ output of the decoder and ignore the $\sigma$ output) are visualized below. ![](https://i.imgur.com/rPnjPSu.png) ### Masked Language Model Our work on using a masked language model to generate PETase sequences is described [here](https://hackmd.io/@m5-sBsYvQRKTCgxto14g4g/BJjTWJLrv). ## 4. Discriminative Model ### Catalytic Activity Oracle We train 20 feed-forward neural networks with independent initializations on the same train/test split. Each network takes a batch of protein sequences as input, represented in binary using one-hot-encoding. The output of each network is passed through the $Tanh$ function. The mean of the ensemble was taken to be the predicted catalytic activity. In practice, we predict the log of the relative catalytic activity to wildtype PETase, normalized between -0.5 and 0.5. The intuition against a (0,1) or (-1,1) normalization is to avoid regions of the activation function with zero gradients. Additionally, improvements made to the sequences during optimization can be easily seen as any value above 0.5. Plots of the oracle's performance on the training (blue) data and the test (orange) data are show below: ![](https://i.imgur.com/LeLVGKd.png) ### Thermostability Oracle The same procedure was used to train an ensemble of 20 feed-forward neural networks to predict Tm normalized between -0.5 and 0.5. ![](https://i.imgur.com/c6bIOha.png) The thermostability oracle performs surprisingly well on test data, having an $R^2 = 0.95$ in the second figure. ## 5. Filtering A limitation of CbAS is that it only incorporates protein sequence information. How do we know the generated sequences will even fold, putting aside the objective that it also needs to act as a thermostable enzyme? To increase our confidence in the generated sequences, we can incorporate information from protein structure and additional heuristics from the biochemistry of PETase. ![](https://i.imgur.com/qkap9ak.png =450x) Example interactions with a plastic ligand that should be preserved in generated sequences. ### Proteinsolver Proteinsolver is an algorithm based on Graph Neural Networks that takes as input a graph that represents the geometry of a protein backbone. Nodes represent amino acids and two nodes are connected by an edge if they have atoms within 6 Angstroms apart. The graph for PETase looks like the following: ![](https://i.imgur.com/oUku4dG.png =350x) Proteinsolver outputs sequences that are predicted to satisfy the backbone geometry. Thus, we generated 1000 sequences using Proteinsolver based on the backbone structure of PETase (PDB: 6EQD) and parameterized a probability distribution over these sequences with a VAE. We calculate the evidence lower bound (ELBO) for each of the generated sequences under this Proteinsolver distribution and use it to rank the generated sequences. The higher the ELBO, the more likely the sequence is to fold into the PETase structure. A visualization of the sequence distribution that is predicted to fold into the PETase structure is shown below. ![](https://i.imgur.com/WNv4LoI.png) ### Heuristics In addition to proper folding, we should also ensure that the generated sequences have the active site and other important residues preserved. For PETase, these are Ser160, Asp206, His237, Cys176, Cys212, Cys246, Cys262, Tyr87, Trp185. We align the generated sequences with the wildtype sequence using Clustal Omega. Then, if the important residues listed above are not present within 2 amino acid positions from the postion in the wildtype sequence, we replace the amino acid at the aligned position in the generated sequence with the appropriate important residue. For example, the following figures show a before and after of such a procedure. ![](https://i.imgur.com/EfDdwkp.png) ![](https://i.imgur.com/ii5ZiVa.png) ## 6. Optimization Algorithm We use Brookes et al.'s Conditioning by Adaptive Sampling algorithm for design. The algorithm starts by training oracles--in our case 40 oracles, 20 for catalytic activity prediction and 20 for thermostability prediction. It also trains a base VAE model, `vae_0` on the existing PETase sequences. At each iteration, the algorithm samples 1000 proposal sequences using `vae_i`, where $i$ is the current iteration number. The oracles are used to make predictions of the generated sequences' property values. We moniter the progress of the optimization procedure by reporting top five sequences ranked by the sum of their predicted catalytic activity and thermostability values. We also report the oracles' uncertainty with its predictions as the variance of each ensemble. We made three modifications to their original code: 1. We implemented our models in PyTorch 2. We add an additional filtering step as described above 3. The weights used for weighted MLE was normalized between 0 and 2 to prevent numerical overflow. While we realize this overflow suggests that the variance of the weights is too large, the results after normalization appear sensible after sequence alignment. ## 7. Future Directions We aim to validate our designs in the wetlab. In addition, we aim to explore alternative sequence optimization approaches, including but not limited to simulated annealing, Bayesian optimization, genetic algorithm, greedy search, stochastic sequence propagation, and activation maximization. A fair comparison between all such algorithms is crucial--to that end we aim to benchmark these algorithms on the Rough Mount Fuji model, akin to benchmarking general optimization algorithms on the Ackley function. To incorporate information from structure, a moonshot goal is to implement a differentiable molecular dynamics network for PETase and its interaction with its ligand. The feature of differentiability allows us to backpropagate through the simulator and perturb variables in the system that can affect our desiderata. A desideratum can be the interatomic distance between the $C\beta$ of Ser160 and a carbonyl carbon on the ligand. A continuous representation of the atoms in the PETase+ligand system would be required for backpropagation to perturb the system's variables such as amino acid identity. ## 8. References Papers for constructing PETase sequence dataset: Austin, Harry P., et al. "Characterization and engineering of a plastic-degrading aromatic polyesterase." Proceedings of the National Academy of Sciences 115.19 (2018): E4350-E4357. Cui, Yinglu, et al. "Computational redesign of PETase for plastic biodegradation by GRAPE strategy." BioRxiv (2019): 787069. Han, Xu, et al. "Structural insight into catalytic mechanism of PET hydrolase." Nature communications 8.1 (2017): 1-6. Joo, Seongjoon, et al. "Structural insight into molecular mechanism of poly (ethylene terephthalate) degradation." Nature communications 9.1 (2018): 1-12. Liu, Bing, et al. "protein crystallography and site‐direct mutagenesis analysis of the poly (ethylene terephthalate) hydrolase PETase from Ideonella sakaiensis." ChemBioChem 19.14 (2018): 1471-1475. Liu, Congcong, et al. "Structural and functional characterization of polyethylene terephthalate hydrolase from Ideonella sakaiensis." Biochemical and biophysical research communications 508.1 (2019): 289-294. Ma, Yuan, et al. "Enhanced poly (ethylene terephthalate) hydrolase activity by protein engineering." Engineering 4.6 (2018): 888-893. Son, Hyeoncheol Francis, et al. "Rational protein engineering of thermo-stable PETase from Ideonella sakaiensis for highly efficient PET degradation." ACS Catalysis 9.4 (2019): 3519-3526. Taniguchi, Ikuo, et al. "Biodegradation of PET: current status and application aspects." ACS Catalysis 9.5 (2019): 4089-4105. Yoshida, Shosuke, et al. "A bacterium that degrades and assimilates poly (ethylene terephthalate)." Science 351.6278 (2016): 1196-1199. Other papers: Brookes, David H., Hahnbeom Park, and Jennifer Listgarten. "Conditioning by adaptive sampling for robust design." arXiv preprint arXiv:1901.10060 (2019). DeLano, Warren Lyford. "PyMOL." (2002): 700. Gligorijevic, Vladimir, et al. "Structure-based function prediction using graph convolutional networks." bioRxiv (2020): 786236. Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017). Rao, Roshan, et al. "Evaluating protein transfer learning with TAPE." Advances in Neural Information Processing Systems. 2019. Vig, Jesse, et al. "Bertology meets biology: Interpreting attention in protein language models." arXiv preprint arXiv:2006.15222 (2020). ## 9. Supplemental ### Graph Convolutional Network Information from structure can also be useful when predicting protein function. The intuition is that knowledge of which amino acids are close together in 3D space can be useful when such amino acids are mutated together or separatedly. To explore this idea, we implemented a Graph Convolutional Network (GCN) to predict catalytic activity. An edge between two nodes exist if any two atoms of the two nodes are within 6 Angstroms apart. Node features represent amino acid identity. We tried two ways of encoding amino acids, one-hot-encoding, and a biochemical feature based embedding. In the second case, the model performance was worse than one-hot-encoding. However, the performance of the one-hot-encoding model was similar to the performance of a simple feed-forward neural network. We decided against using GCNs as oracles because the added model complexity was not warranted for no improvement in performance. A possible way to improve performance could be to add a sequence model, e.g. an LSTM to process the sequence information and concatenate its output with the GCN output for a final prediction, similar to the approach described in Gligorijevic et al. ### TAPE Embeddings Motivated by the results in Vig et al, we hypothesized that using TAPE embeddings (Rao et al.) as input to our oracles, instead of using one-hot-encoding, may improve model performance. However, model performance severely degrades compared to using one-hot-encoding inputs. Visualized below is actual vs. predicted catalytic activity of the oracle with the lowest training loss. A variety of architecture and learning rate schedules could not rescue the model performance. ![](https://i.imgur.com/qrRfC0t.png =300x) Code to reproduce the above result is available [here](https://colab.research.google.com/drive/1_cKlfw0INN54R3Pfrai4fnQTYJtXwJCB?usp=sharing). ### Code All files listed below can be found on [iGEM Toronto Drylab's Github](https://github.com/igemto-drylab/igemto-drylab). `petase_activity.csv` contains 212 sequences and their relative catalytic activity to wildtype. `petase_stability.csv` contains 159 sequences and their Tm values. `CbAS.ipynb` includes all the code for the optimization algorithm, including training the VAE and oracles.

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully