HackMD
  • Beta
    Beta  Get a sneak peek of HackMD’s new design
    Turn on the feature preview and give us feedback.
    Go → Got it
      • Create new note
      • Create a note from template
    • Beta  Get a sneak peek of HackMD’s new design
      Beta  Get a sneak peek of HackMD’s new design
      Turn on the feature preview and give us feedback.
      Go → Got it
      • Sharing Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • More (Comment, Invitee)
      • Publishing
        Please check the box to agree to the Community Guidelines.
        Everyone on the web can find and read all notes of this public team.
        After the note is published, everyone on the web can find and read this note.
        See all published notes on profile page.
      • Commenting Enable
        Disabled Forbidden Owners Signed-in users Everyone
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Invitee
      • No invitee
      • Options
      • Versions and GitHub Sync
      • Transfer ownership
      • Delete this note
      • Template
      • Save as template
      • Insert from template
      • Export
      • Dropbox
      • Google Drive Export to Google Drive
      • Gist
      • Import
      • Dropbox
      • Google Drive Import from Google Drive
      • Gist
      • Clipboard
      • Download
      • Markdown
      • HTML
      • Raw HTML
    Menu Sharing Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Versions and GitHub Sync Transfer ownership Delete this note
    Export
    Dropbox Google Drive Export to Google Drive Gist
    Import
    Dropbox Google Drive Import from Google Drive Gist Clipboard
    Download
    Markdown HTML Raw HTML
    Back
    Sharing
    Sharing Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    More (Comment, Invitee)
    Publishing
    Please check the box to agree to the Community Guidelines.
    Everyone on the web can find and read all notes of this public team.
    After the note is published, everyone on the web can find and read this note.
    See all published notes on profile page.
    More (Comment, Invitee)
    Commenting Enable
    Disabled Forbidden Owners Signed-in users Everyone
    Permission
    Owners
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Invitee
    No invitee
       owned this note    owned this note      
    Published Linked with GitHub
    Like BookmarkBookmarked
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Generating Plastic-Degrading Enzymes ## 1. Abstract We present modifications to the design algorithm Conditioning by Adaptive Sampling (CbAS) (Brookes et al.) and apply it to the design of plastic-degrading enzymes. We curated a dataset of 212 unique PETase sequences and their relative catalytic activities, 159 of which have an experimental Tm value. We implemented CbAS in PyTorch and report the optimization trajectory along with the model uncertainties. Generative models discovered by CbAS are able to consistently sample sequences that are predicted to surpass the catalytic activity and thermostability of the best sequence in the training data. We plan to synthesize the generated sequences in the wetlab for iGEM Toronto's 2021 project. ## 2. Motivation Our main motivation for applying CbAS to protein design is the promising results of iGEM Toronto's 2019 project. In it, the best sequence as predicted by CbAS was synthesized and not only was it able to fold, but also had a catalytic activity competitive with the rationally-designed enzyme in Austin et al. As such, we sought to more carefully monitor the CbAS optimization trajectory with various modifications to the code described below. ## 3. Generative Model ### Variational Autoencoder We train a VAE encoder with two hidden layers for the encoder and two hidden layers for the decoder with a 20-dimensional latent space. The model that is trained on the initial set of PETase sequences is denoted `vae_0` and is a parameterization of the prior probability distribution over PETase sequences. For the prior, we ensured that resulting model is not overfitted to the training data by selecting the model with the lowest loss on a held-out set over 100 training epochs. This is only done for `vae_0` since it is unclear what overfitting means for subsequent VAE models during optimization. 1000 samples from `vae_0` with no decoder noise i.e. only take the $\mu$ output of the decoder and ignore the $\sigma$ output) are visualized below. ![](https://i.imgur.com/rPnjPSu.png) ### Masked Language Model Our work on using a masked language model to generate PETase sequences is described [here](https://hackmd.io/@m5-sBsYvQRKTCgxto14g4g/BJjTWJLrv). ## 4. Discriminative Model ### Catalytic Activity Oracle We train 20 feed-forward neural networks with independent initializations on the same train/test split. Each network takes a batch of protein sequences as input, represented in binary using one-hot-encoding. The output of each network is passed through the $Tanh$ function. The mean of the ensemble was taken to be the predicted catalytic activity. In practice, we predict the log of the relative catalytic activity to wildtype PETase, normalized between -0.5 and 0.5. The intuition against a (0,1) or (-1,1) normalization is to avoid regions of the activation function with zero gradients. Additionally, improvements made to the sequences during optimization can be easily seen as any value above 0.5. Plots of the oracle's performance on the training (blue) data and the test (orange) data are show below: ![](https://i.imgur.com/LeLVGKd.png) ### Thermostability Oracle The same procedure was used to train an ensemble of 20 feed-forward neural networks to predict Tm normalized between -0.5 and 0.5. ![](https://i.imgur.com/c6bIOha.png) The thermostability oracle performs surprisingly well on test data, having an $R^2 = 0.95$ in the second figure. ## 5. Filtering A limitation of CbAS is that it only incorporates protein sequence information. How do we know the generated sequences will even fold, putting aside the objective that it also needs to act as a thermostable enzyme? To increase our confidence in the generated sequences, we can incorporate information from protein structure and additional heuristics from the biochemistry of PETase. ![](https://i.imgur.com/qkap9ak.png =450x) Example interactions with a plastic ligand that should be preserved in generated sequences. ### Proteinsolver Proteinsolver is an algorithm based on Graph Neural Networks that takes as input a graph that represents the geometry of a protein backbone. Nodes represent amino acids and two nodes are connected by an edge if they have atoms within 6 Angstroms apart. The graph for PETase looks like the following: ![](https://i.imgur.com/oUku4dG.png =350x) Proteinsolver outputs sequences that are predicted to satisfy the backbone geometry. Thus, we generated 1000 sequences using Proteinsolver based on the backbone structure of PETase (PDB: 6EQD) and parameterized a probability distribution over these sequences with a VAE. We calculate the evidence lower bound (ELBO) for each of the generated sequences under this Proteinsolver distribution and use it to rank the generated sequences. The higher the ELBO, the more likely the sequence is to fold into the PETase structure. A visualization of the sequence distribution that is predicted to fold into the PETase structure is shown below. ![](https://i.imgur.com/WNv4LoI.png) ### Heuristics In addition to proper folding, we should also ensure that the generated sequences have the active site and other important residues preserved. For PETase, these are Ser160, Asp206, His237, Cys176, Cys212, Cys246, Cys262, Tyr87, Trp185. We align the generated sequences with the wildtype sequence using Clustal Omega. Then, if the important residues listed above are not present within 2 amino acid positions from the postion in the wildtype sequence, we replace the amino acid at the aligned position in the generated sequence with the appropriate important residue. For example, the following figures show a before and after of such a procedure. ![](https://i.imgur.com/EfDdwkp.png) ![](https://i.imgur.com/ii5ZiVa.png) ## 6. Optimization Algorithm We use Brookes et al.'s Conditioning by Adaptive Sampling algorithm for design. The algorithm starts by training oracles--in our case 40 oracles, 20 for catalytic activity prediction and 20 for thermostability prediction. It also trains a base VAE model, `vae_0` on the existing PETase sequences. At each iteration, the algorithm samples 1000 proposal sequences using `vae_i`, where $i$ is the current iteration number. The oracles are used to make predictions of the generated sequences' property values. We moniter the progress of the optimization procedure by reporting top five sequences ranked by the sum of their predicted catalytic activity and thermostability values. We also report the oracles' uncertainty with its predictions as the variance of each ensemble. We made three modifications to their original code: 1. We implemented our models in PyTorch 2. We add an additional filtering step as described above 3. The weights used for weighted MLE was normalized between 0 and 2 to prevent numerical overflow. While we realize this overflow suggests that the variance of the weights is too large, the results after normalization appear sensible after sequence alignment. ## 7. Future Directions We aim to validate our designs in the wetlab. In addition, we aim to explore alternative sequence optimization approaches, including but not limited to simulated annealing, Bayesian optimization, genetic algorithm, greedy search, stochastic sequence propagation, and activation maximization. A fair comparison between all such algorithms is crucial--to that end we aim to benchmark these algorithms on the Rough Mount Fuji model, akin to benchmarking general optimization algorithms on the Ackley function. To incorporate information from structure, a moonshot goal is to implement a differentiable molecular dynamics network for PETase and its interaction with its ligand. The feature of differentiability allows us to backpropagate through the simulator and perturb variables in the system that can affect our desiderata. A desideratum can be the interatomic distance between the $C\beta$ of Ser160 and a carbonyl carbon on the ligand. A continuous representation of the atoms in the PETase+ligand system would be required for backpropagation to perturb the system's variables such as amino acid identity. ## 8. References Papers for constructing PETase sequence dataset: Austin, Harry P., et al. "Characterization and engineering of a plastic-degrading aromatic polyesterase." Proceedings of the National Academy of Sciences 115.19 (2018): E4350-E4357. Cui, Yinglu, et al. "Computational redesign of PETase for plastic biodegradation by GRAPE strategy." BioRxiv (2019): 787069. Han, Xu, et al. "Structural insight into catalytic mechanism of PET hydrolase." Nature communications 8.1 (2017): 1-6. Joo, Seongjoon, et al. "Structural insight into molecular mechanism of poly (ethylene terephthalate) degradation." Nature communications 9.1 (2018): 1-12. Liu, Bing, et al. "protein crystallography and site‐direct mutagenesis analysis of the poly (ethylene terephthalate) hydrolase PETase from Ideonella sakaiensis." ChemBioChem 19.14 (2018): 1471-1475. Liu, Congcong, et al. "Structural and functional characterization of polyethylene terephthalate hydrolase from Ideonella sakaiensis." Biochemical and biophysical research communications 508.1 (2019): 289-294. Ma, Yuan, et al. "Enhanced poly (ethylene terephthalate) hydrolase activity by protein engineering." Engineering 4.6 (2018): 888-893. Son, Hyeoncheol Francis, et al. "Rational protein engineering of thermo-stable PETase from Ideonella sakaiensis for highly efficient PET degradation." ACS Catalysis 9.4 (2019): 3519-3526. Taniguchi, Ikuo, et al. "Biodegradation of PET: current status and application aspects." ACS Catalysis 9.5 (2019): 4089-4105. Yoshida, Shosuke, et al. "A bacterium that degrades and assimilates poly (ethylene terephthalate)." Science 351.6278 (2016): 1196-1199. Other papers: Brookes, David H., Hahnbeom Park, and Jennifer Listgarten. "Conditioning by adaptive sampling for robust design." arXiv preprint arXiv:1901.10060 (2019). DeLano, Warren Lyford. "PyMOL." (2002): 700. Gligorijevic, Vladimir, et al. "Structure-based function prediction using graph convolutional networks." bioRxiv (2020): 786236. Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). Paszke, Adam, et al. "Automatic differentiation in pytorch." (2017). Rao, Roshan, et al. "Evaluating protein transfer learning with TAPE." Advances in Neural Information Processing Systems. 2019. Vig, Jesse, et al. "Bertology meets biology: Interpreting attention in protein language models." arXiv preprint arXiv:2006.15222 (2020). ## 9. Supplemental ### Graph Convolutional Network Information from structure can also be useful when predicting protein function. The intuition is that knowledge of which amino acids are close together in 3D space can be useful when such amino acids are mutated together or separatedly. To explore this idea, we implemented a Graph Convolutional Network (GCN) to predict catalytic activity. An edge between two nodes exist if any two atoms of the two nodes are within 6 Angstroms apart. Node features represent amino acid identity. We tried two ways of encoding amino acids, one-hot-encoding, and a biochemical feature based embedding. In the second case, the model performance was worse than one-hot-encoding. However, the performance of the one-hot-encoding model was similar to the performance of a simple feed-forward neural network. We decided against using GCNs as oracles because the added model complexity was not warranted for no improvement in performance. A possible way to improve performance could be to add a sequence model, e.g. an LSTM to process the sequence information and concatenate its output with the GCN output for a final prediction, similar to the approach described in Gligorijevic et al. ### TAPE Embeddings Motivated by the results in Vig et al, we hypothesized that using TAPE embeddings (Rao et al.) as input to our oracles, instead of using one-hot-encoding, may improve model performance. However, model performance severely degrades compared to using one-hot-encoding inputs. Visualized below is actual vs. predicted catalytic activity of the oracle with the lowest training loss. A variety of architecture and learning rate schedules could not rescue the model performance. ![](https://i.imgur.com/qrRfC0t.png =300x) Code to reproduce the above result is available [here](https://colab.research.google.com/drive/1_cKlfw0INN54R3Pfrai4fnQTYJtXwJCB?usp=sharing). ### Code All files listed below can be found on [iGEM Toronto Drylab's Github](https://github.com/igemto-drylab/igemto-drylab). `petase_activity.csv` contains 212 sequences and their relative catalytic activity to wildtype. `petase_stability.csv` contains 159 sequences and their Tm values. `CbAS.ipynb` includes all the code for the optimization algorithm, including training the VAE and oracles.

    Import from clipboard

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lost their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template is not available.


    Upgrade

    All
    • All
    • Team
    No template found.

    Create custom template


    Upgrade

    Delete template

    Do you really want to delete this template?

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Tutorials

    Book Mode Tutorial

    Slide Mode Tutorial

    YAML Metadata

    Contacts

    Facebook

    Twitter

    Feedback

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions

    Versions and GitHub Sync

    Sign in to link this note to GitHub Learn more
    This note is not linked with GitHub Learn more
     
    Add badge Pull Push GitHub Link Settings
    Upgrade now

    Version named by    

    More Less
    • Edit
    • Delete

    Note content is identical to the latest version.
    Compare with
      Choose a version
      No search result
      Version not found

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub

        Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

         Sign in to GitHub

        HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Available push count

        Upgrade

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Upgrade

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully