1. Abstract
We present modifications to the design algorithm Conditioning by Adaptive Sampling (CbAS) (Brookes et al.) and apply it to the design of plastic-degrading enzymes. We curated a dataset of 212 unique PETase sequences and their relative catalytic activities, 159 of which have an experimental Tm value. We implemented CbAS in PyTorch and report the optimization trajectory along with the model uncertainties. Generative models discovered by CbAS are able to consistently sample sequences that are predicted to surpass the catalytic activity and thermostability of the best sequence in the training data. We plan to synthesize the generated sequences in the wetlab for iGEM Toronto's 2021 project.
2. Motivation
Our main motivation for applying CbAS to protein design is the promising results of iGEM Toronto's 2019 project. In it, the best sequence as predicted by CbAS was synthesized and not only was it able to fold, but also had a catalytic activity competitive with the rationally-designed enzyme in Austin et al. As such, we sought to more carefully monitor the CbAS optimization trajectory with various modifications to the code described below.
3. Generative Model
Variational Autoencoder
We train a VAE encoder with two hidden layers for the encoder and two hidden layers for the decoder with a 20-dimensional latent space. The model that is trained on the initial set of PETase sequences is denoted vae_0 and is a parameterization of the prior probability distribution over PETase sequences. For the prior, we ensured that resulting model is not overfitted to the training data by selecting the model with the lowest loss on a held-out set over 100 training epochs. This is only done for vae_0 since it is unclear what overfitting means for subsequent VAE models during optimization. 1000 samples from vae_0 with no decoder noise