GenomeNet Slide deck

# Slide deck ---  ## Agenda 1. *GenomeNet* team update 2. *deepG* updates 3. network interpretation 4. manuscript draft 5. transfer learning --- ## *GenomeNet* team update  **New people!** 1. **Julia Moosbauer** (part time, PhD) 2. **Anil Gündüz** (full time, PhD) ---  ## *deepG* progress - Added WaveNet [[paper]](https://arxiv.org/pdf/1609.03499.pdf) support, DL architecture developed by DeepMind - Should handle long-term dependencies better - **Skip Connections** they allow a complete bypass of convolution layers - Performance already slighty better on NCBI language model than LSTM model, (~35% vs. 38% test accuracy) ----  > [...] WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. ----  ![](https://i.imgur.com/FjgH3xh.gif) Dilated-convolutions allows to increase the subsequences ----  **Practical implications** - currently we train to predict next character using subsequences of 100-400nt and accuracy is not improving with longer subsequences (due to known long-term dependency problems [[paper](https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0)]) - with *WaveNet* we can train on 1,000-10,000 sized subsequences ----  **Further changes** - instead of prediction of next character we predict the shifted input sequence (should be more efficient) - required rewriting of data generator function for different target format **Example:** `AGCAGAACGTTTGAGGATTAGGTCAAATTG` -> `_AGCAGAACGTTTGAGGATTAGGTCAAATTG` ---- ![](https://i.imgur.com/Xveqe1M.jpg) from *many-to-one* to *many-to-many* ----  **Software enhancements** - Code rearrangement to support different network architectures, previously train call was tied to LSTM architecture - Added options for output format of data generator ----  ``` # create LSTM model model_lstm <- create_model_lstm_cnn( maxlen = 50, layer.size = 32, layers.lstm = 2, learning.rate = 0.001, num_targets = 4) # start training trainNetwork(model = model_lstm, path = "/train/path", path.val= "/test/path", train_type = "lm", run.name = "lstm", batch.size = 32, epochs = 15, tensorboard.log = "/tensorboard/path") ``` ----  ``` # create wavenet model model_wavenet <- create_model_wavenet( residual_blocks = 2, maxlen = 1000, output_channels = 4, learning.rate = 0.001) # start training trainNetwork(model = model_wavenet, path = "/train/path", path.val= "/test/path", train_type = "lm", run.name = "wavenet", batch.size = 32, epochs = 15, tensorboard.log = "/tensorboard/path", wavenet_format = TRUE) ``` ----  **Stateful LSTM** - Stateful LSTM: Normal LSTM layers can not see beyond current sample - Example: Given a sequence that repeats $000100010001 \dots$ and an input sequence of length 2, a normal LSTM network is not able to confidently predict the next character for input $00$. Stateful LSTM network is able to look beyond current batch and make confident predictions since it "knows" what came before $00$, (either $10$ or $01$). ----  **Predict target in middle** - Example: Given a sequence $s = x_1, \dots, x_n, y, x_{n+1} ,\dots, x_m$, create two LSTM networks. One tries to predict target $y$ using $x_1, \dots, x_n$ and second network $x_m, \dots, x_{n+1}$. ----  ![](https://i.imgur.com/77UWZvI.png) ----  - So far the best architecture for NCBI language model ![](https://i.imgur.com/8Y6ea5d.png) - Previous models had test accuracy of around 37% and very unstable learning curves ----  **Research: Network pruning** ![](https://i.imgur.com/mTZ3agP.png) ----  **Research: Network pruning** - Reduce the complexity of a network by deleting "insignificant" weights - Can improve performance significantly - Current implementations are not supported by R yet - Implementation from scratch too time consuming at the moment ---- **Classification CRISPR vs random reads** ![](https://i.imgur.com/SJaQPD8.png)  ---- ## Network interpretation  ![](https://i.imgur.com/ugeTolE.png) ----  - Script for automated evaluation of "network interpretation" - *Input:* set of GFF and fasta files for some genomes - *Output:* Ranked list of neurons based on logistic regression (L1) plus density plots - every category in GFF file will be evaluated ----  - using Prokka, these are features such as _CDS_, _repeat_region_, _rRNA_ ... - but we plan to add higher-level features by processing GFF files - Neuron activation for binary features ----  **Neuron activation for binary features** <img height="550" src="https://i.imgur.com/v4LoOj5.png"> ---- ----  **Neuron activation for binary features** <img height="550" src="https://i.imgur.com/oZneLfw.png"> ---- **Activations for random neurons** <img height="550" src="https://i.imgur.com/ZTupKkV.png">  --- ## Manuscript draft  [overleaf](https://www.overleaf.com/6939532179nvstrnvyyyrh) [pdf](https://web.tresorit.com/l/NWf3p#afPxYs1lFAIqBCVUxDtjaQ) - use optimized model architecture (in progress) - thinking about _Null model_ to compare against for sequence search similar to hmmer --- ## Transfer learning  ![](https://i.imgur.com/Cwbi3fg.png) ---- ## Transfer learning  ![](https://i.imgur.com/YkJulKQ.png)