---
tags: Meta
---
:::info
Notebook to collect ideas. Unconventional ideas are welcome too.
**Format:**
==Importance== (1 - 5) / Author of ideas to faciliate 1:1 discussions
:::
Ideas
===
<!-- see https://hackmd.io/c/codimd-documentation/%2F%40codimd%2Fmarkdown-syntax -->
Architecture
---
- [ ] ==5== **hyperparameter optimization** -- Multi-fidelity approaches. Hyperband / ASHA? / [name=Martin]
- Best on the 3 T4 server (==bioinf###==)
- NCBI complete genomes dataset (==path?==)
- [x] ==4== **stateful RNN** ([R example](https://philipperemy.github.io/keras-stateful-lstm/)). Need to make sure that samples in batches are ordered by their occurence in the genome / [name=Martin]
- [ ] ==?== **"Continuize" the architecture landscape**? Currently the goal is to have one implemention of e.g. LSTM/CNN and one for WaveNet and one for Transformer, usw. But for optimization I guess it would be nice to either combine these architectures somehow or to implement them as "one big architecture" with a lot of arguments that e.g. controls the skip-connections? Not sure if this is helpful/feasible at all. / [name=Philipp]
- The most prominent instance of this that I know is "[DARTS](https://arxiv.org/abs/1806.09055)"; it has the problem that it takes a lot of GPU mem. Then there is "[Proxyless NAS](https://openreview.net/forum?id=HylVB3AqYm)" which tries to reduce the memory requirement by replacing continuous with probabilistic. I really like these approaches and think they would go somewhere and we should try them, but they are not simple. [name=Martin]
- [ ] ==5== **transformer**? Not sure what is the best way to implement it in R Keras
Explainability
---
- [ ] ==3== function that takes a `h5` file of neuron responses and calculates the **neuron-wise correlation with a set of known features** and outputs the neurons with highest correlation coefficients. E.g. check _if_ neurons are activated excusively on low-level features (within CDS or within the same amino acid). Alternative detect a neuron / set of neurons that can predict these low level features as in [^1]. For this a dataset with the low level features needs to be created. / [name=Philipp]
- We'd have to take care to correct for multiple testing with lots of neurons... [name=Martin]
- Could try other models, e.g. fit a L1 lm on the outcomes we want [name=Martin]
```R
>identifyExplainableNeurons(states = "ecoli.h5", targetFile = "ecoli_cds.txt")
Neuron 4: Accuracy = 0.94
Neuron 112: Accuracy = 0.89
...
```
- [ ] ==?== Check for neuron response to phenotypes
- general phenotype association - [data](http://protraits.irb.hr)
- CRISPR cassette orientation - [publication]( https://www.frontiersin.org/articles/10.3389/fmicb.2019.02054/full)
- check if there are neurons firing *before* the CIRSPR cassette starts e.g. *predicts* the occurence of a CIRPSR cassette. There are publications claiming this is a AT rich sequence
- is there a **Acquisition affecting motif** (AAM) as presented here: https://www.pnas.org/content/pnas/110/35/14396.full.pdf
- [ ] ==2== add option that inference is not only saving one particular layer as h5 **but all layers**, while keeping the layer order (three dimensional list). E.g. that the inference function handels a list of layers / [name=Philipp]
Pretraining
---
- [ ] ==2== Train a LM on **3D structures** found in folding structure databases and seed/add this model as layer in GenomeNet? Would be complicated since these structures have other representations than $[A, C, G, T]$? Or any work known that uses these "unrelated" representations to seed a new model? Martin mentioned this? / [name=Philipp]
- Maybe via [Encoder-Decoder](https://keras.rstudio.com/articles/examples/nmt_attention.html) model? [name=Philipp]
Datasets
---
- [ ] ==3== **Datasets for explainability**: Genomes amino acid sequence (e.g. `ACGTACAAAGATAA` -> binarized for coding sequences (e.g. `000111110000`). Same for all amino acids? E.g one dataset for amino acit *Lysin* would be `ACGTACAAAGATAA` -> `00000011100000` in case this sequence is a coding sequence since codon `AAA` and `AAG` encodes Lysin. If this is outside coding sequence than everthing should be `0` (or maybe two datasets one with setting everything to zero and one without) / [name=Philipp]
References
---
[^1]: Learning to Generate Reviews and Discovering Sentiment. https://arxiv.org/pdf/1704.01444.pdf