Ideas - HackMD

--- tags: Meta --- :::info Notebook to collect ideas. Unconventional ideas are welcome too. **Format:** ==Importance== (1 - 5) / Author of ideas to faciliate 1:1 discussions ::: Ideas ===  Architecture --- - [ ] ==5== **hyperparameter optimization** -- Multi-fidelity approaches. Hyperband / ASHA? / [name=Martin] - Best on the 3 T4 server (==bioinf###==) - NCBI complete genomes dataset (==path?==) - [x] ==4== **stateful RNN** ([R example](https://philipperemy.github.io/keras-stateful-lstm/)). Need to make sure that samples in batches are ordered by their occurence in the genome / [name=Martin] - [ ] ==?== **"Continuize" the architecture landscape**? Currently the goal is to have one implemention of e.g. LSTM/CNN and one for WaveNet and one for Transformer, usw. But for optimization I guess it would be nice to either combine these architectures somehow or to implement them as "one big architecture" with a lot of arguments that e.g. controls the skip-connections? Not sure if this is helpful/feasible at all. / [name=Philipp] - The most prominent instance of this that I know is "[DARTS](https://arxiv.org/abs/1806.09055)"; it has the problem that it takes a lot of GPU mem. Then there is "[Proxyless NAS](https://openreview.net/forum?id=HylVB3AqYm)" which tries to reduce the memory requirement by replacing continuous with probabilistic. I really like these approaches and think they would go somewhere and we should try them, but they are not simple. [name=Martin] - [ ] ==5== **transformer**? Not sure what is the best way to implement it in R Keras Explainability --- - [ ] ==3== function that takes a `h5` file of neuron responses and calculates the **neuron-wise correlation with a set of known features** and outputs the neurons with highest correlation coefficients. E.g. check _if_ neurons are activated excusively on low-level features (within CDS or within the same amino acid). Alternative detect a neuron / set of neurons that can predict these low level features as in [^1]. For this a dataset with the low level features needs to be created. / [name=Philipp] - We'd have to take care to correct for multiple testing with lots of neurons... [name=Martin] - Could try other models, e.g. fit a L1 lm on the outcomes we want [name=Martin] ```R >identifyExplainableNeurons(states = "ecoli.h5", targetFile = "ecoli_cds.txt") Neuron 4: Accuracy = 0.94 Neuron 112: Accuracy = 0.89 ... ``` - [ ] ==?== Check for neuron response to phenotypes - general phenotype association - [data](http://protraits.irb.hr) - CRISPR cassette orientation - [publication]( https://www.frontiersin.org/articles/10.3389/fmicb.2019.02054/full) - check if there are neurons firing *before* the CIRSPR cassette starts e.g. *predicts* the occurence of a CIRPSR cassette. There are publications claiming this is a AT rich sequence - is there a **Acquisition affecting motif** (AAM) as presented here: https://www.pnas.org/content/pnas/110/35/14396.full.pdf - [ ] ==2== add option that inference is not only saving one particular layer as h5 **but all layers**, while keeping the layer order (three dimensional list). E.g. that the inference function handels a list of layers / [name=Philipp] Pretraining --- - [ ] ==2== Train a LM on **3D structures** found in folding structure databases and seed/add this model as layer in GenomeNet? Would be complicated since these structures have other representations than $[A, C, G, T]$? Or any work known that uses these "unrelated" representations to seed a new model? Martin mentioned this? / [name=Philipp] - Maybe via [Encoder-Decoder](https://keras.rstudio.com/articles/examples/nmt_attention.html) model? [name=Philipp] Datasets --- - [ ] ==3== **Datasets for explainability**: Genomes amino acid sequence (e.g. `ACGTACAAAGATAA` -> binarized for coding sequences (e.g. `000111110000`). Same for all amino acids? E.g one dataset for amino acit *Lysin* would be `ACGTACAAAGATAA` -> `00000011100000` in case this sequence is a coding sequence since codon `AAA` and `AAG` encodes Lysin. If this is outside coding sequence than everthing should be `0` (or maybe two datasets one with setting everything to zero and one without) / [name=Philipp] References --- [^1]: Learning to Generate Reviews and Discovering Sentiment. https://arxiv.org/pdf/1704.01444.pdf