---
tags: Manuscript drafts
---
<!-- please leave 'Howto' in the task section, you can also add more seperated by ',' -->
MS: A benchmark for genomic lanugage models
===
<!-- add a few sentences in tl;dr style -->
[TOC]
:::info
**Notes:**
- We should discuss the scope of the benchmark, maybe it makes sense to just focus on LM without any particular prediction task (exept inputation?) [name=Philipp]
- Not sure what datasets we should supply, since the total training dataset is immensely big? [name=Philipp]
- We should automate all benchmarks, e.g. that a function just takes the trained `h5` model and then outputs all metrics, probaly in a new github repo?
:::
Introduction
---
- DL methods are effective and LM of genomic sequence can help to uncover patterns
- not known which architecture is working on this kind of data
- Despite these advances, it is difficult to measure or compare the utility of DL in genomics due to a lack of standard evaluation criteria and benchmarking datasets
- lack of benchmarking data resulst in a set of poorly- understood tradeoffs.
- end-to-end approch, we are measuring not only the time per iteration or throughput, but also performance of training (e.g. time, cost) and inference
- This release of the benchmark provides end-to-end learning and inference tasks including a training set consitsting of genomes
- We present benchmark specification and goals
- We present benchmark results of different architectures we tested for the specific tasks and training tradeoffs
Benchmark overview
---
### Datasets
:::info
**Note**: Maybe just focus for now on PRO and EUK and extent later
:::
- build on existing public datasets
- two types of data
- full (or parital) genomes with large contigs
- read-sized fragments (150 nucleotides)
- three biological kingdoms
| Category | Dataset |
| -------- | -------- |
| High quality prokaryotic genomes | PRO |
| Read-sized prokaryotic genomic fragments | PRO~reads~ |
| High quality eukaryotic genomes | EUK |
| Read-sized eukaryotic genomic fragments | EUK~reads~ |
| High quality virus genomes | VIR |
| Read-sized virus genomic fragments | VIR~reads~ |
**Table 1:** Datasets
### Tasks
:::info
**Note**: Probalby we should implement this in a way that we can extend these takss later on and start just with "General transferability test" and "Imputation test"
:::
- imputation capability is the most meaningful parameter
- transferability test is the average predictive performance of the model on other datasets, e.g. a model trained on _EUK_ tested on _PRO_
- Rare species test is the average predictive perfomrance on rare species
- low-level neuron interpretability tasks
- neuron importance test through neuron deletion
Note: use other word than _interpretability_ (see the contrastive predictive coding paper on how they call this).
| Task | Metric |
| -------- | -------- |
| General transferability test | |
| Rare species test | |
| Genome conservation test | |
| Imputation test 1 (of single nucleotides) | |
| Imputation test 2 (gene-sized fragements) | |
| Neuron interpretability test 1 (AA) | |
| Neuron interpretability test 2 (CDS) | |
| Neuron importance test | |
**Table 2:** Tasks
Results
---
We tested following architectures
GenomeNet~LSTM~
: description
GenomeNet~stateful-LSTM~
: description
GenomeNet~CNN~
: description
GenomeNet~CNN+LSTM~
: description
GenomeNet~skip-connections~
: description
Prediction type:
- ende, mitte
---
| Task | PRO | EUK | VIR |
| -------- | -------- | -------- | -------- |
| Rare species test | ==x== | ==x== | ==x== |
| General transferability test | ==x== | ==x== | ==x== |
| Imputation test 1 | ==x== | ==x== | ==x== |
| Imputation test 2 | ==x== | ==x== | ==x== |
...
**Table 2:** Summary of performances of GenomeNet~LSTM~
Software
---
```R!
transferabilityTest(model = "genomenet_cnn_lstm.hdf5")
```