MS: A benchmark for genomic lanugage models

--- tags: Manuscript drafts ---  MS: A benchmark for genomic lanugage models ===  [TOC] :::info **Notes:** - We should discuss the scope of the benchmark, maybe it makes sense to just focus on LM without any particular prediction task (exept inputation?) [name=Philipp] - Not sure what datasets we should supply, since the total training dataset is immensely big? [name=Philipp] - We should automate all benchmarks, e.g. that a function just takes the trained `h5` model and then outputs all metrics, probaly in a new github repo? ::: Introduction --- - DL methods are effective and LM of genomic sequence can help to uncover patterns - not known which architecture is working on this kind of data - Despite these advances, it is difficult to measure or compare the utility of DL in genomics due to a lack of standard evaluation criteria and benchmarking datasets - lack of benchmarking data resulst in a set of poorly- understood tradeoffs. - end-to-end approch, we are measuring not only the time per iteration or throughput, but also performance of training (e.g. time, cost) and inference - This release of the benchmark provides end-to-end learning and inference tasks including a training set consitsting of genomes - We present benchmark specification and goals - We present benchmark results of different architectures we tested for the specific tasks and training tradeoffs Benchmark overview --- ### Datasets :::info **Note**: Maybe just focus for now on PRO and EUK and extent later ::: - build on existing public datasets - two types of data - full (or parital) genomes with large contigs - read-sized fragments (150 nucleotides) - three biological kingdoms | Category | Dataset | | -------- | -------- | | High quality prokaryotic genomes | PRO | | Read-sized prokaryotic genomic fragments | PRO~reads~ | | High quality eukaryotic genomes | EUK | | Read-sized eukaryotic genomic fragments | EUK~reads~ | | High quality virus genomes | VIR | | Read-sized virus genomic fragments | VIR~reads~ | **Table 1:** Datasets ### Tasks :::info **Note**: Probalby we should implement this in a way that we can extend these takss later on and start just with "General transferability test" and "Imputation test" ::: - imputation capability is the most meaningful parameter - transferability test is the average predictive performance of the model on other datasets, e.g. a model trained on _EUK_ tested on _PRO_ - Rare species test is the average predictive perfomrance on rare species - low-level neuron interpretability tasks - neuron importance test through neuron deletion Note: use other word than _interpretability_ (see the contrastive predictive coding paper on how they call this). | Task | Metric | | -------- | -------- | | General transferability test | | | Rare species test | | | Genome conservation test | | | Imputation test 1 (of single nucleotides) | | | Imputation test 2 (gene-sized fragements) | | | Neuron interpretability test 1 (AA) | | | Neuron interpretability test 2 (CDS) | | | Neuron importance test | | **Table 2:** Tasks Results --- We tested following architectures GenomeNet~LSTM~ : description GenomeNet~stateful-LSTM~ : description GenomeNet~CNN~ : description GenomeNet~CNN+LSTM~ : description GenomeNet~skip-connections~ : description Prediction type: - ende, mitte --- | Task | PRO | EUK | VIR | | -------- | -------- | -------- | -------- | | Rare species test | ==x== | ==x== | ==x== | | General transferability test | ==x== | ==x== | ==x== | | Imputation test 1 | ==x== | ==x== | ==x== | | Imputation test 2 | ==x== | ==x== | ==x== | ... **Table 2:** Summary of performances of GenomeNet~LSTM~ Software --- ```R! transferabilityTest(model = "genomenet_cnn_lstm.hdf5") ```