---
title: 'Omnibenchmark projects'
disqus: hackmd
---
Omnibenchmark projects
===
## Table of Contents
[TOC]
## Development
### Project 1: Omnibenchmark CLI
#### Background:
Each of the omnibenchmark modules is a self-contained GitLab project that comes with input and output dataset (bundled into "renku datasets"), a renku workflow that can be exported as CWL file or python class object and a container to execute the workflow with. All these components can be queried and accessed across the entire benchmark via renku API and GitLab API.
#### Aims:
1. Develop a CLI to query and access datasets and workflows project-wide across omnibenchmark.
2. Use workflows (CWL files or renku CLI commands) to execute workflows via CLI with existing inputs/outputs, but without renku triplet generation for the renku KG.
3. Extend renku independent workflow execution to new inputs and outputs
#### Suggested (not implemented) method signatures:
```bash=
# see https://github.com/omnibenchmark/omnibenchmark-py/blob/main/omnibenchmark/management/general_checks.py
$ omnicli query_benchmarks
["omni_clustering", "spatial_clustering"]
# see here https://renkulab.io/gitlab/omnibenchmark/omni-batch-py/orchestrator-py/-/blob/master/.gitlab-ci.yml
$ omnicli list_stages --benchmark = 'omni-batch-py'
["data_run", "process_run", "param_run", "method_run", "metric_run", "summary_run"]
# optional stage, that defaults to all?
# see https://renkulab.io/gitlab/omnibenchmark/omni_site/src/data_gen/get_projects.py
$ omnicli list_projects --benchmark = 'omni-batch-py' --stage = 'param_run'
["https://renkulab.io/gitlab/omnibenchmark/omni-batch-py/omni-batch-param-py"]
# Optional arguments with "all" as defaults?
# https://github.com/omnibenchmark/omnibenchmark-py/blob/d5925cffe6fc9ed7a5d59d828fe12beeba246bf3/omnibenchmark/management/data_commands.py#L44
# maybe a flag about the return? names|ids|names + ids + keywords ?
$ omnicli query_renku_datasets --benchmark = 'omni-batch-py' --project = 'mnn-py' --keyword = 'omni_batch_method'
["https://renkulab.io/datasets/2203a15d700940e29ce418cf8fc263f3"]
$ omnicli download_renku_datasets --id = '2203a15d700940e29ce418cf8fc263f3'
# Optional name argument with default to all
$ omnicli download_renku_datasets --project = 'mnn-py' --name = 'omni_batch_param'
$ omnicli download_renku_datasets --benchmark = 'omni-batch-py' --keyword = 'omni_batch_method'
# see https://renkulab.io/gitlab/omnibenchmark/omni-batch-py/csf-patients-py/-/blob/master/.renku/metadata/plans | renku workflow export (https://github.com/SwissDataScienceCenter/renku-python/blob/38aa53c3fa2a15976ff1cce68b5a21ca24df2078/renku/core/workflow/plan.py#L548)
$ omnicli run_workflow --benchmark = 'omni-batch-py' --project = 'mnn-py' --use_docker = yes --input_1="path/to/local/file" --input_2 ="path/to/metadata" --output_1="path/to/new/file"
```
#### Links:
[renku CLI](https://github.com/SwissDataScienceCenter/renku-python/blob/develop/renku/ui/cli/run.py)
[CWL](https://www.commonwl.org/)
[python click](https://click.palletsprojects.com/en/8.1.x/)
[omnibenchmark python](https://github.com/almutlue/omnibenchmark-py)
### Project 2: Renku CLI plugin
#### Background:
Renku tracks CLI commands by generating triples sent to a Knowledge graph database. We host the triples on a Jena Fuseki SPARQL server. We query the server to mainly retrieve the metadata associated to files lineages; which parameters were used in combination to a method, which dataset does it come from, how many steps of processing/ method did it undergo, etc. These metadata are retrieved with a collection of SPARQL queries (`omniSparql`, see below) at the end of each benchmark. Although functional, the queries could be optimized as they are working 'step-by-step' and undirected.
Renku has several plugin hooks that can be used to add additional metadata (triples) and commands to the Renku CLI. These plugin hooks can be extended by using existing graph ontologies.
#### Aim:
To use the full potential of the renku KG, we want to add custom triples. This would be particular useful to add file lineage triples to renku dataset files and thus enable input file matching in omnibenchmark based on their lineage trees. To do so we could write a renku CLI plugin upon adding files to renku datasets.
Alternatively, the SPARQL queries in `omniSparql` could be optimized to retrieve and store the files lineage along the workflow.
#### Links:
[renku CLI plugin](https://renku.readthedocs.io/en/stable/renku-python/docs/reference/plugins.html#develop-plugins-reference)
[renku CLI](https://renku.readthedocs.io/en/stable/renku-python/docs/reference/commands/index.html)
[renku ontology](https://swissdatasciencecenter.github.io/renku-ontology/)
[renku graph](https://github.com/SwissDataScienceCenter/renku-graph)
[omniSparql](https://github.com/omnibenchmark/omniSparql)
#### Queries
| Description | Use case | Reference |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **File lineage**: Dataset ids from all input files, that were neccessary to generate this file across all projects, e.g., dataset ids from all inputs, the inputs to those inputs etc. | File matching, Benchmark summaries, benchmark "reconstruction" | [omni-sparql](https://github.com/omnibenchmark/omniSparql) |
| Sub-queries: **Parents**: Inputs used in the activity, where this file was generated (is an output) | "" | "" |
| Sub-queries: **File - renku dataset**: Dataset ID(s)(*maybe keyword as well?*), that a renku file belongs to. A file can be belong to multiple renku datasets. | "" | "" |
| Sub-queries: **Renku dataset - project**: Project a renku dataset was generated in. There will be different versions of the same dataset with different ids and projects linked. The project that generated the original version of the dataset. | dataset import, link lineages to projects (?) | [omnibenchmark-python/original dataset id](https://github.com/omnibenchmark/omnibenchmark-py/blob/d5925cffe6fc9ed7a5d59d828fe12beeba246bf3/omnibenchmark/management/data_commands.py#L257) |
| **Benchmark projects**: All projects associated with a certain benchmark (+stage) (Orchestrator); | website, omni-cli, summaries (?) | [gitlab api/check orchestrator](https://github.com/omnibenchmark/omnibenchmark-py/blob/d5925cffe6fc9ed7a5d59d828fe12beeba246bf3/omnibenchmark/management/data_commands.py#L298) |
| **Ghost datasets**: retrieve renku datasets/ workflows that are still in the Knowledge Graph but that were deleted and replaced by newer versions. | Maintenance (filtering out these datasets) | Some examples of such datasets: ['pca_hc'](https://renkulab.io/knowledge-graph/datasets?query=pca_hc), ['sc3_clustering'](https://renkulab.io/knowledge-graph/datasets?query=sc3_clustering), ['cidr-clusttering'](https://renkulab.io/knowledge-graph/datasets?query=cidr-clustering) |
| **Renku dataset size**: the size of all/ individual files of a renku dataset (not sure that the information is stored somewhere. Will have to search.) | Website, metrics sumamries | - |
### Project 3: Summary dashboard
#### Background:
Results from omnibenchmark are currently summarized in a distinct project and result tables are used as inputs for the shiny bettr dashboard to be explored by the user. While this process works in general, there are several bits and pieces that need to be optimized/improved:
#### Aim/Tasks:
1. Optimal defaults (MCDA)
2. Extend bettr to allow parameter and dataset filtering (see this [issue](https://github.com/federicomarini/bettr/issues/8)).
3. Optimize app performance (?)
#### Links:
[bettr](https://github.com/federicomarini/bettr)
[summary project](https://renkulab.io/gitlab/omnibenchmark/omni-batch-py/omni-batch-summary-py)
[bettr dashboard](http://imlspenticton.uzh.ch:3840/omni_batch/)
[MCDA](https://en.wikipedia.org/wiki/Multiple-criteria_decision_analysis)
## Some definitions
Some polysemic terms to be aware of:
- dataset
- dataset (statistics)
- [renku dataset](https://renku.readthedocs.io/en/latest/tutorials/first_steps/03_add_data.html#)
- fuseki2 dataset, a persistent knowledgebase served by jena/fuseki2
- renku
- [renku or renkulab](https://github.com/SwissDataScienceCenter/renku)
- renkulab.io, the renkulab deployment at https://renkulab.io/
- [renku python](https://github.com/SwissDataScienceCenter/renku-python)
- [renku client](https://renku.readthedocs.io/en/latest/introduction/what-is-renku.html#renku-client)
- parameter
- parameter (statistics)
- renku arguments to renku CLI commands
- workflow
- workflow (computing)
- [renku workflow](https://renku.readthedocs.io/en/latest/tutorials/first_steps/08_create_workflow.html)
- renku plan
- renku activity
- metadata
- metadata (computing)
- .renku files
- (queries to) renkulab endpoints: graphml, gitlab API, renku API
- (queries to) our fuseki2 datasets
- omnibenchmark
- [omnibenchmark (framework)](https://github.com/orgs/omnibenchmark), omnibenchmark, omniValidator, omniSparql (python modules), custom templates
- omnibenchmark (production), current system with renkulab.io, omnibenchmark framework, custom triplestore, bettR, computing resources at UZH. [List of current benchmarks](http://omnibenchmark.org/p/benchmarks/).
- omnibenchmark (physical server)
## Useful (not yet implemented) queries to the triplestore
- which methods are run on dataset X?
- which are the omb dependencies/omb full lineage of metric result Y?
- which parameters are used by method Z?
- which normalization (preprocessing) was run on a method file ?
- which renku datasets/ workflows are still in the KG but not in a benchmark (i.e., 'ghost datasets') ?
- what size is a renku dataset X ? (not sure that this information is here)
###### tags: `omb` `development`