Day 1 Reproducible Research Insights

# Day 1 Reproducible Research Insights ## Nextflow exposure From its [website](https://www.nextflow.io/) _"Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages._ _Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters."_ ### Disambiguation Don't worry, it does read alien even to me... Let's do some disambiguation: `containers`: light-weight virtual machine, _i.e._ small footprint operating system most frequently linux-based. Common container types are [`docker`](https://www.docker.com/) and [`singularity`](https://apptainer.org/) (now called `apptainer`). `DSL`: [_Domain Specific Language_](https://en.wikipedia.org/wiki/Domain-specific_language) as opposed to general-purpose language. Example of the former is HTML, and of the latter python. ### Relevance Why do we care? Because `Nextflow` and its python based alternative [snakemake](https://snakemake.readthedocs.io/en/stable/) will allow us to implement reproducible research. They will also help you being able to preprocess your data faster, benefitting for the conjuged effort of many others, experts in different type of data, and finally, it is going to be a good plus on your CV, either as a user or developer. ### Caveat This workshop is not about such workflows, so we will only get a short exposure. But check out Physalia, as Carlo often organises such workshops. ## Hands-on ### Aim This group work is about figuring out how to run a nextflow pipeline. ### Setup __This is a group work and you will have been assigned to a breakout room, turn on your video and talk to each-other. Ideally one would share the screen__ ### Goal Your goal is to generate the `nf-params.json` and upload it in the [living document](https://hackmd.io/@bschiffthaler/S1cNpLZvs/%2F_6QZvwThQIqu3wElYENtfw). ### Instructions Nextflow has a library of pipelines being developed (initally mostly by NBIS), called [`nf-core`](https://nf-co.re/). 1. Find the RNA-seq pipeline 2. Locate its summary figure, we are interested in the rosa path, _i.e._ running the initial QC, trimming adapters, sorting rRNA and quantifying the expression using an alignment free approach 3. The relevant tabs for this step are `Introduction`, `Usage docs`, `Parameters`. Keep these accessible, then click the `Launch version 3.9` 4. Alright! Feel in all the relevant fields in this page, so as to run what is described in point 2. See some help below. 5. Once you are satisfied, click `Launch Workflow`. It will open a new page, where you will see a code block containing the `json` configuration. Copy it to the living document. #### Settings 1. Set the outdir to `/home/training/` (or whichever directory you fancy) 2. Set the genome to `/berlin2022/db/reference/genome/mock.fa.gz` 3. Set the gff to `/berlin2022/db/reference/genome/mock.gff.gz` 4. Set the transcript_fasta to `/berlin2022/db/reference/fasta/Pabies1.0-all-phase.gff3.CDSandLTR-TE.fa.gz` 5. Set the salmon index to `/berlin2022/db/reference/indices/Pabies1.0-all-phase.gff3.CDSandLTR-TE_salmon-version-1dot5dot1`