owned this note
owned this note
Published
Linked with GitHub
# Nextflow
### keywords
- portability
- scalability
- reproducibility
- containerisation
Golden rules of repro
use github
use nexftlow :blush:
isolate the pipeline using docker
small dataset to test
unit testing with a CI
Future:
allows creating a workflow using other workflows
challenging to implement
improving cloud support (azure, google, openstack)
optimize remote storage *caching*
kubernetes
cloud agnostic container clusterig and management
https://kubernetes.io
## Phil Ewels - Nextflow at SciLifeLab
> Slides online >[here](https://www.slideshare.net/tallphil/standardising-swedish-genomics-analyses-using-nextflow/1)<.
To tackle 1x human genome eq every 4 minutes,
need scaling and be strong on automation.
### basic bioinformatics services
initial data analysis for major protocols (QC)
want to be automated, reliable, easy for others to run + reproducible
NGI is ISO17025 certified
Team of 10 bioinformaticians
Mature pipelines:
* Cancer Analysis Workflow
* NGI-RNAseq
In testing:
* NGI-Methyl
* NGI-smRNAseq
* NGI-ChipSeq
* and more!
* sharing is caring, put your stuff on github!
* Use continous integration, e.g. Travis
* Use versioned releases
### Config files
build multiple small config files around small blocks of functions. Allows to setup different profiles more easily!
example:
![](https://i.imgur.com/47EV9ig.jpg)
Code snippets:
1. [`check_max` function](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/nextflow.config#L70-L86)
2. [base config file](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/conf/base.config)
3. [Overwrite the limits](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/conf/uppmax-devel.config#L21-L26)
---
Recommends illumina iGenomes as resource for reference genomes
[ftp link to iGenomes](ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com)
[link to AWS bucket](https://ewels.github.io/AWS-iGenomes/)
### Gotchas
#### Dodgy file patterns, e.g. PE vs SE (try to be explicit)
```groovy
params.singleEnd = false
Channel
.fromFilePairs( params.reads, size: params.singleEnd ? 1 : 2 )
```
The wrong way to do it was with `params.singleEnd = -1` (which treats paired-end files as single-end if glob pattern doesn't use proper `{1,2}` tags),
#### Overwriting params
Cannot overwrite existing params (e.g. `params.xyz`)
Create and overwrite variable instead
### MultiQC
[MultiQC](http://multiqc.info) is nice to have a html report with stats at the end of a workflow. See an [example process](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/main.nf#L1040-L1075).
Has [special process](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bismark.nf#L477-L503) to dump tool versions, [aggregated into yaml](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bin/scrape_software_versions.py) file and [picked up](http://multiqc.info/docs/#custom-content) by MultiQC.
### Email notifications
Default is to use `mail` in `workflow.onComplete` (see [docs](https://www.nextflow.io/docs/latest/metadata.html#notification-message)).
Phil has html equivalent - see [code](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bismark.nf#L541-L625).
Which is also a hackathon project to make the HTML emails even nicer (and easier).
### Use Groovy syntax highlighting
Bugs easier spotted, e.g. in github pull requests
using .gitattributes:
```
*.nf linguist-language=Groovy
*.config linguist-language=Groovy
```
_Note from Phil: This fixes the "code type" bar at the top of the main GitHub page too. It didn't use to affect syntax highlightnig, but maybe it does now..?_
In file headers, using vim syntax higlighting:
```groovy
/*
vim: syntax=groovy
-*- mode: groovy;-*-
*/
```
### Saving intermediates
if special parameter is set, the output is stored in a special directory for intermediate results (e.g. different analysis parts/workflows picking up from main workflow)
### future work
* use singularity for everything
* Benchmark AWS run pricing
* Refine pipelines
### Links
* http://opensource.scilifelab.se/
* https://github.com/scilifelab
#### NGI pipelines
* [NGI-RNAseq](https://github.com/SciLifeLab/NGI-RNAseq)
* [CAW](http://github.com/SciLifeLab/CAW) _(Cancer Analysis Workflow)_
* [NGI-MethylSeq](https://github.com/SciLifeLab/NGI-MethylSeq)
* [NGI-ChIPseq](https://github.com/SciLifeLab/NGI-ChIPseq)
* [NGI-smRNAseq](https://github.com/SciLifeLab/NGI-smRNAseq)
---
## Scott Hazelhurst - Building pipelines to support African Bioinformatics
* 8 collaborative centres, 13 research projects, Biorepositories, Pan-African Bioinformatics Network for H3Africa
* [AWI Gen](https://www.wits.ac.za/research/sbimb/research/awi-gen/): Genetic & environmental factors in cardio-metabolic disorders in afrcan populations - hub at Witswatersrand
* Within H3ABionet, more than 20 different workflows
some constraints for workflows to work in a diverse environment (basically the African continent)
- needs to work on laptop and hpc
-
### Exploring different workflows
* Four workflows, two different systems (Nextflow and CWL)
* Nextflow for GWAS analysis, CWL for NGS / 16S analysis parts at H3ABionet
* More a language specification rather than a tool - several tool support it
* [cwtool](https://github.com/common-workflow-language/cwltool) (reference implementaion)
* Docker support, parallelism, language based on YAML
experimental converter nf <--> cwl: [link](https://github.com/nextflow-io/cwl2nxf)
H3Abionet pipelines [here](https://github.com/h3abionet)
## Evan Floden: Inside-Out DBs & modules with nextflow
### reproducibility
Problem: different OSes different p values, different genome annotation
Solution: containers!
### Data lives in databases
* Figshare, Dryad, NCBI, zenodo, ENA, SRA
* secure data integration, using local SRA caches and enabling fast downloads via aspera for cases where the
* flow: prefetch, validate and fastq-dumpe
* Prefetching with special docker container containing aspera client / directly adjustable to use e.g. a private user to access nonpublic data at the SRA
ncbi [github org](https://github.com/ncbi)
nextflow implementation of the [tuxedo pipeline](http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html): [link](https://github.com/skptic/tuxedo-nf) including pulling of original data from SRA
### Software registries
* https://bio.tools
* https://omictools.com/
* http://biocontainers.pro
* Galaxy toolshed
### nextflow modules (experimental)
A module "mapping" would allow you to select which mapper you want (a "component") and to "swap" one tool for another
* [module demo](https://github.com/skptic/nf-module-demo)
### Sevenbridges
Idea: Converting workflows published as Public Apps into Nextflow lite version
## Tim Dudgeon: Nextflow for chemistry
### OpenRiskNet
horizon 2020 funded, aims at standardise ways to access chemical data and workflows
### Squonk
computational notebook, browser based
[link](https://squonk.it)
focused on cheminformatics at the moment
InformaticsMatter [pipelines](https://github.com/InformaticsMatters/pipelines)
## Luca Cozzuto: From Zero to Nextflow
> _"In Bioinformatics nothing is standardised"_
custom scripts, magic one-liners, ... as a core facility, not easy to optimize time spent by project
nextflow allowed them to:
* automate and parallelize
* adding steps in their pipelines in a painless manner
* keep being a multi-programming language core facility
## [Allessia Visconti](https://github.com/alesssia): Simplifying shotgun metagenomic analysis with Nextflow
1300 samples
22M reads per sample
Workflow? at first (lots of) batch files, was a "disaster"
Then dedicated informatics resources for their workflow: https://github.com/alesssia/MAP
Then switched to nextflow :tada:
[pipeline link](https://github.com/alesssia/YAMP)
## [Matthieu Foll](https://twitter.com/m_foll) - Computational workflows for cancer analysis at the IARC / France
> Slides online >[here](https://www.slideshare.net/MatthieuFoll/computational-workflows-for-omics-analyses-at-the-iarc)<.
* mostly high throughput sequencing and array dat (genetics, transcriptomics, epigenetics)
* starts with description what workflows should look like
### Our philosophy
* Do it once, do it right and use it everywhere
* keep it simple, stupid (KISS principle)
* most systems work best if they are kept simple
* simplicity should be a key goal in design
* code easier to maintain and to understand
### Our design
* Too much automation is not for us
* Hard to read, to maintain and to keep modular, we prefer to have one alignment pipeline, one variant claling pipeline,one annotation pipeline, one qc pipeline
* One Pipeline = One github repo
* Docker + Singularity containers
* CircleCI for tests and deployment
* Standardised readme, params, help etc
* Ise GitHub issues and releases
* Master branch <- beta branch <- dev branch
* [One central repo](https://github.com/IARCbioinfo/IARC-nf) references all nextflow pipelines IARC-nf
* List pipelines with a short description
* Have a hello-world template for new pipelines ready there
* One pipeline = one repo, ends always with "-nf"
* Common instructions to use the pipelines (install nextflow, configuration, basic usage, docker)
* Dagre created graph of pipeline, showing how the pipeline looks like (automatically produced by CircleCI)
### Challenges
* Unix pipes
* CWL / Wdl - do we need this?
### What we love
* Integration with GithUb
* Running any pipeline on any machine in <5 Minutes
* Cluster compatibility is great
* Separate th epipeline definition from the execution aspects
* History, log, trace, timeline
* Resume a pipeline
* Docker and Singularity
* The Gitter Chat for help (!)
### What we hate
* The learning curve
* When we think we have to guess the syntax
* Debugging
* Syncing channels for multiple inputs
* Creating sets/lists/channels to have multiple inputs in a process
* Dealing with optional steps in a pipeline
* Large "work" directories
* Ending up with several logs and trace files in a directory
* Copy/Pasting processes in different pipelines
### What we would love
* Deleting large intermediate files as soon as they are no longer needed
* Importing processes
* Automatically generating usage from params
* Splitting bed files with a splitBed operator
* A nice html report, an email, WebUI fo rmonitoring
* nextflow available in the clouds we want to use
## [Hugues Fontenelle](https://twitter.com/hugues_f?lang=en): Medical Genetics at Oslo University Hospital
* Application of NF pipelines in clinical medicine (Does this patient has X disease?)
* Web UI (Flask - python) for the log and the execution (not suitable for a wide general implementation)
* Importance of security in the data transfer and not possible to work with docker
%missed this part unfortunately
## [Frédéric Lemoine](https://github.com/fredericlemoine): Institut Pasteur
* Bootsrap in phylogenetics. Transfer Bootstrap Expectation (TBE) - stability measure [0,1] per branch
### Workflows
* Automatically reporting from the pipeline (using plots)
* Choice implementation in nextflow useful when depending on the size of the data you need one or another path.
* For the simulated data the pipeline is the same just with a changed first step to simulate the data (? need for modules)
* Available in github [NF pipelines](https://github.com/evolbioinfo/booster-workflows)
## Johnny Wu: Workflow efforts within Roche sequencing
* Collaborative development and optimization of components within workflow across sites
* Support product development/optimization
* Usability for assay team
### Integration into nextflow
* Framework needs to be updated to enable running in two different modes:
* Local node
* SGE cluster
* Research site
* Transfer of components, dependencies and workflow acros sites
## Mike Smoot - [Synthetic genomics](https://www.syntheticgenomics.com)
* Science (non) fiction!
Algaea biofuel
Pig to human transplant ...
* Basic Cluster architecture: Celery queue, Cluster master node: celery worker, NextFlow Internal DAG, Slurm
* NextFlow submission microservice: an HTTP POST endpoint for submitting NF jobs, poster json for params
* Validation of the params file prior to running the pipeline
* Multi queues
* Pipeline cost etimation (trace.tsv file), dynamic query of AWS
* SLURM + AWS autoscaling logic
* Ansible scripts used everywhere
* NF wishlist:
Modules
Automatic batching of processes
## Angel Pizarro - AWS Scientific Computing
* Why is cloud computing good for research: Elastic, Scalable, Time for Science, Globally accessible ...
* Introducing AWS Batch