Nextflow - HackMD

# Nextflow ### keywords - portability - scalability - reproducibility - containerisation Golden rules of repro use github use nexftlow :blush: isolate the pipeline using docker small dataset to test unit testing with a CI Future: allows creating a workflow using other workflows challenging to implement improving cloud support (azure, google, openstack) optimize remote storage *caching* kubernetes cloud agnostic container clusterig and management https://kubernetes.io ## Phil Ewels - Nextflow at SciLifeLab > Slides online >[here](https://www.slideshare.net/tallphil/standardising-swedish-genomics-analyses-using-nextflow/1)<. To tackle 1x human genome eq every 4 minutes, need scaling and be strong on automation. ### basic bioinformatics services initial data analysis for major protocols (QC) want to be automated, reliable, easy for others to run + reproducible NGI is ISO17025 certified Team of 10 bioinformaticians Mature pipelines: * Cancer Analysis Workflow * NGI-RNAseq In testing: * NGI-Methyl * NGI-smRNAseq * NGI-ChipSeq * and more! * sharing is caring, put your stuff on github! * Use continous integration, e.g. Travis * Use versioned releases ### Config files build multiple small config files around small blocks of functions. Allows to setup different profiles more easily! example: ![](https://i.imgur.com/47EV9ig.jpg) Code snippets: 1. [`check_max` function](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/nextflow.config#L70-L86) 2. [base config file](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/conf/base.config) 3. [Overwrite the limits](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/conf/uppmax-devel.config#L21-L26) --- Recommends illumina iGenomes as resource for reference genomes [ftp link to iGenomes](ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com) [link to AWS bucket](https://ewels.github.io/AWS-iGenomes/) ### Gotchas #### Dodgy file patterns, e.g. PE vs SE (try to be explicit) ```groovy params.singleEnd = false Channel .fromFilePairs( params.reads, size: params.singleEnd ? 1 : 2 ) ``` The wrong way to do it was with `params.singleEnd = -1` (which treats paired-end files as single-end if glob pattern doesn't use proper `{1,2}` tags), #### Overwriting params Cannot overwrite existing params (e.g. `params.xyz`) Create and overwrite variable instead ### MultiQC [MultiQC](http://multiqc.info) is nice to have a html report with stats at the end of a workflow. See an [example process](https://github.com/SciLifeLab/NGI-RNAseq/blob/master/main.nf#L1040-L1075). Has [special process](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bismark.nf#L477-L503) to dump tool versions, [aggregated into yaml](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bin/scrape_software_versions.py) file and [picked up](http://multiqc.info/docs/#custom-content) by MultiQC. ### Email notifications Default is to use `mail` in `workflow.onComplete` (see [docs](https://www.nextflow.io/docs/latest/metadata.html#notification-message)). Phil has html equivalent - see [code](https://github.com/SciLifeLab/NGI-MethylSeq/blob/master/bismark.nf#L541-L625). Which is also a hackathon project to make the HTML emails even nicer (and easier). ### Use Groovy syntax highlighting Bugs easier spotted, e.g. in github pull requests using .gitattributes: ``` *.nf linguist-language=Groovy *.config linguist-language=Groovy ``` _Note from Phil: This fixes the "code type" bar at the top of the main GitHub page too. It didn't use to affect syntax highlightnig, but maybe it does now..?_ In file headers, using vim syntax higlighting: ```groovy /* vim: syntax=groovy -*- mode: groovy;-*- */ ``` ### Saving intermediates if special parameter is set, the output is stored in a special directory for intermediate results (e.g. different analysis parts/workflows picking up from main workflow) ### future work * use singularity for everything * Benchmark AWS run pricing * Refine pipelines ### Links * http://opensource.scilifelab.se/ * https://github.com/scilifelab #### NGI pipelines * [NGI-RNAseq](https://github.com/SciLifeLab/NGI-RNAseq) * [CAW](http://github.com/SciLifeLab/CAW) _(Cancer Analysis Workflow)_ * [NGI-MethylSeq](https://github.com/SciLifeLab/NGI-MethylSeq) * [NGI-ChIPseq](https://github.com/SciLifeLab/NGI-ChIPseq) * [NGI-smRNAseq](https://github.com/SciLifeLab/NGI-smRNAseq) --- ## Scott Hazelhurst - Building pipelines to support African Bioinformatics * 8 collaborative centres, 13 research projects, Biorepositories, Pan-African Bioinformatics Network for H3Africa * [AWI Gen](https://www.wits.ac.za/research/sbimb/research/awi-gen/): Genetic & environmental factors in cardio-metabolic disorders in afrcan populations - hub at Witswatersrand * Within H3ABionet, more than 20 different workflows some constraints for workflows to work in a diverse environment (basically the African continent) - needs to work on laptop and hpc - ### Exploring different workflows * Four workflows, two different systems (Nextflow and CWL) * Nextflow for GWAS analysis, CWL for NGS / 16S analysis parts at H3ABionet * More a language specification rather than a tool - several tool support it * [cwtool](https://github.com/common-workflow-language/cwltool) (reference implementaion) * Docker support, parallelism, language based on YAML experimental converter nf <--> cwl: [link](https://github.com/nextflow-io/cwl2nxf) H3Abionet pipelines [here](https://github.com/h3abionet) ## Evan Floden: Inside-Out DBs & modules with nextflow ### reproducibility Problem: different OSes different p values, different genome annotation Solution: containers! ### Data lives in databases * Figshare, Dryad, NCBI, zenodo, ENA, SRA * secure data integration, using local SRA caches and enabling fast downloads via aspera for cases where the * flow: prefetch, validate and fastq-dumpe * Prefetching with special docker container containing aspera client / directly adjustable to use e.g. a private user to access nonpublic data at the SRA ncbi [github org](https://github.com/ncbi) nextflow implementation of the [tuxedo pipeline](http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html): [link](https://github.com/skptic/tuxedo-nf) including pulling of original data from SRA ### Software registries * https://bio.tools * https://omictools.com/ * http://biocontainers.pro * Galaxy toolshed ### nextflow modules (experimental) A module "mapping" would allow you to select which mapper you want (a "component") and to "swap" one tool for another * [module demo](https://github.com/skptic/nf-module-demo) ### Sevenbridges Idea: Converting workflows published as Public Apps into Nextflow lite version ## Tim Dudgeon: Nextflow for chemistry ### OpenRiskNet horizon 2020 funded, aims at standardise ways to access chemical data and workflows ### Squonk computational notebook, browser based [link](https://squonk.it) focused on cheminformatics at the moment InformaticsMatter [pipelines](https://github.com/InformaticsMatters/pipelines) ## Luca Cozzuto: From Zero to Nextflow > _"In Bioinformatics nothing is standardised"_ custom scripts, magic one-liners, ... as a core facility, not easy to optimize time spent by project nextflow allowed them to: * automate and parallelize * adding steps in their pipelines in a painless manner * keep being a multi-programming language core facility ## [Allessia Visconti](https://github.com/alesssia): Simplifying shotgun metagenomic analysis with Nextflow 1300 samples 22M reads per sample Workflow? at first (lots of) batch files, was a "disaster" Then dedicated informatics resources for their workflow: https://github.com/alesssia/MAP Then switched to nextflow :tada: [pipeline link](https://github.com/alesssia/YAMP) ## [Matthieu Foll](https://twitter.com/m_foll) - Computational workflows for cancer analysis at the IARC / France > Slides online >[here](https://www.slideshare.net/MatthieuFoll/computational-workflows-for-omics-analyses-at-the-iarc)<. * mostly high throughput sequencing and array dat (genetics, transcriptomics, epigenetics) * starts with description what workflows should look like ### Our philosophy * Do it once, do it right and use it everywhere * keep it simple, stupid (KISS principle) * most systems work best if they are kept simple * simplicity should be a key goal in design * code easier to maintain and to understand ### Our design * Too much automation is not for us * Hard to read, to maintain and to keep modular, we prefer to have one alignment pipeline, one variant claling pipeline,one annotation pipeline, one qc pipeline * One Pipeline = One github repo * Docker + Singularity containers * CircleCI for tests and deployment * Standardised readme, params, help etc * Ise GitHub issues and releases * Master branch <- beta branch <- dev branch * [One central repo](https://github.com/IARCbioinfo/IARC-nf) references all nextflow pipelines IARC-nf * List pipelines with a short description * Have a hello-world template for new pipelines ready there * One pipeline = one repo, ends always with "-nf" * Common instructions to use the pipelines (install nextflow, configuration, basic usage, docker) * Dagre created graph of pipeline, showing how the pipeline looks like (automatically produced by CircleCI) ### Challenges * Unix pipes * CWL / Wdl - do we need this? ### What we love * Integration with GithUb * Running any pipeline on any machine in <5 Minutes * Cluster compatibility is great * Separate th epipeline definition from the execution aspects * History, log, trace, timeline * Resume a pipeline * Docker and Singularity * The Gitter Chat for help (!) ### What we hate * The learning curve * When we think we have to guess the syntax * Debugging * Syncing channels for multiple inputs * Creating sets/lists/channels to have multiple inputs in a process * Dealing with optional steps in a pipeline * Large "work" directories * Ending up with several logs and trace files in a directory * Copy/Pasting processes in different pipelines ### What we would love * Deleting large intermediate files as soon as they are no longer needed * Importing processes * Automatically generating usage from params * Splitting bed files with a splitBed operator * A nice html report, an email, WebUI fo rmonitoring * nextflow available in the clouds we want to use ## [Hugues Fontenelle](https://twitter.com/hugues_f?lang=en): Medical Genetics at Oslo University Hospital * Application of NF pipelines in clinical medicine (Does this patient has X disease?) * Web UI (Flask - python) for the log and the execution (not suitable for a wide general implementation) * Importance of security in the data transfer and not possible to work with docker %missed this part unfortunately ## [Frédéric Lemoine](https://github.com/fredericlemoine): Institut Pasteur * Bootsrap in phylogenetics. Transfer Bootstrap Expectation (TBE) - stability measure [0,1] per branch ### Workflows * Automatically reporting from the pipeline (using plots) * Choice implementation in nextflow useful when depending on the size of the data you need one or another path. * For the simulated data the pipeline is the same just with a changed first step to simulate the data (? need for modules) * Available in github [NF pipelines](https://github.com/evolbioinfo/booster-workflows) ## Johnny Wu: Workflow efforts within Roche sequencing * Collaborative development and optimization of components within workflow across sites * Support product development/optimization * Usability for assay team ### Integration into nextflow * Framework needs to be updated to enable running in two different modes: * Local node * SGE cluster * Research site * Transfer of components, dependencies and workflow acros sites ## Mike Smoot - [Synthetic genomics](https://www.syntheticgenomics.com) * Science (non) fiction! Algaea biofuel Pig to human transplant ... * Basic Cluster architecture: Celery queue, Cluster master node: celery worker, NextFlow Internal DAG, Slurm * NextFlow submission microservice: an HTTP POST endpoint for submitting NF jobs, poster json for params * Validation of the params file prior to running the pipeline * Multi queues * Pipeline cost etimation (trace.tsv file), dynamic query of AWS * SLURM + AWS autoscaling logic * Ansible scripts used everywhere * NF wishlist: Modules Automatic batching of processes ## Angel Pizarro - AWS Scientific Computing * Why is cloud computing good for research: Elastic, Scalable, Time for Science, Globally accessible ... * Introducing AWS Batch

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.