Introduction to Nextflow

Who is it for?

If you work in a scenario where you need to process hundred’s of DNA/RNA sequenced samples every week then nextflow is for you!

Many bioinformatics tasks or processes that you run, for example running fastqc are sequential by default and will eat away a lot of compute time as you wait to run the next step in your workflow. e.g. to run fastqc on all of your fastq files, you may use a for loop like below to get the results.

for i in *.fastq.gz ; do fastqc $i ; done

Note that this loop is sequential, meaning it will only process one file at a time and if fastqc takes 10 minutes to process one fastq file, it can add up very quickly as you process hundreds of samples altogether.

Why use nextflow?

Many bioinformatics pipelines comprise of various tools that one uses in a dataflow programming manner to get the final results. Some of these tools may be accessible with the BASH interpreter, some are written in python and will require the use of a python interpreter and some are written in R and will require R console for data analysis. Often times many bioinformaticians or data analysts have custom scripts that they also run on their samples in order to get desired results.

Doing this requires jumping from one programming interpreter to another which is not possible to do from a single script file, however nextflow makes this easier and allows the user to define separate code blocks in its processes and makes it super-efficient to chain all of your analysis steps into one main file, which again is a huge benefit.

Pipelines

In practice, a Nextflow pipeline is made by joining together different processes. Each process can be written in any scripting language that can be executed by the Linux platform (Bash, Perl, Ruby, Python, etc.).

Processes are executed independently and are isolated from each other. Any process can define one or more channels as an input and output. The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.

Let's take a look at a nextflow script

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

The code begins with a shebang, which declares Nextflow as the interpreter.
Declares a parameter greeting that is initialized with the value 'Hello world!'.
Initializes a channel labelled greeting_ch, which contains the value from params.greeting. Channels are the input type for processes in Nextflow.
Begins the first process, defined as SPLITLETTERS.
Input declaration for the SPLITLETTERS process. Inputs can be values (val), files or paths (path), or other qualifiers (see here). 6 Tells the process to expect an input value (val), that we assign to the variable 'x'.
Output declaration for the SPLITLETTERS process.
Tells the process to expect an output file(s) (path), with a filename starting with 'chunk_*', as output from the script. The process sends the output as a channel.
Three double quotes initiate the code block to execute in this process.
Code to execute — printing the input value x (called using the dollar symbol [$] prefix), splitting the string into chunks with a length of 6 characters ("Hello " and "world!"), and saving each to a file (chunk_aa and chunk_ab).
Three double quotes end the code block.
End of the first process block.
Begins the second process, defined as CONVERTTOUPPER.
Input declaration for the CONVERTTOUPPER process.
Tells the process to expect an input file(s) (path; i.e. chunk_aa and chunk_ab), that we assign to the variable 'y'.
Output declaration for the CONVERTTOUPPER process.
Tells the process to expect output as standard output (stdout) and send this output as a channel.
Three double quotes initiate the code block to execute in this process.
Script to read files (cat) using the '$y' input variable, then pipe to uppercase conversion, outputting to standard output.
Three double quotes end the code block.
End of first process block.
Start of the workflow scope, where each process can be called.
Execute the process SPLITLETTERS on the greeting_ch (aka greeting channel), and store the output in the channel letters_ch.
Execute the process CONVERTTOUPPER on the letters channel letters_ch, which is flattened using the operator .flatten(). This transforms the input channel in such a way that every item is a separate element. We store the output in the channel results_ch.
The final output (in the results_ch channel) is printed to screen using the view operator (appended onto the channel name).
End of the workflow scope.

In practice

Now copy the above example into your favourite text editor and save it to a file named hello.nf.

Execute the script by entering the following command in your terminal:

nextflow run hello.nf

The output will look similar to the text shown below:

Launching `hello.nf` [fabulous_torricelli] DSL2 - revision: 197a0e289a
executor >  local (3)
[c8/c36893] process > SPLITLETTERS (1)   [100%] 1 of 1 ✔
[1a/3c54ed] process > CONVERTTOUPPER (2) [100%] 2 of 2 ✔
HELLO
WORLD!

1: The Nextflow version executed.
2: The script and version names.
3: The executor used (in the above case: local).
4: The first process is executed once (1). The line starts with a unique hexadecimal value (see TIP below), and ends with the percentage and job completion information.
5: The second process is executed twice (2) (once for chunk_aa, once for chunk_ab).
6-7: The result string from stdout is printed.

Further study

There is a good tutorial for making a RNA-seq analysis pipeline at https://training.seqera.io/#_simple_rna_seq_pipeline. Try to follow it.

Who is it for?

Why use nextflow?

Pipelines

Let's take a look at a nextflow script

In practice

Further study

Read more

Data groupA

Avian Influenza, Hybrid reference mapping

Assembly of Avian Influenza using IRMA

Introduction to Docker