owned this note
owned this note
Published
Linked with GitHub
# Day 1 Doc
## Schedule
| Time (CET) | Program |
|------------|---------|
| 14:00 - 14:55 | Setup & Introductions |
| 15:00 - 15:40 | Getting Started with Nextflow |
| 15:45 - 16:20 | Nextflow Scripting |
| 16:25 - 16:50 | Nextflow parametrization |
| 16:55 - 17:35 | Channels |
| | |
| 17:35 - 18:30 | **Break** |
| | |
| 18:30 - 19:10 | Processes |
| 19:15 - 20:00 | Processes II |
| Extra? | Workflows |
## Session 0 - Setup
This is a living document that we will use throughout the course with the aim to document as much as possible so it becomes a useful dry lab work book.
### Poll 1
1. I have installed the nextflow extension: ***** ***** *****
3. I will use the web version: ***
### Poll 2
1. I am connected: Anna Tom Tracy Matthew Mikhail | * | Ioana Junmo Francesco Juan |Panos |Brittany Hira
2. I am having connection issues: Alex
3. Other (please specify):
4. Could not connect via desktop app (Panos)
### Connection
Use the command palette (Menu`> View > Command Palette` and start typing: `>Remote-SSH: Connect to Host...` ). Then select franklin as the host (or franklin.upsc.se). Alternatively, use the green button at the bottom left corner.
The initial set up is to `Add New SSH Host` and type in `ssh student@franklin.upsc.se:PORT`, where you port is listed in the table below. You will be prompted for the Operating System. Choose: `linux`. A new window will open, prompting for the password. Use the password (and keep it secret!) we shared.
| Name | Port |
-------|------|
|Mikhail Osipovitch|9000|
|Kun Li|9001|
|Alexander Vergara|9002|
|Tom Jenkins|9003|
| Hira Naveed|9004|
|Nick Pestell|9005|
|Brittany Roberts|9006|
|Matthew Adair|9007|
|Panagiotis Provataris|9008|
|Ioana Onut Brännström|9009|
|Junmo Sung|9010|
|Juan Gaitan|9011|
|Francesco Gualdrini|9012|
|Anna Sommer|9013|
|Tracy Chew|9014|
|Mohsen Hegab|9015|
```
student@franklin.upsc.se:PORT
example: student@franklin.upsc.se:9019
```
### :question: Questions!
* Is there any extension within VScode for vpn connection? Or if we need a vpn agent, we do it outside VScode (some institutions need when working outside)?
* The VPN is independent from VS code, if you have e.g. Cisco connected, traffic should get routed through that and you should be fine. Specifics should be discussed with your corporate IT.
* Is there an extension within VScode for X11 Forwarding? And keen to hear any about other cool extensions that you recommend :smile:
* In general, X11 and VS Code do not play with each other, but there are data science plugins e.g. for Jupyter, R, etc. that render plots inside VS Code, so there also isn't really a need for it
* I see nextflow is already pre-installed in the franklin space. Could you show best practice to install in HPC, AWS or others? (Containers, conda envs etc.. ). Which way to install nextflow is the best?
* `conda` works well, most of the time, your local IT support (on HPCs - High Perf. Compute) will have it pre-installed.
* `curl -s https://get.nextflow.io | bash`
* Needs:
* Java 11+
* curl
* bash
* What is the best practice for order of code? I.e. I've seen colleagues do params > process(es) > workflow. You have presented params > workflow > process(es).
* I don't think that there are real guidelines for that (Nico), but I might have missed Good Coding Practice for NF. When workflow get really big, you can actually split them in multiple files. My opinion is as long as you have a clear structure and it's shared with your colleagues, it should be fine.
* I see under input: path read - how did it know to check input_ch?
* In the workflow, you call the process `NUM_LINES(input_ch)`. In the process you assign the input as the variable `read` (name is arbitrary, you choose), which is of type `path` (_i.e._ it expects a filename)
* I see! What if you have multiple inputs per process?
* Then you will provide multiple inputs :smile: I'm sure we will hear about that :wink:
* If you want now to do something else with the reads how would you continue the workflow? Can you highlight the blocks that execute a feature one more time. Thank you.
* We will talk more about this, but basically you would either extend the `script` in the current `process`. Or you would create a new `process` that takes as input the output of `NUM_LINES`. How to chain the processes is what you define in the `workflow` section. Whether you extend the `script` or create a new `process` depends on several factors, but usuallym you'd like processes to do atomic` operations; _i.e._ a very clearly define single (if not simple) operation (like compressing a file). The advantage is that such a process can easily be re-used. It's however a balancing act between being atomic and being time efficient when writing the pipeline. (my opinion, Nico)
* The #!… will always be the same even when coding and using a condo env?
* Yes, unless you are on a very exotic Operating System, but I have not seen any in a very long time, so let's go for 99% :wink:
* How one can add comments near a line? Like the use of # in bash.
* Like in Java: use `//` - see what Bastian is doing in the code. You can have multi-line comments using:
```{nextflow}
/*
some longer comments
*/
```
* How do you connect the parameter file with the nextflow script. I missed it in the explanation.
* See the example in section 3 [below](https://hackmd.io/HkpBvb1eQ82vGPLVKwMDqg?both#Use-a-param-file) :smile:
* I know there are usually several config files in additions to the params. Could you suggest best practice to organize and keep tidy an NF code?
* As you just heard, that's also more or less a personal preference. Bastian recommends having nexflow settings in `nextflow.config` and params in a separate file. If you have a large pipeline and a lot of config, you can use the `includeConfig` statement in the `nextflow.config` to split your config in multiple files.
* As mentioned can the params separate file be written in groovy and sourced from within the nf workflow? Can should :smile:
* You can, but you probably shouldn't (if you still want, you can look at the `import` keyword in Groovy)
* How would you recommend architecting your scripts if for example, you had 100 samples to align, 2 fastq's each? i.e. run nextflow command 100 times, or write code to create params.json and run nextflow once?
* Much easier than this! There is a channel to identify Paired-end (PE) data. Then for the 100 samples, you just use globbing as we just did (_i.e._ `run nexflow --input "*.fq.gz"`) and you let nextflow do the magic. `input` being your `input_ch`, nexflow is going to iterate per PE set. If in addition, you configure `nextflow` to submit to your queueing system (_e.g._ SLURM, PBS, whatever), it will handle all the job submission. Neat!
* So it will automatically pattern match .fq.gz files to identify pairs? That is pretty cool :)
* I didn't really get it what is the use of channels?
* Channels is how nextflow communicates between processes: https://www.nextflow.io/docs/latest/channel.html. channel (data input) -> process -> channel (process output) -> possible channel modification (next data input) -> next process -> next channel -> next process -> _etc_...
* So in a real example, would you say use sample IDs in a queue channel to process each sample 1 by 1, then use value channel to keep calling reference genome channel?
* Yup. If you look at https://www.nextflow.io/docs/latest/channel.html. You will see there are queue channels to get sequencing files as paired-end or either from the SRA. Reusing a queue channel is useful if you want a fork in your workflow. e.g. running fastqc on all raw data (one process) while running sortmerna on the raw data (second process)
* difference between value and queue channels?
* Nextflow distinguish two different kinds of channels: queue channels and value channels. A queue channel is a non-blocking unidirectional FIFO queue which connects two processes or operators. A value channel a.k.a. singleton channel by definition is bound to a single value and it can be read unlimited times without consuming its content. So a queue is something you take values from and the queue eventually gets exhausted , once all values it stores have been emitted. A value channel is infinitely re-usable.
* I saw in nf-core that not all params are converted to actual value Channels. Is this normal practice? in several codes they simply assign params values to other variables i.e. ch_fast = params.faster
* Yes, for example if you have a param which is your genome version, you do not need it to be in a channel, a param is enough. It's again a more or less personal choice. You could also create a `Channel.value(params.genome)` :smirk:
* Is it advisable to tight processes with its own environment (conda or docker) to track down tools versioning?
* Yes. That ensure reproducibility. As an example of a process:
```{nextflow}
process FASTQC_SE {
// Run fastqc per file
tag "fastqc ${file}"
publishDir "results/$params.dataset/fastqc/", mode: "copy"
container 'docker://bschiffthaler/fastqc'
input:
path(reads)
output:
path("${reads.simpleName}_fastqc.html")
script:
"""
fastqc ${reads}
"""
}
```
I have singularity enabled (more about that tomorrow), so singularity is going to pull the docker://bschiffthaler/fastqc image. If you use a tag, _e.g._ docker://bschiffthaler/fastqc:0.11.9' you would force a specific version.
Of course container could be a config/param and you could have all versions in a separate file.
* So with conda automation of download is not duable like with docker?
* It also is, you just need to tell conda what version to fetch.
* how to turn Queue into list of values channels? Like when all the samples have been aligned then I want to collect all of them into a unique channel.
* It's called `collect` and we will see it later :slightly_smiling_face: - see [collect](https://www.nextflow.io/docs/latest/operator.html?highlight=collect#collect) in the doc.
* Can the naming in the work directory be controlled? like to output meaninfull names?
* Sure. You can control the output name: check the code bit above where I output a filename depending on the input. Input is `reads`, output is `"${reads.simpleName}_fastqc.html"`
It is actually quite essential, as often tools (such as fastqc) have "hardcoded" output file name, so you will want to capture them accordingly. There is more to that in the sense that you can choose with output will be kept and which will be considered temporary. Usually you do not want all the files created during the pre-processing to be saved. In my example above as you can see, I actually copy the fastqc results in my "publish" directory, where the final results will be found. Instead of copying, I could move it - saving space (tbh, the block above is some work on progress, so I when prototyping like to copy and be fairly verbose. At a later stage, I will clean it up).
* What is the advantage of using the value channels as opposed to use directly the params.kmer for example as the params.kmer will also be reusable?
* It is more flexible. You can have `params.samples=*.fq.gz` and use that, or you can use a channel like:
```{nextflow}
ch_input = Channel.fromFilePairs("$params.dataset" + '/*_{1,2}.f*q.gz', checkIfExists:true)
```
that will allow you to handle paired-end data without any extra effort of handling that by yourself.
Even if that channel is consumable, you can use it in multiple processes:
```{nextflow}
workflow {
FASTQC(ch_input)
SORTMERNA(ch_input)
}
```
So the same channel is used for fastqc while forked to be run through SortMeRNA (rRNA identification / removal).
### :exclamation: Comments:
* Reentrant: the same as to "resume". The pipeline can restart to the latest completed stage, skipping all the previous steps. It is not the default in Nextflow, you need to use the `--resume` flag on the command line.
* DSL - A domain-specific language (DSL) is a [computer language](https://en.wikipedia.org/wiki/Computer_language) specialized to a particular application [domain](https://en.wikipedia.org/wiki/Domain_(software_engineering)). The second iteration of the Nextflow DSL (a.k.a. DSL2) simplified a lot the use of Nextflow, see this blog [post](http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/bioinfoclub_list/nextflow-dsl2-laurent-modolo).
## Links
[Nextflow documentation](https://www.nextflow.io/docs/latest/)
[Course from SciLifeLab, the Swedish National Sequencing / Bioinformatics facility](https://uppsala.instructure.com/courses/58267/pages/nextflow-1-introduction?module_item_id=387489)
[Some extra content from the same course](https://uppsala.instructure.com/courses/58267/pages/nextflow-7-extra-material?module_item_id=468797)
[Another good "book" reference](https://bioinformaticsworkbook.org/dataAnalysis/nextflow/02_creatingAworkflow.html#gsc.tab=0)
[Some good tips from the Andersen lab (Denmark)](https://andersenlab.org/dry-guide/2021-12-01/writing-nextflow/)
[Existing workflows - nf-core](https://nf-co.re/) - you can pull them and run them. There are also on the webpage a possibility to graphically configure them ands export the corresponding config file.
## Session 1 - Getting Started with Nextflow
```{bash}
cd nf-training/
cp scripts/introduction/wc.nf .
nextflow run wc.nf
```
## Session 2 - Nextflow scripting
```{nextflow}
#!/usr/bin/env nextflow
// printing some lines
println("Hello, world!")
println "Hello city"
// Types
my_var = 1
my_f = 3.14159265
my_bool = false
my_s = "chr1"
my_pattern = /\d+/
text = """
this is a multi
line string
"""
println "Current ${my_var}_chromosome"
// Compounds
kmers = [1,2,4]
kmers[0]
kmers[-1]
println kmers[0..1]
println "My kners: ${kmers[0..1]}"
println "The list of kmers is ${kmers.size()} elements long"
/*
some longer comments
*/
// Maps
roi = [chr: "chr10", start: 10000, end: 12000, genes: ["ATP1B2", "TP53"]]
println roi["chr"]
println roi.chr
println roi.get("chr")
// Closures
square = { it * it }
// "it" is the default variable name
// you can redefine the variable used in closures as follows:
// square = { variable -> variable * variable}
x = [1,2,3,4]
y = x.collect(square)
println x
println y
```
## Session 3 - Workflow parameterization
```{bash}
nextflow run wc.nf --input "data/yeast/reads/ref2*.fq.gz"
```
### Exercise:
Re-run the Nextflow script wc.nf by changing the pipeline input to all files in the directory data/yeast/reads/ that begin with ref and end with .fq.gz
Put a star next to this comment: ***** ***** ****
```{bash}
student@f638cd152d21:~/nf-training$ nextflow run wc.nf --input "data/yeast/reads/ref*.fq.gz"
N E X T F L O W ~ version 22.04.5
Launching `wc.nf` [marvelous_ritchie] DSL2 - revision: c6e739fcd6
executor > local (6)
[3d/724e39] process > NUM_LINES (6) [100%] 6 of 6 ✔
ref2_2.fq.gz 81720
ref3_2.fq.gz 52592
ref2_1.fq.gz 81720
ref1_1.fq.gz 58708
ref1_2.fq.gz 58708
ref3_1.fq.gz 52592
```
### Exercise 2:
1. Command-line: 11111111111
2. External file: 2222222222
3. In the pipeline: 33333333
____
Answer:
1. Parameters specified on the command line (--something value)
2. Parameters provided using the -params-file option
3. Config file specified using the -c my_config option
4. The config file named nextflow.config in the current directory
5. The config file named nextflow.config in the workflow project directory
6. The config file $HOME/.nextflow/config
7. Values defined within the pipeline script itself (e.g. main.nf)
### Use a param file:
The `params.json` file
```{json}
{
"input": "data/yeast/reads/ref1_1.fq.gz",
"sleep": 2
}
```
The script call:
```{bash}
nextflow run wc.nf -params-file params.json
```
## Session 4 - Channels
```{nextflow}
#!/sur/bin/env nextflow
nextflow.enable.dsl=2
ch1 = Channel.value("GRCh38")
ch2 = Channel.value(["chr1","chr2","chrX"])
ch3 = Channel.value(["chr": "chr1", "start": 10000, "stop":12000])
ch2.view()
chr_channel = Channel.of("chr1","chr2","chrX")
chr_channel.view()
```
If you use the Factory: `Channel.fromSRA()` you will need an API key to access the data, see:
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
Channels docs:
https://www.nextflow.io/docs/latest/channel.html#channel-factory
## Session 5 - Processes
In a file called `process.nf`
```{nextflow}
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// salmon index -t <fasta_transcriptome> -i <out_dir> --kmerLen 29
process INDEX {
script:
"""
salmon index -t ${projectDir}/data/yeast/transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz \
-i ${projectDir}/data/yeast/salmon_index --kmerLen 29
"""
}
workflow {
INDEX()
}
```
```{bash}
nextflow run process.nf
```
You can use conditionals in processes, _e.g._
```{nexflow}
script:
"""
if ( ${params.aligner} == "salmon" ) {
salmon index -t ${projectDir}/data/yeast/transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz \
-i ${projectDir}/data/yeast/salmon_index --kmerLen 29
} else if (...) {
...
}
"""
```
## Session 6 - Processes Part 2
## Session 7 - Workflow
## Session 8 - Operators
## Session 9 - Nextflow configuration