HackMD - Collaborative Markdown Knowledge Base

### Learning objectives In this exercise you learn how to - run a [nf-core](https://nf-co.re/) [Nextflow](https://www.nextflow.io/) pipeline, - configure the resources according to what is available, - deal with alternative parameter names, - understand the [nf-core/pangenome](https://github.com/nf-core/pangenome) pipeline's output: - [MutiQC](https://multiqc.info/), - used CPU, RAM, ... - workflow timeline, - output folders, - where are my communities? ### Getting started Make sure you have `screen`, `wget`, `git`, `Nextflow`, and `Docker` installed. All tools are already available on the course workstations. If you haven't done before, clone the `pggb` repository into your home folder: cd ~ git clone https://github.com/pangenome/pggb.git Now create a directory to work in for this tutorial: mkdir day2_nextflow cd day2_nextflow ln -s ~/pggb/data Download the 8 haplotypes yeast FASTA sequences ([Yeast Population Reference Panel](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/)): wget https://zenodo.org/record/7933393/files/scerevisiae8.fa.gz One can distribute the available compute resources efficiently across the different processes of the Nextflow pipeline using [config](https://www.nextflow.io/docs/latest/config.html) files. During the course you have access to 8 threads with 16 gigabytes of memory. To ensure that each run only consumes up to these resources and does not snitch them from other course participants, please create the following config file: mempang23_config="executor { cpus = 8 memory = 16.GB }" echo "$mempang23_config" > mempang23.config ### Building a LPA pangenome graph with nf-core/pangenome Whilst we can limit the maximum allocatable resources with `mempang23.config`, one can assign resources for each step of the pipeline using a different config file: wget https://raw.githubusercontent.com/nf-core/pangenome/a_brave_new_world/conf/hla.config Check it out! Let's build an LPA pangenome graph. If you are interested in setting additional parameters you can always visit https://nf-co.re/pangenome/a_brave_new_world/parameters for details. nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LPA14 Copy the MultiQC report and other statistics to your local machine in order to open them in a web browser. scp $USER@$MASCHINE:~/LPA14/multiqc/multiqc_report.html . scp $USER@$MASCHINE:~/LPA14/pipeline_info/execution_*.html . scp $USER@$MASCHINE:~/LPA14/pipeline_info/pipeline_dag_*.html . In the MultiQC report you will find vital graph statistics, lots of 1D graph visualizations and a 2D graph visualization serving both as quantitative and qualitative graph validation information. In `execution_report_*.html*` you can find an overview of the executed pipeline and especially the resource consumption of each process of the pipeline. If you notice that a process is consuming much less RAM than it was given in `hla.config` you might want to adjust this. Assuming you want to run `nf-core/pangenome` on a cluster, it is crucial to limit the allocated resources for each process, so your jobs usually have a higher chance to be submitted by the cluster scheduler. In `execution_timeline_*.html` one can observe when which process was executed and which processes were submitted in parallel, assuming resources were available. Also take a look at all the output folders. In which one is the final graph stored? If you are not sure, take a look a the `pipeline_dag_*.html` file. ### Parallelizing Base-Level Alignments Across a Cluster One advantage that `nf-core/pangenome` has over `pggb` is that it can parallelize the often heavy base-level alignments across nodes of a cluster. The parameter `--wfmash_chunks` determines into how many equally large subproblems the alignments should be split after the `WFMASH_MAP` process. It is recommended that this number roughly fits the number of available nodes one has. During the course, a full cluster is not available, so we are improvising. In `hla.config` the number of CPUs for `WFMASH_ALIGN` is set to 4. Assuming we are able to run this in parallel on our simulated 8T/16GB machine, one can expect that at most 2 `WFMASH_ALIGN` process can be executed in parallel. For this lesson, it is sufficient, that we only execute the alignment step of the pipeline, which is set with `--wfmash_only`: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LAP14_wfmash --wfmash_only --wfmash_chunks 4 Examine the `execution_timeline_*.html` to find out if it worked. If you are interested, you can play around with the resources in `hla.config` and see how this affects the parallelism. ### Building a Community Yeast Pangenome In the previous tutorials, you learned how to partition the yeast sequences into communities manually. The `nf-core/pangenome` pipeline is able to do this automatically. On top of that, it will create a pangenome graph for each of the communities on the fly, merging all of them in one final graph. Before we create the yeast graph, let's open a `screen` session, since the graph construction will take ~15 minutes: screen -R yeast Create your own `yeast.config`. You can start with the `hla.config` one: cp hla.config yeast.config Modify it as you see fit. Once you ran the pipeline, or even during the pipeline run, you may have to adjust it (e.g. because you did not reserve enough resources for a specific process). Let's start building: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100," Since this will take some time, you can watch all the steps, or you can check out what's happening in the background. Type `CONTROL+A+D` in order to detach from the `screen` session. With less .nextflow.log you can observe what `Nextflow` is actually doing. If you scroll down to the bottom, you may see in which output folder each process is storing there files and commands. You can `cd` in there and take a look at for example `.command.sh` or `command.log`. You can always go back to the `screen` session via `screen -r yeast`. Once there, type several times `CONTROL+C` in order to abort the pipeline run. Nextflow stores all intermediate files, so we can just continue again with `-resume`: nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100," -resume Once the pipeline is done, you can take a look a the MultiQC report of each community and the final graph. ### Bonus Challenge Configure the config file(s) for the lowest run time possible. You can even collaborate with your colleagues on the same VM.