MemPanG23 - Day 2b - Pangenome openness and nf-core/pangenome

NOTE: Allow starting a web server for browsing outputs

Let's get a web server running. Log-in again to your virtual machine from another terminal and run:

python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1)

You can access this by pointing your web browser at http://<your_ip>:889N/, where <your_ip> is the ip address of your instance.

curl ifconfig.me

can show you your ip address.

TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER

rsync -av user1@<YOUR_MACHINE>:/path/of/your/file .

Pangenome openness

Learning objectives

In this exercise you learn how to

evaluate and interpret the openness of a pangenome.

Getting started

Make sure you have panacus installed. It is already available on the course VMs.

Now create a directory to work on for this tutorial:

cd ~
mkdir day2_openness
cd day2_openness

Download the 44 haplotypes chrM HPRC human pangenome graph (Liao, Asri, Ebler et al., 2023) from the HPRC Pangenome Resources and the 50 haplotypes E. coli pangenome graph (Garrison, Guarracino et al., 2023):

mkdir chrM
cd chrM
wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chrM.hprc-v1.0-pggb.gfa.gz
gunzip chrM.hprc-v1.0-pggb.gfa.gz
mv chrM.hprc-v1.0-pggb.gfa chrM.gfa
cd ..

mkdir ecoli50
cd ecoli50
wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
zstd -d ecoli50.gfa.zst
cd ..

`odgi heaps`

odgi heaps calculates permutations of the path pangenome coverage in the graph. The number of permutations affects the accuracy of the subsequent power law regression. This regression happens in this Rscript that uses the heap's law (Tettelin et al., 2008) to calculate a pangenome growth curve from all odgi heaps permutations. For more details, take a look at https://en.wikipedia.org/wiki/Pan-genome#Classification.

`Panacus`

panacus is able to calculate the pangenome openness without the need to perform any permutations. Indeed, it directly applies the binomial formula described in Parmigiani et al., 2022, Section 2.1, Eq 1.

panacus exposes a parameter (-c) that allow users to chose which graph feature (sequence, node, edge) is taken into account to calculate the growth histogram parameter. The coverage -l parameter sets the minimum number of haplotypes visiting a graph feature in order for this graph feature to be included into the calculation at all. With -q one can set the minimum fraction of haplotypes that must share a graph feature after each time a haplotype is added to the growth histograph.

For example, assuming a 100 haplotypes pangenome graph, setting -c bps -q 0,1,0.5,0.1 would calculate the pangenome growth in sequence scape (-c bps) for 4 different cases. Remember, quorum sets the minimum fraction of haplotypes for a nucleotide to be included in the openness calculation.

without setting any quorum (-q 0), so all sequences are considered.
limited to sequences that are traversed by 100% of haplotypes (-q 1). This is the core pangenome.
limited to sequences that are traversed by at least 50% of the haplotypes (-q 0.5). This is the shell pangenome.
limited to sequences that are traversed by at least 10% of the haplotypes (-q 0.1). This is the cloud pangenome.

panacus fits two curves, one following heap's law, and one using heap's power law for modeling the data.

A Closed Pangenome?

Create the matrix of path pangenome coverage permutations of the chrM graph with odgi heaps subsequently performing the heap's law regression:

cd chrM
odgi heaps -i chrM.gfa -S -n 100 > chrM.gfa.heaps.tsv
Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.heaps.tsv chrM.gfa.heaps.tsv.pdf # this should be executed on your local machine

Taking a look at the PDF, we can surprisingly observe 2 traces of permutations. How can this happen?

**Hint**

Take a look at the 1D visualization of the graph.

**Explanation**

The CHM13 reference was linearized differently than all the other mitochondrial sequences. Therefore it has an additional tip. odgi heaps permutation algorithm is reflecting this, because the permutation always starts with the beginning of each genome.

So to get a better pangenome growth curve with odgi heap we remove the CHM13 reference sequence and run the analysis again:

odgi paths -i chrM.gfa -L | head -n 1 > chrM.gfa.chm13
odgi paths -i chrM.gfa -X chrM.gfa.chm13 -o chrM.gfa.no_chm13.og
odgi heaps -i chrM.gfa.no_chm13.og -S -n 100 > chrM.gfa.no_chm13.og.heaps.tsv
Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.no_chm13.og.heaps.tsv chrM.gfa.heaps.no_chm13.og.tsv.pdf # this should be executed on your local machine

This looks much better! With every added genome, the number of newly added bases is really low. Let's take a closer look with panacus.

Create and visualize a growth histograph of the chrM graph in sequence space with panacus:

RUST_LOG=info panacus histgrowth chrM.gfa -c bp -q 0,1,0.5,0.1 -t 16 > chrM.gfa.histgrowth.tsv
panacus-visualize.py -e -l "lower right" chrM.gfa.histgrowth.tsv > chrM.gfa.histgrowth.tsv.pdf

**See PNG**

Taking a look at the top Figure in the PDF there is a nearly asymptotic growth of the number of nucleotides with increasing number of genomes.

The bottome Figure shows the added number of bases per added individual. Which is basically 0 for all, except for the first one. In conclusion, ChrM is a closed pangenome.
The core pangenome (quorum >= 1) is very large, with the shell pangenome (quorum >= 0.5) being even slightly larger, the cloud pangenome (quorum >= 0.1) is not noticably larger.

A Open Pangenome

Create the matrix of path pangenome coverage permutations of the E.coli graph with odgi heaps subsequently performing the power law regression:

cd ~/day2_openness/ecoli50
odgi heaps -i ecoli50.gfa -S -n 100 -t 16 > ecoli50.gfa.heaps.tsv
Rscript heaps_fit.R ecoli50.gfa.heaps.tsv ecoli50.gfa.heaps.tsv.pdf

With every added genome, the number of newly added bases is at least 100k. Let's take a closer look with panacus.

Create and visualize a growth histograph of the E. coli graph in sequence space with panacus:

RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv
panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf

**See PNG**

Taking a look at the top Figure in the PDF there is a polynomial growth of the number of nucleotides with increasing number of genomes visible.

The bottom Figure shows the added number of bases per added individual. Even the 50th added individual adds more than 100k new bases to the pangenome. In conclusion, the E. coli pangenome is open. The core pangenome (quorum >= 1) is quite small, with the shell pangenome (quorum >= 0.5) not adding much more sequence, and the cloud pangenome (quorum >= 0.1) adding some more sequence.

Acknowledgments

The MemPanG23 team wants to thank Daniel Doerr for revising and polishing the Pangenome openness part of the tutorial.

nf-core/pangenome

Learning objectives

In this exercise you learn how to

run a nf-core Nextflow pipeline,
configure the resources according to what is available,
deal with alternative parameter names,
understand the nf-core/pangenome pipeline's output:
- MutiQC,
- used CPU, RAM, …
- workflow timeline,
- output folders,
- where are my communities?

Getting started

Make sure you have screen, wget, git, Nextflow, and Docker installed. All tools are already available on the course VMs.

If you haven't done before, clone the pggb repository into your home folder:

cd ~
git clone https://github.com/pangenome/pggb.git

Now create a directory to work in for this tutorial:

mkdir day2_nextflow
cd day2_nextflow
ln -s ~/pggb/data

Download the 8 haplotypes yeast FASTA sequences (Yeast Population Reference Panel):

wget https://zenodo.org/record/7933393/files/scerevisiae8.fa.gz

One can distribute the available compute resources efficiently across the different processes of the Nextflow pipeline using config files. During the course you have access to 16 threads with 32 gigabytes of memory. To ensure that each run only consumes up to these resources and does not snitch them from other course participants, please create the following config file:

mempang23_config="executor {
  cpus = 16
  memory = 32.GB
}"
echo "$mempang23_config" > mempang23.config

Building a LPA pangenome graph with nf-core/pangenome

Whilst we can limit the maximum allocatable resources with mempang23.config, one can assign resources for each step of the pipeline using a different config file:

wget https://raw.githubusercontent.com/nf-core/pangenome/a_brave_new_world/conf/hla.config

Check it out!

Let's build an LPA pangenome graph. If you are interested in setting additional parameters you can always visit https://nf-co.re/pangenome/a_brave_new_world/parameters for details. All parameters starting with one - are handed over to Nextflow, all parameters starting with two - are handled by the pipeline itself:

nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LPA14

In which folder can the final ODGI graph be found? And in which folder in the final GFA graph?

**See Solution**

ODGI: odgi_sort
GFA: odgi_view

Open the MultiQC report and other statistics on your local machine in order to take a closer look.

LPA14/multiqc/multiqc_report.html .
LPA14/pipeline_info/execution_*.html .
LPA14/pipeline_info/pipeline_dag_*.html .

In the MultiQC report you will find vital graph statistics, lots of 1D graph visualizations and a 2D graph visualization serving both as quantitative and qualitative graph validation information. In execution_report_*.html* you can find an overview of the executed pipeline and especially the resource consumption of each process of the pipeline. If you notice that a process is consuming much less RAM than it was given in hla.config you might want to adjust this. Assuming you want to run nf-core/pangenome on a cluster, it is crucial to limit the allocated resources for each process, so your jobs usually have a higher chance to be submitted by the cluster scheduler. In execution_timeline_*.html one can observe when which process was executed and which processes were submitted in parallel, assuming resources were available.

Also take a look at all the output folders. In which one is the final graph stored? If you are not sure, take a look a the pipeline_dag_*.html file.

Parallelizing base-level alignments across a cluster

One advantage that nf-core/pangenome has over pggb is that it can parallelize the often heavy base-level alignments across nodes of a cluster. The parameter --wfmash_chunks determines into how many equally large subproblems the alignments should be split after the WFMASH_MAP process. It is recommended that this number roughly fits the number of available nodes one has. During the course, a full cluster is not available, so we are improvising. In hla.config the number of CPUs for WFMASH_ALIGN is set to 4. Assuming we are able to run this in parallel on our simulated 16T/32GB machine, one can expect that at most 4 WFMASH_ALIGN process can be executed in parallel. For this lesson, it is sufficient, that we only execute the alignment step of the pipeline, which is set with --wfmash_only:

nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,hla.config --input data/LPA/LPA.fa.gz --n_haplotypes 14 --outdir LAP14_wfmash --wfmash_only --wfmash_chunks 4

Examine the execution_timeline_*.html to find out if it worked. If you are interested, you can play around with the resources in hla.config and see how this affects the parallelism.

Building a community yeast pangenome

In the previous tutorials, you learned how to partition the yeast sequences into communities manually. The nf-core/pangenome pipeline is able to do this automatically. On top of that, it will create a pangenome graph for each of the communities on the fly, merging all of them in one final graph.
Before we create the yeast graph, let's open a screen session, since the graph construction will take ~8 minutes:

screen -R yeast

Create your own yeast.config. You can start with the hla.config one:

cp hla.config yeast.config

Modify it as you see fit. Once you ran the pipeline, or even during the pipeline run, you may have to adjust it (e.g. because you did not reserve enough resources for a specific process). Let's start building:

nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100,"

Since this will take some time, you can watch all the steps, or you can check out what's happening in the background. Type CONTROL+A+D in order to detach from the screen session. With

less .nextflow.log

you can observe what Nextflow is actually doing. If you scroll down to the bottom, you may see in which output folder each process is storing there files and commands. You can cd in there and take a look at for example .command.sh or command.log. You can always go back to the screen session via screen -r yeast. Once there, type several times CONTROL+C in order to abort the pipeline run. Nextflow stores all intermediate files, so we can just continue again with -resume:

nextflow run nf-core/pangenome -r a_brave_new_world -profile docker -c mempang23.config,yeast.config --input scerevisiae8.fa.gz --n_haplotypes 8 --outdir yeast8 --wfmash_map_pct_id 95 --communities --smoothxg_poa_length "1100," -resume

Once the pipeline is done, you can take a look a the MultiQC report of each community and the final graph(s).

Bonus Challenge

Configure the config file(s) for the lowest run time possible. You can even collaborate with your colleagues on the same VM only executing one Nextflow instance.