<!---
panacus 0.2 https://anaconda.org/bioconda/panacus
--->
# HPRC HUGO24 Workshop - Building and Analyzing Pangenome Graphs
#### NOTE: Allow starting a web server for browsing outputs
Let's get a web server running. Log-in again to your virtual machine from another terminal and run:
```shell
python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1)
```
You can access this by pointing your web browser at `http://<your_ip>:889N/`, where <your_ip> is the ip address of your instance.
curl ifconfig.me
can show you your ip address.
<br/>
##### TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER
rsync -av user1@<YOUR_MACHINE>:/path/of/your/file .
## PGGB
### Learning objectives
- build pangenome graphs using pggb
- explore pggb's results
- understand how parameters affect the built pangenome graphs
### Getting started
Make sure you have `pggb` v0.5.4 and its tools installed. It is already available on the course workstations. If you want to build everything on your laptop, follow the instructions at the [pggb homepage](https://github.com/pangenome/pggb) (`guix`, `docker`, `singularity`, and `conda` alternatives are available). So make sure you have checked out pggb repository:
cd ~
git clone https://github.com/pangenome/pggb.git
Now create a directory to work on for this part of the tutorial tutorial:
mkdir hprc_hugo24_pggb
cd hprc_hugo24_pggb
ln -s ../pggb/data
<!---
Take a look at the sequence names of `data/HLA/DRB1-3123.fa.gz.fai`:
cut -f 1 data/HLA/DRB1-3123.fa.gz.fai
gi|568815592:32578768-32589835
gi|568815529:3998044-4011446
gi|568815551:3814534-3830133
gi|568815561:3988942-4004531
gi|568815567:3779003-3792415
gi|568815569:3979127-3993865
gi|345525392:5000-18402
gi|29124352:124254-137656
gi|28212469:126036-137103
gi|28212470:131613-146345
gi|528476637:32549024-32560088
gi|157702218:147985-163915
They do not correspond to the [PanSN-spec](https://github.com/pangenome/PanSN-spec) with '#' as the delimiter. Let's fix this first:
cp data/HLA/DRB1-3123.fa.gz ./
gunzip DRB1-3123.fa.gz
cat DRB1-3123.fa | while read line; do [[ $line = '>'* ]] && echo $(echo $line | cut -f 1 -d" ")#1 || echo $line; done > DRB1-3123.pansn.fa
bgzip DRB1-3123.pansn.fa
samtools faidx DRB1-3123.pansn.fa.gz
--->
### Build HLA pangenome graphs
The [human leukocyte antigen (HLA)](https://en.wikipedia.org/wiki/Human_leukocyte_antigen) system is a complex of genes on chromosome 6 in humans which encode cell-surface proteins responsible for the regulation of the immune system.
Let's build a pangenome graph from a collection of sequences of the DRB1-3123 gene:
pggb -i data/HLA/DRB1-3123.fa.gz -n 12 -t 16 -o out_DRB1_3123
Run `pggb` without parameters to get information on the meaning of each parameter:
pggb
Take a look at the files in the `out_DRB1_3123` folder.
Why did we specify `-n 12`?
<details>
<summary>Click me for the answer</summary>
The pggb graph is defined by the number of mappings per segment of each genome `-n, --n-mappings N`. Ideally, you should set this to equal the number of haplotypes in the pangenome. Because, one expects the `number of haplotypes minus 1` as the maximum number of secondary mappings and alignments. Keep in mind that the total work of alignment is proportional to `N*N`, and these multimappings can be highly redundant. If you provide a `N` that is not equal to the number of haplotypes, provide the actual number of haplotypes to `-H`. This helps smoothxg to determine the right POA problem size.
<!---
`pggb` expects sequence names to follow the [PanSN-spec](https://github.com/pangenome/PanSN-spec). This potentially enables `pggb` to automatically detect the correct paremeter setting for `-c, --n-mappings N` and `-n, --n-haplotypes N`. For teaching, we set this manually.
Setting this parameter is important, because the `pggb` graph is defined by the number of mappings per segment of each genome `-c, --n-mappings N`. Ideally, you should set this to the number of haplotypes minus 1 in the pangenome. Because that's the maximum number of secondary mappings and alignments that we expect. Keep in mind that the total work of alignment is proportional to `O(N) = N*N`, and these multimappings can be highly redundant. `-n, --n-haplotypes N` helps `smoothxg` to determine the right POA problem size.
--->
</details>
<br />
<!---
How many alignments were executed during the pairwise alignment (take a look at the PAF output)? Visualize the alignments:
cd out_DRB1_3123
pafplot *.paf --size 2000
cd ..
This creates `*.paf.png`. Examine the PNG on your local laptop.
--->
How many alignments were executed during the pairwise alignment (take a look at the PAF output)? Visualize the alignments:
cd out_DRB1_3123
pafplot *.paf --size 2000
cd ..
Use `odgi stats` to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question.
<details>
<summary>Click me for the answer</summary>
The total graph length is much longer than the length of most of the input sequences. This indicates an underalignment of all the sequences.
</details>
<br />
How many blocks were selected and 'smoothed' during the two rounds of graph normalization?
<details>
<summary>Hint</summary>
Take a look at the `*.log` file to answer this question.
</details>
<br />
Try building the same pangenome graph by specifying a lower percent identity (`-p 90` by default):
pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -o out2_DRB1_3123
Check graph statistics. Does this pangenome graph represent better or worse the input sequences than the previously produced graph?
<details>
<summary>Click me for the answer</summary>
The total graph length is now closer to each length of most of the input sequences.
</details>
<br />
Try to decrease the number of mappings to keep for each segment:
pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 6 -t 16 -o out3_DRB1_3123
How does it affect the graph?
Try to increase the target sequence length for the partial order alignment (POA) problem (`-G 700,900,1100` by default):
pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -G 1400,1800,2200 -o out4_DRB1_3123
How is this changing the runtime and the memory usage? How is this affecting graph statistics? How many blocks were selected and 'smoothed' during the two rounds of graph normalization?
<details>
<summary>Hint</summary>
Take a look at the `*.log` file to answer this question.
</details>
<br />
Take the second `pggb` run and try to increase the segment length (`-s 5000` by default):
pggb -i data/HLA/DRB1-3123.fa.gz -s 20000 -p 80 -n 12 -t 16 -o out5_DRB1_3123
How is this affecting graph statistics? Why?
<details>
<summary>Hint</summary>
The length of a segment for mapping is now so large for the given sequence identity, that some mappings are not possible anymore.
</details>
<br />
<!---
`pggb` produces intermediate graphs during the process. Let's keep all of them:
pggb -i DRB1-3123.pansn.fa.gz -p 80 -n 12 -c 11 -t 16 --keep-temp-files -o out2_DRB1_3123_keep_intermediate_graphs
What does the file with name ending with `.seqwish.gfa` contain? And what about the file with name ending with `.smooth.1.gfa`?
Take a look at the graph statistics of all the GFA files in the `out2_DRB1_3123_keep_intermediate_graphs` folder.
<details>
<summary>Click me for the answer</summary>
- The `.seqwish.gfa` graph is the graph that directly comes out of `seqwish`.
- `.smooth.[0-1].gfa` graphs are intermediate graphs after each round of `smoothxg`.
</details>
<br />
--->
## nf-core/pangenome
### Learning objectives
In this exercise you learn how to
- run a [nf-core](https://nf-co.re/) [Nextflow](https://www.nextflow.io/) pipeline,
- configure the resources according to what is available,
- deal with alternative parameter names,
- understand the [nf-core/pangenome](https://github.com/nf-core/pangenome) pipeline's output:
- [MutiQC](https://multiqc.info/),
- used CPU, RAM, ...
- workflow timeline,
- output folders
### Getting started
Make sure you have `screen`, `wget`, `git`, `Nextflow`, and `Docker` installed. All tools are already available on the course VMs.
Now create a directory to work in for this tutorial:
cd ~
mkdir hprc_hugo24_nextflow
cd hprc_hugo24_nextflow
One can distribute the available compute resources efficiently across the different processes of the Nextflow pipeline using [config](https://www.nextflow.io/docs/latest/config.html) files. During the course you have access to 32 threads with 64 gigabytes of memory. To ensure that each run only consumes up to these resources, please create the following config file:
hprc_hugo24_config="executor {
cpus = 32
memory = 64.GB
}"
echo "$hprc_hugo24_config" > hprc_hugo24.config
Prepare input data:
# download the HPRC PGGB chr20 graph
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr20.hprc-v1.0-pggb.gfa.gz
gunzip chr20.hprc-v1.0-pggb.gfa.gz
We want to build a graph with 4 haplotypes, so we need to extract a subset from all the sequences in the graph:
# extract all sequences in FASTA format with ODGI
odgi paths -i chr20.hprc-v1.0-pggb.gfa -f -t 32 -P > chr20.hprc-v1.0-pggb.gfa.fa
# index the FASTA
samtools faidx chr20.hprc-v1.0-pggb.gfa.fa
We select the two references CHM13, GRCH38, and the 2 haplotypes of the HG01978 diploid sample:
grep "chm13\|grch38\|HG01978" chr20.hprc-v1.0-pggb.gfa.fa.fai | cut -f 1 > chr20.pan4.txt
# fetch the sequences of the desired haplotypes
samtools faidx chr20.hprc-v1.0-pggb.gfa.fa -r chr20.pan4.txt > chr20.hprc.pan4.fa
# zip it
bgzip -@32 chr20.hprc.pan4.fa
# index the FASTA
samtools faidx chr20.hprc.pan4.fa.gz
### Building an HPRC 4 haplotypes chr20 pangenome graph with nf-core/pangenome
Whilst we can limit the maximum allocatable resources with `hprc_hugo24.config`, one can assign resources for each step of the pipeline using a different config file:
chr20_hprc_pan4_config="process {
withName:'MULTIQC|MULTIQC_COMMUNITY|SAMTOOLS_FAIDX|CUSTOM_DUMPSOFTWAREVERSIONS' {
// these tools can only make use of one thread and need little RAM
cpus = 1
memory = 1.GB
}
withName:'TABIX_BGZIP|ODGI_STATS|WFMASH_ALIGN|VG_DECONSTRUCT' {
cpus = 8
memory = 8.GB
}
withName:'WFMASH_MAP_ALIGN|WFMASH_MAP|SEQWISH|ODGI_BUILD|ODGI_UNCHOP|ODGI_SORT|ODGI_LAYOUT|WFMASH_MAP_COMMUNITY|ODGI_SQUEEZE' {
cpus = 16
memory = 16.GB
}
withName:'SMOOTHXG' {
cpus = 32
memory = 32.GB
}
withName:'GFAFFIX|ODGI_VIEW|ODGI_VIZ*|ODGI_DRAW|SPLIT_APPROX_MAPPINGS_IN_CHUNKS|PAF2NET|NET2COMMUNITIES|EXTRACT_COMMUNITIES' {
// these tools can only make use of one thread and need medium RAM
cpus = 1
memory = 8.GB
}
}"
echo "$chr20_hprc_pan4_config" > chr20.hprc.pan4.config
Let's build the chromosome 20 pangenome graph. If you are interested in setting additional parameters you can always visit https://nf-co.re/pangenome/1.1.2/parameters for details. All parameters starting with one `-` are handed over to Nextflow, all parameters starting with two `-` are handled by the pipeline itself:
nextflow run nf-core/pangenome -r 1.1.2 --input chr20.hprc.pan4.fa.gz --outdir chr20.hprc.pan4_out --n_haplotypes 4 --wfmash_map_pct_id 98 --wfmash_segment_length 10k --wfmash_n_mappings 3 --seqwish_min_match_length 311 --smoothxg_poa_length "1000," -c hprc_hugo24.config,chr20.hprc.pan4.config --wfmash_exclude_delim '#' -profile docker --wfmash_chunks 4
<details>
<summary>Click me for the explanations of some Nextflow parameters</summary>
- `nextflow run`: Execute a Nextflow pipeline.
- `nf-core/pangenome -r 1.1.2`: Select pipeline https://github.com/nf-core/pangenome for execution in its version 1.1.2.
- `--n_haplotypes 4`: We have 4 haplotypes as input.
- `--wfmash_map_pct_id 98`: Genomic sequences of human vary by ~2%.
- `--wfmash_n_mappings 3`: We want to retain 3 mappings for each segment, because each segment could map to 3 other haplotypes.
- `seqwish_min_match_length 311`: Filter exact matches below this length. This can smooth the graph locally and prevent the formation of complex local graph topologies from forming due to differential alignments.
- `--wfmash_exclude_delim '#'`: Our input sequences follows the pansn spec so the idea is to skip mappings when the query and target have the same
prefix: '#'. Since our sample `HG01978` still consists of contigs, this will reduce our mapping problem and speed up `WFMASH_MAP`.
- `--wfmash_chunks 4`: One advantage that `nf-core/pangenome` has over `pggb` is that it can parallelize the often heavy base-pair level alignments across nodes of a cluster. The parameter `--wfmash_chunks` determines into how many equally large subproblems the alignments should be split after the `WFMASH_MAP` process. It is recommended that this number roughly fits the number of available nodes one has. During the course, a full cluster is not available, so we are improvising. In `chr20.hprc.pan4.config` the number of CPUs for `WFMASH_ALIGN` is set to 8. Assuming we are able to run this in parallel on our 32T/64GB machine, one can expect that at most 4 `WFMASH_ALIGN` process can be executed in parallel.
</details>
<br />
*In which folder can the final ODGI graph be found? And in which folder in the final GFA graph?*
<details>
<summary>Click me for the answer</summary>
- ODGI: *FINAL_ODGI*
- GFA: *FINAL_GFA*
</details>
Open the MultiQC report and other statistics on your local machine in order to take a closer look.
chr20.hprc.pan4_out/multiqc/multiqc_report.html .
chr20.hprc.pan4_out/pipeline_info/execution_*.html .
chr20.hprc.pan4_out/pipeline_info/pipeline_dag_*.html .
In the MultiQC report you will find vital graph statistics, lots of 1D graph visualizations and a 2D graph visualization serving both as quantitative and qualitative graph validation information. In `execution_report_*.html*` you can find an overview of the executed pipeline and especially the resource consumption of each process of the pipeline. If you notice that a process is consuming much less RAM than it was given in `chr20.hprc.pan4.config` you would want to adjust this in future runs. Assuming you want to run `nf-core/pangenome` on a cluster, it is crucial to limit the allocated resources for each process, so your jobs usually have a higher chance to be submitted by the cluster scheduler. In `execution_timeline_*.html` one can observe when which process was executed and which processes were submitted in parallel, assuming resources were available.
## ODGI
### Learning objectives
- extract subgraphs representing loci of interest
- visualize graph annotation
- make phylogenetic trees
### Getting started
Check out odgi repository (we need one of its example):
cd ~
git clone https://github.com/pangenome/odgi.git
Now create a directory to work on for this tutorial:
cd ~
mkdir hprc_hugo24_subgraphs
cd hprc_hugo24_subgraphs
ln -s ~/odgi/test
### Exploring the HPRC chromosome 6 pangenome graph
Download the pangenome graph of the Human chromosome 6 in GFA format, decompress it, and convert it to a graph in odgi format.
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
odgi build -g chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -o chr6.pan.og -t 32 -P
This graph contains contigs of 88 haploid, phased human genome assemblies from 44 individuals, plus the `chm13` and `grch38` reference genomes.
#### Extraction
The [major histocompatibility complex (MHC)](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. The human MHC is also called the HLA (human leukocyte antigen) complex (often just the HLA).
To see the coordinates of some HLA genes, execute:
head test/chr6.HLA_genes.bed -n 5
The coordinates are expressed with respect to the `grch38` reference genome.
To extract the subgraph containing all the HLA genes annotated in the `chr6.HLA_genes.bed` file, execute:
odgi extract -i chr6.pan.og -o chr6.pan.MHC.og -b <(bedtools merge -i test/chr6.HLA_genes.bed -d 10000000) -d 0 -E -t 32 -P
The instruction extracts:
- the nodes belonging to the `grch38#chr6` path ranges specified in the the `chr6.HLA_genes.bed` file via `-b`,
- all nodes between the min and max positions touched by the given path ranges, also if they belong to other paths (`-E`),
- the edges connecting all the extracted nodes, and
- the paths traversing all the extracted nodes.
How many paths are present in the extracted subgraph? With 90 haplotypes (44 diploid samples plus 2 haploid reference genomes), how many paths would you expect in the subgraph if the MHC locus were solved with a single contig per haplotype?
To visualize the graph, execute:
odgi sort -i chr6.pan.MHC.og -o - -O | odgi viz -i - -o chr6.pan.MHC.png -s '#' -a 20
Why are we using `odgi sort` before visualizing the graph?
Are there haplotypes where the MHC locus is not resolved with a single contig? If so, which ones? Counts the number of contigs for each haplotype.
Generate the graph layout with `odgi layout`. (remember to specify the number of threads). Visualize the layout with `odgi draw` Specify `-P` to get information on the progress.
The MHC locus includes the complement component 4 (C4) region, which encodes proteins involved in the complement system. In humans, the C4 gene exists as 2 functionally distinct genes, C4A and C4B, which both vary in structure and copy number ([Sekar et al., 2016](https://www.nature.com/articles/nature16549)). Moreover, C4A and C4B genes segregate in both long and short genomic forms, distinguished by the presence or absence of a human endogenous retroviral (HERV) sequence.
Find C4 coordinates:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
zgrep 'gene_id "C4A"\|gene_id "C4B"' hg38.ncbiRefSeq.gtf.gz |
awk '$1 == "chr6"' | cut -f 1,4,5 |
bedtools sort | bedtools merge -d 15000 | bedtools slop -l 10000 -r 20000 -g hg38.chrom.sizes |
sed 's/chr6/grch38#chr6/g' > hg38.ncbiRefSeq.C4.coordinates.bed
Extract the C4 locus:
odgi extract -i chr6.pan.og -b hg38.ncbiRefSeq.C4.coordinates.bed -o - -d 0 -E -t 32 -P | odgi explode -i - --biggest 1 --sorting-criteria P --optimize -p chr6.pan.C4
odgi sort -i chr6.pan.C4.0.og -o chr6.pan.C4.sorted.og -p Ygs -x 100 -t 32 -P
What `odgi explode` is doing?
Regarding the `odgi viz` visualization, select the haplotypes to visualize
odgi paths -i chr6.pan.C4.sorted.og -L | grep 'chr6\|HG00438\|HG0107\|HG01952' > chr6.selected_paths.txt
and visualize them:
# odgi viz: default (binned) mode
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt
# odgi viz: color by strand
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.z.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -z
# odgi viz: color by position
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.du.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -du
# odgi viz: color by depth
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.m.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -m -B Spectral:4
For the `chr6.pan.C4.sorted.m.png` image we used the Spectra color palette with 4 levels of node depths, so white indicates no depth, while grey, red, and yellow indicate depth 1, 2, and greater than or equal to 3, respectively. What information does this image provide us about the state of the C4 region in the selected haplotypes?
Visualize all haplotypes with `odgi viz`, coloring by depth. How many haplotypes have three copies of the C4 region? How many haplotypes are missing the HERV sequence? What is the copy number state of the `chm13` and `grch38` reference genomes?
Use `odgi layout` and `odgi draw` to compute and visualize the layout of the C4 locus. Try to find out how to add the following /home/participant/odgi/test/chr6.C4.bed to `odgi draw`'s SVG output. The HERV sequence may be present or absent in the C4 regions across haplotypes: how does this reflect on the structure of the graph?
### Primate chromosome 6
Download the pangenome graph of the primate chromosome 6 in GFA format, decompress it, and convert it to a graph in `odgi` format.
wget https://zenodo.org/record/7933393/files/primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst
zstd -d primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst
odgi build -g primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa -o primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og -t 16 -P
This graph contains contigs of chromosome 6 of six diploid (2 haplotypes for each sample), phased primate genome assemblies, plus the chm13 and grch38 reference genomes. Primate genomes were downloaded from https://genomeark.github.io/t2t-all/.
Compute the dissimilarity (distance) between all possible pairs of haplotypes:
odgi similarity -i primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og --distances -t 32 -D '#' -p 2 -P > primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv
The `-D` and `-p` options specifies to use the 2nd occurrence of the # character to group the results. As path names follow the PanSN-spec specification, this means that results are grouped by haplotype. Take a look at the output:
head primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv
Construct a phylogenetic tree by using the jaccard.distance:
library(tidyverse)
library(ape)
library(ggtree)
path_dist_tsv <- 'primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv'
# Read sparse matrix
sparse_matrix_df <- read_tsv(path_dist_tsv)
# Prepare distance matrix
jaccard_dist_df <- sparse_matrix_df %>%
arrange(group.a, group.b) %>%
select(group.a, group.b, dice.distance) %>%
pivot_wider(names_from = group.b, values_from = dice.distance) %>%
column_to_rownames(var = "group.a")
# Clustering
jaccard_hc <- as.dist(jaccard_dist_df) %>% hclust()
# Open a pdf device with the specified width and height
pdf(file = "dendrogram.haplotypes.pdf", width = 5, height = 6)
# Plot the dendrogram
plot(
jaccard_hc,
# Label at same height
hang = -1,
main = 'primate14.chr6',
xlab = 'Haplotype',
ylab = 'Jaccard distance',
sub = '',
cex = 1.2
)
# Close the device and save the file
dev.off()
Try to make the tree by grouping the results by sample.
## Bonus: Pangenome growth curve
### Learning objectives
In this exercise you learn how to
- Evaluate and interpret the growth curve of a pangenome.
### Getting started
Make sure you have `panacus` installed. It is already available on the course VMs.
Now create a directory to work on for this tutorial:
cd ~
mkdir hprc_openness
cd hprc_openness
Download the 44 haplotypes chrM HPRC human pangenome graph ([Liao, Asri, Ebler et al., 2023](https://doi.org/10.1038/s41586-023-05896-x)) from the [HPRC Pangenome Resources](https://github.com/human-pangenomics/hpp_pangenome_resources) and the 50 haplotypes *E. coli* pangenome graph ([Garrison, Guarracino et al., 2023](https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1)):
mkdir chrM
cd chrM
wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chrM.hprc-v1.0-pggb.gfa.gz
gunzip chrM.hprc-v1.0-pggb.gfa.gz
mv chrM.hprc-v1.0-pggb.gfa chrM.gfa
cd ..
mkdir ecoli50
cd ecoli50
wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
zstd -d ecoli50.gfa.zst
cd ..
### `odgi heaps`
`odgi heaps` calculates permutations of the path pangenome coverage in the graph. The number of permutations affects the accuracy of the subsequent power law regression. This regression happens in this [Rscript](https://github.com/pangenome/odgi/blob/master/scripts/heaps_fit.R) that uses the heap's law ([Tettelin et al., 2008](https://www.sciencedirect.com/science/article/pii/S1369527408001239?via=ihub#section0020)) to calculate a pangenome growth curve from all `odgi heaps` permutations. For more details, take a look at https://en.wikipedia.org/wiki/Pan-genome#Classification.
### `Panacus`
`panacus` is able to calculate the pangenome openness without the need to perform any permutations. Indeed, it directly applies the binomial formula described in [Parmigiani et al., 2022](https://www.biorxiv.org/content/10.1101/2022.11.15.516472v2), Section 2.1, Eq 1.
`panacus` exposes a parameter (`-c`) that allow users to chose which graph feature (sequence, node, edge) is taken into account to calculate the growth histogram parameter. The coverage `-l` parameter sets the _minimum number_ of haplotypes visiting a graph feature in order for this graph feature to be included into the calculation at all. With `-q` one can set the _minimum fraction_ of haplotypes that must share a graph feature *after each time a haplotype is added to the growth histograph*.
For example, assuming a 100 haplotypes pangenome graph, setting `-c bps -q 0,1,0.5,0.1` would calculate the pangenome growth in sequence scape (`-c bps`) for 4 different cases. Remember, quorum sets the minimum fraction of haplotypes for a nucleotide to be included in the openness calculation.
- without setting any quorum (`-q 0`), so all sequences are considered.
- limited to sequences that are traversed by 100% of haplotypes (`-q 1`). This is the `core pangenome`.
- limited to sequences that are traversed by at least 50% of the haplotypes (`-q 0.5`). This is the `shell pangenome`.
- limited to sequences that are traversed by at least 10% of the haplotypes (`-q 0.1`). This is the `cloud pangenome`.
`panacus` fits two curves, one following heap's law, and one using heap's power law for modeling the data.
### Pangenome growth curve of the HPRC chrM pangenome graph
Create the matrix of path pangenome coverage permutations of the chrM graph with `odgi heaps` subsequently performing the heap's law regression:
cd chrM
odgi heaps -i chrM.gfa -S -n 100 > chrM.gfa.heaps.tsv
Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.heaps.tsv chrM.gfa.heaps.tsv.pdf # this should be executed on your local machine
Taking a look at the PDF, we can surprisingly observe 2 traces of permutations. *How can this happen*?
<details>
<summary>**Hint**</summary>
Take a look at the 1D visualization of the graph.
</details>
<details>
<summary>**Explanation**</summary>
The CHM13 reference was linearized differently than all the other mitochondrial sequences. Therefore it has an additional `tip`. `odgi heaps` permutation algorithm is reflecting this, because the permutation always starts with the beginning of each genome.
</details>
So to get a cleaner pangenome growth curve with `odgi heap` we remove the CHM13 reference sequence and run the analysis again:
odgi paths -i chrM.gfa -L | head -n 1 > chrM.gfa.chm13
odgi paths -i chrM.gfa -X chrM.gfa.chm13 -o chrM.gfa.no_chm13.og
odgi heaps -i chrM.gfa.no_chm13.og -S -n 100 > chrM.gfa.no_chm13.og.heaps.tsv
Rscript PATH_TO_ODGI_GIT_REPOSITORY/scripts/heaps_fit.R chrM.gfa.no_chm13.og.heaps.tsv chrM.gfa.heaps.no_chm13.og.tsv.pdf # this should be executed on your local machine
This looks much better! With every added genome, the number of newly added bases is really low. Let's take a closer look with `panacus`.
Create and visualize a growth histogramm of the chrM graph in sequence space with `panacus`:
RUST_LOG=info panacus histgrowth chrM.gfa -c bp -q 0,1,0.5,0.1 -t 16 > chrM.gfa.histgrowth.tsv
panacus-visualize.py -e -l "lower right" chrM.gfa.histgrowth.tsv > chrM.gfa.histgrowth.tsv.pdf
<details>
<summary>**See PNG**</summary>

</details>
Taking a look at the top Figure in the PDF there is a nearly asymptotic growth of the number of nucleotides with increasing number of genomes.
<!---
A value of`γ=0.019` indicates a closed pangenome, because $γ$ is very close to $0$.
--->
The bottome Figure shows the added number of bases per added individual. Which is basically 0 for all, except for the first one. In conclusion, ChrM is a closed pangenome.
The core pangenome (`quorum >= 1`) is very large, with the shell pangenome (`quorum >= 0.5`) being even slightly larger, the cloud pangenome (`quorum >= 0.1`) is not noticably larger.
### Pangenome growth curve of the *E.coli* pangenome graph
Create the matrix of path pangenome coverage permutations of the *E.coli* graph with `odgi heaps` subsequently performing the power law regression:
cd ~/day2_openness/ecoli50
odgi heaps -i ecoli50.gfa -S -n 100 -t 16 > ecoli50.gfa.heaps.tsv
Rscript heaps_fit.R ecoli50.gfa.heaps.tsv ecoli50.gfa.heaps.tsv.pdf
With every added genome, the number of newly added bases is at least 100k. Let's take a closer look with `panacus`.
Create and visualize a growth histograph of the *E. coli* graph in sequence space with `panacus`:
RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv
panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf
<details>
<summary>**See PNG**</summary>

</details>
Taking a look at the top Figure in the PDF there is a polynomial growth of the number of nucleotides with increasing number of genomes visible.
<!---The value `1>γ=0.386>0` indicates an open pangenome.
--->
The bottom Figure shows the added number of bases per added individual. Even the 50th added individual adds more than 100k new bases to the pangenome. In conclusion, the *E. coli* pangenome is open. The core pangenome (`quorum >= 1`) is quite small, with the shell pangenome (`quorum >= 0.5`) not adding much more sequence, and the cloud pangenome (`quorum >= 0.1`) adding some more sequence.