# MemPanG23 - Day 2a - Understanding pangenomes
## Sequence partitioning and pangenome graph building
### Learning objectives
- collect and preprocess *de novo* human assemblies
- partition the assembly contigs by chromosome
- built chromosome-specific pangenome graphs
### Getting started
#### NOTE: Allow starting a web server for browsing outputs
Let's get a web server running. Log-in again to your virtual machine from another terminal and run:
```shell
python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1)
```
You can access this by pointing your web browser at `http://<your_ip>:889N/`, where <your_ip> is the ip address of your instance.
curl ifconfig.me
can show you your ip address.
<br/>
##### TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER
rsync -av user1@<YOUR_MACHINE>:/path/of/your/file .
Now create a directory to work on for this tutorial:
cd ~
mkdir day2_hsapiens
cd day2_hsapiens
Get the URLs of the assemblies:
mkdir -p ~/day2_hsapiens/assemblies
cd ~/day2_hsapiens/assemblies
wget https://raw.githubusercontent.com/human-pangenomics/HPP_Year1_Assemblies/main/assembly_index/Year1_assemblies_v2_genbank.index
grep 'chm13\|h38' Year1_assemblies_v2_genbank.index | awk '{ print $2 }' | sed 's%s3://human-pangenomics/working/%https://s3-us-west-2.amazonaws.com/human-pangenomics/working/%g' > refs.urls
grep 'chm13\|h38' -v Year1_assemblies_v2_genbank.index | awk '{ print $2; print $3 }' | sed 's%s3://human-pangenomics/working/%https://s3-us-west-2.amazonaws.com/human-pangenomics/working/%g' > samples.urls
Download the 2 haplotypes of the `HG01978` diploid sample and the references:
cat refs.urls <(grep 'HG01978' samples.urls | head -n 6) | parallel -j 4 'wget -q {} && echo got {}'
Unpack the assemblies:
ls *.f1_assembly_v2_genbank.fa.gz | while read f; do echo $f; gunzip $f && samtools faidx $(basename $f .gz); done
### Pangenome Sequence Naming
We follow the [PanSN-spec](https://github.com/pangenome/PanSN-spec) naming to simplify the identification of samples and haplotypes in pangenomes. We do that by adding a prefix to the reference sequences:
fastix -p 'grch38#1#' <(zcat GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz) | bgzip -c -@ 16 > grch38_full.fa.gz && samtools faidx grch38_full.fa.gz
fastix -p 'chm13#1#' <(zcat chm13.draft_v1.1.fasta.gz) | bgzip -c -@ 16 > chm13.fa.gz && samtools faidx chm13.fa.gz
# Remove unplaced contigs from grch38 that are (hopefully) represented in chm13
samtools faidx grch38_full.fa.gz $(cat grch38_full.fa.gz.fai | cut -f 1 | grep -v _ ) | bgzip -@ 16 -c > grch38.fa.gz
# Cleaning
rm chm13.draft_v1.1.fasta.gz GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz grch38_full.fa.gz.*
Take a look at how sequence names are changed in the FASTA files.
### Sequence partitioning
Put the two references together:
zcat chm13.fa.gz grch38.fa.gz | bgzip -c -@ 16 > chm13+grch38.fa.gz && samtools faidx chm13+grch38.fa.gz
Why are we using two reference genomes? Does the old `grch38` reference have chromosomes that the `chm13` reference does not have? If so, which ones?
We partition contigs by chromosome by mapping each assembly against the scaffolded references:
PATH_REFERENCE_FA_GZ=~/day2_hsapiens/assemblies/chm13+grch38.fa.gz
mkdir -p ~/day2_hsapiens/assemblies/partitioning
ls *.f1_assembly_v2_genbank.fa | while read FASTA; do
NAME=$(basename $FASTA .fa);
echo $NAME
PATH_PAF=~/day2_hsapiens/assemblies/partitioning/$NAME.vs.ref.paf
wfmash $PATH_REFERENCE_FA_GZ ~/day2_hsapiens/assemblies/$FASTA -s 50k -l 150k -p 95 -N -m -t 16 > $PATH_PAF
done
Run `wfmash` without parameters to get information on the meaning of each parameter. What does `-N` mean?
For each haplotype (for each FASTA file), count how many contigs were partitioned in each reference chromosome. Which of the two references (`grch38` and` chm13`) do more contigs map to? If there is a clear winner, why?
What is the chromosome against which more contigs map? Make a plot displaying the number of contigs (on the y-axis) with respect to the chromosome (on the x-axis) to which they map. Consider `grch38` and `chm13` in the same way: for example, merge the number of contigs mapping against `grch38#1#chr1` and `chm13#1#chr1` in a single count for `chr1`.
Subset by chromosome, including the references. To save time and space, let's take only sequences from chromosome 20:
cd ~/day2_hsapiens/assemblies
DIR_PARTITIONING=~/day2_hsapiens/assemblies/partitioning
mkdir -p parts
( echo 20 ) | while read i; do
awk '$6 ~ "chr'$i'$"' $(ls $DIR_PARTITIONING/*.vs.ref.paf | \
grep -v unaligned | sort) | cut -f 1 | sort | uniq \
> parts/chr$i.contigs;
done
( echo 20 ) | while read i; do
echo chr$i
samtools faidx chm13+grch38.fa.gz chm13#1#chr$i grch38#1#chr$i > parts/chr$i.pan.fa
ls *.f1_assembly_v2_genbank.fa | while read FASTA; do
NAME=$(basename $FASTA .fa);
echo chr$i $NAME
samtools faidx $FASTA $( comm -12 <(cut -f 1 $FASTA.fai | sort) <(sort parts/chr$i.contigs) ) >> parts/chr$i.pan.fa
done
bgzip -@ 16 parts/chr$i.pan.fa && samtools faidx parts/chr$i.pan.fa.gz
done
### Pangenome graph building
Build the pangenome graph for chromosome 20. Since human presents a low sequence divergence, we will set the mapping identity (`-p` parameter) in `pggb` to `98`.
mkdir -p ~/day2_hsapiens/graphs
pggb -i ~/day2_hsapiens/assemblies/parts/chr20.pan.fa.gz -o ~/day2_hsapiens/graphs/chr20.pan -t 16 -p 98 -s 5000 -n 4 -k 229 -O 0.001 -G 1000
Run `pggb` without parameters to get information on the meaning of each parameter:
pggb
Why did we specify `-n 4`?
The `-k` parameter is used to filter exact matches shorter than 229 bps. Graph induction with `seqwish` often works better when we filter very short matches out of the input alignments. In practice, these often occur in regions of low alignment quality, which are typical of areas with large indels and structural variations in the `wfmash` alignments. This underalignment is then resolved in the final `smoothxg` step. Removing short matches can simplify the graph and remove spurious relationships caused by short repeated homologies.
The `-O` parameter specifies to expand a little bit the partial order alignment (POA) blocks at both sides. This help in better resolving POA blocks boundaries and avoiding wrong alignments due to not optimal block selection.
**Try this at home**: build the pangenome graph for all chromosomes.
Use `odgi stats` to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question.
Take a look at the PNG files in the `chr20.pan` folder. Is the layout of the graph roughly linear? Are there unaligned regions?
Generate another `odgi viz` visualization with
cd ~/day2_hsapiens/graphs/chr20.pan
odgi paths -i chr20.pan.fa.gz.e212b17.f599738.1d7f1c3.smooth.final.og -L | cut -f 1,2 -d '#' | uniq > prefixes.txt
odgi viz -i chr20.pan.fa.gz.e212b17.f599738.1d7f1c3.smooth.final.og -o chr20.pan.fa.gz.e212b17.f599738.1d7f1c3.smooth.final.og.viz_multiqc.2.png -x 1500 -y 500 -a 10 -I Consensus_ -M prefixes.txt
What do you think is different between the `chr20.pan.fa.gz.e212b17.f599738.ff92e05.smooth.final.og.viz_multiqc.png` image and the newly generated image (`chr20.pan.fa.gz.e212b17.f599738.ff92e05.smooth.final.og.viz_multiqc.2.png`)?
<details>
<summary>Click me for the answer</summary>
The contigs of the two additional haplotypes are visually collapsed into one path each.
</details>
</br>
**Optional step (only if you have a working gfaestus installed on your computer)**
Go to the folder where the `gfaestus` repository is checked out and open the graph layout with it:
cd ~/gfaestus
./target/release/gfaestus chr20.pan/chr20.pan.fa.gz.*.smooth.final.gfa chr20.pan.fa.gz.*.smooth.final.og.lay.tsv
Download the coordinates of the p and q amrs on `chm13` in BED format:
wget https://raw.githubusercontent.com/pangenome/pggb-paper/main/data/chm13.pq_arms.bed
and load the file in `gfaestus`: `Tools` -> `BED Label Wizard` -> Go to the folder where the BED file is -> `Double click on the file` (do not click `Accept`, it is buggy).
Produce full gene annotations and visualize them:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz
zcat gencode.v40.annotation.gff3.gz | grep ^chr20 | awk '$3 == "gene"' | cut -f 1,4,5,9 | tr ';' '\t' | grep gene_name= | cut -f 1-3,7 | sed s/gene_name=// | sed s/^chr20/grch38#1#chr20/ > grch38#1#chr20.gene.bed
## ODGI
### Learning objectives
- extract subgraphs representing *loci* of interest
- visualize graph annotation
- untangle the pangenome graph
- make phylogenetic trees
### Getting started
Check out `odgi` repository (we need one of its example):
cd ~
git clone https://github.com/pangenome/odgi.git
Now create a directory to work on for this tutorial:
cd ~
mkdir day2_subgraphs
cd day2_subgraphs
ln -s ~/odgi/test
### Lipoprotein A graph
Construct an `odgi` file from the LPA dataset in GFA format:
odgi build -g test/LPA.gfa -o LPA.og
The command creates a file called `LPA.og`, which contains the input graph in `odgi` format. This graph contains 13 contigs from 7 haploid human genome assemblies from 6 individuals plus the `chm13` cell line.
Use `odgi paths` to display the sequence names, that is the path names.
To see the variants for the two contigs of the `HG02572` sample, execute:
zgrep -v '^##' test/LPA.chm13__LPA__tig00000001.vcf.gz | head -n 9 | cut -f 1-9,16,17 | column -t
In the `LPA.chm13__LPA__tig00000001.vcf.gz` file, the variants were called with respect to the `chm13__LPA__tig00000001` contig, which was used as the reference path. The `ID` field in the VCF lists the nodes involved in the variant. A `>` means that the node is visited in forward orientation, a `<` means that the node is visited in reverse orientation.
The insertion at position **1050** (G > GT) is present only in one of the `HG02572`'s contig (`HG02572__LPA__tig00000001`). To extract the sub-graph where this insertion falls, execute:
odgi extract -i LPA.og -n 23 -c 1 -d 0 -o LPA.21_23_G_GT.og
The instruction extracts:
- the node with ID 23 (`-n 23`),
- the nodes reachable from this node following a single edge (`-c 1`) in the graph topology,
- the edges connecting all the extracted nodes, and
- the paths traversing all the extracted nodes.
How many nodes/edges the extracted graph contain? How many paths ? Which are their names?
Use `Bandage` to visualize the graph. Display node IDs. What can you see between nodes 21 and 23?
### Human chromosome 6
Download the pangenome graph of the Human chromosome 6 in GFA format, decompress it, and convert it to a graph in `odgi` format.
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_11_16_pggb_wgg.88/chroms/chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
gunzip chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa.gz
odgi build -g chr6.pan.fa.a2fb268.4030258.6a1ecc2.smooth.gfa -o chr6.pan.og -t 16 -P
This graph contains contigs of 88 haploid, phased human genome assemblies from 44 individuals, plus the `chm13` and `grch38` reference genomes.
#### Extraction
The [major histocompatibility complex (MHC)](https://en.wikipedia.org/wiki/Major_histocompatibility_complex) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. The human MHC is also called the HLA (human leukocyte antigen) complex (often just the HLA).
To see the coordinates of some HLA genes, execute:
head test/chr6.HLA_genes.bed -n 5
The coordinates are expressed with respect to the `grch38` reference genome.
To extract the subgraph containing all the HLA genes annotated in the `chr6.HLA_genes.bed` file, execute:
odgi extract -i chr6.pan.og -o chr6.pan.MHC.og -b <(bedtools merge -i test/chr6.HLA_genes.bed -d 10000000) -d 0 -E -t 16 -P
The instruction extracts:
- the nodes belonging to the `grch38#chr6` path ranges specified in the the `chr6.HLA_genes.bed` file via `-b`,
- all nodes between the min and max positions touched by the given path ranges, also if they belong to other paths (`-E`),
- the edges connecting all the extracted nodes, and
- the paths traversing all the extracted nodes.
How many paths are present in the extracted subgraph? With 90 haplotypes (44 diploid samples plus 2 haploid reference genomes), how many paths would you expect in the subgraph if the MHC locus were solved with a single contig per haplotype?
To visualize the graph, execute:
odgi sort -i chr6.pan.MHC.og -o - -O | odgi viz -i - -o chr6.pan.MHC.png -s '#' -a 20
Why are we using `odgi sort` before visualizing the graph?
Are there haplotypes where the MHC locus is not resolved with a single contig? If so, which ones? Counts the number of contigs for each haplotype.
Generate the graph layout with `odgi layout`, asking to generate both the `lay` and the `TSV` files (remember to specify the number of threads). Visualize the layout with `odgi draw` and `gfaestus`. Specify `-P` to get information on the progress.
The MHC locus includes the complement component 4 (C4) region, which encodes proteins involved in the complement system. In humans, the C4 gene exists as 2 functionally distinct genes, C4A and C4B, which both vary in structure and **copy number** ([Sekar et al., 2016](https://doi.org/10.1038/nature16549)). Moreover, C4A and C4B genes segregate in both long and short genomic forms, distinguished by the **presence or absence** of a human endogenous retroviral (HERV) sequence.
Find C4 coordinates:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
zgrep 'gene_id "C4A"\|gene_id "C4B"' hg38.ncbiRefSeq.gtf.gz |
awk '$1 == "chr6"' | cut -f 1,4,5 |
bedtools sort | bedtools merge -d 15000 | bedtools slop -l 10000 -r 20000 -g hg38.chrom.sizes |
sed 's/chr6/grch38#chr6/g' > hg38.ncbiRefSeq.C4.coordinates.bed
Extract the C4 locus:
odgi extract -i chr6.pan.og -b hg38.ncbiRefSeq.C4.coordinates.bed -o - -d 0 -E -t 16 -P | odgi explode -i - --biggest 1 --sorting-criteria P --optimize -p chr6.pan.C4
odgi sort -i chr6.pan.C4.0.og -o chr6.pan.C4.sorted.og -p Ygs -x 100 -t 16 -P
What `odgi explode` is doing?
Regarding the `odgi viz` visualization, select the haplotypes to visualize
odgi paths -i chr6.pan.C4.sorted.og -L | grep 'chr6\|HG00438\|HG0107\|HG01952' > chr6.selected_paths.txt
and visualize them
# odgi viz: default (binned) mode
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt
# odgi viz: color by strand
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.z.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -z
# odgi viz: color by position
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.du.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -du
# odgi viz: color by depth
odgi viz -i chr6.pan.C4.sorted.og -o chr6.pan.C4.sorted.m.png -c 12 -w 100 -y 50 -p chr6.selected_paths.txt -m -B Spectral:4
For the `chr6.pan.C4.sorted.m.png` image we used the Spectra color palette with 4 levels of node depths, so white indicates no depth, while grey, red, and yellow indicate depth 1, 2, and greater than or equal to 3, respectively. What information does this image provide us about the state of the C4 region in the selected haplotypes?
Visualize all haplotypes with `odgi viz`, coloring by depth. How many haplotypes have three copies of the C4 region? How many haplotypes are missing the HERV sequence? What is the copy number state of the `chm13` and `grch38` reference genomes?
Use `odgi layout` and `odgi draw` to compute and visualize the layout of the C4 locus. Visualize it in `gfaestus` too, displaying also the following annotation in the `/home/participant/odgi/test/chr6.C4.bed` file. The HERV sequence may be present or absent in the C4 regions across haplotypes: how does this reflect on the structure of the graph?
#### Untangling
To obtain another overview of a collapsed locus, we can apply `odgi untangle` to segment paths into linear segments by breaking these segments where the paths loop back on themselves. In this way, we can obtain information on the copy number status of the sequences in the locus.
To untangle the C4 graph, execute:
(echo query.name query.start query.end ref.name ref.start ref.end score inv self.cov n.th |
tr ' ' '\t'; odgi untangle -i chr6.pan.C4.sorted.og -r $(odgi paths -i chr6.pan.C4.sorted.og -L | grep grch38) -t 16 -m 256 -P |
bedtools sort -i - ) | awk '$8 == "-" { x=$6; $6=$5; $5=x; } { print }' |
tr ' ' '\t' > chr6.pan.C4.sorted.untangle.bed
Take a look at the `chr6.pan.C4.sorted.untangle.bed` file. For each segment in the query (`query.name`, `query.start`, and `query.end` columns), the best match on the reference is reported (`ref.name`, `ref.start`, and `ref.end`), with information about the quality of the match (`score`), the strand (`inv`), the copy number status (`self.cov`), and its rank over all possible matches (`n.th`).
Try to visualize the results with `ggplot2` in R (hint: the intervals in the BED file can be displayed with `geom_segment`). Compare such a visualization with the visualization obtained with the `odgi viz` coloring by depth.
A pangenome graph represents the alignment of many genome sequences. By embedding gene annotations into the graph as paths, we "align" them with all other paths.
#### Injection
A pangenome graph represents the alignment of many genome sequences. By embedding gene annotations into the graph as paths, we "align" them with all other paths.
We start with gene annotations against the GRCh38 reference. Our annotations are against the full `grch38#chr6`, in `test/chr6.C4.bed`. Take a look at the first column in the annotation file
head test/chr6.C4.bed
However, the C4 locus graph `chr6.c4.gfa` is over the reference range, that is `grch38#chr6:31972046-32055647`. With `odgi paths` we can take a look at the names of the paths in the graph:
odgi paths -i test/chr6.C4.gfa -L | grep grc
So, we must adjust the annotations to match the subgraph to ensure that path names and coordinates exactly correspond between the BED and GFA. We do so using `odgi procbed`, which cuts BED records to fit within a given subgraph:
odgi procbed -i test/chr6.C4.gfa -b test/chr6.C4.bed > chr6.C4.adj.bed
The coordinate space now matches that of the C4 subgraph. Now, we can inject these annotations into the graph:
odgi inject -i test/chr6.C4.gfa -b chr6.C4.adj.bed -o chr6.C4.genes.og
Use `odgi viz` to visualize the new subgraph with the injected paths.
We now use the gene names and the `gggenes` output format from `odgi untangle` to obtain a gene arrow map. We specify the injected paths as target paths:
odgi paths -i chr6.C4.genes.og -L | tail -4 > chr6.C4.gene.names.txt
odgi untangle -R chr6.C4.gene.names.txt -i chr6.C4.genes.og -j 0.5 -t 16 -g -P > chr6.C4.gene.gggenes.tsv
We use `-j 0.5` to filter out low-quality matches.
We can then load this into R for plotting with `gggenes`, don't forget to download the file first via the setup python server interface in your browser:
require(ggplot2)
require(gggenes)
x <- read.delim('chr6.C4.gene.gggenes.tsv')
ggplot(x, aes(xmin=start, xmax=end, y=molecule, fill=gene, forward=strand)) + geom_gene_arrow()
ggsave('c4.gggenes.subset.png', height=1.5, width=15)
The plot will look a bit odd because some of the paths are in reverse complement orientation relative to the annotations. We can clean this up by using `odgi flip`, which flips paths around if they tend to be in the reverse complement orientation relative to the graph:
odgi flip -i chr6.C4.genes.og -o - -t 16 | odgi view -i - -g > chr6.C4.genes.flip.gfa
odgi untangle -R chr6.C4.gene.names.txt -i chr6.C4.genes.flip.gfa -j 0.5 -t 16 -g -P > chr6.C4.gene.gggenes.tsv
Plot the new results: what is changed? Why do all results start with a short blue arrow (and never with a short orange one)?
### Primate chromosome 6
Download the pangenome graph of the primate chromosome 6 in GFA format, decompress it, and convert it to a graph in `odgi` format.
wget https://zenodo.org/record/7933393/files/primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst
zstd -d primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa.zst
odgi build -g primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.gfa -o primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og -t 16 -P
This graph contains contigs of chromosome 6 of six diploid (2 haplotypes for each sample), phased primate genome assemblies, plus the `chm13` and `grch38` reference genomes. Primate genomes were downloaded from https://genomeark.github.io/t2t-all/.
Compute the dissimilarity (distance) between all possible pairs of haplotypes:
odgi similarity -i primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.og --distances -t 16 -D '#' -p 2 -P > primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv
The `-D` and `-p` options specifies to use the 2nd occurrence of the `#` character to group the results. As path names follow the [PanSN-spec](https://github.com/pangenome/PanSN-spec) specification, this means that results are grouped by haplotype. Take a look at the output:
head primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv
Construct a phylogenetic tree by using the `jaccard.distance`:
library(tidyverse)
library(ape)
library(ggtree)
path_dist_tsv <- 'primates14.chr6.fa.gz.667b9b6.c2fac19.ee137be.smooth.final.dist.tsv'
# Read sparse matrix
sparse_matrix_df <- read_tsv(path_dist_tsv)
# Prepare distance matrix
jaccard_dist_df <- sparse_matrix_df %>%
arrange(group.a, group.b) %>%
select(group.a, group.b, dice.distance) %>%
pivot_wider(names_from = group.b, values_from = dice.distance) %>%
column_to_rownames(var = "group.a")
# Clustering
jaccard_hc <- as.dist(jaccard_dist_df) %>% hclust()
# Open a pdf device with the specified width and height
pdf(file = "dendrogram.haplotypes.pdf", width = 5, height = 6)
# Plot the dendrogram
plot(
jaccard_hc,
# Label at same height
hang = -1,
main = 'primate14.chr6',
xlab = 'Haplotype',
ylab = 'Jaccard distance',
sub = '',
cex = 1.2
)
# Close the device and save the file
dev.off()
Try to make the tree by grouping the results by sample. Is there something strange? Try extracting a sub-region from the pangenome graph (the MHC region for example, or a randomly chosen region) and create its corresponding tree.
## Saccharomyces cerevisiae pangenome graph
### Learning objectives
In this exercise you learn how to
- identify variants in the graph
- partition sequences
### Getting started
Create a directory to work on for this tutorial:
mkdir -p ~/day2_yeast
cd ~/day2_yeast
Download 7 *Saccharomyces cerevisiae* assemblies ([Yue et al., 2016](https://doi.org/10.1038/ng.3847)) from the [Yeast Population Reference Panel (YPRP)](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/) plus the `SGD` reference:
mkdir assemblies
cd assemblies
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/DBVPG6044.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/DBVPG6765.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/S288C.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/SGDref.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/SK1.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/UWOPS034614.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/Y12.fa.gz
wget -c http://hypervolu.me/~guarracino/CPANG22/yeast/YPS128.fa.gz
### Pangenome Sequence Naming
Rename the sequences according to the [PanSN-spec](https://github.com/pangenome/PanSN-spec) specification, using [fastix](https://github.com/ekg/fastix). The sequence names have to follow such a scheme:
DBVPG6044#1#chrI
DBVPG6044#1#chrII
...
S288C#1#chrI
S288C#1#chrII
...
SGDref#1#chrI
SGDref#1#chrII
...
Group sequences by chromosome, creating a FASTA file for each of them (`scerevisiae8.chrI.fasta`, `scerevisiae8.chrII.fasta`, ...). Moreover, put all sequences in a single FASTA file called `scerevisiae8.fasta`. Compress all FASTA files with `bgzip` and index them.
<details>
<summary>Click me for the answer</summary>
# Solution for chrI
for chr in *.fa.gz; do samtools faidx "$chr" "chrI" > "$chr".chrI.fa && fastix -p "$(echo "$chr" | cut -f 1 -d '.')#" "$chr".chrI.fa > "$chr".chrI.PanSN.fa; done
cat *.chrI.PanSN.fa > scerevisiae8.chrI.fasta
bgzip -@16 scerevisiae8.chrI.fasta
samtools faidx scerevisiae8.chrI.fasta.gz
# Solution for all chromosomes together
for chr in *.fa.gz; do fastix -p "$(echo "$chr" | cut -f 1 -d '.')#" <(zcat "$chr") > "$chr".PanSN.fa; done
cat *.fa > scerevisiae8.fasta
bgzip -@ 16 scerevisiae8.fasta
samtools faidx scerevisiae8.fasta.gz
</details>
</br>
### Pangenome graph building
Run `pggb` on each chromosome separately (including the mitochondrial one, chrMT), calling variants with respect to the `SGDref` assembly. For example, for chromosome 1, run:
cd ~/day2_yeast
mkdir -p graphs
pggb -i assemblies/scerevisiae8.chrI.fasta.gz -s 20000 -p 90 -n 8 -t 16 -o graphs/scerevisiae8.chrI -V SGDref:#
The `-V SGDref:#` parameter specify to call variants with respect to the path having as sample name `SGDref`; the sample name is identified by using `#` as separator (take a look at the [PanSN-spec](https://github.com/pangenome/PanSN-spec) specification).
Counts the number of variants in the VCF file for each chromosomes and take a look at the graphs. Is there a chromosome for which the corresponding graph looks wrong? If yes, try to solve the problem.
Use `odgi squeeze` to put all the graphs together in the same file. Try to visualize the graph layout of the squeezed graph (the graph just generated with all chromosomes together) with `odgi layout` and `odgi draw`.
Run `pggb` on all chromosomes jointly, giving in input the `scerevisiae8.fasta.gz` file. Call variants too with respect to the `SGDref` sample. Take a look at the graph layout. Is the layout of the graph obtained by running `pggb` separately on each chromosome (and then combining its results) similar to the graph obtained by running `pggb` on all chromosomes jointly?
In the newly obtained VCF, count how many variants are called for each chromosome. Are these counts similar to those obtained by calling variants on each chromosome separately? Are there variants esclusive (that do not appear in the chromosome-based VCF files) to this new VCF file?
### Sequence partitioning
We can't really expect to pairwise map all sequences together and obtain well separated connected components. It is likely to get a giant connected component, and probably a few smaller ones, due to incorrect mappings or false homologies. This might unnecessarily increase the computational burden, as well as complicate the downstream analyzes. Therefore, it is recommended to split up the input sequences into communities in order to find the latent structure of their mutual relationship.
We need to obtain the mutual relationship between the input assemblies in order to detect the underlying communities. To compute the pairwise mappings with `wfmash`, execute:
cd ~/day2_yeast
cd assemblies
wfmash scerevisiae8.fasta.gz -p 90 -n 7 -t 16 -m > scerevisiae8.mapping.paf
Why did we specify `-n 7`?
To project the PAF mappings into a network format (an edge list), execute:
python3 ~/pggb/scripts/paf2net.py -p scerevisiae8.mapping.paf
The `paf2net.py` script creates 3 files:
- `scerevisiae8.mapping.paf.edges.list.txt` is the edge list representing the pairs of sequences mapped in the PAF;
- `scerevisiae8.mapping.paf.edges.weights.txt` is a list of edge weights (long and high estimated identity mappings have greater weight);
- `scerevisiae8.mapping.paf.vertices.id2name.txt` is the 'id to sequence name' map.
To identity the communities, execute:
python3 ~/pggb/scripts/net2communities.py \
-e scerevisiae8.mapping.paf.edges.list.txt \
-w scerevisiae8.mapping.paf.edges.weights.txt \
-n scerevisiae8.mapping.paf.vertices.id2name.txt
How many communities were detected?
The `paf2net.py` script creates a set of `*.community.*.txt` files one for each the communities detected. Each `txt` file lists the sequences that belong to the same community.
Are there communities that contain multiple chromosomes? Which ones?
Identity the communities again, but this time add the `--plot` option to visualize them too:
python3 ~/pggb/scripts/net2communities.py \
-e scerevisiae8.mapping.paf.edges.list.txt \
-w scerevisiae8.mapping.paf.edges.weights.txt \
-n scerevisiae8.mapping.paf.vertices.id2name.txt \
--plot
Take a look at the `scerevisiae8.mapping.paf.edges.list.txt.communities.pdf` file.
Write a little script that take the `*.community.*.txt` files in input and create the corresponding FASTA files, ready to be input to `pggb`. Run `pggb` on the communities with multiple chromosomes and explore the results (layout and variants).