MemPanG23 - Day 1b - Introduction to pangenomics and PGGB

# MemPanG23 - Day 1b - Introduction to pangenomics and PGGB ## Introduction to pangenomics ### Learning objectives - construct graphs using `vg construct` - visualize graphs using `vg view` and `Bandage` - convert graphs using `vg convert` ### Getting started Make sure you have `vg` installed. It is already available on the course workstations. If you want to build it on your laptop, follow the instructions at the [vg homepage](https://github.com/vgteam/vg) (note that building `vg` and all submodules from source can take ~1h). In this exercise, you will use small toy examples from the `test` directory. So make sure you have checked out `vg` repository: cd ~ git clone https://github.com/vgteam/vg.git Now create a directory to work on for this tutorial: mkdir day1_intro cd day1_intro ln -s ~/vg/test/tiny ### Constructing and viewing your first graphs Like many other toolkits, `vg` is comes with many different subcommands. First we will use `vg construct` to build our first graph. Run it without parameters to get information on its usage: vg construct We will construct a reference-based graph from the sequence in file `tiny/tiny.fa`, which looks like this: >x CAAATAAGGCTTGGAAATTTTCTGGAGTTCTATTATATTCCAACTCTCTG To make the graph, we’ll add variants from the `tiny.vcf.gz` VCF file, whose contents look like this: zgrep '^##' -v tiny/tiny.vcf.gz | column -t #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 x 9 . G A 99 . AC=1;LEN=1;NA=1;NS=1;TYPE=snp GT 1|0 x 10 . C T 99 . AC=2;LEN=1;NA=1;NS=1;TYPE=snp GT 1|1 x 14 . G A 99 . AC=1;LEN=1;NA=1;NS=1;TYPE=snp GT 1|0 x 34 . T A 99 . AC=2;LEN=1;NA=1;NS=1;TYPE=snp GT 1|1 x 39 . T A 99 . AC=1;LEN=1;NA=1;NS=1;TYPE=snp GT 1|0 To construct a graph, run vg construct -r tiny/tiny.fa -m 32 > tiny.ref.vg This will create a graph that just consists of a linear chain of nodes, each with 32 characters. The `-m` option tells `vg` to put at most 32 characters into each graph node (you might want to run it with different values and observe the different results). Now let’s look at the graph. To visualize a graph, you can use `vg view`. Per default, `vg view` will output a graph in [GFA](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md) format. By adding `-d`, you can generate [DOT](https://www.graphviz.org/doc/info/lang.html) output: vg view tiny.ref.vg vg view -d tiny.ref.vg Next, we use [graphviz](https://graphviz.org/) to layout the graph representation in DOT format: vg view -d tiny.ref.vg | dot -T pdf -o tiny.ref.pdf View the PDF and compare it to the input sequence. #### NOTE: Allow starting a web server for browsing outputs Let's get a web server running. Log-in again to your virtual machine from another terminal and run: ```shell python -m http.server 889N # replace N with your user number (for example: N = 1 for user 1) ``` You can access this by pointing your web browser at `http://<your_ip>:889N/`, where <your_ip> is the ip address of your instance. curl ifconfig.me can show you your ip address. ##### TEMPORARY SOLUTION TO DOWNLOAD FILES ON YOUR COMPUTER rsync -av user1@<YOUR_MACHINE>:/path/of/your/file . Now vary the parameter passed to `-m` of `vg construct` and visualize the result. Now let's build a new graph that has some variants built into it. First, take a look at at `tiny/tiny.vcf.gz`, which contains variants in (gzipped) [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) format. vg construct -r tiny/tiny.fa -v tiny/tiny.vcf.gz -m 32 > tiny.vg Are there any warnings in the output? If yes, what do they indicate? Visualize the resulting graph. Ok, that's nice, but you might wonder which sequence of nodes actually corresponds to the sequence (`tiny.fa`) you started from? To keep track of that, `vg` adds a **path** to the graph. Let's add this path to the visualization. vg view -dp tiny.vg | dot -T pdf -o tiny.pdf You find the output too crowded? Option `-S` removes the sequence labels and only plots node IDs. vg view -dpS tiny.vg | dot -T pdf -o tiny.pdf It's also possible to read the graph in different formats. For instance, we can write the graph in GFA, modify it using text processing tools, and read it back in: vg view tiny.vg > tiny.gfa In GFA format, each line is a separate record of some part of the graph. The lines come in several types, which are indicated by the first character of the line. What line types do you see? What do you think they indicate? <details> <summary>Click me for the answers</summary> You should see the following line types: - `H`: a header. - `S`: a "sequence" line, which is the sequence and ID of a node in the graph. - `L`: a "link" line, which is an edge in the graph. - `P`: a "path" line, which labels a path of interest in the graph. In this case, the path is the walk that the reference sequence takes through the graph. It should be noted, however, that the format does not specify that these lines come in a particular order. </details> We can use `grep` to remove the path lines: cat tiny.gfa | grep -v '^P' | vg view -dp - | dot -T pdf -o tiny.no_path.pdf Try to remove the nodes and/or the edges from the GFA file and visualize them. What happens? Another tool for visualizing (not too big) graphs is [Bandage](https://github.com/rrwick/Bandage). It supports graphs in GFA format. Download the graph and try to visualize it locally with: Bandage load tiny.gfa Let's step up to a slightly bigger example. ln -s ~/vg/test/1mb1kgp This directory contains 1Mbp of 1000 Genomes data for `chr20:1000000-2000000`. As for the tiny example, let's' build one linear graph that only contains the reference sequence and one graph that additionally encodes the known sequence variation. The reference sequence is contained in `1mb1kgp/z.fa`, and the variation is contained in `1mb1kgp/z.vcf.gz`. Make a reference-only graph named `ref.vg`, and a graph with variation named `z.vg`. Look at the previous examples to figure out the command. Try to load the graph in `Bandage` too (it will be slow). Try to generate a PNG image with `Bandage` (take a look at the `Bandage -h` output). ## PGGB ### Learning objectives - build pangenome graphs using `pggb` - explore `pggb`'s results - understand how parameters affect the built pangenome graphs ### Getting started Make sure you have `pggb` and its tools installed. It is already available on the course workstations. If you want to build everything on your laptop, follow the instructions at the [pggb homepage](https://github.com/pangenome/pggb#installation) (`guix`, `docker`, `singularity`, and `conda` alternatives are available). So make sure you have checked out `pggb` repository: cd ~ git clone https://github.com/pangenome/pggb.git Check out also `wfmash` repository (we need one of its scrips): cd ~ git clone https://github.com/waveygang/wfmash.git Now create a directory to work on for this tutorial: mkdir day1_pggb cd day1_pggb ln -s ~/pggb/data ### Build HLA pangenome graphs The [human leukocyte antigen (HLA)](https://en.wikipedia.org/wiki/Human_leukocyte_antigen) system is a complex of genes on chromosome 6 in humans which encode cell-surface proteins responsible for the regulation of the immune system. Let's build a pangenome graph from a collection of sequences of the DRB1-3123 gene: pggb -i data/HLA/DRB1-3123.fa.gz -n 12 -t 16 -o out_DRB1_3123 Run `pggb` without parameters to get information on the meaning of each parameter: pggb Take a look at the files in the `out_DRB1_3123` folder. Visualize the graph with `Bandage`. Why did we specify `-n 12`? <details> <summary>Click me for the answer</summary> The pggb graph is defined by the number of mappings per segment of each genome `-n, --n-mappings N`. Ideally, you should set this to equal the number of haplotypes in the pangenome. Because that's the maximum number of secondary mappings and alignments that we expect. Keep in mind that the total work of alignment is proportional to `N*N`, and these multimappings can be highly redundant. If you provide a N that is not equal to the number of haplotypes, provide the actual number of haplotypes to `-H`. This helps smoothxg to determine the right POA problem size. </details> How many alignments were executed during the pairwise alignment (take a look at the `PAF` output)? Visualize the alignments: cd out_DRB1_3123 ~/wfmash/scripts/paf2dotplot png large *paf cd .. Use `odgi stats` to obtain the graph length, and the number of nodes, edges, and paths. Do you think the resulting pangenome graph represents the input sequences well? Check the length and the number of the input sequences to answer this question. <details> <summary>Click me for the answer</summary> The total graph length is nearly double the length of most of the input sequences. This indicates a strong underalignment of all the sequences. </details> How many blocks were selected and 'smoothed' during the two rounds of graph normalization (take a look at the `*.log` file to answer this question)? Try building the same pangenome graph by specifying a lower percent identity (`-p 90` by default): pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -o out2_DRB1_3123 Check graph statistics. Does this pangenome graph represent better or worse the input sequences than the previously produced graph? <details> <summary>Click me for the answer</summary> The total graph length is now closter each length of most of the input sequences, the underalignment is basically fixed. </details> Try to decrease the number of mappings to keep for each segment: pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 6 -t 16 -o out3_DRB1_3123 How does it affect the graph? Try to increase the target sequence length for the partial order alignment (POA) problem (`-G 700,900,1100` by default): pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 -G 1400,1800,2200 -o out4_DRB1_3123 How is this changing the runtime and the memory usage? How is this affecting graph statistics? How many blocks were selected and 'smoothed' during the two rounds of graph normalization? Try 1 or 4 rounds of normalization (for example, by specifying `-G 700`, or `-G 700,900,1100,1300`). How does this affect graph statistics? Take the second `pggb` run and try to increase the segment length (`-s 5000` by default): pggb -i data/HLA/DRB1-3123.fa.gz -s 20000 -p 80 -n 12 -t 16 -o out5_DRB1_3123 How is this affecting graph statistics? Why? `pggb` produces intermediate graphs during the process. Let's keep all of them: pggb -i data/HLA/DRB1-3123.fa.gz -p 80 -n 12 -t 16 --keep-temp-files -o out2_DRB1_3123_keep_intermediate_graphs What does the file with name ending with `.seqwish.gfa` contain? and what about the file with name ending with `.smooth.1.gfa`? Take a look at the graph statistics of all the GFA files in the `out2_DRB1_3123_keep_intermediate_graphs` folder. Choose another HLA gene from the `data` folder and explore how the statistics of the resulting graph change as` s`, `p`,` n` change. If you have `R` installed on your laptop, produce scatter plots where on the x-axis there are the tested values of one of the `pggb` parameters (`s`, `p`, or `n`) and on the y-axis one of the graph statistics (length, number of nodes, or number of edges). You can do that using the final graph and/or the intermediate ones. ### Build LPA pangenome graphs [Lipoprotein(a) (LPA)](https://en.wikipedia.org/wiki/Lipoprotein(a)) is a low-density lipoprotein variant containing a protein called apolipoprotein(a). Genetic and epidemiological studies have identified lipoprotein(a) as a risk factor for atherosclerosis and related diseases, such as coronary heart disease and stroke. Try to make LPA pangenome graphs. The input sequences are in `data/LPA/LPA.fa.gz`. Sequences in this locus have a peculiarity: which one? Hint: visualize the alignments and take a look at the graph layout (with `Bandage` and/or in the `.draw_multiqc.png` files).