IMplicit Pangenome Graph

# Towards IMplicit Pangenome Graph Construction Tools > This MD explores how we could use https://github.com/ekg/impg in order to build a pangenome graph in small pieces feeding these into https://github.com/pangenome/pggb, lacing them together and normalizing the boundaries of our laced pieces in the final step. First, let's generate some alignments from 8 yeast genomes from the [YPRP](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/). ``` wfmash \ scerevisiae8.fasta.gz \ scerevisiae8.fasta.gz \ \ --threads 4 \ \ -n 7 -s 5000 -p 90.0 -X \ -l 25000 -k 19 -H 0.001 -2 30 > scerevisiae8.fasta.gz.paf ``` Somehow we need to detect the connected components inherent in the alignments... I would guess that this is already possible with the Leiden algorithm. Here, for simplicity, just work with one component, specifically the one that is spanned by the sequence `S288C#chrI`. ## By Component To get an overview of what is possible using `S288C#chrI` as our anchor sequence, let's see what's in the FASTA index. ``` grep -E "S288C#chrI" scerevisiae8.fasta.gz.fai | head -n 1 S288C#chrI 219929 24333564 60 61 ``` Now we know the maximum sequence length of our chosen anchor `S288C` is 219929 bases. Since IMPG is 0-based, we need to subtract 1 when we query the whole length of this sequence: ``` impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:0-219928" -x S288C#chrI 0 219928 SGDref#chrI 0 219136 YPS128#chrI 8328 219784 DBVPG6044#chrI 1152 198912 Y12#chrI 1792 185000 DBVPG6765#chrI 2816 204416 SK1#chrI 17560 209304 UWOPS034614#chrI 16912 204685 ``` Unsurprisingly, we don't hit all positions of all the sequences. > During my latest trials, one can overshoot the maximum sequence lengths in IMGS: `S288C#chrI:0-220000` was also possible to enter without any errors. Not sure if I am doing something wrong, or the input alignments are at fault, or it is a bug. I tend to say it is the last one. Now we would need to extend transitively to the left and to the right, until we hit all positions of all sequences. Let's try this manually to the left: ``` impg -p scerevisiae8.fasta.gz.paf -q "YPS128#chrI:0-8327" -x YPS128#chrI 0 8327 SK1#chrI 219704 228349 DBVPG6044#chrI 209258 214933 S288C#chrI 1024 1439 SGDref#chrI 1280 1390 ``` Uuh, interesting. We also collected sequences with very high positions, presumably because the assemblies were done on the reverse strand or in the different direction. Also note that sequence `UWOPS034614#chrI` does not appear anymore! Let's give it another try for this missing sequence: ``` impg -p scerevisiae8.fasta.gz.paf -q "UWOPS034614#chrI:0-16911" -x UWOPS034614#chrI 0 16911 SK1#chrI 8072 32821 S288C#chrI 24336 30070 SGDref#chrI 23952 29631 DBVPG6044#chrI 11536 15390 DBVPG6765#chrI 12048 16262 Y12#chrI 13456 16290 YPS128#chrI 26008 27148 ``` Hmmm.... this is weird. Because now `SC288C#chrI` is appearing again, although, we already went through it's full range. Is this expected or a bug? I need to ask Erik. I am theorizing that this might point to regions of repeats? Anyhow, not understanding what's going on here, I can't complete my extensions to the left or right. ## By Ranges Now we want to split our input into even smaller ranges, specifically we want to use `S288C#chrI` as the anchor sequence gain, but split it into 3 equally sized pieces. Query for the 1st piece: ``` impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:0-73309" -x S288C#chrI 0 73309 SGDref#chrI 0 72869 YPS128#chrI 8328 70380 DBVPG6044#chrI 1152 58611 Y12#chrI 1792 59591 DBVPG6765#chrI 2816 59461 SK1#chrI 17560 76038 UWOPS034614#chrI 16912 60133 ``` Query for the 2nd piece: ``` impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:73310-146618" -x S288C#chrI 73310 146618 SGDref#chrI 72870 146175 YPS128#chrI 70381 143650 DBVPG6044#chrI 58612 132568 Y12#chrI 59592 132892 DBVPG6765#chrI 59462 138728 SK1#chrI 76039 149338 UWOPS034614#chrI 60134 133792 ``` Query for the 3rd piece: ``` impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:146619-219928" -x S288C#chrI 146619 219928 SGDref#chrI 146176 219136 YPS128#chrI 143651 219784 DBVPG6044#chrI 132569 198912 Y12#chrI 132893 185000 DBVPG6765#chrI 138729 204416 SK1#chrI 149339 209304 UWOPS034614#chrI 133793 204685 ``` The good news: We get perfectly overlapping ranges. No bugs! The bad news: I still don't know how to extend to the left or right. In the end, we need to ensure that we hit all full ranges of all paths. We can store this in some tree structure and validate and extend in parallel. Next we would run PGGB for each of our BED ranges. This means that at the moment, we would need to extract all these ranges from the FASTA with `samtools faidx`. Once IMPG supports PAF output, we can directly start with the SEQWISH phase. Once we have the graphs, we need to lace these together. A task for the future `odgi lace` tool. Then we need to run POA at the boundaries of our laced graphs, to normalize there, too. Maybe another subcommand in ODGI, `odgi poa`, could do this. Then we sort the final graph and are done. Once we did this for all components, we can put them into one graph with `odgi squeeze`.