# Towards IMplicit Pangenome Graph Construction Tools
> This MD explores how we could use https://github.com/ekg/impg in order to build a pangenome graph in small pieces feeding these into https://github.com/pangenome/pggb, lacing them together and normalizing the boundaries of our laced pieces in the final step.
First, let's generate some alignments from 8 yeast genomes from the [YPRP](https://yjx1217.github.io/Yeast_PacBio_2016/welcome/).
```
wfmash \
scerevisiae8.fasta.gz \
scerevisiae8.fasta.gz \
\
--threads 4 \
\
-n 7 -s 5000 -p 90.0 -X \
-l 25000 -k 19 -H 0.001 -2 30 > scerevisiae8.fasta.gz.paf
```
Somehow we need to detect the connected components inherent in the alignments... I would guess that this is already possible with the Leiden algorithm.
Here, for simplicity, just work with one component, specifically the one that is spanned by the sequence `S288C#chrI`.
## By Component
To get an overview of what is possible using `S288C#chrI` as our anchor sequence, let's see what's in the FASTA index.
```
grep -E "S288C#chrI" scerevisiae8.fasta.gz.fai | head -n 1
S288C#chrI 219929 24333564 60 61
```
Now we know the maximum sequence length of our chosen anchor `S288C` is 219929 bases. Since IMPG is 0-based, we need to subtract 1 when we query the whole length of this sequence:
```
impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:0-219928" -x
S288C#chrI 0 219928
SGDref#chrI 0 219136
YPS128#chrI 8328 219784
DBVPG6044#chrI 1152 198912
Y12#chrI 1792 185000
DBVPG6765#chrI 2816 204416
SK1#chrI 17560 209304
UWOPS034614#chrI 16912 204685
```
Unsurprisingly, we don't hit all positions of all the sequences.
> During my latest trials, one can overshoot the maximum sequence lengths in IMGS: `S288C#chrI:0-220000` was also possible to enter without any errors. Not sure if I am doing something wrong, or the input alignments are at fault, or it is a bug. I tend to say it is the last one.
Now we would need to extend transitively to the left and to the right, until we hit all positions of all sequences. Let's try this manually to the left:
```
impg -p scerevisiae8.fasta.gz.paf -q "YPS128#chrI:0-8327" -x
YPS128#chrI 0 8327
SK1#chrI 219704 228349
DBVPG6044#chrI 209258 214933
S288C#chrI 1024 1439
SGDref#chrI 1280 1390
```
Uuh, interesting. We also collected sequences with very high positions, presumably because the assemblies were done on the reverse strand or in the different direction. Also note that sequence `UWOPS034614#chrI` does not appear anymore!
Let's give it another try for this missing sequence:
```
impg -p scerevisiae8.fasta.gz.paf -q "UWOPS034614#chrI:0-16911" -x
UWOPS034614#chrI 0 16911
SK1#chrI 8072 32821
S288C#chrI 24336 30070
SGDref#chrI 23952 29631
DBVPG6044#chrI 11536 15390
DBVPG6765#chrI 12048 16262
Y12#chrI 13456 16290
YPS128#chrI 26008 27148
```
Hmmm.... this is weird. Because now `SC288C#chrI` is appearing again, although, we already went through it's full range. Is this expected or a bug? I need to ask Erik.
I am theorizing that this might point to regions of repeats?
Anyhow, not understanding what's going on here, I can't complete my extensions to the left or right.
## By Ranges
Now we want to split our input into even smaller ranges, specifically we want to use `S288C#chrI` as the anchor sequence gain, but split it into 3 equally sized pieces.
Query for the 1st piece:
```
impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:0-73309" -x
S288C#chrI 0 73309
SGDref#chrI 0 72869
YPS128#chrI 8328 70380
DBVPG6044#chrI 1152 58611
Y12#chrI 1792 59591
DBVPG6765#chrI 2816 59461
SK1#chrI 17560 76038
UWOPS034614#chrI 16912 60133
```
Query for the 2nd piece:
```
impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:73310-146618" -x
S288C#chrI 73310 146618
SGDref#chrI 72870 146175
YPS128#chrI 70381 143650
DBVPG6044#chrI 58612 132568
Y12#chrI 59592 132892
DBVPG6765#chrI 59462 138728
SK1#chrI 76039 149338
UWOPS034614#chrI 60134 133792
```
Query for the 3rd piece:
```
impg -p scerevisiae8.fasta.gz.paf -q "S288C#chrI:146619-219928" -x
S288C#chrI 146619 219928
SGDref#chrI 146176 219136
YPS128#chrI 143651 219784
DBVPG6044#chrI 132569 198912
Y12#chrI 132893 185000
DBVPG6765#chrI 138729 204416
SK1#chrI 149339 209304
UWOPS034614#chrI 133793 204685
```
The good news: We get perfectly overlapping ranges. No bugs! The bad news: I still don't know how to extend to the left or right.
In the end, we need to ensure that we hit all full ranges of all paths. We can store this in some tree structure and validate and extend in parallel.
Next we would run PGGB for each of our BED ranges. This means that at the moment, we would need to extract all these ranges from the FASTA with `samtools faidx`. Once IMPG supports PAF output, we can directly start with the SEQWISH phase.
Once we have the graphs, we need to lace these together. A task for the future `odgi lace` tool. Then we need to run POA at the boundaries of our laced graphs, to normalize there, too. Maybe another subcommand in ODGI, `odgi poa`, could do this.
Then we sort the final graph and are done.
Once we did this for all components, we can put them into one graph with `odgi squeeze`.