gfa2bin - HackMD

# GFA2bin tool 1. Below is "our" pipeline. ![Screenshot from 2025-03-31 14-45-31](https://hackmd.io/_uploads/HkggZ_O6kx.png) 2. The tool was developed in the meantime: https://github.com/MoinSebi/gfa2bin This tool didn't start from our input and output. It worked directly on the matrix of reads per node. As we have done in the past, it created a matrix of genotypes based on the matrix of reads per node (using different normalization steps). ![Screenshot from 2025-04-01 10-13-09](https://hackmd.io/_uploads/rkTqMYtp1e.png) Below is how it works: `gfa2bin cov --p file.pack --output <output>` ![Screenshot from 2025-03-31 13-18-40](https://hackmd.io/_uploads/HysNPwuake.png) Normalizations: ``` -a, --absolute-threshold <absolute-threshold> Set a absolute threshold --method <method> Normalization method (mean|median|percentile) ``` A hard cutoff value for coverage--> absolute threshold If I tried it with the example tests, it didn't work. Main differences: - Compared to our pipeline, it used `gaf2pack` to produce the matrix of reads per node, which gives this output: With our tool, we obtained instead node.ids and coverage only. ![Screenshot from 2025-03-31 14-41-41](https://hackmd.io/_uploads/HykSlOd6kx.png) - It used a rGFA file, https://github.com/lh3/gfatools/blob/master/doc/rGFA.md - In addition, it's possible to compress the pack file (maybe interesting for human data): Compress a plain-text coverage file to "pack compressed". Mainly used to reduce the storage size of the coverage file. Maximum coverage in these files is 6553. Higher coverages are truncated. #### One sample - Use `gaf2pack` `gaf2pack --gfa D_C_mm10.fa.gz.bf3285f.eb0f3d3.867196c.smooth.final.gfa --alignment BXD32_BDref_inject.gaf --output BXD32.gaf2pack.txt` - Use `gfa2bin` `gfa2bin cov --output /scratch/BXD32.geno --packlist /lizardfs/flaviav/mouse/panQTL_result/BXD32/path.txt` Default: ``` 4740 entries have been truncated (have a coverage above 65,535). 4740 entries have been truncated (have a coverage above 65,535). ```