BIG data - HackMD

## BIG data vs the HPRC pangenome ##### 1. What is The Graph Alignment Format (GAF) file? GAF is a TAB delimited format for sequence-to-graph alignments. It is a strict superset of the PAF format. Each GAF line consists of 12 mandatory fields: ``` Col Type Description 1 string Query sequence name 2 int Query sequence length 3 int Query start (0-based; closed) 4 int Query end (0-based; open) 5 char Strand relative to the path: "+" or "-" 6 string Path matching /([><][^\s><]+(:\d+-\d+)?)+|([^\s><]+)/ 7 int Path length 8 int Start position on the path (0-based) 9 int End position on the path (0-based) 10 int Number of residue matches 11 int Alignment block length 12 int Mapping quality (0-255; 255 for missing) ``` A path on column 6 is defined by the following grammar: ``` <path> <- <stableId> | <orientIntv>+ <orientIntv> <- ('>' | '<') (<segId> | <stableIntv>) <stableIntv> <- <stableId> ':' <start> '-' <end> ``` where `<segId>` is the second column on an S-line in GFA and `<stableId>` is an identifier on an SN tag in rGFA. With `<segId>`, GAF can encode alignments against an ordinary GFA. `<stableId>` is preferred for alignments against rGFA. Notably, if a path consists of a single interval on a stable sequence, we can simplify the entire path to `<stableId>` without `<start>` and `<end>`, and assume the orientation to be forward `>`. Such a GAF line is reduced to PAF. The following example shows the read mapping for GTGGCT and CGTTTCC against the example rGFA above. In the unstable segment coordinate, the GAF is: ``` read1 6 0 6 + >s2>s3>s4 12 2 8 6 6 60 cg:Z:6M read2 7 0 7 + >s2>s5>s6 11 1 8 7 7 60 cg:Z:7M ``` In the stable coordinate, the GAF becomes: ``` read1 6 0 6 + chr1 17 7 13 6 6 60 cg:Z:6M read2 7 0 7 + >chr1:5-8>foo:8-16 11 1 8 7 7 60 cg:Z:7M ``` #### 2. Difference between gafpack and the new gaf2pack Gaf2pack is reporting the coverage for each base in the nodes. To plot it it is necessary do a sumn of the coverage value for each node, it will be the cumulative coverage. The format is a tab-separated file with sequence position (seq.pos), node ID (node.id), node offset (node.offset, 0-based) and coverage. ``` seq.pos node.id node.offset coverage 423 30 61 6 424 30 62 0 425 30 63 2 426 30 64 2 ``` Not sure, about node offset. Maybe for node.offset they mean: The 0-based position within that specific node (meaning position 0 would be the first base in the node). So for example, position 423 in the overall sequence corresponds to the 61st position (0-based, so actually the 62nd base) within node 30, and it has a coverage of 6. #### 3. Plots All the plots that I was producing were based on the cumulative plots. Plus due the complexity of the matrix I never produce a plot for the whole matrix, but with a script in rust I extracted only 1 million of random nodes. #### 4. Masked regions We can't remove the nodes that are not useful in the matrices (not covered by the exome), because we don't have the positions respect to the reference. We need to wait that the position matrix will be create.