Sourmash Coverage Binning

Sourmash Coverage Binning === [article](https://almob.biomedcentral.com/articles/10.1186/s13015-022-00221-z) 1. 3-mer frequencies (composition) > For each long read, we count the frequencies of all 64 3-mers in this read and merge the reverse complements to form a vector of 32 dimensions. The resulting vector is then normalised by the total number of 3-mers observed in the read. - sketch singleton 3-mers --> vector of 3-mer frequencies per read/contig (build csv of frequencies of all 32 canonical 3-mers) 2. 15-mer --> coverage (compare to metabats read-mapping coverage file?) - sketch full file --> overall 15-mer abundances - stream through contigs/reads --> apply abundances from full file to k-mers found in contigs. - normalization: any normalization??? - contig-level normalization = (sum abunds of each k-mer) / n_kmers in contig/read > While an all-vs-all alignment of long reads may provide coverage information for each long read, it is usually too time-consuming to perform the quadratic number of pairwise alignments on large scale long-read datasets. Given a sufficiently large k, the frequency of a k-mer is defined as the number of occurrences of this k-mer in the entire dataset. Long reads from high-abundance species tend to contain k-mers with higher frequencies compared to long reads from low-abundance species. Hence, a k-mer frequency vector can be computed for each long read to represent coverage information without performing alignments [15] to represent read coverage. In order to obtain such coverage histograms, we first compute the k-mer counts of all long reads in the entire dataset by DSK [22] (the default value of k=15). The counts are then indexed in memory by encoding each nucleotide in 2 bits as per the encoding (i.e., A=00, C=01, T=10 and G=11) [22]. The resulting index is in the form (as key, value pairs), where is the number of occurrences of the k-mer in the entire dataset. Now for each k-mer of a read, we obtain the frequency from the index. These frequencies are then used to build a normalised histogram, . We chose a preset bin width ( ) for the histogram and obtain a vector of bins dimensions. By default we set . All the k-mers with counts exceeding the histogram limits are added into the last index of the histogram. We also normalise the histogram by the total number of k-mers observed in the read. also https://www.nature.com/articles/s42003-023-05452-3