Sourmash Coverage Binning
===
[article](https://almob.biomedcentral.com/articles/10.1186/s13015-022-00221-z)
1. 3-mer frequencies (composition)
> For each long read, we count the frequencies of all 64 3-mers in this read and merge the reverse complements to form a vector of 32 dimensions. The resulting vector is then normalised by the total number of 3-mers observed in the read.
- sketch singleton 3-mers --> vector of 3-mer frequencies per read/contig
(build csv of frequencies of all 32 canonical 3-mers)
2. 15-mer --> coverage
(compare to metabats read-mapping coverage file?)
- sketch full file --> overall 15-mer abundances
- stream through contigs/reads --> apply abundances from full file to k-mers found in contigs.
- normalization: any normalization???
- contig-level normalization = (sum abunds of each k-mer) / n_kmers in contig/read
> While an all-vs-all alignment of long reads may provide coverage information for each long read, it is usually too time-consuming to perform the quadratic number of pairwise alignments on large scale long-read datasets. Given a sufficiently large k, the frequency of a k-mer is defined as the number of occurrences of this k-mer in the entire dataset. Long reads from high-abundance species tend to contain k-mers with higher frequencies compared to long reads from low-abundance species. Hence, a k-mer frequency vector can be computed for each long read to represent coverage information without performing alignments [15] to represent read coverage. In order to obtain such coverage histograms, we first compute the k-mer counts of all long reads in the entire dataset by DSK [22] (the default value of k=15). The counts are then indexed in memory by encoding each nucleotide in 2 bits as per the encoding (i.e., A=00, C=01, T=10 and G=11) [22]. The resulting index is in the form
(as key, value pairs), where
is the number of occurrences of the k-mer
in the entire dataset. Now for each k-mer
of a read, we obtain the frequency from the index. These frequencies are then used to build a normalised histogram,
. We chose a preset bin width (
) for the histogram and obtain a vector of bins dimensions. By default we set
. All the k-mers with counts exceeding the histogram limits are added into the last index of the histogram. We also normalise the histogram by the total number of k-mers observed in the read.
also https://www.nature.com/articles/s42003-023-05452-3