---
tags: stamps2025
---
# Metagenomics intro III: reference-based metagenome interpretation & enhancing reference catalogs (STAMPS 2025)
Titus Brown
July 17, 2025
## Measuring assignability of metagenomes using a reference database
Let's run that sourmash gather one... more... time:
```
conda activate sourmash
cd ~/
sourmash gather -k 51 --scaled 10_000 \
/opt/shared-2/sourmash-mgx/SRR11125891.sig \
/opt/shared-2/sourmash-db/entire-2025-07-11.k51.rocksdb \
--threshold-bp=0
```
At the bottom you should see:
```
found 1536 matches total;
the recovered matches hit 76.5% of the abundance-weighted query.
the recovered matches hit 60.6% of the query k-mers (unweighted).
```
This tells you that 76.5% of the metagenome matches to at least one thing in the database we are using - that is the "assignability" or "explainability" of the metagenome.
My prediction would be that you can map at least 76% of the reads in the metagenome to at least one known genome.
It turns out this is highly dependent on k-mer size - smaller k-mer sizes match better! Here's the same analysis but on _all_ the 21-mers from GTDB rs226:
```
sourmash gather -k 21 \
/opt/shared-2/sourmash-mgx/SRR11125891.sig \
/opt/shared-2/sourmash-db/gtdb-rs226.k21.scaled100_000.merged.sig.zip \
--threshold-bp=0
```
it should yield:
```
the recovered matches hit 89.9% of the abundance-weighted query.
the recovered matches hit 76.7% of the query k-mers (unweighted).
```
So, at k=21, about 90% of the metagenome matches to something bacterial or archaeal.
## Measuring the impact of new MAGs on assignability
With sourmash, it's relatively straightforward to add new databases into your search.
I've placed Annie ter Horst's new pig catalog on our shared instances here: `/opt/shared-2/sourmash-db/new-pig-catalog.k21.sig.zip`.
You can see there's over 11,000 new genomes in it:
```
conda activate sourmash
sourmash sig summarize /opt/shared-2/sourmash-db/new-pig-catalog.k21.sig.zip
```
and you can run SRR11125891 against it with this command:
```
sourmash gather -k 21 --scaled=100_000 \
/opt/shared-2/sourmash-mgx/SRR11125891.sig \
/opt/shared-2/sourmash-db/gtdb-rs226.k21.scaled100_000.merged.sig.zip \
/opt/shared-2/sourmash-db/new-pig-catalog.k21.sig.zip
```
and you should see:
```
the recovered matches hit 92.9% of the abundance-weighted query.
the recovered matches hit 79.1% of the query k-mers (unweighted)
```
along with list of the matches _after_ the GTDB matches. So that's a relatively small increase for this metagenome, but it shows the process.
<!--
## The challenges of assembly and binning
assemble metagenome
(bin metagenome)
## Classifying a newly assembled genome
-->
## To reference, or not to reference - that is the question!
We'll talk about assembly tomorrow morning, as one way to become less dependent on reference databases. Assembly has its own issues, however.
But, basically, a good "default" workflow in metagenomics these days seems to be to plan on generating a new set of reference genomes from each environment by doing a focused set of deep and high quality sequencing, followed by assembly and binning. Then, you can do lower coverage sequencing of other samples from that same environment, and analyze it using the constructed catalog.
## Reference- and assembly-independent comparison of metagenomes using k-mers
There's actually no particular reason to insist that k-mers be part of a reference database; you can compare k-mers in metagenomes directly:
```
cd ~/day3-kmers
sourmash scripts upset /opt/shared-2/sourmash-mgx/{SRR11125891,SRR8655119,SRR11126365}.sig -k 31 -o upset.png --show-singletons
```
In the resulting `upset.png`, you will see:

which shows you that there's actually relatively little overlap between these three metagenomes - but there is *some*.
(Can anyone hazard a guess as to why there is so little overlap in here? One thing sticks out for me.)
### What would an overlap mean?
Let's take a look at a much larger analysis that I did for some collaborators on a collection of acid mine drainage data sets:

What do the overlaps here mean? Shared content!! That may be accessible for building new genomes!!
While this is still pretty early days, I think there is reason to believe that this _kind_ of analysis (looking at overlaps between large collections of metagenomes) can suggest which samples should be prioritized for co-assembly and co-binning to produce catalogs.
## Annotated bibliography
Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2023. [link](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2). This is the core sourmash paper that explains how sourmash works and what 'sourmash gather' does.
Simka: Multiple comparative metagenomics using multiset k-mer counting. Benoit et al, (2016). [link](https://peerj.com/articles/cs-94/) One of my favorite examples of "just let's use k-mers to compare things, ok?"
Smith et al. (2022) – Database choice impacts accuracy in rumen metagenomic classification (focus: RefSeq):
https://animalmicrobiome.biomedcentral.com/articles/10.1186/s42523-022-00207-7
Xie et al. (2021) – An integrated gene catalog and MAGs from ruminant gastrointestinal microbiomes (focus: GenBank):
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01078-x
[Comprehensive taxonomic identification of microbial species in metagenomic data using SingleM and Sandpiper](https://www.nature.com/articles/s41587-025-02738-1), Woodcroft et al., 2025.