---
tags: stamps2025
---
[toc]
# Metagenomics intro I: sampling & presence/absence (STAMPS 2025)
Titus Brown
July 17, 2025
## Vibe check?
Is the AC running today?
## Background motivation and the day ahead!!
Metagenomes are big, and complicated. They are mixtures of many different genomic sequences, at different relative abundances, and this makes it tricky to understand them intuitively.
Many metagenome analyses are inherently **reference based**, in that they use **reference genomes** to interpret metagenome content. This has become the dominant paradigm in the last 5 years.
Today is "eat your wheaties" day: my job is to prepare you for more fun and advanced analyses in the coming week.
We will talk about four major aspects of metagenomics, one each session:
1. Shotgun sequencing of microbiomes, and how to think about sequencing depth and detection of genomes;
2. The two most common basic sequence analysis approaches used in reference-based metagenomics: read mapping and k-mers;
3. Why (and a bit of how) to think about constructing and using "metagenome-assembled genomes";
4. How metagenome content (taxonomy) analysis and genome evolution interact;
My goal for today is to give you a set of intellectual tools for thinking about metagenomics, and my hope is that this will enable you to better understand and interpret your own and other people's analyses, as well as to design your own analyses.
Let's gooooooo!
## Stickies
We'll be using two colors of stickies today!
Green: "Done!"
Purple/red: "Need help!"
No sticky at all: "working on it!"
## Metagenome sequencing is weighted sampling from a mixture
### Generating a treemap!
Connect to [your cloud computer](https://github.com/mblstamps/stamps2025/wiki/Accessing-our-cloud-computers).
Start a Terminal or shell prompt. Then run:
```
conda activate sourmash
pip install -U sourmash_plugin_betterplot
conda tos accept --override-channels --channel defaults
```
This will add some new features to the betterplot plugin that I first showed you on Tuesday (for the sankey diagram). It also makes sure you've accepted the legal agreement for conda :).
Make a new folder:
```
mkdir ~/day3
cd ~/day3
```
and run an analysis of the SRR11125891 metagenome using sourmash:
```
sourmash gather -k 51 --scaled 10_000 \
/opt/shared-2/sourmash-mgx/SRR11125891.sig \
/opt/shared-2/sourmash-db/entire-2025-07-11.k51.rocksdb \
-o SRR11125891.gather.csv
```
Here I've placed the metagenome signature in a shared location on the computers, `/opt/shared-2`, which is also where the sourmash database is.
With this command, we're searching our pig gut metagenome SRR11125891 (see [description](https://www.ebi.ac.uk/ena/browser/view/SRR11125891)) against over 700,000 genomes to generate a summary of its content.
Now, convert the genome names into a taxonomic summary:
```
sourmash tax metagenome -g SRR11125891.gather.csv \
-t /opt/shared-2/sourmash-db/entire-2025-07-11.lineages.sqldb \
-o SRR11125891
```
And, finally, plot the taxonomy with the treemap command:
```
sourmash scripts treemap SRR11125891.summarized.csv \
-o SRR11125891.treemap.species.png \
-r species
```
This generates a "treemap", which is a proportional representation of metagenome sequencing content by genomic content, aggregated (in this case) by _species_.

(Green stickies up when done! Red if you're running into trouble!)
If you have extra time:
- try changing the command from summarizing to species, to instead use genus, family, order, class, phylum or superkingdom!
- Open and inspect the 'gather.csv' file. Look at the 'intersect_bp', 'average_abund', and 'name' columns and think about them. ([Here](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html#appendix-d-gather-csv-output-columns) is a guide to the columns.)
- Try using [Tuesday's instructions](https://hackmd.io/zEpAkqJGTP2DNUmso8jNcw?view#Step-5-Generate-a-sankey-plot) to generate a sankey or taxburst plot.
- Try modifying the commands above to look at different metagenomes - I've put a few in `/opt/shared-2/sourmash-mgx/`, and you can use `ls /opt/shared-2/sourmash-mgx/` to get their names.
## My dart board analogy
(switch to slides & group discussion)
## (Back to computers)
Briefly discuss:
* columns!
## When is a specific genome "present" in a metagenome?
One of the things that every analysis tool has to decide is when to treat something as "real".
### Digression: hidden threshold!
For example, there's an implicit threshold in the gather command above that I didn't tell you about - it by default ignores overlaps between your metagenome and a genome that are less than 50kb in size!
Before we do anything... how would you expect the treemap to change if we include more data??
Let's modify our command to remove the threshold and rerun everything:
```
sourmash gather -k 51 --scaled 10_000 \
--threshold-bp 0 \
/opt/shared-2/sourmash-mgx/SRR11125891.sig \
/opt/shared-2/sourmash-db/entire-2025-07-11.k51.rocksdb \
-o SRR11125891.gather.t0.csv
sourmash tax metagenome -g SRR11125891.gather.t0.csv \
-t /opt/shared-2/sourmash-db/entire-2025-07-11.lineages.sqldb \
-o SRR11125891.t0
sourmash scripts treemap SRR11125891.t0.summarized.csv \
-o SRR11125891.t0.treemap.species.png \
-r species
```
~~(Note: I made a mistake above, intentionally. Where is it?)~~
A few points, echoing Ben's advice from last night:
* "Relaxing" thresholds and seeing how your summaries change is a good thing to do to build intuition!
* (What typically happens to specificity and sensitivity when you make your thresholds less stringent?)
* Remember you can almost always filter _later_. I usually recommend a "triage" approach: generate results with maximum sensitivity, throw stuff out later as you develop reasons to do so.
* Beware of hidden thresholds :sob:
::::spoiler How _does_ the treemap change?
* unclassified goes down!
* _many_ more species are found...
* a few of the top species decrease their percent...
* ...most don't.
Remind me to address the question of why a few species decrease when we talk about sourmash gather in session 4 :)
::::
### Undigression: when is a genome present?
As I was saying, every tool has to make a decision about what to discard or ignore.
Let's take a look at a particular kind of plot: a presence/abundance plot.

To generate this plot, run:
```
sourmash scripts presence_filter SRR11125891.gather.csv \
-o SRR11125891.presence.png \
--green-color Entero --red-color Clostr \
-N 10
```
This is a scatterplot of the gather file, looking at how much of each genome _uniquely_ intersects this metagenome (X axis), against the estimated abundance of that genome in the metagenome (Y axis). Note that the Y axis is log10 scale, so it starts at 1 and goes up to about 100.
The green dots are genomes with "Entero" in the name, and the red dots are genomes with "Clostr" in the name. That's not important, really, it's just to make the plot look pretty.
## A fun exercise: choose your decision parameters!!
You can adjust the following parameters:
* `-N num` sets the minimum threshold in basepairs divided by 10,000, so `-N 10` is saying, discard matches smaller than 100kb in size.
* `--ani` switches the X axis to use average nucleotide identity instead of fraction of genome.
* `--min-ani` sets your lowest ANI threshold for display a genome (b/t 0.7 and 1)
* `--min-fraction` sets your lowest detection threshold for fraction of genomes (b/t 0 and 1)
You can also feed in the threshold=0 gather file from above - what happens? :stuck_out_tongue_winking_eye:
Your goal is to play with these parameters and see if you get a plot that you "like".
:::success
If you want, this is a good time to play with hackmd! Create a new hackmd at hackmd.io and copy paste the following into it:
````
```
sourmash scripts presence_filter SRR11125891.gather.csv \
-o SRR11125891.presence.png \
--green-color Entero --red-color Clostr \
-N 10
```
````
Then, generate a few different command lines, copy/paste them back into your Terminal, and drag/drop or upload the resulting images into your hackmd. Voila! Reproducibility!
:::
### The exercise...
Try out a few different parameters, develop some intuition on your own. (5 minutes)
Find a friendly neighbor or two, chat about your charts. Come up with questions. (5 minutes)
Let's discuss the questions as a group!
Adjust your charts, pick one that you like (2 minutes).
Tell me about your favorite set of parameters! (or post in slack - let's create a new channel, '#session-chatter'!)
### Some points to make
* playing with parameters for viz is basically "free", right? It's "just" time.
* of the options, only one does something to a parameter that is not directly represented on the figure. Which one?
* why do you like the plot that you like?
Titus will show you his favorite version of this plot upon request, but he wants to stress that you should not in any circumstance believe his arguments.
## Concluding thoughts
Metagenome sequencing and in particular _depth_ is all about sampling a weighted mixture, aka throwing darts at a large, complex dart board.
Sequencing more deeply will increase your sensitivity but also dramatically oversequence highly abundant stuff!
Long reads will generally not show you new things, especially since with Illumina you get many _more_ darts. But you may be more confident in what you do see, and it may be more useful for downstream analyses.
It's very hard to see viruses and rare things without enriching. Christina will talk about this!
Metagenomes contain a lot of stuff and it's hard to figure out what's really "there", although generally speaking it's a lot of stuff.
Always ask people how their software decides what's there!
## Annotated bibliography
YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample, Koslicki et al, 2024. [link](https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873). This is one of the first papers to really tackle this question.
Comparative performance of reference-based metagenomic tools to identify species-level taxa among families of bacteria: benchmarking Mycobacteriaceae and Neisseriaceae. Harrison et al., 2025. [link](https://www.biorxiv.org/content/10.1101/2025.06.23.661092v1) - Via Jim Shaw, author of sylph; see [his comment on bluesky](https://bsky.app/profile/jimshaw.bsky.social/post/3lt3n6u7d3k2c).
Biogeographic distribution of five Antarctic cyanobacteria using large-scale k-mer searching with sourmash branchwater, Lumian et al, 2024. [link](https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2024.1328083/full). A paper of "mine" from a great collaboration. Raises a lot of questions :stuck_out_tongue_winking_eye: We'll talk a bit more about this later, too.