Annotate the eukaryotic and bacterial taxonomy of a metagenome using sourmash

[TOC] # Annotate the eukaryotic and bacterial taxonomy of a metagenome using sourmash C. Titus Brown Aug 5, 2025 DRAFT tutorial; updates &c will happen here: [sourmash#3765](https://github.com/sourmash-bio/sourmash/issues/3765). ## Description This tutorial demonstrates how to use sourmash to search a metagenome signature against a combined database of GTDB+NCBI eukaryote genomes. This addresses a challenge in metagenomics where people may want to use the GTDB database of genomes and/or the GTDB taxonomy, but also want to discover eukaryotic organisms that are not in GTDB. At the end, this tutorial provides several visualizations of taxonomic composition as well. ## Compute requirements You'll need about 70 GB of disk space to run this tutorial, including the software install. The memory and CPU requirements are minimal (1 GB RAM, ~2 minutes on one CPU). ## Install sourmash, taxburst, and the betterplot plugin for sourmash Install [sourmash](https://sourmash.readthedocs.io/), [taxburst](https://taxburst.github.io/taxburst/), and [sourmash betterplot](https://github.com/sourmash-bio/sourmash_plugin_betterplot/). Create a new conda environment with sourmash installed: ``` conda create -n smash-euk-tutorial -y 'sourmash >= 4.9.3' python=3.12 ``` Activate, and then install some pure Python packages: ``` conda activate smash-euk-tutorial pip install sourmash_plugin_betterplot taxburst ``` ## Download databases We need two of the [sourmash indexed RocksDB databases](https://sourmash.readthedocs.io/en/latest/databases.html) - the GTDB RS226 entire database, and the NCBI Eukaryotic dindexed database. Download the indexed GTDB RS226 database (Bacteria and Archaea). This requires 35 GB of disk space. ``` curl -L https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k51.dna.rocksdb.tar.gz | tar xzf - ``` Download and unpack the Jan 2025 NCBI eukaryotes database. This requires 23 GB of disk space. ``` curl -L https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/ncbi-euks-2025.01/ncbi-euks-all-2025.01-k51.dna.rocksdb.tar.gz | tar xzf - ``` You should now have two directories, both ending in ``.rocksdb`: ``` ls -FC ``` >gtdb-rs226-k51.dna.rocksdb/ >ncbi-euks-all-2025.01-k51.dna.rocksdb/ ## Download a metagenome to analyze Grab a metagenome sig from our repository of all public SRA data sets: ``` curl -L https://wort.sourmash.bio/v1/view/sra/SRR11125891 -o SRR11125891.sig ``` You can replace `SRR11125891` with any public SRA accession you're interested in, [or sketch your own](https://sourmash.readthedocs.io/en/latest/sourmash-sketch.html). ## Run `sourmash gather` to select overlapping genomes Now run sourmash gather to search the downloaded metagenome against both databases: ``` sourmash gather SRR11125891.sig \ gtdb-rs226-k51.dna.rocksdb \ ncbi-euks-all-2025.01-k51.dna.rocksdb \ -o SRR11125891.x.gtdb+euks.csv \ -k 51 --scaled 10_000 ``` This will take ~1-2 min and require 1 GB of RAM. The output of this command will be a CSV file that contains a [minimum metagenome cover](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2) listing a short list of known genomes that are present. ## Annotate the sourmash gather results with GTDB and NCBI eukaryote taxonomies. Next we will generate a taxonomic summary based on the matching genomes. For this we will need to combine the GTDB taxonomy (for bacteria and archaea) with the NCBI eukaryotic taxonomy. Start by downloading gtdb-rs226.lineages.csv and ncbi-eukaryotes.2025.01.lineages.csv, again, from the [prepared databases page](https://sourmash.readthedocs.io/en/latest/databases.html): ``` curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226.lineages.csv curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/ncbi-euks-2025.01/ncbi-eukaryotes.2025.01.lineages.csv ``` Then, use `sourmash tax` to annotate the gather output with taxonomy: ``` sourmash tax annotate -g SRR11125891.x.gtdb+euks.csv \ -t gtdb-rs226.lineages.csv \ ncbi-eukaryotes.2025.01.lineages.csv ``` Notes: * you could also use the NCBI taxonomy instead of the GTDB taxonomy for the bacterial and archaeal matches, since every GTDB genome is also in NCBI. * if you plan to run `sourmash tax` multiple times, you should use `sourmash tax prepare` to combine the CSV files into a SQLite database, which will be much faster! ## Create a taxburst taxonomy of the results [taxburst](https://taxburst.github.io/) (a fork of [Krona](https://github.com/marbl/Krona)) is a nice way to explore taxonomic results. Note the presence of some putative host contamination in this pig metagenome - the eukaryote 'Sus scrofa' is pig :). Run: ``` taxburst -F tax_annotate SRR11125891.x.gtdb+euks.with-lineages.csv \ -o SRR11125891.x.gtdb+euks.taxburst.html ``` <iframe src="https://farm.cse.ucdavis.edu/~ctbrown/2025-euk-tax-tutorial-files/SRR11125891.x.gtdb+euks.taxburst.html" width="1200" height="800" frameborder="1" allowfullscreen></iframe> (If anyone has better ideas for the iframe embedding, please let me know!) ## Generate a Sankey flow diagram of the results Sankey flow diagrams are nice ways to view hierarchical decompositions! This is a nice interactive one, produced by [the betterplot plugin for sourmash](https://github.com/sourmash-bio/sourmash_plugin_betterplot/). ``` sourmash scripts sankey \ --annotate-csv SRR11125891.x.gtdb+euks.with-lineages.csv \ -o SRR11125891.x.gtdb+euks.sankey.html ``` <iframe src="https://farm.cse.ucdavis.edu/~ctbrown/2025-euk-tax-tutorial-files/SRR11125891.x.gtdb+euks.sankey.html" width="1200" height="800" frameborder="1" allowfullscreen></iframe> ## Generate a treemap of the results Last but not least, here is a proportional representation of species detected in the metagenome. This is a 'treemap', which does a nice job of showing the content weighting. First we need a new format for the taxonomic summary: ``` sourmash tax metagenome -F csv_summary \ -o SRR11125891.x.gtdb+euks \ -g SRR11125891.x.gtdb+euks.csv \ -t gtdb-rs226.lineages.csv \ ncbi-eukaryotes.2025.01.lineages.csv ``` and then we run treemap (again from [the betterplot plugin for sourmash](https://github.com/sourmash-bio/sourmash_plugin_betterplot/): ``` sourmash scripts treemap \ SRR11125891.x.gtdb+euks.summarized.csv \ -o SRR11125891.x.gtdb+euks.treemap.png \ -r species ``` The file `SRR11125891.x.gtdb+euks.treemap.png` will contain the following image: ![a treemap of the metagenome content at species level](https://farm.cse.ucdavis.edu/~ctbrown/2025-euk-tax-tutorial-files/SRR11125891.x.gtdb+euks.treemap.png) ## Additional notes and discussion points ### Why are we using k=51? K=51 is a very stringent match, and we are certainly missing some bacterial content here. However, we've found that lower k-mer sizes introduce too many false positives with large eukaryotic genomes. We're working on it! ### Can we do a simultaneous search for viruses? We do have viral databases [for sourmash](https://sourmash.readthedocs.io/en/latest/databases.html), but the search parameters are incompatible with k=51/scaled=10_000. We're working on it :). See [sourmash-bio/sourmash#1340](https://github.com/sourmash-bio/sourmash/issues/1340).