2024 ICPPB Sourmash LIN demo

2024 ICPPB Sourmash LIN demo === # Analyzing Metagenome Composition using the LIN taxonomic framework Tessa Pierce Ward July 2024 requires sourmash v4.8+ All materials for this workshop: https://github.com/vinatzer-lab/ICPPB2024_workshop --- This tutorial uses the `sourmash taxonomy` module, which was introduced via [blog post](https://bluegenes.github.io/sourmash-tax/) and was recently shown to perfom well for taxonomic profiling of long (and short) reads in [Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets](https://link.springer.com/article/10.1186/s12859-022-05103-0), Portik et al., 2022. In this tutorial, we'll use sourmash gather to analyze metagenomes using the [LIN taxonomic framework](https://dl.acm.org/doi/pdf/10.1145/3535508.3545546). Specifically, we will analyze plant metagenomes for the presence of _Ralstonia solanacearum_. The goal is to see if we can correctly assign the sequence in each file to the correct phylogenetic group, distinguishing between pathogenic and non-pathogenic strains. :::info **Simulated Samples: Ralstonia + tomato host:** - `Sample0` - no Ralstonia - `Sample-II` - Ralstonia solanacearum, PhylIIB - `SampleIV` - Ralstonia solanacearum, Phyl-IV **Additonal Sample (nanopore):** - `barcode16` - ?? ## Install sourmash First, we need to install the software! We'll use conda/mamba to do this. The below command installs [sourmash](http://sourmash.readthedocs.io/). Install the software: ``` # create a new environment mamba create -n sourmash -y -c conda-forge -c bioconda sourmash ``` then activate the conda environment: ``` conda activate sourmash ``` > Victory conditions: your prompt should start with > `(sourmash) ` > and you should now be able to run `sourmash` and have it output usage information!! ## Create a working subdirectory Make a directory named `smash_lin`, change into it: ``` mkdir -p ~/smash_lin cd ~/smash_lin ``` Now make a couple useful folders: ``` mkdir -p inputs mkdir -p databases ``` ## Download relevant data ### First, download a database and taxonomic information Here, we know we are looking specifically for Ralstonia, so we can download a database containing signatures of 32 Ralstonia genomes (pathogenic and not) and the corresponding taxonomic and lingroup information. ``` # database curl -JLO https://osf.io/download/wxtk3/ mv ralstonia*.zip ./databases/ralstonia.zip # taxonomy csv curl -JLO https://osf.io/download/sj2z7/ mv ralstonia32.lin-taxonomy.csv ./databases # lingroup csv curl -JLO https://osf.io/download/nqms2/ mv LINgroups.csv ./databases/ralstonia.lingroups.csv ls databases # look at the database files ``` > For cases where you do not yet know what organisms you may encounter, we provide pre-prepared databases for GTDB and GenBank [here](https://sourmash.readthedocs.io/en/latest/databases.html) ### Next, download pre-made sourmash signatures made from the input metagenomes ``` # download Sample-0 signature curl -JLO https://osf.io/download/dvyt9/ # download Sample-II signature curl -JLO https://osf.io/download/agwdu/ # download Sample-IV signature curl -JLO https://osf.io/download/rngjq/ # move downloaded signatures to ./inputs mv Sample*.zip ./inputs # look at available input files ls inputs ``` ## Look at the signatures Let's start with the `Sample-II` sample ### First, let's look at the metagenome signature. By running `sourmash sig fileinfo`, we can see information on the signatures available within the zip file. Here, you can see I've generated the metagenome signature with `scaled=100` and built three ksizes, `k=21`, k=31` and `k=51` Run: ``` sourmash sig fileinfo ./inputs/Sample-II.sc1000.zip ``` In the output, you should see: ``` ** loading from './inputs/Sample-II.sc1000.zip' path filetype: ZipFileLinearIndex location: /Users/ntward/dib-lab/sandbox/smash_lin/inputs/Sample-II.sc1000.zip is database? yes has manifest? yes num signatures: 3 ** examining manifest... total hashes: 105825 summary of sketches: 1 sketches with DNA, k=21, scaled=1000, abund 33335 total hashes 1 sketches with DNA, k=31, scaled=1000, abund 35516 total hashes 1 sketches with DNA, k=51, scaled=1000, abund 36974 total hashes ``` ### We can also look at the database Here, you can see I've generated the database with `scaled=1000` and built three ksizes, `k=21`, `k=31` and `k=51` Run: ``` sourmash sig fileinfo ./databases/ralstonia.zip ``` In the output, you should see: ``` ** loading from './databases/ralstonia.zip' path filetype: ZipFileLinearIndex location: /Users/ntward/dib-lab/sandbox/smash_lin/databases/ralstonia.zip is database? yes has manifest? yes num signatures: 96 ** examining manifest... total hashes: 524340 summary of sketches: 32 sketches with DNA, k=21, scaled=1000 174967 total hashes 32 sketches with DNA, k=51, scaled=1000 174975 total hashes 32 sketches with DNA, k=31, scaled=1000 174398 total hashes ``` There's a lot of things to digest in this output but the two main ones are: * there are 32 genomes represented in this database, each of which are sketched at k=21,k=31,k=51 * this database represents ~524 *million* k-mers (multiply number of hashes by the scaled number) ## Run sourmash gather using ksize 31 Now let's run `sourmash gather` to find the closest reference genome(s) in the database. If you want to read more about what sourmash is doing, please see [Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2), Irber et al., 2022. Run: ``` sourmash gather inputs/Sample-II.sc1000.zip \ databases/ralstonia.zip -k 31 \ --output Sample-II.k31.gather.csv ``` You should see the following output: ``` selecting specified query k=31 loaded query: Sample-II... (k=31, DNA) -- loaded 96 total signatures from 1 locations. after selecting signatures compatible with search, 32 remain. Starting prefetch sweep across databases. Prefetch found 31 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 1.3 Mbp 3.9% 26.6% 1.2 GCF_001373295.1 Ralstonia solanacearum RS2 found less than 50.0 kbp in common. => exiting found 1 matches total; the recovered matches hit 3.9% of the abundance-weighted query. the recovered matches hit 3.6% of the query k-mers (unweighted). ``` The first step of gather ("prefetch") found all potential matches with at least 50kb matching sequence (31 of 32 total database genomes). Then, the greedy algorithm narrowed this to a single best match, ` GCF_001373295.1` which shared an estimated 1.3 Mbp with the metagenome (~3.9% of the total query dataset). We can visualize this by looking at a venn diagram of the shared k-mers between metagenome sample and the top match. The yellow intersection represend <4% of the metagenome and ~26.6% of the Ralstonia RS2 reference genome. This small match percentage is expected, though, since the dataset is a simulated plant metagenome with an in silico Ralstonia spike-in, and we are just searching for `Ralstonia` here. ![sampleii.venn](https://hackmd.io/_uploads/SyssO_QPC.png) ## Add taxonomic information and summarize up lingroups `sourmash gather` finds the smallest set of reference genomes that contains all the known information (k-mers) in the metagenome. In most cases, `gather` will find many metagenome matches. Here, we're only looking for `Ralstonia` matches and we only have a single gather result. Regardless, let's use `sourmash tax metagenome` to add taxonomic information and see if we've correctly assigned the pathogenic sequence. ### First, let's look at the relevant taxonomy files. These commands will show the first few lines of each file. If you prefer, you can look at a more human-friendly view by opening the files in a spreadsheet program. - **taxonomy_csv:** `databases/ralstonia32.lin-taxonomy.csv` - the essential columns are `lin` (`14;1;0;...`) and `accession` (`GCF_00`...) - **lingroups information:** `databases/ralstonia.lingroups.csv` - both columns are essential (`name`, `lin`) Look at the taxonomy file: ``` head -n 5 databases/ralstonia32.lin-taxonomy.csv ``` You should see: ``` lin,species,strain,filename,accession 14;1;0;0;0;0;0;0;0;0;6;0;1;0;1;0;0;0;0;0,Ralstonia solanacearum,OE1_1,GCF_001879565.1_ASM187956v1_genomic.fna,GCF_001879565.1 14;1;0;0;0;0;0;0;0;0;6;0;1;0;0;0;0;0;0;0,Ralstonia solanacearum,PSS1308,GCF_001870805.1_ASM187080v1_genomic.fna,GCF_001870805.1 14;1;0;0;0;0;0;0;0;0;2;1;0;0;0;0;0;0;0;0,Ralstonia solanacearum,FJAT_1458,GCF_001887535.1_ASM188753v1_genomic.fna,GCF_001887535.1 14;1;0;0;0;0;0;0;0;0;2;0;0;4;4;0;0;0;0;0,Ralstonia solanacearum,Pe_13,GCF_012062595.1_ASM1206259v1_genomic.fna,GCF_012062595.1 ``` > The key columns are: > - `accession`, containing identifiers matching the database sketches > - `lin`, containing the LIN taxonomic information. Now, let's look at the lingroups file ``` head -n5 databases/ralstonia.lingroups.csv ``` You should see: ``` name,lin A_Total_reads;B_PhylI,14;1;0;0;0;0;0;0;0;0 A_Total_reads;B_PhylI;C_seq14,14;1;0;0;0;0;0;0;0;0;3 A_Total_reads;B_PhylI;C_seq15,14;1;0;0;0;0;0;0;0;0;2 A_Total_reads;B_PhylI;C_seq34,14;1;0;0;0;0;0;0;0;0;6 ``` > Here, we have two columns: > - `name` - the name for each lingroup. > - `lin` - the LIN prefix corresponding to each group. ### Now, run `sourmash tax metagenome` to integrate taxonomic information into `gather` results Using the `gather` output we generated above, we can integrate taxonomic information and summarize up "ranks" (lin positions). We can produce several different types of outputs, including a `lingroup` report. `lingroup` format summarizes the taxonomic information at each `lingroup`, and produces a report with 4 columns: - `name` (from lingroups file) - `lin` (from lingroups file) - `percent_containment` - total % of the file matched to this lingroup - `num_bp_contained` - estimated number of bp matched to this lingroup > Since sourmash assigns all k-mers to individual genomes, no reads/base pairs are "assigned" to higher taxonomic ranks or lingroups (as with Kraken-style LCA). Here, "percent_containment" and "num_bp_contained" is calculated by summarizing the assignments made to all genomes in a lingroup. This is akin to the "contained" information in Kraken-style reports. Run `tax metagenome`: ``` sourmash tax metagenome -g Sample-II.k31.gather.csv \ -t databases/ralstonia32.lin-taxonomy.csv \ --lins --lingroup databases/ralstonia.lingroups.csv ``` You should see: ``` Trying to read LIN taxonomy assignments. loaded 1 gather results from 'Sample-II.k31.gather.csv'. loaded results for 1 queries from 1 gather CSVs Read 20 lingroup rows and found 20 distinct lingroup prefixes. ``` and the results: ``` A_Total_reads;B_PhylII 14;1;0;0;0;3;0 3.94 1464000 A_Total_reads;B_PhylII;C_IIB 14;1;0;0;0;3;0;0 3.94 1464000 A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0 3.94 1464000 A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 3.94 1464000 ``` :::info Here, the most specific lingroup we assign to is `A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1;`, which means this is in **phylotype IIB, sequevar 1**. This is the USDA select agent! ::: #### Now output the lingroup report to a file (instead of to the terminal) use `-o` to provide an output basename for taxonomic output. ``` gather_csv_output="Sample-II.k31.gather.csv" taxonomy_csv="databases/ralstonia32.lin-taxonomy.csv" lingroups_csv="databases/ralstonia.lingroups.csv" sourmash tax metagenome -g Sample-II.k31.gather.csv -t databases/ralstonia32.lin-taxonomy.csv \ --lins --lingroup databases/ralstonia.lingroups.csv\ -o "Sample-II" ``` > You should see `saving 'lingroup' output to 'Sample-II.lingroup.tsv'` in the output. #### Optionally, write multiple output formats You can use `-F` to specify additional output formats. Here, I've added `csv_summary`. Note that while the `lingroup` format will be generated automatically if you specify the `--lingroup` file, you can also specify it with `-F lingroup` if you want, as I've done here. Run: ``` gather_csv_output="Sample-II.k31.gather.csv" taxonomy_csv="databases/ralstonia32.lin-taxonomy.csv" lingroups_csv="databases/ralstonia.lingroups.csv" sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \ --lins --lingroup $lingroups_csv \ -F lingroup csv_summary -o "Sample-II" ``` You should see the following in the output: ``` saving 'csv_summary' output to 'Sample-II.summarized.csv'. saving 'lingroup' output to 'Sample-II.lingroup.txt'. ``` The `csv_summary` format is the **full** summary of this sample, e.g. the summary at each taxonomic rank (LIN position). It also includes an entry with the `unclassified` portion at each rank. > Note: Multiple output formats require the `-o` `--output-base` to be specified, as each must be written to a file. Here's an abbreviated version of the `gather` results for `Sample-II`, with lingroup information added: | | **ksize** | **scaled** | **best overlap** | **gather match(es)** | **lingroup** | **lin** | | ------- | --------- | ---------- | ---------------- | -------------------- | ------------ | ---------------------------------- | | **Sample-II** | 31 | 1000 | 1.3 Mbp | GCF_001373295.1 | A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1 | 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 | :::warning **Interlude** - These samples were generated via read simulation from ralstonia genomes, and it looks like this one was created from the RS2 genome we are matching here. Let's try excluding this specific genome to see if we still find the same results without an exact database match. This should be more realistic. We have a gather command-line option for just this situation, `--exclude-db-pattern`: gather: ``` query="inputs/Sample-II.sc1000.zip" database="databases/ralstonia.zip" gather_csv_output="Sample-II.k31.gather.noRS2.csv" sourmash gather $query $database -k 31 -o $gather_csv_output --exclude-db-pattern "RS2" ``` gather results: ``` Starting prefetch sweep across databases. Prefetch found 30 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 1.2 Mbp 3.9% 24.0% 1.2 GCF_002251655.1 Ralstonia solanacearum UW551 found less than 50.0 kbp in common. => exiting found 1 matches total; the recovered matches hit 3.9% of the abundance-weighted query. the recovered matches hit 3.5% of the query k-mers (unweighted). ``` tax: ``` gather_csv_output="Sample-II.k31.gather.noRS2.csv" taxonomy_csv="databases/ralstonia32.lin-taxonomy.csv" lingroups_csv="databases/ralstonia.lingroups.csv" sourmash tax metagenome -g $gather_csv_output -t $taxonomy_csv \ --lins --lingroup $lingroups_csv ``` tax results: ``` name lin percent_containment num_bp_contained A_Total_reads;B_PhylII 14;1;0;0;0;3;0 3.91 1451000 A_Total_reads;B_PhylII;C_IIB 14;1;0;0;0;3;0;0 3.91 1451000 A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0 3.91 1451000 A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1 14;1;0;0;0;3;0;0;0;0;1;0;0;0;0;0;0 3.91 1451000 ``` Without the exact genome the reads were generated from, we get a slightly smaller sequence overlap (1.2Mbp instead of 1.3Mbp). However, when we add the LINgroup information, we still find the right group, `A_Total_reads;B_PhylII;C_IIB;D_seq1&seq2;E_seq1`, or Phylotype IIB, sequevar 1 (pathogenic). ::: ## Run the other samples ### Now run with `Sample-0` sample #### sourmash gather Run: ``` sourmash gather inputs/Sample-0.sc1000.zip databases/ralstonia.zip -k 31 -o Sample-0.dna.k31.gather.csv ``` You should see: ``` selecting specified query k=31 loaded query: Sample-0... (k=31, DNA) -- loaded 96 total signatures from 1 locations. after selecting signatures compatible with search, 32 remain. Starting prefetch sweep across databases. Prefetch found 0 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. found less than 50.0 kbp in common. => exiting No matches found for --threshold-bp at 50.0 kbp. ``` #### gather found no sequence matches! Let's try lowering the detection threshold: ``` # use a 3kb detection threshold sourmash gather inputs/Sample-0.sc100.zip databases/ralstonia.zip -k 31 --threshold-bp 3000 -o Sample-0.k31.gather.csv ``` This time, you should see: ``` selecting specified query k=31 loaded query: Sample-0... (k=31, DNA) -- loaded 96 total signatures from 1 locations. after selecting signatures compatible with search, 32 remain. Starting prefetch sweep across databases. Prefetch found 0 signatures with overlap >= 3.0 kbp. Doing gather to generate minimum metagenome cover. found less than 3.0 kbp in common. => exiting No matches found for --threshold-bp at 3.0 kbp. ``` Which means we *still* didn't find anything! It turns out that Sample-0 is a control sample that does not have any *Ralstonia*. :::info Note: We typically recommend requiring matches to have 3 or more matching k-mers (threshold = `3 * scaled`) ::: Another option we won't pursue here is generating sketches for both the database and the queries at higher resolution (lower scaled, e.g. scaled=100). See more information on scaled and thresholds [here](https://sourmash.readthedocs.io/en/latest/faq.html#what-scaled-values-should-i-use-with-sourmash). ### Run the last sample, Sample-IV ``` sourmash gather inputs/Sample-IV.sc1000.zip databases/ralstonia.zip -k 31 -o Sample-IV.k31.gather.csv ``` we see: ``` Prefetch found 32 signatures with overlap >= 3.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 1.2 Mbp 3.7% 22.1% 1.2 GCF_003515185.1 Ralstonia solanacearum SL3175 found less than 3.0 kbp in common. => exiting found 1 matches total; the recovered matches hit 3.7% of the abundance-weighted query. the recovered matches hit 3.4% of the query k-mers (unweighted). ``` We can look directly at the k-mer overlap between the SampleIV and this Ralstonia genome: ![sampleiv.venn](https://hackmd.io/_uploads/rk1FLdQDC.png) #### run tax metagenome ``` gather_csv_output="Sample-IV.k31-sc1000.gather.csv" taxonomy_csv="databases/ralstonia32.lin-taxonomy.csv" lingroups_csv="databases/ralstonia.lingroups.csv" sourmash tax metagenome -g Sample-IV.k31-sc1000.gather.csv -t databases/ralstonia32.lin-taxonomy.csv \ --lins --lingroup databases/ralstonia.lingroups.csv\ -F lingroup csv_summary -o "Sample-IV" ``` LINgroup output file: ``` name lin percent_containment num_bp_contained A_Total_reads;B_PhylIV 14;1;0;0;0;2;0;0;0 3.72 1404000 A_Total_reads;B_PhylIV;C_seq10 14;1;0;0;0;2;0;0;0;0;0;0 3.72 1404000 ``` :::info We find that this genome is in **Phylotype IV, seq10** group (non-pathogenic). ::: ## run the infected field sample ("barcode 16"; nanopore): ### Download the signature ``` curl -JLO https://osf.io/download/s2q83/ mv bc16.scaled1000.zip ./inputs ``` ### run gather ``` sourmash gather inputs/bc16.scaled1000.zip databases/ralstonia.zip \ -k 31 --output barcode16.k31.gather.csv ``` gather results: ``` prefetch found 32 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 5.1 Mbp 17.8% 97.2% 57.5 GCF_002251605.2 Ralstonia solanacearum UW700 2.7 Mbp 0.3% 4.0% 21.2 GCF_000825845.1 Ralstonia solanacearum Grenada9_1 2.2 Mbp 0.0% 2.1% 5.0 GCF_001644815.1 Ralstonia solanacearum CFBP6783 0.8 Mbp 0.0% 1.6% 2.0 GCF_012062595.1 Ralstonia solanacearum Pe_13 4.9 Mbp 0.2% 1.2% 44.8 GCF_002251695.1 Ralstonia solanacearum K60 2.1 Mbp 0.0% 1.0% 1.4 GCF_000212635.3 Ralstonia solanacearum MolK2 found less than 50.0 kbp in common. => exiting found 6 matches total; the recovered matches hit 18.3% of the abundance-weighted query. the recovered matches hit 0.6% of the query k-mers (unweighted). ``` The initial search (prefetch) found that all 32 genomes had shared sequence with our query. The minimum metagenome cover shows 6 genomes with non-overlapping matches. Nearly 18% of the abundance-weighted query matched to the GCF_002251605.2 Ralstonia solanacearum UW700 genome. :::info Just for fun, let's try adding a random tomato genome to the database, to see if we can match the host k-mers: ``` curl -JLO https://osf.io/download/28pjz/ query="inputs/bc16.scaled1000.zip" database1="databases/ralstonia.zip" database2="GCF_000188115.5_SL3.1.zip" gather_csv_output="barcode16.k31.gather.csv" sourmash gather $query $database1 $database2 -k 31 -o $gather_csv_output ``` new results: ``` Starting prefetch sweep across databases. Prefetch found 33 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 210.1 Mbp 19.8% 33.4% 1.5 GCF_000188115.5 Solanum lycopersicum Heinz 1706 5.1 Mbp 17.8% 97.2% 57.5 GCF_002251605.2 Ralstonia solanacearum UW700 2.7 Mbp 0.3% 4.0% 21.2 GCF_000825845.1 Ralstonia solanacearum Grenada9_1 2.2 Mbp 0.0% 2.1% 5.0 GCF_001644815.1 Ralstonia solanacearum CFBP6783 0.8 Mbp 0.0% 1.6% 2.0 GCF_012062595.1 Ralstonia solanacearum Pe_13 4.9 Mbp 0.2% 1.2% 44.8 GCF_002251695.1 Ralstonia solanacearum K60 2.1 Mbp 0.0% 1.0% 1.4 GCF_000212635.3 Ralstonia solanacearum MolK2 found less than 50.0 kbp in common. => exiting found 7 matches total; the recovered matches hit 38.2% of the abundance-weighted query. the recovered matches hit 21.7% of the query k-mers (unweighted). ``` We can plot the k-mer overlap between this sample and the two top matches: Ralstonia UW700 and the Heinz 1706 tomato genome. Here, we see that while the Ralstonia k-mers are a small portion of the overall file (0.6%; small circle on the left), nearly the entire reference genome is present in the barcode16 sample (97.2%). This tomato genome, in constrast, shares a little less than half its content with the sample. It likely does not reflect the cultivar where bc16 was sampled from. ![bc16.venn](https://hackmd.io/_uploads/H1UMrdXDR.png) > plotted with the sourmash_plugin_venn library https://github.com/sourmash-bio/sourmash_plugin_venn ::: #### run tax metagenome We can run tax metagenome with these results. However, since we don't have LIN or LINgroup information for the tomato genome, the results will only include the Ralstonia matches. Since the tomato genome did not share any k-mers with the Ralstonia genomes, there will be no impact on Ralstonia taxonomic assignment. ``` gather_csv_output="barcode16.k31.gather.csv" taxonomy_csv="databases/ralstonia32.lin-taxonomy.csv" lingroups_csv="databases/ralstonia.lingroups.csv" sourmash tax metagenome -g barcode16.k31.gather.csv \ -t databases/ralstonia32.lin-taxonomy.csv \ --lins --lingroup databases/ralstonia.lingroups.csv \ -F lingroup csv_summary -o "barcode16" ``` LINgroup results: ``` name lin percent_containment num_bp_contained A_Total_reads;B_PhylII 14;1;0;0;0;3;0 18.33 300671000 A_Total_reads;B_PhylII;C_IIC 14;1;0;0;0;3;0;2 18.01 295389000 A_Total_reads;B_PhylII;C_IIA 14;1;0;0;0;3;0;1 0.28 4629000 A_Total_reads;B_PhylII;C_IIB 14;1;0;0;0;3;0;0 0.04 653000 A_Total_reads;B_PhylII;C_IIB;D_seq4 14;1;0;0;0;3;0;0;1;0;0;0;0;0 0.04 579000 A_Total_reads;B_PhylI 14;1;0;0;0;0;0;0;0;0 0.01 182000 A_Total_reads;B_PhylI;C_seq15 14;1;0;0;0;0;0;0;0;0;2 0.01 182000 ``` :::info 18% of the sample matched to A_Total_reads;B_PhylII;C_IIC, so we find that this samples is also in **Phylotype IIC**, which is not the USDA select agent and is not pathogenic. ::: ## Summary and concluding thoughts The LIN taxonomic framework may be useful distinguishing groups below the species level, and we can use LINs and LINgroups with `sourmash tax`. For low level matches, the gather greedy approach can struggle. In cases where there is an identical % match between two reference genomes, the reported match is selected at random. We are working on ways to better warn users about places where this behavior occurs and welcome feedback and suggestions on our [issue tracker](https://github.com/sourmash-bio/sourmash/issues/new). We typically recommend running at `scaled=1000` (our default), as this works for most microbial use cases. However, for smaller samples and databases or for distinguishing between highly related genomes, you may want to run at higher resolution (lower scaled), e.g. scaled=100 or lower. Note, higher resolution signatures are larger and take longer to build and search. See more information on scaled and thresholds [here](https://sourmash.readthedocs.io/en/latest/faq.html#what-scaled-values-should-i-use-with-sourmash). Sourmash taxonomy can also be used with NCBI, GTDB, and ICTV taxonomies. For a walkthough using GTDB and sample-specific assemblies with an environmental metagenome, see [here](https://sourmash.readthedocs.io/en/latest/tutorial-lemonade.html). NOTE: We're in the process of upgrading sourmash commands for multithreading and faster processing. If you get comfortable with these commands and want to process more samples, please check out the **[branchwater plugin](https://github.com/sourmash-bio/sourmash_plugin_branchwater/tree/main/doc) e.g. `fastmultigather` command for faster execution and to run multiple samples at once.** ## Post-workshop challenge: use the full GTDB reference database While you can always create your own reference databases, as we've done here, one of sourmash's strengths is its ability to search large databases. Here is a walkthrough for running the same analysis using the full GTDB reference database (rs214). ``` curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs214/gtdb-rs214-k31.zip mv gtdb-rs214-k31.zip ./databases ``` > Note, this database is 12G, so downloading may take a while. For all available databases, see [sourmash prepared databases](https://sourmash.readthedocs.io/en/latest/databases.html#gtdb-r08-rs214-all-genomes-403k). ``` query="inputs/Sample-IV.sc1000.zip" database="databases/gtdb-rs214-k31.zip" gather_csv_output="Sample-IV.k31-sc1000.gather-gtdb.csv" sourmash gather $query $database -k 31 -o $gather_csv_output ``` output: ``` sourmash gather $query $database -k 31 -o $gather_csv_output == This is sourmash version 4.8.9. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == selecting specified query k=31 loaded query: Sample-IV... (k=31, DNA) -- loaded 402709 total signatures from 1 locations. after selecting signatures compatible with search, 402709 remain. Starting prefetch sweep across databases. Prefetch found 350 signatures with overlap >= 50.0 kbp. Doing gather to generate minimum metagenome cover. overlap p_query p_match avg_abund --------- ------- ------- --------- 1.2 Mbp 3.7% 22.1% 1.2 GCF_003515185.1 Ralstonia solanacearum strain=SL3175, ASM351518v1 found less than 50.0 kbp in common. => exiting found 1 matches total; the recovered matches hit 3.7% of the abundance-weighted query. the recovered matches hit 3.4% of the query k-mers (unweighted). ``` Despite searching a much larger database which includes 526 Ralstonia genomes (350 of which had overlap >= 50.0 kbp), the gather results show that this sample is still most similar to `GCF_003515185.1`, which is in the Phylotype-IV LIN group.