Gina Vazquez
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # From raw metagenome reads to phyloseq taxonomy table using sourmash gather and sourmash taxonomy (STAMPS 2024) Updated by CTB 7/24/24 for STAMPS: uses `fastgather` now. ## Introduction to `gather` and sourmash `taxonomy` Sourmash `gather` is a tool that will tell you the minimum set of genomes in a database needed to “cover” all of the sequences in your sample of interest. Sourmash `gather` is _really_ accurate. Below I’ve included two plots from a recent [preprint](https://www.biorxiv.org/content/10.1101/2022.01.11.475838v2.abstract) that details how sourmash gather works. They show that sourmash `gather` has the highest completeness and purity at the species level, while maintaining a low L1 norm error. Completeness is the ratio of taxa correctly identified in the ground truth (higher is better), purity is the ratio of correctly predicted taxa over all predicted taxa (higher is better), and L1 norm error is the ratio true positives to false positives (lower is better). ![image](https://hackmd.io/_uploads/SkkkjE6dC.png) ![image](https://hackmd.io/_uploads/B1aKBra_R.png) Sourmash `gather` works best for metagenomes from environments that have been sequenced before or from which many genomes have been isolated and sequenced. Because k-mers are so specific, a genome needs to be in a database in order for sourmash `gather` to find a match. Sourmash `gather` won’t find much above species (k = 31) or genus (k = 21) similarity, so if most of the organisms in a sample are new, sourmash won’t be able to label them. [Sourmash `taxonomy`](https://bluegenes.github.io/sourmash-tax/) makes the sourmash `gather` output more interpretable. Previously, we only output the statistics about the genomes in a database that were found in a metagenome. While this information is useful, taxonomic labels make this type of information infinitely _more_ useful. The `taxonomy` commands were added into sourmash in August 2021, but we haven’t done a good job advertising them or writing tutorials about how to use them and how to integrate with other visualization and analysis tools. This blog post seeks to close that gap a bit by demonstrating how to go from raw metagenome reads to a phyloseq taxonomy table using `sourmash gather` and `sourmash taxonomy` to make the actual taxonomic assignments. ## Tutorial Software In the RStudio terminal: ```bash conda create -y -n smash_to_phylo sourmash sourmash_plugin_branchwater conda activate smash_to_phylo mkdir ~/sourmash_tutorial cd ~/sourmash_tutorial ``` ### Tutorial Data We’ll use six samples from the iHMP IBD cohort. These are short shotgun metagenomes from stool microbiomes. While this is a longitudinal study, these are all time point 1 samples taken from different individuals. All individuals were symptomatic at time point 1, but three individuals were diagnosed with Crohn’s disease (cd) at the end of the year, while three individuals were not (nonIBD). | run_accession | sample_id | group | | ------------- | --------- | ----- | | SRR5936131 | PSM7J199 | cd | | SRR5947006 |PSM7J1BJ | cd | | SRR5935765 | HSM6XRQB | cd | | SRR5936197 | MSM9VZMM | nonibd | | SRR5946923 | MSM9VZL5 | nonibd | |SRR5946920| HSM6XRQS| nonibd If you want to start with `.fastq` files start by downloading the files from the [European Nucleotide Archive](https://www.ebi.ac.uk/ena/browser/home). Since these are paired-end files, we’ll end up running 12 download commands. *Read below before downloading* ```[bash] # Terminal # SRR5936131 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/001/SRR5936131/SRR5936131_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/001/SRR5936131/SRR5936131_2.fastq.gz # SRR5947006 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/006/SRR5947006/SRR5947006_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/006/SRR5947006/SRR5947006_2.fastq.gz # SRR5935765 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/005/SRR5935765/SRR5935765_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/005/SRR5935765/SRR5935765_2.fastq.gz # SRR5936197 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/007/SRR5936197/SRR5936197_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR593/007/SRR5936197/SRR5936197_2.fastq.gz # SRR5946923 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/003/SRR5946923/SRR5946923_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/003/SRR5946923/SRR5946923_2.fastq.gz # SRR5946920 wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/000/SRR5946920/SRR5946920_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR594/000/SRR5946920/SRR5946920_2.fastq.gz ``` **Suggested:** If short on time :point_left: :::info ****To download or not to download**** :stopwatch: To make this tutorial realistic, I’ve started with real FASTQ files from real metagenomes. If you have a slow internet connection, they may take a long time to download. If you would prefer to start this tutorial from the sourmash signatures, you can download those (small) files using the commands below in lieu of downloading the data and creating them yourself. Note that the download step and the sketch step are the steps that take the longest, so starting from the signatures will save a substantial (hours?) amount of time. ```[bash] # Terminal wget -O sigs.tar.gz https://osf.io/download/9sfpt/ gunzip sigs.tar.gz tar -xvf sigs.tar ``` ::: ## Using sourmash `gather` to determine the composition of a metagenome Using the FASTQ files, we’ll first generate FracMinHash sketches using `sourmash sketch`. These are the files that `sourmash scripts fastgather` will ingest. **Note: You do not have to do this if you downloaded the sketches** *Skip this step if you downloaded the `.sig` files.* ```[bash] # Terminal for infile in *_1.fastq.gz do bn=$(basename ${infile} _1.fastq.gz) sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund --merge ${bn} -o ${bn}.sig ${infile} ${bn}_2.fastq.gz done ``` We also need a sourmash database. For this tutorial, we’ll use the GTDB rs207 representatives database. It’s relatively small (~2GB) so it will be faster to download and run, however there are many more [prepared databases available](https://sourmash.readthedocs.io/en/latest/databases.html), including the full GTDB rs214. The full database will generally match more sequences in the sample than the representatives database, so it would likely produce more complete results for most metagenomes. ```[bash] # Terminal #cd sigs/ #wget -O gtdb-rs207.genomic-reps.dna.k31.zip https://osf.io/3a6gn/download #@CTB: on Jetstream instances, instead link in: ln -s /opt/shared/sourmash-data/gtdb-rs214-k31.zip . ``` ::: info ****Choosing a Database**** :thinking_face: The minimum database I would recommend using is the full GTDB rs214 (~400k genomes). The GTDB databases will work well for well-studied environments like the human microbiomes. If you’re working with a sample from a more complex (ocean, soil) or understudied environment, you will probably see more matches when comparing your sample against all of GenBank. Additionally, GTDB only focuses on bacteria and archaea; if you’re interested in viruses, fungi, or protozoa, these genomes are only in the GenBank database. ::: Now, we’ll run [sourmash `fastgather`](https://github.com/sourmash-bio/sourmash_plugin_branchwater/blob/main/doc/README.md) to figure out the minimum set of genomes in our database that cover all of the known k-mers in our metagenomes. ```[bash] # Terminal for infile in *sig do bn=$(basename $infile .sig) sourmash scripts fastgather ${infile} gtdb-rs214-k31.zip -o ${bn}_gather_gtdb_rs214.csv done ``` ## Determining the taxonomic composition of a sample using sourmash `taxonomy` Once we have our gather results, we have to translate them into a taxonomy using `sourmash taxonomy`. The genomes by default don’t have a taxonomic lineage, so without this information we wouldn’t be able to relate one genome to another very easily. First, we have to download and format the taxonomy spreadsheet: ```[bash] # Terminal #wget -O gtdb-rs207.taxonomy.csv.gz https://osf.io/v3zmg/download #gunzip gtdb-rs207.taxonomy.csv.gz #sourmash tax prepare -t gtdb-rs207.taxonomy.csv -o gtdb-rs207.taxonomy.sqldb -F sql # @CTB on STAMPS jetstream instead link it in: ln -fs /opt/shared/sourmash-data/gtdb-rs214.lineages.sqldb . ``` We will use `sourmash taxonomy annotate` to add the taxonomic lineage to each gather match. This command adds a new column into the gather output csv file, and by default writes a new file containing all of the information. ```[bash] # Terminal for infile in *_gather_gtdb_rs214.csv do sourmash tax annotate -g ${infile} -t gtdb-rs214.lineages.sqldb done ``` ## Importing the gather results into a phyloseq object [Phyloseq](https://joey711.github.io/phyloseq/) (and its successors like [speedyseq](https://github.com/mikemc/speedyseq)) organize microbial ecology data into R data structures. While these packages offer a lot of functionality for analyzing microbial abundance information, other packages also use phyloseq data structures even if they don’t use phyloseq functions. One particularly handy function is `tax_glom()` which allows users to agglomerate counts up levels of taxonomy, for example summarizing all counts at the phylum level. The sourmash `gather` -> `taxonomy` framework has all of the information we need to take advantage of these types of functionality, if only we can get all the data into the right format to be ingested by phyloseq! We outline the commands needed below. These command are all executed in R. We’ll use the tidyverse for the majority of data wrangling before creating the actual phyloseq object. In the Rstudio **console**: *(make sure your working directory is sourmash_tutorial/)* ```[r] # R Console library(tidyverse) library(phyloseq) # create a metadata table ------------------------------------------------- # create a metadata that has the SRR run accession numbers and other phenotypic information about the samples # usually, this might be in a csv or tsv file that might be read in with read_csv(). # here, we'll make it from the table earlier in the workflow. run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765", "SRR5936197", "SRR5946923", "SRR5946920") sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB", "MSM9VZMM", "MSM9VZL5", "HSM6XRQS") groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd") metadata <- data.frame(run_accessions = run_accessions, sample_ids = sample_ids, groups = groups) %>% column_to_rownames("run_accessions") # read in sourmash gather & taxonomy results ------------------------------ # read in the sourmash taxonomy results from all samples into a single data frame sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>% map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>% mutate(match_name = gsub(" .*", "", match_name)) # We need two tables: a tax table and an "otu" table. # The tax table will hold the taxonomic lineages of each of our gather matches. # To make this, we'll make a table with two columns: the genome match and the lineage of the genome. # The "otu" table will have the counts of each genome in each sample. # We'll call this our gather_table. tax_table <- sourmash_taxonomy_results %>% select(match_name, lineage) %>% distinct() %>% separate(lineage, into = c("domain", "phylum", "class", "order", "family", "genus", "species"), sep = ";") %>% column_to_rownames("match_name") gather_table <- sourmash_taxonomy_results %>% mutate(n_unique_kmers = (unique_intersect_bp / scaled) * average_abund) %>% # calculate the number of uniquely matched k-mers and multiply it by average k-mer abundance select(query_name, match_name, n_unique_kmers) %>% # select only the columns that have information we need pivot_wider(id_cols = match_name, names_from = query_name, values_from = n_unique_kmers) %>% # transform to wide format replace(is.na(.), 0) %>% # replace all NAs with 0 column_to_rownames("match_name") # move the genome name to a rowname # create a phyloseq object ------------------------------------------------ physeq <- phyloseq(otu_table(gather_table, taxa_are_rows = T), tax_table(as.matrix(tax_table)), sample_data(metadata)) ``` # From raw metagenome reads to taxonomic differential abundance with sourmash and corncob Now we demonstrates how to do differential abundance analysis with the same data using corncob. Corncob works really well for microbial count data because it accounts for sparsity (lots of unobserved taxonomic lineages in samples) and variable sequencing depth. ## Building the phyloseq object We've already built a phyloseq object from the sourmash taxonomy output. Corncob doesn’t require a phyloseq object to run, but it does play well with phyloseq objects so we'll use that as a jumping point. ## Differential abundance testing with corncob Once we have the `phyloseq` object physeq, running corncob is fairly straightforward. In this case, our two groups are stool microbiomes from individuals with Crohn’s disease (cd) and individuals without Crohn’s disease (nonIBD). Let’s start by testing for differential abundance for a specific genome that `sourmash gather` identified in our samples. We’ll test a genome that was identified in all of our samples. Using the below code block, we can take a look at 6 genomes that were identified in all samples. ```[r] # R Console sourmash_taxonomy_results %>% group_by(match_name) %>% tally() %>% arrange(desc(n)) %>% head() ``` ``` # output match_name n <chr> <int> 1 GCA_001915545.1 6 2 GCA_003112755.1 6 3 GCA_003573975.1 6 4 GCA_003584705.1 6 5 GCA_900557355.1 6 6 GCA_900755095.1 6 ``` Randomly pulling out one of these genomes, let’s take a look at its taxonomic lineage: ```[r] # R console sourmash_taxonomy_results %>% filter(match_name == "GCA_003112755.1") %>% select(lineage) %>% distinct() ``` This genome is assigned to `s__Lawsonibacter asaccharolyticus` in the GTDB taxonomy. ``` # output lineage <chr> 1 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Oscillospiraceae;g__Lawsonibacter;s__Lawsonibacter asaccharolyticus ``` ## Testing differential abundance for a single genome Corncob builds a model for each genome/ASV/taxonomic lineage separately, so we can start by building a model for this genome specifically: ```[r] # R Console library(corncob) install.packages("janitor") # if necessary, comment out after installing library(janitor) corncob <- bbdml(formula = GCA_003112755.1 ~ groups, phi.formula = ~ 1, allow_noninteger = TRUE, data = physeq) corncob ``` We can plot the results: ```[r] # R Console plot(corncob, color = groups) ``` While our results show that this genome is significantly differentially abundant in CD when compared to nonIBD, this result doesn’t seem super biologically meaningful; visually, the mean abundances of the genome aren’t that different between groups. The [corncob vignette](https://cran.r-project.org/web/packages/corncob/vignettes/corncob-intro.pdf) includes a lot more information about creating and interpreting models, especially for more complex experimental designs. ## Testing differential abundance at different levels of taxonomy Because corncob integrates well with phyloseq, we can agglomerate to different levels of taxonomy and test for differential abundance systematically. Let’s test for differential abundance for all genuses observed in our data set. ```[r] # R Console # Aggregate the data at the genus and family level physeq_genus <- physeq %>% tax_glom("genus") corncob_genus <- differentialTest(formula = ~ groups, phi.formula = ~ 1, formula_null = ~ 1, phi.formula_null = ~ 1, data = physeq_genus, allow_noninteger = T, test = "Wald", B = 100, fdr_cutoff = 0.1) ``` Then we can plot the results, showing only taxa that were significantly different at an FDR cutoff of 0.1 ```[r] # R Console plot(corncob_genus) ``` ![image](https://hackmd.io/_uploads/rJq_IK6O0.png) By default, the `bbdml()` function does not do multiple testing correction because it only performs one test. If you choose to do multiple tests, be sure to include p value correction. In contrast, `differentialTest()` does calculate the false discovery rate for all of the tests performed for that run. Using `corncob_genus$significant_models`, we can see the results of corncob. Below, we’ll parse the output into a dataframe for easier downstream use of the corncob results. First, we’ll write a function that takes the `differentialTest()` results object as input and parses it into a dataframe of taxa-labelled coefficients. ```[R] # R console glance_significant_models <- function(corncob_results){ df <- data.frame() for(i in 1:length(corncob_results$significant_taxa)){ taxa_name <- corncob_results$significant_taxa[i] taxa_df <- corncob_results$significant_models[[i]]$coefficients %>% as.data.frame() %>% rownames_to_column("covariate") %>% mutate(taxa_name = taxa_name) df <- bind_rows(df, taxa_df) } df <- clean_names(df) %>% select(covariate, estimate, std_error, t_value, fdr = pr_t, taxa_name) return(df) } ``` Then we’ll use this function to parse the results and filter to coefficients with a false discovery rate < 0.1. ```[r] # R Console glance_significant_models(corncob_genus) %>% filter(fdr < 0.1) %>% filter(!grepl("Intercept", covariate)) ``` |covariate | estimate| std_error|t_value | fdr | taxa_name| | :-- | --: | --: | --: | --: | :-- | | mu.groupsnonibd | -2.6052725 | 0.6667681 | -3.907314 | 0.05969535 | GCF_002234575.2 | | mu.groupscd | -1.4867950 | 0.4398837 | -3.379973 | 0.07749496 | GCF_000012825.1 | | mu.groupscd | -0.8150979 | 0.2502086 | -3.257673 | 0.08270706 | GCF_009696065.1 | | mu.groupsnonibd | -2.2554143 | 0.2852505 | -7.906784 | 0.01562175 | GCF_009696065.1 | | mu.groupscd | 3.5106793 | 1.1003016 | 3.190652 | 0.08577909 | GCA_003478935.1 | | mu.groupscd | 3.4957077 | 1.1778541 | 2.967861 | 0.09725123 | GCF_015557635.1 | | mu.groupscd | 3.0423122 | 0.9160323 | 3.321185 | 0.07993957 | GCF_900537995.1 | | mu.groupscd | 3.5415046 | 1.1134170 | 3.180753 | 0.08624662 | GCF_003478505.1 | | mu.groupsnonibd | 3.2390747 | 1.0717325 | 3.022279 | 0.09425566 | GCF_003478505.1 | | mu.groupscd | 3.9035629 | 1.2645871 | 3.086828 | 0.09087019 | GCF_015560765.1 | Phyloseq has an …odd behavior of assigning a genome/OTU name instead of the lineage to aggregated taxa. The last parsing step we’ll perform is re-deriving the genus-level lineage for these results using the `physeq_genus` taxonomy table. ```{r} # R console physeq_genus_tax_table <- tax_table(physeq_genus) %>% as.data.frame() %>% rownames_to_column("taxa_name") glance_significant_models(corncob_genus) %>% filter(fdr < 0.1) %>% filter(!grepl("Intercept", covariate)) %>% left_join(physeq_genus_tax_table) ``` Instead of printing this dataframe to the console, we can write it to a file to use later, for example for visualization. ```[r] # R Console glance_significant_models(corncob_genus) %>% filter(fdr < 0.1) %>% filter(!grepl("Intercept", covariate)) %>% left_join(physeq_genus_tax_table) %>% write_tsv("corncob_results_genus.tsv") ``` | covariate | estimate|std_error | t_value | fdr | taxa_name| | | :-- | --: | --: | --: | --: | :-- | --- | | mu.groupsnonibd | -2.6052725 | 0.6667681 | -3.907314 | 0.05969535 | GCF_002234575.2 | | | mu.groupscd | -1.4867950 | 0.4398837 | -3.379973 | 0.07749496 | GCF_000012825.1 | | | mu.groupscd | -0.8150979 | 0.2502086 | -3.257673 | 0.08270706 | GCF_009696065.1 | | | mu.groupsnonibd | -2.2554143 | 0.2852505 | -7.906784 | 0.01562175 | GCF_009696065.1 | | | mu.groupscd | 3.5106793 | 1.1003016 | 3.190652 | 0.08577909 | GCA_003478935.1 | | | mu.groupscd | 3.4957077 | 1.1778541 | 2.967861 | 0.09725123 | GCF_015557635.1 | | | mu.groupscd | 3.0423122 | 0.9160323 | 3.321185 | 0.07993957 | GCF_900537995.1 | | | mu.groupscd | 3.5415046 | 1.1134170 | 3.180753 | 0.08624662 | GCF_003478505.1 | | | mu.groupsnonibd | 3.2390747 | 1.0717325 | 3.022279 | 0.09425566 | GCF_003478505.1 | | | mu.groupscd | 3.9035629 | 1.2645871 | 3.086828 | 0.09087019 | GCF_015560765.1 | # Visualizing the taxonomic composition of metagenomes using sourmash and metacoder Here, we’ll use the tidyverse and metacoder R packages to visualize the taxonomic composition of a metagenome produced from sourmash `gather` and sourmash `taxonomy`. [Metacoder](https://grunwaldlab.github.io/metacoder_documentation/) is an R package for visualizing heat trees from taxonomic data. First we’ll visualize the taxonomy of all of our samples, then we’ll color this visualization by taxa identified within a specific group, and lastly we’ll [incorporate differential abundance results from the previous tutorial](https://taylorreiter.github.io/2022-08-29-From-raw-metagenome-reads-to-taxonomic-differential-abundance-with-sourmash-and-corncob) into the visualization. ## Importing the sourmash gather and sourmash taxonomy results into a metacoder object We’ll start by loading the necessary R packages, creating a metadata dataframe, and reading the sourmash taxonomy results into a dataframe. ```[r] # R console library(tidyverse) #@CTB you'll need to do this once, on the Jetstream instances: # install.packages("metacoder") library(metacoder) # create a metadata that has the SRR run accession numbers and other phenotypic information about the samples # usually, this might be in a csv or tsv file that might be read in with read_csv(). # here, we'll make it from the table earlier in the workflow. run_accessions <- c("SRR5936131", "SRR5947006", "SRR5935765", "SRR5936197", "SRR5946923", "SRR5946920") sample_ids <- c("PSM7J199", "PSM7J1BJ", "HSM6XRQB", "MSM9VZMM", "MSM9VZL5", "HSM6XRQS") groups <- c("cd", "cd ", "cd", "nonibd", "nonibd", "nonibd") metadata <- data.frame(run_accessions = run_accessions, sample_ids = sample_ids, groups = groups) # read in the sourmash taxonomy results from all samples into a single data frame sourmash_taxonomy_results <- Sys.glob("*gather_gtdb_rs214.with-lineages.csv") %>% map_dfr(read_csv, col_types = "ddddddddcccddddcccdc") %>% # read in all of the taxonomy results mutate(match_name = gsub(" .*", "", match_name)) %>% # simplify the genome name to only include the accession mutate(n_unique_kmers = unique_intersect_bp / scaled) # calculate the number of uniquely matched k-mers ``` We’ll join the sourmash taxonomy results to the metadata we have to add information about our sample groups. ``` # R Console sourmash_taxonomy_results <- sourmash_taxonomy_results %>% left_join(metadata, by = c("query_name" = "run_accessions")) ``` Then, we’ll parse the sourmash `taxonomy` results into a metacoder object. To start with, this object contains the input data we supplied it (`query_name`, `groups`, `name` (genome accession), `n_unique_weighted_found`, and `lineage`) and parsed taxonomic data derived from each observed lineage. ```[r] # R Console #parse the taxonomy results into a metacoder object obj <- parse_tax_data(sourmash_taxonomy_results %>% select(query_name, groups, match_name, n_unique_weighted_found, lineage), class_cols = "lineage", # the column that contains taxonomic information class_sep = ";", # The character used to separate taxa in the classification class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is class_key = c(tax_rank = "info", # A key describing each regex capture group tax_name = "taxon_name")) ``` We can then calculate the number of abundance-weighted k-mers observed at each level of taxonomy by using the `calc_taxon_abund()`. ```[r] # R Console #set number of unique k-mers assigned to a genome as the abundance information obj$data$tax_abund <- calc_taxon_abund(obj, "tax_data", cols = c("n_unique_weighted_found")) ``` ## Visualizing the taxonomic composition of the metagenome samples ### Taxonomy of all samples Using the metacoder object, we can now visualize our abundance and taxonomy data using the `heat_tree()` function. Giving `heat_tree()` the full object, we’ll first visualize the taxonomic breakdown of all of our samples combined. ```[r] #R Console # generate a heat_tree plot with all taxa from all samples set.seed(1) # This makes the plot appear the same each time it is run heat_tree(obj, node_label = taxon_names, node_size = n_obs, node_size_axis_label = "k-mer abundance", node_color_axis_label = "samples with genomes", #output_file = "metacoder_all.png", # uncomment to save file upon running layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations ``` ![image](https://hackmd.io/_uploads/Hk8AAY6d0.png) # Conclusion [Metacoder](https://grunwaldlab.github.io/metacoder_documentation/) offers many visualization options, including coloring by many groups. See the documentation and [example](https://grunwaldlab.github.io/metacoder_documentation/example.html) for more. ### Addendum: a `manysketch` CSV file - ``` name,read1,read2 SRR5936131,SRR5936131_1.fastq.gz,SRR5936131_2.fastq.gz SRR5947006,SRR5947006_1.fastq.gz,SRR5947006_2.fastq.gz SRR5935765,SRR5935765_1.fastq.gz,SRR5935765_2.fastq.gz SRR5936197,SRR5936197_1.fastq.gz,SRR5936197_2.fastq.gz SRR5946923,SRR5946923_1.fastq.gz,SRR5946923_2.fastq.gz SRR5946920,SRR5946920_1.fastq.gz,SRR5946920_2.fastq.gz ``` **Adapted from:** Reiter,Taylor.(2022, August 29). [*Adventures in Seq Land*.](https://taylorreiter.github.io/2022-07-28-From-raw-metagenome-reads-to-phyloseq-taxonomy-table-using-sourmash-gather-and-sourmash-taxonomy/)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully