STAMPS 2023 GToTree tutorial

--- <a href="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "><img src="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "></a> --- > [GToTree](https://github.com/AstrobioMike/GToTree/wiki) is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered **M**etagenome-**A**ssembled **G**enomes (MAGs) and references from the stellar [**G**enome **T**axonomy **D**ata**B**ase (GTDB)](https://gtdb.ecogenomic.org/). <br> --- **Contents** [toc] --- > **NOTE** > This page assumes some baseline familiarty with the Unix-like command line. You can find a [Crash Course here](https://astrobiomike.github.io/unix/unix-intro) if wanted 🙂 ## Accessing our Unix-like environment You can get to your Unix-like environment by accessing your cloud instance (links on [this page](https://hackmd.io/@ctb/HynqHkUqh) during the course) and getting to a Terminal via either Jupyter Lab or RStudio. ## Environment creation Here we are going to install GToTree on our instances with [conda](https://astrobiomike.github.io/unix/conda-intro)/[mamba](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5): ```bash mamba create -y -n gtotree -c astrobiomike \ -c conda-forge -c bioconda -c defaults \ gtotree ``` > This was v1.8.1 at the time this page was initially put together. ## Environment activation ```bash conda activate gtotree ``` And creating a working directory and changing into it: ```bash mkdir -p ~/gtotree cd ~/gtotree/ ``` And our prompt should look something like this: `(gtotree) stamps@<IP>:~/gtotree$` --- ## Alteromonas example The example here focuses on the *Alteromonas* genus, where we pretend we have 9 newly recovered *Alteromonas* MAGs and we want to integrate them with some *Alteromonas* reference genomes to get an overview of where our new MAGs fit within the estimated evolutionary history of the known *Alteromonas* genus. We are just focusing on this genus to keep live run times down under ~5 minutes, but the process would be the same if we wanted to show where our new MAGs fit in across an entire Domain or all Domains. --- ### Conceptually how we may have gotten to this point To get to our hypothetical situation here, we may have: - utilized metagenomes from several different locations, assembled them, recovered bins (groups of contigs we think are reprentative of a super-closely related microbial population) - did something to estimate their quality (like using [CheckM2](https://github.com/chklovski/CheckM2)) and "promoted" the high-quality ones to what we'd call MAGs - taxonomically classified our MAGs (preferably with the [GTDB-tk](https://github.com/Ecogenomics/GTDBTk) program) if working with bacteria or archaea. We would usually have a bunch of different MAGs from that process, but for this example we are just focusing on those that were classified as being part of the *Alteromonas* genus. --- ### What we need to provide to GToTree for this example - **input genomes** - here we're going to do that with: - **1)** a file holding the accessions of all the reference genomes we want to include - **2)** a file holding the location of all our MAG fasta files - **which SCGs (single-copy genes) to target** There is a lot more GToTree can do, detailed at the [wiki](https://github.com/AstrobioMike/GToTree/wiki). But this page is an example of the baseline/most common usage. ### Downloading our MAGs This code block will download and unpack the fasta files of our MAGs (which would be the typical format we'd have these in at this point; also, this would all work the same if these were isolate genomes instead of MAGs, or a mix of the two): ```bash curl -L -o Alteromonas-MAGs.tar https://figshare.com/ndownloader/files/36444270 tar -xf Alteromonas-MAGs.tar ``` ```bash ls *.gz ``` We can use that to make a file that we will give to GToTree that holds their locations (remember that just the file name is actually the "[relative path](https://astrobiomike.github.io/unix/getting-started#absolute-vs-relative-path)" (address) of where these files are located): ```bash ls *.gz > our-fasta-files.txt head our-fasta-files.txt ``` --- ### Getting *Alteromonas* reference genomes to include in our tree To add reference genomes to our tree alongside our new MAGs, we could download them all ourselves (as fasta or GenBank files for instance), or we can just give GToTree a file holding the NCBI/GTDB accessions for the genomes we want :+1: To get our list of *Alteromonas* reference accessions, we are going to search the [GTDB](https://gtdb.ecogenomic.org/) based on taxonomy. #### Using "representative" genomes When wanting to span a certain breadth of diversity, we often don't need to have many versions of highly closely related genomes included. Using "representative" genomes can help with this. Representative genomes are a slimmed down set of manually and computationally selected genomes that are designed to capture the breadth of microbial diversity in genomic lineage space. It is common to use either [NCBI's representative genomes](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) or [GTDB's species representatives](https://gtdb.ecogenomic.org/faq#how-are-gtdb-species-clusters-formed). In the commands we are using below, we are adding the flag `--GTDB-representatives-only` to the command below to focus on just them. First we can see how many *Alteromonas* exist in GTDB (the `--get-taxon-counts` flag tells the program just to return this information). This is typically how I would start so I have some idea of how many things I'll be working with. > This will need to download some GTDB info the first time it's run, but should only take ~1 minute. ```bash gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only --get-taxon-counts # Downloading and parsing archaeal and bacterial metadata tables from # GTDB (only needs to be done once)... # Using GTDB v214.1: Released Jun 9th, 2023 # The rank 'genus' has 256 Alteromonas entries. # In considering only GTDB representative genomes: # The rank 'genus' has 51 Alteromonas representative genome entries. ``` So there are 256 in there total, and 51 when considering only [GTDB representatives](https://gtdb.ecogenomic.org/faq#how-are-gtdb-species-clusters-formed). We can create files with the information for these reference genomes by removing the `--get-taxon-counts` flag from the above command: ```bash gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only ``` And it also tells us it wrote out information to two files: ``` # The targeted NCBI accessions were written to: # GTDB-Alteromonas-genus-GTDB-rep-accs.txt # A subset GTDB table of these targets was written to: # GTDB-Alteromonas-genus-GTDB-rep-metadata.tsv ``` The tsv file holds all the information from GTDB for these 51 *Altermonas* reference genomes, and the txt file just holds the accessions we need to give to GToTree: ```bash head GTDB-Alteromonas-genus-GTDB-rep-accs.txt ``` --- ### GToTree And we're ready to kick off GToTree! On our setup this will take about 5 minutes the first time it's run. So let's start it and then look at a code breakdown while it's running: ```bash GToTree -a GTDB-Alteromonas-genus-GTDB-rep-accs.txt \ -f our-fasta-files.txt \ -H Bacteria -D -L Species -j 16 \ -o alteromonas-gtotree-output ``` > **CODE BREAKDOWN** > - `GToTree` - this is our base command > - `-a` - where we provide our file holding the GTDB/NCBI accessions we want to include > - `-f` - where we provide our file holding the locations of our input fasta MAGs/genomes > - `-H` - here we are specifying the SCG-set we want to use (we could see all pre-built available ones by running `gtt-hmms`) > - `-D` - specifies to adjust our tree labels to hold lineage information (from GTDB) rather than just keeping the input accession ID (making it easier to explore the tree later) > - `-L` - lets us specify what lineage info we want on the tree labels (here just including "species" because all of our input genomes are of the same genus; though note that species names include genus designations ¯\\\_(ツ)_/¯) > - `-j` - where we specify how many cpus to use concurrently on steps that can be run in parallel > - `-o` - here is where we specify the output directory Here's the high-level of what GToTree is doing for us: - it will download any reference genome files needed for those we provided as accessions to the `-a` argument - it will go through each individual input genome and identify any of our target single-copy genes - does some filtering That should take 4-5 minutes (mostly due to needing to download some NCBI data the first time it's run) :+1: --- ### Downloading our tree file We are going to download our tree file to our local computer so we can visualize it. There are a few ways we can do this, the images below are an example of using the GUI file-browser in the Jupyter Lab environment: At the top-left, we can open our browser if it's not visible by clicking the folder icon: <a href="https://i.imgur.com/vYBkMLy.png"><img src="https://i.imgur.com/vYBkMLy.png" width="50%"></a> Then we can go into the "alteromonas-gtotree-output" directory and download the "alteromonas-gtotree-output.tre" file: <a href="https://i.imgur.com/2krqCMh.png"><img src="https://i.imgur.com/2krqCMh.png" width="50%" align="center"></a> In your regular file-browser GUI, try to find where that downloaded on your computer (it is most likely in your "Downloads" folder), and put up a yellow stickie if you're having trouble finding it. You can also just download it by clicking this link if wanted: [alteromonas-gtotree-output.tre file](https://figshare.com/ndownloader/files/41673465). --- ### Visualizing our tree We're going to use the [interactive Tree of Life](https://itol.embl.de/upload.cgi) site to visualize our tree. So go to [this upload page](https://itol.embl.de/upload.cgi) in your browser: ![](https://i.imgur.com/2dMRhRH.png) And drag and drop the "alteromonas-gtotree-output.tre" file from your file-browser onto the iToL upload page. This should autoload to show us the starting tree: <a href="https://hackmd.io/_uploads/ByjhDbncn.png"><img src="https://hackmd.io/_uploads/ByjhDbncn.png"></a> > This is an "unrooted" tree, which means it is not suggestive of anything in terms of direction of evolution (e.g., we wouldn't be able to make any interpretations like "this diverved *before* that"). But unrooted trees still tell us about relatedness (e.g., "this is most closely related to this" – based on what we've done to generate this view. It can help organization/vizualization of unrooted trees to "mid-point root". We can do that by clicking on "Advanced" in the control panel at the top-right: <a href="https://i.imgur.com/LZe2L0m.png"><img src="https://i.imgur.com/LZe2L0m.png" width="50%"></a> Then clicking on "Midpoint root" at the bottom: <a href="https://i.imgur.com/3TZi27I.png"><img src="https://i.imgur.com/3TZi27I.png" width="50%"></a> Which will re-orient our tree based on the furthest points in it: <a href="https://hackmd.io/_uploads/rJGuYbnc3.png"><img src="https://hackmd.io/_uploads/rJGuYbnc3.png"></a> Back in our terminal, we can also make a quick iToL file to color our MAG labels. First we need a file that holds just their names, which we can quickly make from the "our-fasta-files.txt" file we used. Here is a way we can use `cut` to do that: ```bash cut -f 1 -d "." our-fasta-files.txt > MAG-labels.txt head MAG-labels.txt ``` And we can format this for iToL with the following helper program: ```bash gtt-gen-itol-map -w labels -o iToL-label-colors.txt -g MAG-labels.txt head iToL-label-colors.txt ``` Download that from our instance similar to how we did above (you may need to navigate the browser up a level), or you can download it by clicking here: [iToL-label-colors.txt file](https://figshare.com/ndownloader/files/41673687). Again find that file in your local computer file-browser, and drag and drop that file on top of our tree, and it will color all our MAG genome labels for us: <a href="https://hackmd.io/_uploads/rk0CcW39n.png"><img src="https://hackmd.io/_uploads/rk0CcW39n.png"></a> This gives us an overview of which reference genomes our MAGs are most closely related to based on this method and SCGs we used. For instance, our MAG-1 here is most closely related to that *Alteromonas mediterranea* we included: ![](https://hackmd.io/_uploads/rysOjW293.png) ### Citations output file! GToTree relies on many wonderful other programs. With each run, GToTree produces a "citations.txt" file with citation information specific for that run. Please be sure to cite the developers appropriately :) Here is the output citations file from this run: ```bash cat alteromonas-gtotree-output/citations.txt | sed 's/^/# /' # GToTree v1.8.1 # Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi.org/10.1093/bioinformatics/btz188 # # HMMER3 v3.3.2 # Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi.org/10.1371/journal.pcbi.1002195 # # Muscle 5.1.linux64 # Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021.06.20.449169. doi.org/10.1101/2021.06.20.449169 # # TrimAl v1.4.rev15 # Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972-1973. doi.org/10.1093/bioinformatics/btp348 # # Prodigal v2.6.3 # Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223-2230. doi.org/10.1186/1471-2105-11-119 # # Genome Taxonomy Database (GTDB) ; # Parks DH et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotech. 2020. doi.org/10.1038/s41587-020-0501-8 # # FastTree 2 v2.1.11 # Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi.org/10.1371/journal.pone.0009490 # # GNU Parallel v20230522 # Tange O. GNU Parallel 2018. doi.org/10.5281/zenodo.1146014 # ``` And an example of how this could be written up aspart of the methods: :::::info **Example methods with citations from above** The *Alteromonas* phylogenomic tree was produced with GToTree v1.8.1 (Lee 2019), using the prepackaged single-copy gene-set for bacteria (74 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes in input genomes provided as fasta files. Reference genomes were accessed from the Genome Taxonomy Database v214.1 (Parks et al. 2022) based on searching for *Alteromonas* species representatives. Target genes were identified with HMMER3 v3.3.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). And GNU Parallel (Tange 2018) was used where possible to run stages in parallel. ::::: Following getting a tree into iToL like we have above, it is common to use iToL to do alot of coloring/annotation, and then if this is for a publication, we might want to further beautify the tree in a figure-editing program such as [Affinity Designer](https://affinity.serif.com/en-us/designer/) (one-time license fee) or [Inkscape](https://inkscape.org/) (free), like has been done with the examples below. --- ## Fin If working with MAGs from exploratory metagenomics, it is often helpful to make an overview of where the newly recoverd MAGs fit in across an entire Domain, e.g., like this one where the blue branches and marks on the outer ring indicate newly recovered MAGS: <a href="https://i.imgur.com/xFn3bCX.jpg"><img src="https://i.imgur.com/xFn3bCX.jpg"></a> And also for zooming in on clades with newly recovered MAGs to show the distribution of certain traits, like this one shows which members of the Order Rubrobacterales have a pathway for carbon-fixation: <a href="https://i.imgur.com/fjCxXVp.png"><img src="https://i.imgur.com/fjCxXVp.png"></a> A search for specific functions in each genome can also be performed by GToTree if specifying the target functions with [KO](https://www.genome.jp/kegg/ko.html) or [PFam](https://pfam.xfam.org/) IDs) provided in a file (like the [example here](https://github.com/AstrobioMike/GToTree/wiki/example-usage#visualization-of-gene-presenceabsence-across-the-bacterial-domain)). --- If new to this arena, and there are a million moving parts and terms while we're trying to find our footing in it all, it can be confusing to know when something like GToTree (or any program) might be useful. To try to help with that, and discuss some other nuances and caveats of GToTree, there is a ["Things to consider" page here](https://github.com/AstrobioMike/GToTree/wiki/Things-to-consider). Please check that out if interested, and definitely don't hesitate to reach out to see if it seems like this might be useful for your work, or if there is something else more suitable that we might be able to point you towards 🙂 ---