GToTree is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered Metagenome-Assembled Genomes (MAGs) and references from the stellar Genome Taxonomy DataBase (GTDB).
Contents
This is already done on our instances, but if we wanted to install GToTree with conda/mamba in a new location, we would do so like this:
And creating a working directory and changing into it:
And our prompt should look something like this:
(gtotree) stamps2022@XXX.XXX.XXX.XXX:~/gtotree$
The example here focuses on the Alteromonas genus, where we pretend we have 9 newly recovered Alteromonas MAGs and we want to integrate them with some Alteromonas reference genomes to get an overview of where our new MAGs fit within the estimated evolutionary history of the known Alteromonas genus.
We are just focusing on this genus to keep live run times down under ~5 minutes, but the process would be the same if we wanted to show where our new MAGs fit in across an entire Domain.
To get to our hypothetical situation here, we may have utilized metagenomes from several different locations, assembled them, recovered bins (groups of contigs we think are reprentative of a super-closely related microbial population), did something to estimate their quality (like using CheckM2) and "promoted" the high-quality ones to what we'd call MAGs, and then we taxonomically classified our MAGs (preferably with the stellar GTDB-tk program).
We would usually have a bunch of different MAGs from that process, but for now we are just focusing on those that were classified as being part of the Alteromonas genus.
This code block will download and unpack the fasta files of our MAGs (which would be the typical format we'd have these in at this point):
We can use that to make a file that we will give to GToTree that holds their locations (remember that just the file names is actually the "relative path" (address) of where these files are located):
To add reference genomes to our tree alongside our new MAGs, we could download them all ourselves (as fasta or GenBank files for instance), or we can just give GToTree a file holding the NCBI/GTDB accessions for the genomes we want :+1:
To get our list of Alteromonas reference accessions, we are going to search the GTDB based on taxonomy.
When wanting to span some specific breadth of diversity, we don't need to have many versions of highly closely related genomes. Using "representative" genomes can help with this. Representative genomes are a slimmed down set of manually and computationally selected genomes that are designed to capture the breadth of microbial diversity in genomic lineage space. I regularly make use of both NCBI's representative genomes and GTDB's species representatives. In the program we are using below, we are adding the flag --GTDB-representatives-only
to the command below to focus on just them.
First we can see how many Alteromonas exist in GTDB (the --get-taxon-counts
flag tells the program just to return this information). This is typically how I would start so I have some idea of how many things I'll be working with.
So 231 in there total, and 50 when considering only GTDB representatives.
We can create files with the information for these reference genomes by removing the --get-taxon-counts
flag.
And now it also tells us it wrote out information to two files:
The tsv file holds all the information from GTDB for these 50 Altermonas reference genomes, and the txt file just holds the accessions we need to give to GToTree:
Here we are just going to pretend we care about these 4 KO terms:
By giving a file holding these to GToTree, as it searches each input genome for the SCGs we are using for treeing, it will also look for these functions, store the sequences for us, and make some summary output files about their distributions across our input genomes.
Here we are just putting these into a file:
And we're ready to kick off GToTree! On our setup this will take about with a code breakdown following:
CODE BREAKDOWN
GToTree
- this is our base command
-a
- where we provide our file holding the reference accessions we want to include-f
- where we provide our file holding the locations of our input fasta MAGs/genomes-H
- here we are specifying the SCG-set we want to use (we could see all pre-packaged ones by runninggtt-hmms
)-K
- where we specify our file holding the additional functions we want to search for (as KO IDs in this example)-D
- specifies to adjust our tree labels to hold lineage information (from GTDB) rather than just keeping the input accession ID (making it easier to explore the tree later)-L
- lets us specify what lineage info we want on the tree labels (here just including "species" because all of our input genomes are of the same genus)-j
- where we specify how many cpus to use concurrently on steps that can be run in parallel-o
- here is where we specify the output directory
That should take 4-5 minutes :+1:
We are going to download our tree file to our local computer so we can visualize it. There are a few ways we can do this, but we'll just use the GUI in our jupyter lab environment:
At the top-left, we can open our browser if it's not visible by clicking the folder icon:
Then we can go into the "alteromonas-gtotree-output" directory and download the "alteromonas-gtotree-output.tre" file:
In your regular file-browser GUI, try to find where that downloaded on your computer (it is most likely in your "Downloads" folder), and put up a yellow stickie if you're having trouble finding it.
You can also just download it by clicking this link if wanted: alteromonas-gtotree-output.tre file
We're going to use the interactive Tree of Life site to visualize our tree. So go to this upload page in your browser:
And drag and drop the "alteromonas-gtotree-output.tre" file from your file-browser onto the iToL upload page.
This should autoload to show us the starting tree:
This is an "unrooted" tree, which means we shouldn't be looking at it in terms of direction of evolution, and we can't make any interpretations like "this diverged before this". But unrooted trees still tell us about relatedness (e.g., "this is most closely related to this" – based on what we've done/are looking at).
It can help organization/vizualization to "mid-point root" unrooted trees like this. We can do that by clicking on "Advanced" in the control panel at the top-right:
Then clicking on "Midpoint root" at the bottom:
Which will re-orient our tree based on the furthest points in it:
Back in our terminal, we can also make a quick iToL file to color our MAG labels. First we need a file that holds just their names, which we can quickly make from the "our-fasta-files.txt" file we used. Here is a way we can use cut
to do that:
And we can format this for iToL with the following helper program:
Download that from our instance similar to how we did above (you may need to navigate the browser up a level), or you can download it by clicking here: iToL-label-colors.txt file
Again find that file in your local computer file-browser, and drag and drop that file on top of our tree:
This gives us an overview of which reference genomes our MAGs are most closely related to based on this method and SCGs we used. For instance, our MAG-1 here is most closely related to that Alteromonas mediterranea we included:
Remember we also added some KO terms to be searched:
Back in our command-line environment, all of these outputs are in a sub-directory called "KO_search_results/":
We can glance at one of the summary outputs from that, a table with counts per of KOs identified per genome, with the following:
And if we were interested in looking into any sequences of those functions, those files were also produced, e.g., the sequences that were annotated as K00001 can be found in this file:
GToTree also created files for these that we can just drop onto our iToL tree. These are in this subdirectory of our outputs:
There are only 3 (even though we searched for 4) because "K03895" (aerobactin synthase) was not detected in any of them.
So let's download them to our local computers too. Again, you can do this through the jupyter lab interface, or can just click these links:
And again, if we find where those are in our file-browser, we can just drop them onto the tree page, which will color the branches of the genomes that had them detected. And we can toggle them on/off with the legend:
GToTree relies on many wonderful other programs. With each run, GToTree produces a "citations.txt" file with citation information specific for that run. Please be sure to cite the developers appropriately :)
Here is the output citations file from this run:
And an example of how I would write this part of the methods:
Example methods with citations from above
The Alteromonas phylogenomic tree was produced with GToTree v1.6.36 (Lee 2019), using the prepackaged single-copy gene-set for bacteria (74 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes on input genomes provided as fasta files. Reference genomes were accessed from the Genome Taxonomy Database v207 (Parks et al. 2022) based on searching for Alteromonas species representatives. Target genes were identified with HMMER3 v3.3.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). KOfamScan v1.3.0 (Aramaki et al. 2020) was used to search all included genomes for functions of interest. And GNU Parallel (Tange 2018) was used where possible to run stages in parallel.
These days, I'm mostly working with exploratory metagenomics datasets, and for those I use GToTree to make a quick overview figure of recovered MAGs across domains, e.g., like this one where the blue branches and marks on the outer ring indicate newly recovered MAGs:
And also for zooming in on clades with newly recovered MAGs to show the distribution of certain traits, like this one shows which members of the Order Rubrobacterales have a pathway for carbon-fixation:
If new to this arena, and there are a million moving parts and terms while we're trying to find our footing in it all, it can be confusing to know when something like GToTree (or any program) might be useful. To try to help with that, and some other nuances and caveats of GToTree, I have a "Things to consider" page here. Please check that out if interested, and definitely don't hesitate to reach out to me to see if it seems like this might be useful for your work, or if there is somewhere else I might be able to point you towards 🙂