Try   HackMD


GToTree is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered Metagenome-Assembled Genomes (MAGs) and references from the stellar Genome Taxonomy DataBase (GTDB).



Contents


NOTE
This page assumes some baseline familiarty with the Unix-like command line. You can find a Crash Course here if wanted 🙂

Accessing our Unix-like environment

You can get to your Unix-like environment by accessing your cloud instance (links on this page during the course) and getting to a Terminal via either Jupyter Lab or RStudio.

Environment creation

Here we are going to install GToTree on our instances with conda/mamba:

mamba create -y -n gtotree -c astrobiomike \
             -c conda-forge -c bioconda -c defaults \
             gtotree

This was v1.8.1 at the time this page was initially put together.

Environment activation

conda activate gtotree

And creating a working directory and changing into it:

mkdir -p ~/gtotree
cd ~/gtotree/

And our prompt should look something like this:

(gtotree) stamps@<IP>:~/gtotree$


Alteromonas example

The example here focuses on the Alteromonas genus, where we pretend we have 9 newly recovered Alteromonas MAGs and we want to integrate them with some Alteromonas reference genomes to get an overview of where our new MAGs fit within the estimated evolutionary history of the known Alteromonas genus.

We are just focusing on this genus to keep live run times down under ~5 minutes, but the process would be the same if we wanted to show where our new MAGs fit in across an entire Domain or all Domains.


Conceptually how we may have gotten to this point

To get to our hypothetical situation here, we may have:

  • utilized metagenomes from several different locations, assembled them, recovered bins (groups of contigs we think are reprentative of a super-closely related microbial population)
  • did something to estimate their quality (like using CheckM2) and "promoted" the high-quality ones to what we'd call MAGs
  • taxonomically classified our MAGs (preferably with the GTDB-tk program) if working with bacteria or archaea.

We would usually have a bunch of different MAGs from that process, but for this example we are just focusing on those that were classified as being part of the Alteromonas genus.


What we need to provide to GToTree for this example

  • input genomes
    • here we're going to do that with:
      • 1) a file holding the accessions of all the reference genomes we want to include
      • 2) a file holding the location of all our MAG fasta files
  • which SCGs (single-copy genes) to target

There is a lot more GToTree can do, detailed at the wiki. But this page is an example of the baseline/most common usage.

Downloading our MAGs

This code block will download and unpack the fasta files of our MAGs (which would be the typical format we'd have these in at this point; also, this would all work the same if these were isolate genomes instead of MAGs, or a mix of the two):

curl -L -o Alteromonas-MAGs.tar https://figshare.com/ndownloader/files/36444270

tar -xf Alteromonas-MAGs.tar
ls *.gz

We can use that to make a file that we will give to GToTree that holds their locations (remember that just the file name is actually the "relative path" (address) of where these files are located):

ls *.gz > our-fasta-files.txt

head our-fasta-files.txt

Getting Alteromonas reference genomes to include in our tree

To add reference genomes to our tree alongside our new MAGs, we could download them all ourselves (as fasta or GenBank files for instance), or we can just give GToTree a file holding the NCBI/GTDB accessions for the genomes we want :+1:

To get our list of Alteromonas reference accessions, we are going to search the GTDB based on taxonomy.

Using "representative" genomes

When wanting to span a certain breadth of diversity, we often don't need to have many versions of highly closely related genomes included. Using "representative" genomes can help with this. Representative genomes are a slimmed down set of manually and computationally selected genomes that are designed to capture the breadth of microbial diversity in genomic lineage space. It is common to use either NCBI's representative genomes or GTDB's species representatives.

In the commands we are using below, we are adding the flag --GTDB-representatives-only to the command below to focus on just them.

First we can see how many Alteromonas exist in GTDB (the --get-taxon-counts flag tells the program just to return this information). This is typically how I would start so I have some idea of how many things I'll be working with.

This will need to download some GTDB info the first time it's run, but should only take ~1 minute.

gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only --get-taxon-counts

#   Downloading and parsing archaeal and bacterial metadata tables from
#   GTDB (only needs to be done once)...

#     Using GTDB v214.1: Released Jun 9th, 2023


#     The rank 'genus' has 256 Alteromonas entries.

#   In considering only GTDB representative genomes:

#     The rank 'genus' has 51 Alteromonas representative genome entries.

So there are 256 in there total, and 51 when considering only GTDB representatives.

We can create files with the information for these reference genomes by removing the --get-taxon-counts flag from the above command:

gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only

And it also tells us it wrote out information to two files:

#   The targeted NCBI accessions were written to:
#     GTDB-Alteromonas-genus-GTDB-rep-accs.txt

#   A subset GTDB table of these targets was written to:
#     GTDB-Alteromonas-genus-GTDB-rep-metadata.tsv

The tsv file holds all the information from GTDB for these 51 Altermonas reference genomes, and the txt file just holds the accessions we need to give to GToTree:

head GTDB-Alteromonas-genus-GTDB-rep-accs.txt

GToTree

And we're ready to kick off GToTree! On our setup this will take about 5 minutes the first time it's run. So let's start it and then look at a code breakdown while it's running:

GToTree -a GTDB-Alteromonas-genus-GTDB-rep-accs.txt \
        -f our-fasta-files.txt \
        -H Bacteria -D -L Species -j 16 \
        -o alteromonas-gtotree-output

CODE BREAKDOWN

  • GToTree - this is our base command
    • -a - where we provide our file holding the GTDB/NCBI accessions we want to include
    • -f - where we provide our file holding the locations of our input fasta MAGs/genomes
    • -H - here we are specifying the SCG-set we want to use (we could see all pre-built available ones by running gtt-hmms)
    • -D - specifies to adjust our tree labels to hold lineage information (from GTDB) rather than just keeping the input accession ID (making it easier to explore the tree later)
    • -L - lets us specify what lineage info we want on the tree labels (here just including "species" because all of our input genomes are of the same genus; though note that species names include genus designations ¯\_(ツ)_/¯)
    • -j - where we specify how many cpus to use concurrently on steps that can be run in parallel
    • -o - here is where we specify the output directory

Here's the high-level of what GToTree is doing for us:

  • it will download any reference genome files needed for those we provided as accessions to the -a argument
  • it will go through each individual input genome and identify any of our target single-copy genes
  • does some filtering

That should take 4-5 minutes (mostly due to needing to download some NCBI data the first time it's run) :+1:


Downloading our tree file

We are going to download our tree file to our local computer so we can visualize it. There are a few ways we can do this, the images below are an example of using the GUI file-browser in the Jupyter Lab environment:

At the top-left, we can open our browser if it's not visible by clicking the folder icon:

Then we can go into the "alteromonas-gtotree-output" directory and download the "alteromonas-gtotree-output.tre" file:

In your regular file-browser GUI, try to find where that downloaded on your computer (it is most likely in your "Downloads" folder), and put up a yellow stickie if you're having trouble finding it.

You can also just download it by clicking this link if wanted: alteromonas-gtotree-output.tre file.


Visualizing our tree

We're going to use the interactive Tree of Life site to visualize our tree. So go to this upload page in your browser:

And drag and drop the "alteromonas-gtotree-output.tre" file from your file-browser onto the iToL upload page.

This should autoload to show us the starting tree:

This is an "unrooted" tree, which means it is not suggestive of anything in terms of direction of evolution (e.g., we wouldn't be able to make any interpretations like "this diverved before that"). But unrooted trees still tell us about relatedness (e.g., "this is most closely related to this" – based on what we've done to generate this view.

It can help organization/vizualization of unrooted trees to "mid-point root". We can do that by clicking on "Advanced" in the control panel at the top-right:

Then clicking on "Midpoint root" at the bottom:

Which will re-orient our tree based on the furthest points in it:

Back in our terminal, we can also make a quick iToL file to color our MAG labels. First we need a file that holds just their names, which we can quickly make from the "our-fasta-files.txt" file we used. Here is a way we can use cut to do that:

cut -f 1 -d "." our-fasta-files.txt > MAG-labels.txt

head MAG-labels.txt

And we can format this for iToL with the following helper program:

gtt-gen-itol-map -w labels -o iToL-label-colors.txt -g MAG-labels.txt

head iToL-label-colors.txt

Download that from our instance similar to how we did above (you may need to navigate the browser up a level), or you can download it by clicking here: iToL-label-colors.txt file.

Again find that file in your local computer file-browser, and drag and drop that file on top of our tree, and it will color all our MAG genome labels for us:

This gives us an overview of which reference genomes our MAGs are most closely related to based on this method and SCGs we used. For instance, our MAG-1 here is most closely related to that Alteromonas mediterranea we included:

Citations output file!

GToTree relies on many wonderful other programs. With each run, GToTree produces a "citations.txt" file with citation information specific for that run. Please be sure to cite the developers appropriately :)

Here is the output citations file from this run:

cat alteromonas-gtotree-output/citations.txt | sed 's/^/# /'
# GToTree v1.8.1
# Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi.org/10.1093/bioinformatics/btz188
# 
# HMMER3 v3.3.2
# Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi.org/10.1371/journal.pcbi.1002195
# 
# Muscle 5.1.linux64
# Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021.06.20.449169. doi.org/10.1101/2021.06.20.449169
# 
# TrimAl v1.4.rev15
# Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972-1973. doi.org/10.1093/bioinformatics/btp348
# 
# Prodigal v2.6.3
# Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223-2230. doi.org/10.1186/1471-2105-11-119
# 
# Genome Taxonomy Database (GTDB) ; 
# Parks DH et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotech. 2020. doi.org/10.1038/s41587-020-0501-8
# 
# FastTree 2 v2.1.11
# Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi.org/10.1371/journal.pone.0009490
# 
# GNU Parallel v20230522
# Tange O. GNU Parallel 2018. doi.org/10.5281/zenodo.1146014
# 

And an example of how this could be written up aspart of the methods:

Example methods with citations from above

The Alteromonas phylogenomic tree was produced with GToTree v1.8.1 (Lee 2019), using the prepackaged single-copy gene-set for bacteria (74 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes in input genomes provided as fasta files. Reference genomes were accessed from the Genome Taxonomy Database v214.1 (Parks et al. 2022) based on searching for Alteromonas species representatives. Target genes were identified with HMMER3 v3.3.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). And GNU Parallel (Tange 2018) was used where possible to run stages in parallel.

Following getting a tree into iToL like we have above, it is common to use iToL to do alot of coloring/annotation, and then if this is for a publication, we might want to further beautify the tree in a figure-editing program such as Affinity Designer (one-time license fee) or Inkscape (free), like has been done with the examples below.


Fin

If working with MAGs from exploratory metagenomics, it is often helpful to make an overview of where the newly recoverd MAGs fit in across an entire Domain, e.g., like this one where the blue branches and marks on the outer ring indicate newly recovered MAGS:

And also for zooming in on clades with newly recovered MAGs to show the distribution of certain traits, like this one shows which members of the Order Rubrobacterales have a pathway for carbon-fixation:

A search for specific functions in each genome can also be performed by GToTree if specifying the target functions with KO or PFam IDs) provided in a file (like the example here).


If new to this arena, and there are a million moving parts and terms while we're trying to find our footing in it all, it can be confusing to know when something like GToTree (or any program) might be useful. To try to help with that, and discuss some other nuances and caveats of GToTree, there is a "Things to consider" page here. Please check that out if interested, and definitely don't hesitate to reach out to see if it seems like this might be useful for your work, or if there is something else more suitable that we might be able to point you towards 🙂