Try   HackMD

STAMPS 2022 GToTree tutorial



GToTree is a user-friendly workflow for phylogenomics. This page is an example of using GToTree to make a phylogenomic tree incorporating newly recovered Metagenome-Assembled Genomes (MAGs) and references from the stellar Genome Taxonomy DataBase (GTDB).



Contents


Environment creation

This is already done on our instances, but if we wanted to install GToTree with conda/mamba in a new location, we would do so like this:

# DON'T RUN IF WORKING ON OUR CLOUD INSTANCES
# mamba create -y -n gtotree -c conda-forge -c bioconda -c defaults -c astrobiomike gtotree

Environment activation

conda activate gtotree

And creating a working directory and changing into it:

mkdir -p ~/gtotree
cd ~/gtotree/

And our prompt should look something like this:

(gtotree) stamps2022@XXX.XXX.XXX.XXX:~/gtotree$


Alteromonas example

The example here focuses on the Alteromonas genus, where we pretend we have 9 newly recovered Alteromonas MAGs and we want to integrate them with some Alteromonas reference genomes to get an overview of where our new MAGs fit within the estimated evolutionary history of the known Alteromonas genus.

We are just focusing on this genus to keep live run times down under ~5 minutes, but the process would be the same if we wanted to show where our new MAGs fit in across an entire Domain.


Conceptually how we may have gotten to this point

To get to our hypothetical situation here, we may have utilized metagenomes from several different locations, assembled them, recovered bins (groups of contigs we think are reprentative of a super-closely related microbial population), did something to estimate their quality (like using CheckM2) and "promoted" the high-quality ones to what we'd call MAGs, and then we taxonomically classified our MAGs (preferably with the stellar GTDB-tk program).

We would usually have a bunch of different MAGs from that process, but for now we are just focusing on those that were classified as being part of the Alteromonas genus.


What we need to provide to GToTree

  • input genomes
    • here we're going to do that with: 1) a file holding the accessions of all the reference genomes we want to include; and 2) a file holding the location of all our MAG fasta files
  • which SCGs to use
  • optional: additional functions we want to look for in all our genomes (specified with KO or PFam IDs)

Downloading our MAGs

This code block will download and unpack the fasta files of our MAGs (which would be the typical format we'd have these in at this point):

curl -L -o Alteromonas-MAGs.tar https://figshare.com/ndownloader/files/36444270

tar -xf Alteromonas-MAGs.tar
ls *.gz

We can use that to make a file that we will give to GToTree that holds their locations (remember that just the file names is actually the "relative path" (address) of where these files are located):

ls *.gz > our-fasta-files.txt

head our-fasta-files.txt

Getting Alteromonas reference genomes to include in our tree

To add reference genomes to our tree alongside our new MAGs, we could download them all ourselves (as fasta or GenBank files for instance), or we can just give GToTree a file holding the NCBI/GTDB accessions for the genomes we want :+1:

To get our list of Alteromonas reference accessions, we are going to search the GTDB based on taxonomy.

Using "representative" genomes

When wanting to span some specific breadth of diversity, we don't need to have many versions of highly closely related genomes. Using "representative" genomes can help with this. Representative genomes are a slimmed down set of manually and computationally selected genomes that are designed to capture the breadth of microbial diversity in genomic lineage space. I regularly make use of both NCBI's representative genomes and GTDB's species representatives. In the program we are using below, we are adding the flag --GTDB-representatives-only to the command below to focus on just them.

First we can see how many Alteromonas exist in GTDB (the --get-taxon-counts flag tells the program just to return this information). This is typically how I would start so I have some idea of how many things I'll be working with.

gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only --get-taxon-counts

#   Reading in the GTDB info table...
#     Using GTDB v207: Released April 08, 2022


#     The rank 'genus' has 231 Alteromonas entries.

#   In considering only GTDB representative genomes:

#     The rank 'genus' has 50 Alteromonas representative genome entries.

So 231 in there total, and 50 when considering only GTDB representatives.

We can create files with the information for these reference genomes by removing the --get-taxon-counts flag.

gtt-get-accessions-from-GTDB -t Alteromonas --GTDB-representatives-only

And now it also tells us it wrote out information to two files:

#   The targeted NCBI accessions were written to:
#     GTDB-Alteromonas-genus-GTDB-rep-accs.txt

#   A subset GTDB table of these targets was written to:
#     GTDB-Alteromonas-genus-GTDB-rep-metadata.tsv

The tsv file holds all the information from GTDB for these 50 Altermonas reference genomes, and the txt file just holds the accessions we need to give to GToTree:

head GTDB-Alteromonas-genus-GTDB-rep-accs.txt

Here we are just going to pretend we care about these 4 KO terms:

By giving a file holding these to GToTree, as it searches each input genome for the SCGs we are using for treeing, it will also look for these functions, store the sequences for us, and make some summary output files about their distributions across our input genomes.

Here we are just putting these into a file:

printf "K00001\nK03782\nK03895\nK16092\n" > target-KOs.txt

head target-KOs.txt

GToTree

And we're ready to kick off GToTree! On our setup this will take about with a code breakdown following:

GToTree -a GTDB-Alteromonas-genus-GTDB-rep-accs.txt \
        -f our-fasta-files.txt \
        -H Bacteria -K target-KOs.txt \
        -D -L Species -j 8 \
        -o alteromonas-gtotree-output

CODE BREAKDOWN

  • GToTree - this is our base command
    • -a - where we provide our file holding the reference accessions we want to include
    • -f - where we provide our file holding the locations of our input fasta MAGs/genomes
    • -H - here we are specifying the SCG-set we want to use (we could see all pre-packaged ones by running gtt-hmms)
    • -K - where we specify our file holding the additional functions we want to search for (as KO IDs in this example)
    • -D - specifies to adjust our tree labels to hold lineage information (from GTDB) rather than just keeping the input accession ID (making it easier to explore the tree later)
    • -L - lets us specify what lineage info we want on the tree labels (here just including "species" because all of our input genomes are of the same genus)
    • -j - where we specify how many cpus to use concurrently on steps that can be run in parallel
    • -o - here is where we specify the output directory

That should take 4-5 minutes :+1:


Downloading our tree file

We are going to download our tree file to our local computer so we can visualize it. There are a few ways we can do this, but we'll just use the GUI in our jupyter lab environment:

At the top-left, we can open our browser if it's not visible by clicking the folder icon:

Then we can go into the "alteromonas-gtotree-output" directory and download the "alteromonas-gtotree-output.tre" file:

In your regular file-browser GUI, try to find where that downloaded on your computer (it is most likely in your "Downloads" folder), and put up a yellow stickie if you're having trouble finding it.

You can also just download it by clicking this link if wanted: alteromonas-gtotree-output.tre file


Visualizing our tree

We're going to use the interactive Tree of Life site to visualize our tree. So go to this upload page in your browser:

And drag and drop the "alteromonas-gtotree-output.tre" file from your file-browser onto the iToL upload page.

This should autoload to show us the starting tree:

This is an "unrooted" tree, which means we shouldn't be looking at it in terms of direction of evolution, and we can't make any interpretations like "this diverged before this". But unrooted trees still tell us about relatedness (e.g., "this is most closely related to this" – based on what we've done/are looking at).

It can help organization/vizualization to "mid-point root" unrooted trees like this. We can do that by clicking on "Advanced" in the control panel at the top-right:

Then clicking on "Midpoint root" at the bottom:

Which will re-orient our tree based on the furthest points in it:

Back in our terminal, we can also make a quick iToL file to color our MAG labels. First we need a file that holds just their names, which we can quickly make from the "our-fasta-files.txt" file we used. Here is a way we can use cut to do that:

cut -f 1 -d "." our-fasta-files.txt > MAG-labels.txt

head MAG-labels.txt

And we can format this for iToL with the following helper program:

gtt-gen-itol-map -w labels -o iToL-label-colors.txt -g MAG-labels.txt

head iToL-label-colors.txt

Download that from our instance similar to how we did above (you may need to navigate the browser up a level), or you can download it by clicking here: iToL-label-colors.txt file

Again find that file in your local computer file-browser, and drag and drop that file on top of our tree:

This gives us an overview of which reference genomes our MAGs are most closely related to based on this method and SCGs we used. For instance, our MAG-1 here is most closely related to that Alteromonas mediterranea we included:

Remember we also added some KO terms to be searched:

Back in our command-line environment, all of these outputs are in a sub-directory called "KO_search_results/":

ls alteromonas-gtotree-output/KO_search_results/
# individual_genome_results  KO-hit-counts.tsv  target_KO_profiles
# iToL_files                 KO_hit_seqs        target-KOs.tsv

We can glance at one of the summary outputs from that, a table with counts per of KOs identified per genome, with the following:

head alteromonas-gtotree-output/KO_search_results/KO-hit-counts.tsv | column -t
# assembly_id      total_gene_count  K00001  K03782  K03895  K16092
# GCA_000020585.3  3965              0       1       0       1
# GCA_000172635.2  3872              1       2       0       1
# GCA_000213655.1  4371              0       2       0       1
# GCA_000378185.1  3607              0       1       0       0
# GCA_000730385.1  3445              0       2       0       1
# GCA_000753865.1  3777              0       2       0       1
# GCA_000808575.1  3583              0       1       0       1
# GCA_001562115.1  4078              0       2       0       1
# GCA_001757105.1  4136              0       1       0       1

And if we were interested in looking into any sequences of those functions, those files were also produced, e.g., the sequences that were annotated as K00001 can be found in this file:

head alteromonas-gtotree-output/KO_search_results/KO_hit_seqs/K00001-hits.faa  | sed 's/^/# /'
# >GCA_000172635.2_472
# MKAVVFESFGGQLNIQDVKKPMPKSHGVVIRVKATGVCRSDWHGWMGHDDGINLPHVPGHEFAGVVESVGSDVKRFKAGDKVTVPFISACGRCSECSSGNHQVCGNQTQPGFTHWGSFAEYVEIDFGDVNLVHLPEEIDYVTAASLGCRFATSFRAVVDQGKVSAGQWVAVHGCGGVGLSAVMIASSIGANVIAVDISDDALALAQKLGADVVINAKNEKDVSATIKRLSSGGAHVSIDALGNPITCVNSVNSLRKRGKHVQVGLLLAEQSTPPIPMDTVVAHELEVIGSHGMQAYRYEAMFGMIATKKLNPQALIGKIISLQEAPKALTEMDKYSQPGVTVISFDEK
# >GCA_900129565.1_360
# MKAVLIEQFSQRPSIQQVADPTPNSHGVVIRVKATGVCRSDWHCWQGHDTDVVLPHVPGHEFAGIVEAVGKDVSKFKVGDRVTVPFINACGNCNECHSGNHQVCGNQTQPGFTHWGSFAEYTTVDHADVNLVTLPEHMDFTTAASLGCRFVTAFRAVVDQGGVSAGQWVAVHGCGGVGLSAIMIASAAGANVIAVDIAEDKLALAKSLGAVITLNANTTDDVAAAIKEVTKGGAHVSLDALGHPVTCVNSINCLRKLGKHIQVGLLLAEHATPPIPMDNVVANELQILGSHGMQAFRYTAMMALILSGKLQPEKLLGKTIALEDAIDAMVDMDTSTSVGVTVVTHF
# >MAG-2_485
# MKAVVFESFGGQLNIQDVKKPMPKSHGVVIRVKATGVCRSDWHGWMGHDDGINLPHVPGHEFAGVVESVGSDVKRFKAGDKVTVPFISACGRCSECSSGNHQVCGNQTQPGFTHWGSFAEYVEIDFGDVNLVHLPEEIDYVTAASLGCRFATSFRAVVDQGKVSAGQWVAVHGCGGVGLSAVMIASSIGANVIAVDISDDALALAQKLGADVVINAKNEKDVSATIKRLSSGGAHVSIDALGNPITCVNSVNSLRKRGKHVQVGLLLAEQSTPPIPMDTVVAHELEVIGSHGMQAYRYEAMFGMIATKKLNPQALIGKIISLQEAPKALTEMDKYSQPGVTVISFDEK
# >GCA_017794925.1_1036
# MKAVLIEQFSQRPTIQQVADPTPNSHGVVIRVKATGVCRSDWHCWQGHDTDVVLPHVPGHEFAGIVEAVGKDVSKFKVGDRVTVPFINACGNCNECHSGNHQVCGNQTQPGFTHWGSFAEYTTVDHADVNLVTLPEHMDFTTAASLGCRFVTAFRAVVDQGGVTAGQWVAVHGCGGVGLSAIMIASAAGANVIAVDIAEDKLALAKSLGAVITLNANNTDDVAAAIKEVTKGGAHVSLDALGHPVTCVNSINCLRKLGKHIQVGLLLAEHATPPIPMDNVVANELQILGSHGMQAFRYNAMMALILSGKLQPEKLLGKTIALEDAIDAMVDMDTSTSVGVTVVTHF

GToTree also created files for these that we can just drop onto our iToL tree. These are in this subdirectory of our outputs:

ls alteromonas-gtotree-output/KO_search_results/iToL_files/
# K00001-iToL.txt  K03782-iToL.txt  K16092-iToL.txt

There are only 3 (even though we searched for 4) because "K03895" (aerobactin synthase) was not detected in any of them.

So let's download them to our local computers too. Again, you can do this through the jupyter lab interface, or can just click these links:

And again, if we find where those are in our file-browser, we can just drop them onto the tree page, which will color the branches of the genomes that had them detected. And we can toggle them on/off with the legend:

Citations output file!

GToTree relies on many wonderful other programs. With each run, GToTree produces a "citations.txt" file with citation information specific for that run. Please be sure to cite the developers appropriately :)

Here is the output citations file from this run:

cat alteromonas-gtotree-output/citations.txt | sed 's/^/# /'
# GToTree v1.6.36
# Lee MD. GToTree: a user-friendly workflow for phylogenomics. Bioinformatics. 2019; (March):1-3. doi.org/10.1093/bioinformatics/btz188
# 
# HMMER3 v3.3.2
# Eddy SR. Accelerated profile HMM searches. PLoS Comput. Biol. 2011; (7)10. doi.org/10.1371/journal.pcbi.1002195
# 
# Muscle 5.1.linux64
# Edgar RC. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv. 2021.06.20.449169. doi.org/10.1101/2021.06.20.449169
# 
# TrimAl v1.4.rev15
# Gutierrez SC. et al. TrimAl: a Tool for automatic alignment trimming. Bioinformatics. 2009; 25, 1972–1973. doi.org/10.1093/bioinformatics/btp348
# 
# Prodigal v2.6.3
# Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2010; 28, 2223–2230. doi.org/10.1186/1471-2105-11-119
# 
# KOfamScan 1.3.0
# Aramaki, T et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics. 2020. doi.org/10.1093/bioinformatics/btz859
# 
# Genome Taxonomy Database (GTDB) v207; Released April 08, 2022
# Parks DH et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat. Biotech. 2020. doi.org/10.1038/s41587-020-0501-8
# 
# FastTree 2 v2.1.11
# Price MN et al. FastTree 2 - approximately maximum-likelihood trees for large alignments. PLoS One. 2010; 5. doi.org/10.1371/journal.pone.0009490
# 
# GNU Parallel v20220522
# Tange O. GNU Parallel 2018. doi.org/10.5281/zenodo.1146014

And an example of how I would write this part of the methods:

Example methods with citations from above

The Alteromonas phylogenomic tree was produced with GToTree v1.6.36 (Lee 2019), using the prepackaged single-copy gene-set for bacteria (74 target genes). Briefly, prodigal v2.6.3 (Hyatt et al. 2010) was used to predict genes on input genomes provided as fasta files. Reference genomes were accessed from the Genome Taxonomy Database v207 (Parks et al. 2022) based on searching for Alteromonas species representatives. Target genes were identified with HMMER3 v3.3.2 (Eddy 2011), individually aligned with muscle v5.1 (Edgar 2021), trimmed with trimal v1.4.rev15 (Capella-Gutiérrez et al. 2009), and concatenated prior to phylogenetic estimation with FastTree2 v2.1.11 (Price et al. 2010). KOfamScan v1.3.0 (Aramaki et al. 2020) was used to search all included genomes for functions of interest. And GNU Parallel (Tange 2018) was used where possible to run stages in parallel.


Fin

These days, I'm mostly working with exploratory metagenomics datasets, and for those I use GToTree to make a quick overview figure of recovered MAGs across domains, e.g., like this one where the blue branches and marks on the outer ring indicate newly recovered MAGs:

And also for zooming in on clades with newly recovered MAGs to show the distribution of certain traits, like this one shows which members of the Order Rubrobacterales have a pathway for carbon-fixation:


If new to this arena, and there are a million moving parts and terms while we're trying to find our footing in it all, it can be confusing to know when something like GToTree (or any program) might be useful. To try to help with that, and some other nuances and caveats of GToTree, I have a "Things to consider" page here. Please check that out if interested, and definitely don't hesitate to reach out to me to see if it seems like this might be useful for your work, or if there is somewhere else I might be able to point you towards 🙂