--- title: Talk slides template tags: Templates, Talk description: View the slide with "Slide Mode". --- # STAMPS 2025 - Phylogenetics Tutorial --- ## Setting up GToTree (and bit) - using conda If you want to install both GToTree and bit (**Not necessary in the cluster computers**), we can proceed using conda (or mamba if you prefer). ```python #We are creating an environment for bit conda create -n bit -c astrobiomike -c conda-forge -c bioconda -c defaults bit #WHENEVER we need to use a bit line command conda activate bit #And we are creating an environment for GToTree conda create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda gtotree #WHENEVER we are using a gtotree function conda activate gtotree ``` **Small note:** Installing can be tricky for people with a windows computer or a Mac with an M chip. For Windows, it is recommended to use the Windows Subsystem for Linux (WSL) and installing miniconda in the WSL terminal. For Mac with an M chip, installing the x86_64 version of miniconda as described [here](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda) is recommended, but I will not tell you what to do. For details on GToTree go to [here](https://github.com/AstrobioMike/GToTree/wiki). --- ## First steps on the virtual computer Open the terminal, either from the Rstudio instance or the JupyterLab. We will be using R later for visualization, so I recommend having both in different tabs for good measure. #### Setting a separate folder for our example (recommended) and downloading data Download the zip folder and unzip in your preferred folder. ```bash #To download curl -O https://raw.githubusercontent.com/mblstamps/stamps2025/main/day7/Phylogenomics/PhylogenomicsLab.zip #To unzip and enter the folder unzip PhylogenomicsLab.zip cd PhylogenomicsLab ``` --- ## Inputs needed for GToTree ### Metagenome-Assembled Genomes (MAGs) - FASTA files: a text file holding location of FASTA files in your computer. - Accessions codes: a text file listing all ### Single-Copy Genes? (SCGs) - Use one of the 15 [sets provided by GToTree](https://github.com/AstrobioMike/GToTree/wiki/SCG-sets). - Use your own profile Hidden Markov models (HMMs) file (a myriad available at [Pfam database](https://www.ebi.ac.uk/interpro/download/Pfam/)) In the folder we downloaded above, some MAGs Fasta files are contained in the folder data/MAGs, while the file with Accession codes is given in data/accession-codes.txt. We can create a file with the MAGs location paths to pass to the GToTree program by running. ```bash ls data/MAGsFastaFiles/*.gz > data/local-fasta-files.txt head data/local-fasta-files.txt ``` --- ## Genomes we will be exploring here In this short example, we will be focusing on Lactic Acid Bacteria (LAB), particularly representatives of the order Lactobacillales. These bacteria are widely known for their capacity of lactic acid production from sugars (fermentative metabolism), which is central to their ecological function. In microbiomes, Lactobacillus species are key contributors to host health. Here is how I constructed the small set of genomes we will be seeing (this data is already in your folder). **No need to run it**. For the following process, we need to load the bit environment. And we will start by pulling all NCBI acessions of representative genomes in the order Lactobacillales. This will produce a tsv table called GTDB-Lactobacillales-order-GTDB-rep-metadata, containing the accession codes and all available information on the samples. ```bash #Entering proper bit environment to run this command lines conda activate bit #Pulling accession codes bit-get-accessions-from-GTDB -t Lactobacillales -r order --GTDB-representatives-only #Mantaining only the information I care about #(Taxonomy and sample source) awk -F'\t' ' NR==1 { for (i = 1; i <= NF; i++) { colname[i] = $i if ($i == "ncbi_isolation_source") sources_col = i } } { out = "" for (i = 1; i <= 8; i++) { out = out (i == 1 ? "" : OFS) $i } out = out OFS $(sources_col) print out } ' OFS='\t' GTDB-Lactobacillales-order-GTDB-rep-metadata.tsv > filtered-GTDB-table.tsv ``` From here, you can select the samples you care about. While you can use any preferred method to select which genomes to include, I manually filtered (using excel) some representatives from the geni: Lactobacillus, #Streptococcus, Lactococcus and Aerococcus. Check our data by running (or just open the file data/GTDB-Lactobacillales-order-metadata.tsv directly) ```bash #Seeing values of the genomes in our sample head GTDB-Lactobacillales-order-metadata.tsv ``` Assuming we collect the relevant Accession codes in a text file my-accession-codes.text, we can use that directly with GToTree, or we can download the MAGs files beforehand with the following code (**no need to run!!**). ```bash bit-dl-ncbi-assemblies -w my-accession-codes.text -f fasta ``` ## Running the code for the "Species" tree Let's jump head first into the code and then I will explain! ```bash #Do not forget to enter the gtotree environment conda activate gtotree #And we run the estimation workflow with GToTree -f data/local-fasta-files.txt -a data/accession-codes.txt \ -H Bacteria -T IQ-TREE \ -m data/genome-to-label.txt \ -D -L Species -j 8 \ -o species-tree-output ``` #### Understanding the arguments in the command \-f : File containing a list of all pathways to our fasta files, each of them a representative genome. \-a : File containing a list of reference accessions codes, particularly for genomes in the [NCBI database](https://www.ncbi.nlm.nih.gov/datasets/) \-g : (**Not featured above**) File containing a list of file names or paths to GenBank files. \-H : Specifying which single-copy gene-set to use for the tree estimation. \-T : Specifying the estimation program to be used. We are using IQ-TREE! \-m : A two column file (tab delimited table) containing some input genomes with desired labels. Make sure all genomes included are part of the input genomes. **Names depend on how the genome as provided, either the NCBI accession or the file name**. ```bash cat data/genome-to-label.txt ``` \-D : Add if you want GToTree to add GTDB taxonomy information to the labels that appear in the final alignment and in the tree. Which specific taxonomic ranks get added can be specified with the -L argument. \-L : Specifying which ranks you would like added to the labels, as a comma-separated list. (**default: Domain,Phylum,Class,Species,Strain**) \-j : Specifying how many jobs to run in parallel during steps that are parallelizable. (**default: 1**) \-o : Giving name for the output folder where all results will be stored by GToTree. (**default: GToTree_output**) ## Tree Visualization We are looking at our trees employing the R packages - ape - ggtree Although we are not going to explore the myriad of tools this packages offer, they allow for reading and manipulating trees, as well as flexible tools for visualization similar as other plotting devides in ggplot. Jump into the VisualizeTree.R file in your virtual computer to visualize this first tree (it should be inside the main folder) ## Let's explore a single gene tree instead Now, let's imagine we are interested in understanding how a particular gene might be affected by environmental factors. And to enrich our exploration, we might wonder what evolutionary history this gene has among Lactic Acid Bacteria. Here, we decided to explore this evolutionary history for Ldh_1_N, which encodes part of a domain crucial inn the conversion of pyruvate to lactate, an important metabolic step in lactic acid bacteria. It plays an important role at maintaining balance during anaerobic fermentation and as an accessory gene, Ldh_1_N can be subject to horizontal gene transfer (HGT), especially in complex microbial ecosystems like the gut or fermented food environments where gene exchange is frequent. We downloaded the appropriate HMM file from the Pfam database and is now in data/HMMs/PF00056.hmm ```bash GToTree -f data/local-fasta-files.txt -a data/accession-codes.txt -T IQ-TREE \ -H data/HMMs/PF00056.hmm -m data/genome-to-label.txt \ -t -D -L Species -j 8 -B \ -o PF00056-tree-output ``` **NOTE:** There is an important addition to the above code! So let's jump into R again and visualize this tree. ## What about the individual gene evolutionary histories for the SCG we used? Can we actually say that they share the same evolutionary history? Let's explore a couple of them individually. We downloaded the Bacteria gene set used by GToTree in the file data/HMMs/Bacteria.hmm. We can take a quick peek on what these files look like: ```bash head data/HMMs/Bacteria.hmm -n 25 ``` Inside the document, each gene is described by a sequence of lines, among them the "NAME", a description given after the keyword "DESC" and the accession code "ACC". To gain information on all genes contained, we can summarize this information first ```bash awk ' /^NAME/ { name[n++] = $2 } /^ACC/ { acc[a++] = $2 } /^DESC/ { desc[d++] = $2 } END { for (i = 0; i < n || i < a || i < d; i++) { print name[i] "\t" acc[i] "\t" desc[i] } } ' data/HMMs/Bacteria.hmm > core-genes-summary.txt awk ' NR == 1 { for (i = 1; i <= NF; i++) { header[i] = $i sum[i] = 0 } next } { for (i = 1; i <= NF; i++) { sum[i] += $i } } END { for (i = 1; i <= NF; i++) { if (sum[i] == 10) print header[i] } } ' species-tree-output/SCG_hit_counts.tsv > good_columns.txt awk 'NR==FNR { good[$1]; next } FNR==1 || ($1 in good)' good_columns.txt core-genes-summary.txt > filtered-core-genes-summary.txt cat filtered-core-genes-summary.txt ``` We can pick one or a few or as many of these genes we want, and run GToTree again. On my part, I will grab the first Ribosomal gene I see! Ribosomal_L1 with accession PF00687.21. We can create our HMM file for this using the command `hmmfetch` from GToTree. ```bash echo 'PF00687.21' > temporal-hmm-acc.txt hmmfetch -f data/HMMs/Bacteria.hmm temporal-hmm-acc.txt > PF00687.hmm rm temporal-hmm-acc.txt ``` And we can run GToTree again ```bash GToTree -f data/local-fasta-files.txt -a data/accession-codes.txt -T IQ-TREE \ -H PF00687.hmm -m data/genome-to-label.txt \ -t -D -L Species -j 8 \ -o PF00687-tree-output; ``` **Question:** do you notice what is missing in the code above in comparison to the previous gene tree? ### Activity: Select another gene from the filtered core genes and estimate its evolutionary history. - Does it have the same topology as the Species tree? Or with the topology of the Ribosomal_L1 gene tree? Raise a green sticky note! (if you still have one) - Is it different? Raise a pink sticky note! (if you still have one)