Try   HackMD

GToTree subset target HMMs example

This page gives an overview of how one could subset a pre-packaged HMM single-copy geneset. For example, if you had specific genomes that you wanted to work with, and wanted to use a pre-packaged HMM file (like Cyanobacteria.hmm which holds 251 target genes at the time of putting this page together (12-Feb-2024)), but you wanted to drop that down to some subset of those.



General process (example code below)

  1. Run GToTree with everything (all wanted input genomes, and pointing to the pre-packaged HMM file that is suitable)
  2. Use the output "SCG_hit_counts.tsv" table to figure out which target genes we want to use based on the hit-counts of each target gene per genome; put the wanted gene-names (from their names in the "SCG_hit_counts.tsv" file) into a plain text file, one per line
  3. Make a suset SCG-targets hmm file with hmmfetch (which comes installed with the GToTree conda environment)
  4. Run GToTree again, with all input genomes, but now pointing to the newly created, subset hmm file passed to the -h argument

Example code

The example code will be based on running the gtt-test.sh program to start (so there is standard data to be used for this example), but would work the same after adjusting for your situation. If you have trouble, feel free to reach out to me for help: MikeLee<at>bmsis<dot>org

The below assumes starting with GToTree already installed via conda as described here.

Activating conda environment to start:

conda activate gtotree

1. Run GToTree with everything

gtt-test.sh

# that downloads test data and runs it as follows using the 
# pre-packaged "Universal" SCG HMM set

# GToTree -a GToTree-test-data/ncbi_accessions.txt \
#         -g GToTree-test-data/genbank_files.txt \
#         -f GToTree-test-data/fasta_files.txt \
#         -A GToTree-test-data/amino_acid_files.txt \
#         -m GToTree-test-data/genome_to_id_map.tsv \
#         -p GToTree-test-data/pfam_targets.txt \
#         -H Universal -t -D -j 4 -o GToTree-test-output -F

2. Get wanted target-genes from "SCG_hit_counts.tsv" output file

If we look in the outputs directory, we can see the "SCG_hit_counts.tsv" file:

column -ts $'\t' GToTree-test-output/SCG_hit_counts.tsv
# assembly_id                          Ribosomal_L14  Ribosomal_L15e  Ribosomal_L16  Ribosomal_L18  Ribosomal_L22  ribosomal_L24  Ribosomal_L2  Ribosomal_L3  Ribosomal_L4  Ribosomal_L5  Ribosomal_L6  Ribosomal_S10  Ribosomal_S17  Ribosomal_S19  Ribosomal_S3_C  Ribosomal_S8
# GCA_000299365.1                      1              1               1              1              1              0              1             1             1             1             1             1              1              1              1               1
# GCA_003818365.1                      0              0               0              0              0              0              0             0             0             0             0             0              0              0              0               0
# GCF_000153765.1                      1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCA_000172635                        1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCF_900162675.1                      1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCF_000013045.1                      1              0               1              0              1              1              1             1             1             1             1             2              1              1              1               1
# GCF_000972705.1                      1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCF_000012505.1_ASM1250v1_genomic    1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCA_900473895.1_N32_genomic          1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCA_000009925.1_ASM992v1_genomic     1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCA_000012825.1_ASM1282v1_genomic    1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCF_000020585.3_ASM2058v3_protein    1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1
# GCF_001886455.1_ASM188645v1_protein  1              0               1              0              1              1              1             1             1             1             1             1              1              1              1               1

The target genes are listed as the column names (after the first column), and the rows represent how many times each target gene was found in each input genome.

We can use that to filter down the just the target genes we want to include. If it is a large table, you may want to filter it with python or R. But no matter how you get there, we ultimately just want a list of the target gene names we want to include in our next alignment/treeing process.

We need to retain the exact names of the target genes that are the headers in that table. For the purposes of the example here, we are going to pretend we want to keep the following target genes:

  • Ribosomal_L16
  • Ribosomal_L3
  • Ribosomal_S3_C

We need to have them in a plain-text file, with one per line. Here is just making one for the example on this page:

printf "Ribosomal_L16\nRibosomal_L3\nRibosomal_S3_C\n" > wanted-target-genes.txt

And looking at that new file we just made, holding just the target genes we want (comment symbols added by me below, they shouldn't be in the file):

cat wanted-target-genes.txt
# Ribosomal_L16
# Ribosomal_L3
# Ribosomal_S3_C

3. Make subset SCG-targets hmm file with hmmfetch

Now we are going to make a subset target-genes HMM file that will only hold the genes we have listed in the file we just made. The code below depends on the starting HMM set that was used. In this example thus far, we have used the "Universal" set, so that is what is pointed to below. But if we were using something else, we'd need to change the code below accordingly.

This should work in the active GToTree conda environment, as that has the hmmfetch program installed, and it should have a special variable that holds the location of the pre-packaged HMMs.

We can look at that location like so:

gtt-hmms
# that will print out where the HMMs are or would be stored

ls ${GToTree_HMM_dir}
# that will print out the HMMs that are stored there already

The ls command above should hold the starting HMM set since it was utilized above (and therefore downloaded) with the initial run done in Step 1. For the example on this page, the starting HMM file is going to be the "Universal_Hug_et_al.hmm" file at that location. But if you used a different starting HMM set, like "Cyanobacteria", for the Step 1 for your data, then you'd want to point at the "Cyanobacteria.hmm" file at that location. Here is the code for the Universal HMM set that follows the example so far on this page:

hmmfetch -f ${GToTree_HMM_dir}Universal_Hug_et_al.hmm wanted-target-genes.txt > Universal-subset.hmm

Where:

  • hmmfetch - the base command
    • -f - here we are specifying the input full HMM file (you would change this if using something other than the pre-packaged "Universal" HMMs)
    • wanted-target-genes.txt - then comes a positional argument that is our input file holding the gene names we want to pull out
    • > Universal-subset.hmm - then we are "redirecting" the output into this file we are naming right now (you'd want to change this too based on the starting pre-packaged HMM file used)

Based on the structure of what these HMM files look like, we can count how many target genes there are with the following. Here is counting how many were in the starting "Universal" file here, and the how many are in the new subset (which should be the 3 we listed above):

# counting how many target genes in the initial HMM
grep -c "^NAME" ${GToTree_HMM_dir}Universal_Hug_et_al.hmm
# 16

# counting how many target genes in our new, subset HMM file
grep -c "^NAME" Universal-subset.hmm
# 3

4. Run GToTree again, now specifying newly created SCG-targets hmm file

Now we just need to re-run GToTree with all the arguments/options we normally want, except we want to point to our newly created HMM subset file instead of one of the pre-packaged ones. Here is an example using the test data downloaded above:

GToTree -a GToTree-test-data/ncbi_accessions.txt \
        -g GToTree-test-data/genbank_files.txt \
        -f GToTree-test-data/fasta_files.txt \
        -A GToTree-test-data/amino_acid_files.txt \
        -H Universal-subset.hmm \
        -o new-GToTree-output

And after that finishes, now if we glance at the "SCG_hit_counts.tsv" summary file, we can see it only used the target genes we subset down to:

column -ts $'\t' new-GToTree-output/SCG_hit_counts.tsv
# assembly_id                          Ribosomal_L16  Ribosomal_L3  Ribosomal_S3_C
# GCA_000172635                        1              1             1
# GCA_000299365.1                      1              1             1
# GCA_003818365.1                      0              0             0
# GCF_000153765.1                      1              1             1
# GCF_900162675.1                      1              1             1
# GCF_000013045.1                      1              1             1
# GCF_000972705.1                      1              1             1
# GCA_900473895.1_N32_genomic          1              1             1
# GCF_000012505.1_ASM1250v1_genomic    1              1             1
# GCA_000009925.1_ASM992v1_genomic     1              1             1
# GCA_000012825.1_ASM1282v1_genomic    1              1             1
# GCF_000020585.3_ASM2058v3_protein    1              1             1
# GCF_001886455.1_ASM188645v1_protein  1              1             1

Again, if you're having trouble applying the above to your specific case, feel free to reach out to me for help: MikeLee<at>bmsis<dot>org 🙂