This page gives an overview of how one could subset a pre-packaged HMM single-copy geneset. For example, if you had specific genomes that you wanted to work with, and wanted to use a pre-packaged HMM file (like Cyanobacteria.hmm which holds 251 target genes at the time of putting this page together (12-Feb-2024)), but you wanted to drop that down to some subset of those.
hmmfetch
(which comes installed with the GToTree conda environment)-h
argumentThe example code will be based on running the
gtt-test.sh
program to start (so there is standard data to be used for this example), but would work the same after adjusting for your situation. If you have trouble, feel free to reach out to me for help: MikeLee<at>bmsis<dot>orgThe below assumes starting with GToTree already installed via conda as described here.
Activating conda environment to start:
If we look in the outputs directory, we can see the "SCG_hit_counts.tsv" file:
The target genes are listed as the column names (after the first column), and the rows represent how many times each target gene was found in each input genome.
We can use that to filter down the just the target genes we want to include. If it is a large table, you may want to filter it with python or R. But no matter how you get there, we ultimately just want a list of the target gene names we want to include in our next alignment/treeing process.
We need to retain the exact names of the target genes that are the headers in that table. For the purposes of the example here, we are going to pretend we want to keep the following target genes:
We need to have them in a plain-text file, with one per line. Here is just making one for the example on this page:
And looking at that new file we just made, holding just the target genes we want (comment symbols added by me below, they shouldn't be in the file):
hmmfetch
Now we are going to make a subset target-genes HMM file that will only hold the genes we have listed in the file we just made. The code below depends on the starting HMM set that was used. In this example thus far, we have used the "Universal" set, so that is what is pointed to below. But if we were using something else, we'd need to change the code below accordingly.
This should work in the active GToTree conda environment, as that has the hmmfetch
program installed, and it should have a special variable that holds the location of the pre-packaged HMMs.
We can look at that location like so:
The ls
command above should hold the starting HMM set since it was utilized above (and therefore downloaded) with the initial run done in Step 1. For the example on this page, the starting HMM file is going to be the "Universal_Hug_et_al.hmm" file at that location. But if you used a different starting HMM set, like "Cyanobacteria", for the Step 1 for your data, then you'd want to point at the "Cyanobacteria.hmm" file at that location. Here is the code for the Universal HMM set that follows the example so far on this page:
Where:
hmmfetch
- the base command
-f
- here we are specifying the input full HMM file (you would change this if using something other than the pre-packaged "Universal" HMMs)wanted-target-genes.txt
- then comes a positional argument that is our input file holding the gene names we want to pull out> Universal-subset.hmm
- then we are "redirecting" the output into this file we are naming right now (you'd want to change this too based on the starting pre-packaged HMM file used)
Based on the structure of what these HMM files look like, we can count how many target genes there are with the following. Here is counting how many were in the starting "Universal" file here, and the how many are in the new subset (which should be the 3 we listed above):
Now we just need to re-run GToTree with all the arguments/options we normally want, except we want to point to our newly created HMM subset file instead of one of the pre-packaged ones. Here is an example using the test data downloaded above:
And after that finishes, now if we glance at the "SCG_hit_counts.tsv" summary file, we can see it only used the target genes we subset down to:
Again, if you're having trouble applying the above to your specific case, feel free to reach out to me for help: MikeLee<at>bmsis<dot>org 🙂