GToTree subset target HMMs example

# GToTree subset target HMMs example > This page gives an overview of how one could subset a pre-packaged HMM single-copy geneset. For example, if you had specific genomes that you wanted to work with, and wanted to use a pre-packaged HMM file (like Cyanobacteria.hmm which holds 251 target genes at the time of putting this page together (12-Feb-2024)), but you wanted to drop that down to some subset of those. --- [toc] --- ## General process (example code below) 1. Run GToTree with everything (all wanted input genomes, and pointing to the pre-packaged HMM file that is suitable) 2. Use the output "SCG_hit_counts.tsv" table to figure out which target genes we want to use based on the hit-counts of each target gene per genome; put the wanted gene-names (from their names in the "SCG_hit_counts.tsv" file) into a plain text file, one per line 3. Make a suset SCG-targets hmm file with `hmmfetch` (which comes installed with the GToTree conda environment) 4. Run GToTree again, with all input genomes, but now pointing to the newly created, subset hmm file passed to the `-h` argument --- ## Example code > The example code will be based on running the `gtt-test.sh` program to start (so there is standard data to be used for this example), but would work the same after adjusting for your situation. If you have trouble, feel free to reach out to me for help: MikeLee\<at\>bmsis\<dot\>org > >The below assumes starting with GToTree already installed via conda as described [here](https://github.com/AstrobioMike/GToTree/wiki/installation#conda-quickstart). Activating conda environment to start: ```bash conda activate gtotree ``` ### 1. Run GToTree with everything ```bash gtt-test.sh # that downloads test data and runs it as follows using the # pre-packaged "Universal" SCG HMM set # GToTree -a GToTree-test-data/ncbi_accessions.txt \ # -g GToTree-test-data/genbank_files.txt \ # -f GToTree-test-data/fasta_files.txt \ # -A GToTree-test-data/amino_acid_files.txt \ # -m GToTree-test-data/genome_to_id_map.tsv \ # -p GToTree-test-data/pfam_targets.txt \ # -H Universal -t -D -j 4 -o GToTree-test-output -F ``` ### 2. Get wanted target-genes from "SCG_hit_counts.tsv" output file If we look in the outputs directory, we can see the "SCG_hit_counts.tsv" file: ```bash column -ts $'\t' GToTree-test-output/SCG_hit_counts.tsv # assembly_id Ribosomal_L14 Ribosomal_L15e Ribosomal_L16 Ribosomal_L18 Ribosomal_L22 ribosomal_L24 Ribosomal_L2 Ribosomal_L3 Ribosomal_L4 Ribosomal_L5 Ribosomal_L6 Ribosomal_S10 Ribosomal_S17 Ribosomal_S19 Ribosomal_S3_C Ribosomal_S8 # GCA_000299365.1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 # GCA_003818365.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # GCF_000153765.1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCA_000172635 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCF_900162675.1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCF_000013045.1 1 0 1 0 1 1 1 1 1 1 1 2 1 1 1 1 # GCF_000972705.1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCF_000012505.1_ASM1250v1_genomic 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCA_900473895.1_N32_genomic 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCA_000009925.1_ASM992v1_genomic 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCA_000012825.1_ASM1282v1_genomic 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCF_000020585.3_ASM2058v3_protein 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 # GCF_001886455.1_ASM188645v1_protein 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 ``` The target genes are listed as the column names (after the first column), and the rows represent how many times each target gene was found in each input genome. We can use that to filter down the just the target genes we want to include. If it is a large table, you may want to filter it with python or R. But no matter how you get there, we ultimately just want a list of the target gene names we want to include in our next alignment/treeing process. We need to retain the exact names of the target genes that are the headers in that table. For the purposes of the example here, we are going to pretend we want to keep the following target genes: - Ribosomal_L16 - Ribosomal_L3 - Ribosomal_S3_C We need to have them in a plain-text file, with one per line. Here is just making one for the example on this page: ```bash printf "Ribosomal_L16\nRibosomal_L3\nRibosomal_S3_C\n" > wanted-target-genes.txt ``` And looking at that new file we just made, holding just the target genes we want (comment symbols added by me below, they shouldn't be in the file): ```bash cat wanted-target-genes.txt # Ribosomal_L16 # Ribosomal_L3 # Ribosomal_S3_C ``` ### 3. Make subset SCG-targets hmm file with `hmmfetch` Now we are going to make a subset target-genes HMM file that will only hold the genes we have listed in the file we just made. The code below depends on the starting HMM set that was used. In this example thus far, we have used the "Universal" set, so that is what is pointed to below. But if we were using something else, we'd need to change the code below accordingly. This should work in the active GToTree conda environment, as that has the `hmmfetch` program installed, and it should have a special variable that holds the location of the pre-packaged HMMs. We can look at that location like so: ```bash gtt-hmms # that will print out where the HMMs are or would be stored ls ${GToTree_HMM_dir} # that will print out the HMMs that are stored there already ``` The `ls` command above should hold the starting HMM set since it was utilized above (and therefore downloaded) with the initial run done in Step 1. For the example on this page, the starting HMM file is going to be the "Universal_Hug_et_al.hmm" file at that location. But if you used a different starting HMM set, like "Cyanobacteria", for the Step 1 for your data, then you'd want to point at the "Cyanobacteria.hmm" file at that location. Here is the code for the Universal HMM set that follows the example so far on this page: ```bash hmmfetch -f ${GToTree_HMM_dir}Universal_Hug_et_al.hmm wanted-target-genes.txt > Universal-subset.hmm ``` > **Where:** > - `hmmfetch` - the base command > - `-f` - here we are specifying the input full HMM file (you would change this if using something other than the pre-packaged "Universal" HMMs) > - `wanted-target-genes.txt` - then comes a positional argument that is our input file holding the gene names we want to pull out > - `> Universal-subset.hmm` - then we are "[redirecting](https://astrobiomike.github.io/unix/wild-redirectors#redirectors)" the output into this file we are naming right now (you'd want to change this too based on the starting pre-packaged HMM file used) Based on the structure of what these HMM files look like, we can count how many target genes there are with the following. Here is counting how many were in the starting "Universal" file here, and the how many are in the new subset (which should be the 3 we listed above): ```bash # counting how many target genes in the initial HMM grep -c "^NAME" ${GToTree_HMM_dir}Universal_Hug_et_al.hmm # 16 # counting how many target genes in our new, subset HMM file grep -c "^NAME" Universal-subset.hmm # 3 ``` ### 4. Run GToTree again, now specifying newly created SCG-targets hmm file Now we just need to re-run GToTree with all the arguments/options we normally want, except we want to point to our newly created HMM subset file instead of one of the pre-packaged ones. Here is an example using the test data downloaded above: ```bash GToTree -a GToTree-test-data/ncbi_accessions.txt \ -g GToTree-test-data/genbank_files.txt \ -f GToTree-test-data/fasta_files.txt \ -A GToTree-test-data/amino_acid_files.txt \ -H Universal-subset.hmm \ -o new-GToTree-output ``` And after that finishes, now if we glance at the "SCG_hit_counts.tsv" summary file, we can see it only used the target genes we subset down to: ```bash column -ts $'\t' new-GToTree-output/SCG_hit_counts.tsv # assembly_id Ribosomal_L16 Ribosomal_L3 Ribosomal_S3_C # GCA_000172635 1 1 1 # GCA_000299365.1 1 1 1 # GCA_003818365.1 0 0 0 # GCF_000153765.1 1 1 1 # GCF_900162675.1 1 1 1 # GCF_000013045.1 1 1 1 # GCF_000972705.1 1 1 1 # GCA_900473895.1_N32_genomic 1 1 1 # GCF_000012505.1_ASM1250v1_genomic 1 1 1 # GCA_000009925.1_ASM992v1_genomic 1 1 1 # GCA_000012825.1_ASM1282v1_genomic 1 1 1 # GCF_000020585.3_ASM2058v3_protein 1 1 1 # GCF_001886455.1_ASM188645v1_protein 1 1 1 ``` --- Again, if you're having trouble applying the above to your specific case, feel free to reach out to me for help: MikeLee<at>bmsis<dot>org 🙂