--- tags: Scott --- # Working up genome [toc] --- ## Tools/versions used and environment setup Conda environment to do the following can be created and activated with: ```bash conda create -n S-bug -c conda-forge -c bioconda -c defaults -c astrobiomike \ gtdbtk=1.2.0 bit=1.8.11 gtotree=1.5.36 fastani=1.31 conda activate S-bug ``` --- ## Assigning taxonomy with [gtdb-tk](https://github.com/Ecogenomics/GTDBTk#gtdb-tk) Starting with `ID19966_finalcontigs.fa` file in current working directory. ```bash gtdbtk classify_wf --genome_dir ./ -x fa --out_dir gtdbtk-tax-output/ --cpus 40 ``` **Classifcation output** ```bash column -ts $'\t' gtdbtk-tax-output/gtdbtk.bac120.summary.tsv ``` ```bash # user_genome classification fastani_reference fastani_reference_radius fastani_taxonomy fastani_ani fastani_af closest_placement_reference closest_placement_radius closest_placement_taxonomy closest_placement_ani closest_placement_af pplacer_taxonomy classification_method note other_related_references(genome_id,species_name,radius,ANI,AF) aa_percent translation_table red_value warnings # ID19966_finalcontigs d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__;s__ N/A N/A N/A N/A N/A GCF_002029235.1 95.0 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_AE;s__Clostridium_AE oryzae 77.38 0.12 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__;s__ taxonomic classification defined by topology and ANI N/A N/A 89.9 11 0.8497590752069591 N/A ``` GTDB stops at family Clostridiaceae. --- ## Peeking at how many Clostridiaceae are in GTDB and NCBI Seeing how many Clostridiaceae there are in GTDB: ```bash gtt-get-accessions-from-GTDB -t Clostridiaceae --get-taxon-counts --GTDB-representatives-only ``` ```bash # Reading in the GTDB info table... # Using GTDB v95: Released July 17, 2020 # # # The rank 'family' has 843 Clostridiaceae entries. # # In considering only GTDB representative genomes: # # The rank 'family' has 153 Clostridiaceae representative genome entries. ``` Seeing how may there are in NCBI as searched on 16-Sept-2020: ```bash # "representative" in RefSeq (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly # 149 # all in RefSeq esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly # 1538 ``` | DB | Number of Clostridiaceae genomes | |:---|:---:| |GTDB representatives| 153 | |GTDB all | 843 | |NCBI RefSeq representatives |149| |NCBI RefSeq all|1,538| --- ## Making Clostridiaceae phylogenomic trees > **Notes** > * As currently made, they are not "rooted", and therefore not about evolution-through-time in any sense. They are just about relatedness (things closer to each other are more closely related). > * All are made currently using 119 single-copy genes specific to the Firmicutes phylum packaged with GToTree. ### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only ```bash gtt-get-accessions-from-GTDB -t Clostridiaceae --GTDB-representatives-only ``` ```bash # making file holding input fastas ls ID19966_finalcontigs.fa > fasta-files.txt # making file with custom label for Scott's bug on the tree printf "ID19966_finalcontigs.fa\tScotts_bug\n" > my-labels.tsv # GToTree command GToTree -a GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-representatives-out ``` These "representative" genomes in GTDB represent genomes chosen to represent what they've deemed distinct species clusters. GTDB delineates different *Clostridium* genera just with letters after "*Clostridium*"" (e.g. *Clostridium L* and *Clostridium X* are two different genera based on genomics that were previously both just called *Clostridium*). By eye, it certainly doesn't appear that Scott's bug is any closer to *Clostridium AE oryzae* than others that are distinct species, e.g. here's a snapshot: <a href="https://i.imgur.com/96ISGqq.png"><img src="https://i.imgur.com/96ISGqq.png"></a> <br> **Full tree can be explored [here](https://itol.embl.de/tree/7184244121131381600278079).** Making the same one with NCBI taxonomy info added instead of GTDB taxonomy (`-t` flag instead of `-D`): ```bash GToTree -a GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-representatives-with-NCBI-tax-out ``` Here's a similar snapshot of that same region: <a href="https://i.imgur.com/x4euJO5.png"><img src="https://i.imgur.com/x4euJO5.png"></a> <br> **And that full tree can be explored [here](https://itol.embl.de/tree/7184244121131381600278079).** ### GTDB all Clostridiaceae ```bash gtt-get-accessions-from-GTDB -t Clostridiaceae ``` ```bash GToTree -a GTDB-Clostridiaceae-family-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-all-out ``` Great, nothing closer jumped in. Still closest to the *Clostridium AE oryzae* bugger. **Full tree can be explored [here](https://itol.embl.de/tree/7184244121332991600299365).** ### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only ```bash esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > NCBI-Clostridiaceae-refseq-representatives-accs.txt ``` ```bash GToTree -a NCBI-Clostridiaceae-refseq-representatives-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-representatives-out ``` Similar story by eye as the GTDB representative tree, nothing looks super-close. **Full tree can be explored [here](https://itol.embl.de/tree/7184244121306601600281377).** Making the same tree but with NCBI taxonomy info added instead of GTDB taxonomy (`-t` flag instead of `-D`): ```bash GToTree -a NCBI-Clostridiaceae-refseq-representatives-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-representatives-with-NCBI-tax-out ``` **That full tree can be explored [here](https://itol.embl.de/tree/7184244121303711600298726).** ### NCBI Refseq all Clostridiaceae ```bash esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > NCBI-Clostridiaceae-refseq-all-accs.txt ``` The majority of these won't have GTDB counterparts, so only making tree with NCBI taxonomy: ```bash GToTree -a NCBI-Clostridiaceae-refseq-all-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-all-out-with-NCBI-tax ``` Good, still nothing closer than *Clostridium oryzae*. **Full tree can be explored [here](https://itol.embl.de/tree/7184244121342491600313289) (labels are off on default view to speed up drawing and interactivity, can be clicked on in menu at top right).** --- ## ANI work ### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only **IN PROGRESS** on kuat in screen `scott` Downloading reference genomes: ```bash mkdir ani-work cd ani-work bit-dl-ncbi-assemblies -w ../GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta -j 10 gunzip *.gz ls *.fa > genome-list.txt ``` Runnig fastANI: ```bash fastANI --ql genome-list.txt --rl genome-list.txt -o GTDB-Clostridiaceae-family-GTDB-rep-genomes-fastani -t 10 ``` Getting genus and species info for each from info tab `bit` produced when getting the accessions from GTDB: ```bash cut -f 1,7,8 ../GTDB-Clostridiaceae-family-GTDB-rep-metadata.tsv | cut -f 2- -d "_" > GTDB-Clostridiaceae-family-GTDB-rep-genus-species.tsv ``` Time for parsing in R when ANI is done. ### GTDB all Clostridiaceae **COMING SOON** ### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only **COMING SOON** ### NCBI Refseq all Clostridiaceae **COMING SOON** --- ## Getting pairwise distances from trees ### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only **COMING SOON** ### GTDB all Clostridiaceae **COMING SOON** ### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only **COMING SOON** ### NCBI Refseq all Clostridiaceae **COMING SOON** --- ## Summarizing ANI and phylogenomic distances based on species designations within GTDB and NCBI **COMING SOON**