---
tags: Scott
---
# Working up genome
[toc]
---
## Tools/versions used and environment setup
Conda environment to do the following can be created and activated with:
```bash
conda create -n S-bug -c conda-forge -c bioconda -c defaults -c astrobiomike \
gtdbtk=1.2.0 bit=1.8.11 gtotree=1.5.36 fastani=1.31
conda activate S-bug
```
---
## Assigning taxonomy with [gtdb-tk](https://github.com/Ecogenomics/GTDBTk#gtdb-tk)
Starting with `ID19966_finalcontigs.fa` file in current working directory.
```bash
gtdbtk classify_wf --genome_dir ./ -x fa --out_dir gtdbtk-tax-output/ --cpus 40
```
**Classifcation output**
```bash
column -ts $'\t' gtdbtk-tax-output/gtdbtk.bac120.summary.tsv
```
```bash
# user_genome classification fastani_reference fastani_reference_radius fastani_taxonomy fastani_ani fastani_af closest_placement_reference closest_placement_radius closest_placement_taxonomy closest_placement_ani closest_placement_af pplacer_taxonomy classification_method note other_related_references(genome_id,species_name,radius,ANI,AF) aa_percent translation_table red_value warnings
# ID19966_finalcontigs d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__;s__ N/A N/A N/A N/A N/A GCF_002029235.1 95.0 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium_AE;s__Clostridium_AE oryzae 77.38 0.12 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__;s__ taxonomic classification defined by topology and ANI N/A N/A 89.9 11 0.8497590752069591 N/A
```
GTDB stops at family Clostridiaceae.
---
## Peeking at how many Clostridiaceae are in GTDB and NCBI
Seeing how many Clostridiaceae there are in GTDB:
```bash
gtt-get-accessions-from-GTDB -t Clostridiaceae --get-taxon-counts --GTDB-representatives-only
```
```bash
# Reading in the GTDB info table...
# Using GTDB v95: Released July 17, 2020
#
#
# The rank 'family' has 843 Clostridiaceae entries.
#
# In considering only GTDB representative genomes:
#
# The rank 'family' has 153 Clostridiaceae representative genome entries.
```
Seeing how may there are in NCBI as searched on 16-Sept-2020:
```bash
# "representative" in RefSeq (https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes)
esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly
# 149
# all in RefSeq
esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly
# 1538
```
| DB | Number of Clostridiaceae genomes |
|:---|:---:|
|GTDB representatives| 153 |
|GTDB all | 843 |
|NCBI RefSeq representatives |149|
|NCBI RefSeq all|1,538|
---
## Making Clostridiaceae phylogenomic trees
> **Notes**
> * As currently made, they are not "rooted", and therefore not about evolution-through-time in any sense. They are just about relatedness (things closer to each other are more closely related).
> * All are made currently using 119 single-copy genes specific to the Firmicutes phylum packaged with GToTree.
### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only
```bash
gtt-get-accessions-from-GTDB -t Clostridiaceae --GTDB-representatives-only
```
```bash
# making file holding input fastas
ls ID19966_finalcontigs.fa > fasta-files.txt
# making file with custom label for Scott's bug on the tree
printf "ID19966_finalcontigs.fa\tScotts_bug\n" > my-labels.tsv
# GToTree command
GToTree -a GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-representatives-out
```
These "representative" genomes in GTDB represent genomes chosen to represent what they've deemed distinct species clusters. GTDB delineates different *Clostridium* genera just with letters after "*Clostridium*"" (e.g. *Clostridium L* and *Clostridium X* are two different genera based on genomics that were previously both just called *Clostridium*).
By eye, it certainly doesn't appear that Scott's bug is any closer to *Clostridium AE oryzae* than others that are distinct species, e.g. here's a snapshot:
<a href="https://i.imgur.com/96ISGqq.png"><img src="https://i.imgur.com/96ISGqq.png"></a>
<br>
**Full tree can be explored [here](https://itol.embl.de/tree/7184244121131381600278079).**
Making the same one with NCBI taxonomy info added instead of GTDB taxonomy (`-t` flag instead of `-D`):
```bash
GToTree -a GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-representatives-with-NCBI-tax-out
```
Here's a similar snapshot of that same region:
<a href="https://i.imgur.com/x4euJO5.png"><img src="https://i.imgur.com/x4euJO5.png"></a>
<br>
**And that full tree can be explored [here](https://itol.embl.de/tree/7184244121131381600278079).**
### GTDB all Clostridiaceae
```bash
gtt-get-accessions-from-GTDB -t Clostridiaceae
```
```bash
GToTree -a GTDB-Clostridiaceae-family-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-gtdb-all-out
```
Great, nothing closer jumped in. Still closest to the *Clostridium AE oryzae* bugger.
**Full tree can be explored [here](https://itol.embl.de/tree/7184244121332991600299365).**
### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only
```bash
esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND "representative genome"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > NCBI-Clostridiaceae-refseq-representatives-accs.txt
```
```bash
GToTree -a NCBI-Clostridiaceae-refseq-representatives-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -D -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-representatives-out
```
Similar story by eye as the GTDB representative tree, nothing looks super-close.
**Full tree can be explored [here](https://itol.embl.de/tree/7184244121306601600281377).**
Making the same tree but with NCBI taxonomy info added instead of GTDB taxonomy (`-t` flag instead of `-D`):
```bash
GToTree -a NCBI-Clostridiaceae-refseq-representatives-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-representatives-with-NCBI-tax-out
```
**That full tree can be explored [here](https://itol.embl.de/tree/7184244121303711600298726).**
### NCBI Refseq all Clostridiaceae
```bash
esearch -query 'Clostridiaceae[ORGN] AND "latest refseq"[filter] AND (latest[filter] AND all[filter] NOT anomalous[filter])' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > NCBI-Clostridiaceae-refseq-all-accs.txt
```
The majority of these won't have GTDB counterparts, so only making tree with NCBI taxonomy:
```bash
GToTree -a NCBI-Clostridiaceae-refseq-all-accs.txt -f fasta-files.txt -m my-labels.tsv -H Firmicutes -t -L Genus,Species -j 20 -o GToTree-clostridiaceae-ncbi-refseq-all-out-with-NCBI-tax
```
Good, still nothing closer than *Clostridium oryzae*.
**Full tree can be explored [here](https://itol.embl.de/tree/7184244121342491600313289) (labels are off on default view to speed up drawing and interactivity, can be clicked on in menu at top right).**
---
## ANI work
### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only
**IN PROGRESS** on kuat in screen `scott`
Downloading reference genomes:
```bash
mkdir ani-work
cd ani-work
bit-dl-ncbi-assemblies -w ../GTDB-Clostridiaceae-family-GTDB-rep-accs.txt -f fasta -j 10
gunzip *.gz
ls *.fa > genome-list.txt
```
Runnig fastANI:
```bash
fastANI --ql genome-list.txt --rl genome-list.txt -o GTDB-Clostridiaceae-family-GTDB-rep-genomes-fastani -t 10
```
Getting genus and species info for each from info tab `bit` produced when getting the accessions from GTDB:
```bash
cut -f 1,7,8 ../GTDB-Clostridiaceae-family-GTDB-rep-metadata.tsv | cut -f 2- -d "_" > GTDB-Clostridiaceae-family-GTDB-rep-genus-species.tsv
```
Time for parsing in R when ANI is done.
### GTDB all Clostridiaceae
**COMING SOON**
### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only
**COMING SOON**
### NCBI Refseq all Clostridiaceae
**COMING SOON**
---
## Getting pairwise distances from trees
### GTDB Clostridiaceae [representatives](https://gtdb.ecogenomic.org/faq#gtdb_species_clusters) only
**COMING SOON**
### GTDB all Clostridiaceae
**COMING SOON**
### NCBI RefSeq Clostridiaceae [representatives](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes) only
**COMING SOON**
### NCBI Refseq all Clostridiaceae
**COMING SOON**
---
## Summarizing ANI and phylogenomic distances based on species designations within GTDB and NCBI
**COMING SOON**