---
tags: GToTree
---
# Subsetting abundant classes
This whole process takes about 3 minutes 🙂
## Installing bit package
This approach uses tools and programs in my [bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) package. We can install this in its own conda environment and enter it like so:
```bash
conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
conda activate bit
```
## Downloading needed ad-hoc scripts
```bash
curl -L https://ndownloader.figshare.com/files/23373092 -o ad-hoc-parse-ncbi-table.py
curl -L https://ndownloader.figshare.com/files/23374082 -o ad-hoc-subset-abundant-classes.py
```
## Getting all latest refseq, complete, representative genomes
```bash
esearch -query '"latest refseq"[filter] AND "complete genome"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > target-refseq-accessions
```
## Downloading NCBI assembly summary file
This let's allows us to link the accessions to their taxids:
```bash
curl -L ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -o ncbi_RefSeq_assembly_info.tsv
```
## Parsing full assembly summary file down to target accessions and their taxids only
```bash
python ad-hoc-parse-ncbi-table.py -a target-refseq-accessions -t ncbi_RefSeq_assembly_info.tsv
```
## Getting lineage info for each target accession
First updating the ncbi taxonomy database just to make sure it's up-to-date:
```bash
bit-update-ncbi-taxonomy
```
Getting lineage info for each taxid:
```bash
cut -f 2 target-accs-and-taxids.tsv | tail -n +2 | taxonkit lineage | taxonkit reformat -r NA | cut -f3 | tr ";" "\t" > lineages.tmp
```
Adding header:
```bash
cat <(printf "domain\tphylum\tclass\torder\tfamily\tgenus\tspecific_name\n") lineages.tmp > lineages.tsv && rm lineages.tmp
```
Adding accessions and taxids in front of lineage info:
```bash
paste target-accs-and-taxids.tsv lineages.tsv > target-accs-with-lineages.tsv
```
Cleaning up some intermediate files:
```bash
rm lineages.tsv target-refseq-accessions ncbi_RefSeq_assembly_info.tsv
```
## Randomly subsetting highly abundant classes
This script counts how many times each class shows up. By default, if a specific class makes up > 5% of the total number of accessions, it will randomly subset that class down to 25% of what it was.
So if there are 1,000 total accessions, and Gammaproteobacteria have 100 entries, the program will randomly select 25 Gammaproteobacteria accessions only.
The final output is a single-column file of the filtered accessions ready to go to GToTree.
Run like so:
```bash
python ad-hoc-subset-abundant-classes.py -l target-accs-with-lineages.tsv
```
That generates the `subset-accessions.txt` file. There is a help menu with `-h`. The `-p` flag is what determines what fraction of the total a class must make up before it will be subsampled (0.05 by default). And the `-f` flag is what determines how much to subsample those classes that are above the abundance threshold (0.25 by default).