Subsetting abundant classes

--- tags: GToTree --- # Subsetting abundant classes This whole process takes about 3 minutes 🙂 ## Installing bit package This approach uses tools and programs in my [bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) package. We can install this in its own conda environment and enter it like so: ```bash conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit conda activate bit ``` ## Downloading needed ad-hoc scripts ```bash curl -L https://ndownloader.figshare.com/files/23373092 -o ad-hoc-parse-ncbi-table.py curl -L https://ndownloader.figshare.com/files/23374082 -o ad-hoc-subset-abundant-classes.py ``` ## Getting all latest refseq, complete, representative genomes ```bash esearch -query '"latest refseq"[filter] AND "complete genome"[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > target-refseq-accessions ``` ## Downloading NCBI assembly summary file This let's allows us to link the accessions to their taxids: ```bash curl -L ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt -o ncbi_RefSeq_assembly_info.tsv ``` ## Parsing full assembly summary file down to target accessions and their taxids only ```bash python ad-hoc-parse-ncbi-table.py -a target-refseq-accessions -t ncbi_RefSeq_assembly_info.tsv ``` ## Getting lineage info for each target accession First updating the ncbi taxonomy database just to make sure it's up-to-date: ```bash bit-update-ncbi-taxonomy ``` Getting lineage info for each taxid: ```bash cut -f 2 target-accs-and-taxids.tsv | tail -n +2 | taxonkit lineage | taxonkit reformat -r NA | cut -f3 | tr ";" "\t" > lineages.tmp ``` Adding header: ```bash cat <(printf "domain\tphylum\tclass\torder\tfamily\tgenus\tspecific_name\n") lineages.tmp > lineages.tsv && rm lineages.tmp ``` Adding accessions and taxids in front of lineage info: ```bash paste target-accs-and-taxids.tsv lineages.tsv > target-accs-with-lineages.tsv ``` Cleaning up some intermediate files: ```bash rm lineages.tsv target-refseq-accessions ncbi_RefSeq_assembly_info.tsv ``` ## Randomly subsetting highly abundant classes This script counts how many times each class shows up. By default, if a specific class makes up > 5% of the total number of accessions, it will randomly subset that class down to 25% of what it was. So if there are 1,000 total accessions, and Gammaproteobacteria have 100 entries, the program will randomly select 25 Gammaproteobacteria accessions only. The final output is a single-column file of the filtered accessions ready to go to GToTree. Run like so: ```bash python ad-hoc-subset-abundant-classes.py -l target-accs-with-lineages.tsv ``` That generates the `subset-accessions.txt` file. There is a help menu with `-h`. The `-p` flag is what determines what fraction of the total a class must make up before it will be subsampled (0.05 by default). And the `-f` flag is what determines how much to subsample those classes that are above the abundance threshold (0.25 by default).