--- tags: cov-irt --- # Taxon summaries of kraken2 output --- # UPDATE > **In getting closer to publication, I split these CoV-IRT microbial subgroup related programs into their [own conda package](https://github.com/AstrobioMike/CoV-IRT-Micro). The custom programs on this page that start with `bit-` will be replaced by versions that just start with `cov-` that are included in that conda install. That should be installed with conda as shown on [that page](https://github.com/AstrobioMike/CoV-IRT-Micro), and the install `bit` instructions below should be ignored.** --- [toc] New programs `bit-kraken2-to-taxon-summaries` and `bit-combine-kraken2-taxon-summaries` are available as of [bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) v1.8.13. ```bash conda install -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.13 conda activate bit ``` ## Summarizing single sample Operates on the regular output file from a kraken2 run. Getting an kraken2 output file for example: ```bash curl -L -o CRR125950_qual_complexity_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out https://osf.io/hu4m6/download ``` That's the per-read classification output that looks like this: ```bash head CRR125950_qual_complexity_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out | column | sed 's/^/# /' ``` ``` # U A00159:145:H75T2DMXX:1:1101:7735:13792 unclassified (taxid 0) 16 0:0 # U A00159:145:H75T2DMXX:1:1101:11216:13557 unclassified (taxid 0) 30 0:0 # U A00159:145:H75T2DMXX:1:1101:22688:14074 unclassified (taxid 0) 26 0:0 # U A00159:145:H75T2DMXX:1:1101:1325:14559 unclassified (taxid 0) 31 0:0 # U A00159:145:H75T2DMXX:1:1101:23719:15013 unclassified (taxid 0) 30 0:0 # C A00159:145:H75T2DMXX:1:1102:11388:8312 Ochrobactrum (taxid 528) 194 0:12 1224:16 28211:7 528:15 356:1 528:12 356:3 528:9 28211:25 528:2 28211:5 528:2 356:1 28211:5 528:3 356:1 528:22 356:1 528:8 28211:7 2:3 # U A00159:145:H75T2DMXX:1:1102:15465:8390 unclassified (taxid 0) 27 0:0 # U A00159:145:H75T2DMXX:1:1102:6343:7560 unclassified (taxid 0) 271 0:237 # U A00159:145:H75T2DMXX:1:1102:30101:11600 unclassified (taxid 0) 26 0:0 # U A00159:145:H75T2DMXX:1:1101:19678:2221 unclassified (taxid 0) 279 0:245 ``` Running summary command: ```bash bit-kraken2-to-taxon-summaries -i CRR125950_qual_complexity_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out -o example-out.tsv ``` This counts how many times each taxid shows up, adds full lineage info, adds a row for unclassified, and has columns of reads classified to that taxon as well as percent of all reads in that sample. It looks like this afterwards: ```bash head example-out.tsv | column | sed 's/^/# /' ``` ``` # taxid domain phylum class order family genus species read_counts percent_of_reads # 0 Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified 766758 94.70380589843856 # 1 NA NA NA NA NA NA NA 3 0.0003705359679264066 # 2 Bacteria NA NA NA NA NA NA 884 0.10918459854898115 # 9 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Buchnera Buchnera aphidicola 34 0.004199407636499275 # 56 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Polyangiaceae Sorangium Sorangium cellulosum 1 0.0001235119893088022 # 63 Bacteria Proteobacteria Betaproteobacteria Neisseriales Neisseriaceae Vitreoscilla Vitreoscilla filiformis 13 0.0016056558610144287 # 114 Bacteria Planctomycetes Planctomycetia Gemmatales Gemmataceae Gemmata Gemmata obscuriglobus 1 0.0001235119893088022 # 154 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Spirochaeta Spirochaeta thermophila 2 0.0002470239786176044 # 157 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Treponema NA 2 0.0002470239786176044 ``` ## Combining multiple sample summaries Here we'll just make a copy of the last output to have another for the example: ```bash cp example-out.tsv example-out2.tsv ``` And combining the summaries: ```bash bit-combine-kraken2-taxon-summaries -i example-out.tsv example-out2.tsv -o combined-kraken2-taxon-summaries.tsv ``` Similar to our other combining program, that will by default use the basename of the input file to make the column headers, and look like this (new columns on the right): ```bash head combined-kraken2-taxon-summaries.tsv | column | sed 's/^/# /' ``` ``` # taxid domain phylum class order family genus species example-out_read_counts example-out_perc_of_reads example-out2_read_counts example-out2_perc_of_reads # 0 Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified 766758 94.70380589843857 766758 94.70380589843857 # 1 NA NA NA NA NA NA NA 3 0.00037053596792640663 3 0.00037053596792640663 # 2 Bacteria NA NA NA NA NA NA 884 0.10918459854898116 884 0.10918459854898116 # 9 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Buchnera Buchnera aphidicola 34 0.004199407636499275 34 0.004199407636499275 # 56 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Polyangiaceae Sorangium Sorangium cellulosum 1 0.0001235119893088022 1 0.0001235119893088022 # 63 Bacteria Proteobacteria Betaproteobacteria Neisseriales Neisseriaceae Vitreoscilla Vitreoscilla filiformis 13 0.0016056558610144287 13 0.0016056558610144287 # 114 Bacteria Planctomycetes Planctomycetia Gemmatales Gemmataceae Gemmata Gemmata obscuriglobus 1 0.0001235119893088022 1 0.0001235119893088022 # 154 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Spirochaeta Spirochaeta thermophila 2 0.0002470239786176044 2 0.0002470239786176044 # 157 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Treponema NA 2 0.0002470239786176044 2 0.0002470239786176044 ``` Or an argument with a comma-delimited list of wanted names can be provided, but it needs to be provided in the same order as the input files, e.g.: ```bash bit-combine-kraken2-taxon-summaries -i example-out.tsv example-out2.tsv -n A,B -o combined-kraken2-taxon-summaries.tsv ``` ``` # taxid domain phylum class order family genus species A_read_counts A_perc_of_reads B_read_counts B_perc_of_reads # 0 Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified Unclassified 766758 94.70380589843857 766758 94.70380589843857 # 1 NA NA NA NA NA NA NA 3 0.00037053596792640663 3 0.00037053596792640663 # 2 Bacteria NA NA NA NA NA NA 884 0.10918459854898116 884 0.10918459854898116 # 9 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Buchnera Buchnera aphidicola 34 0.004199407636499275 34 0.004199407636499275 # 56 Bacteria Proteobacteria Deltaproteobacteria Myxococcales Polyangiaceae Sorangium Sorangium cellulosum 1 0.0001235119893088022 1 0.0001235119893088022 # 63 Bacteria Proteobacteria Betaproteobacteria Neisseriales Neisseriaceae Vitreoscilla Vitreoscilla filiformis 13 0.0016056558610144287 13 0.0016056558610144287 # 114 Bacteria Planctomycetes Planctomycetia Gemmatales Gemmataceae Gemmata Gemmata obscuriglobus 1 0.0001235119893088022 1 0.0001235119893088022 # 154 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Spirochaeta Spirochaeta thermophila 2 0.0002470239786176044 2 0.0002470239786176044 # 157 Bacteria Spirochaetes Spirochaetia Spirochaetales Spirochaetaceae Treponema NA 2 0.0002470239786176044 2 0.0002470239786176044 ``` Lastly, since the input files are a space-delimited list, they can be provided with shell wildcards if wanted, e.g.: ```bash bit-combine-kraken2-taxon-summaries -i example*.tsv -o combined-kraken2-taxon-summaries.tsv ```