GO summarizing and combining files with domain information included example

--- tags: cov-irt --- # GO summarizing and combining files with domain information included example --- # UPDATE > **In getting closer to publication, I split these CoV-IRT microbial subgroup related programs into their [own conda package](https://github.com/AstrobioMike/CoV-IRT-Micro). The custom programs on this page that start with `bit-` will be replaced by versions that just start with `cov-` that are included in that conda install. That should be installed with conda as shown on [that page](https://github.com/AstrobioMike/CoV-IRT-Micro), and the install `bit` instructions below should be ignored.** --- [TOC] ## Installing [bit](https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit) package in a conda environment ```bash conda create -y -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.16 conda activate bit ``` ## Getting example data This one needs to start with a different input than we used before as it needs to have the associated taxids in it. I think the seqscreen_report.tsv files might work, so long as they are the same as this example one I pulled. The following is from [this page/compressed directory](https://osf.io/u47vn/) and is this file: "SRR10903401_rnaspades_contigs_seqscreen_fast/report_generation/seqscreen_report.tsv" ```bash curl -L -o SRR10903401_rnaspades_contigs_seqscreen_fast_report_generation_seqscreen_report.tsv https://ndownloader.figshare.com/files/23125316 ``` Cutting down to just unit ID, taxid, and GO annotations (this is the format needed for the summary program, 3 columns, tab-delimited, the third column can hold multiple GO IDs delimited by semi-colons): ```bash cut -f 1,2,3 SRR10903401_rnaspades_contigs_seqscreen_fast_report_generation_seqscreen_report.tsv > ex-SRR10903401-seqscreen-out.tsv ``` This looks like this: ``` query taxid go NODE_1_length_32580_cov_43.160704_g0_i0 1283333 GO:0019082;GO:0003723;GO:0005524;GO:0019079;GO:0008233;GO:0008168;GO:0004386;GO:0003968;GO:0016896;GO:0039520;GO:0016787;GO:0008234;GO:0090503;GO:0000166;GO:0008270;GO:0032259;GO:0016779;GO:0006351;GO:0016032;GO:0001172;GO:0003676;GO:0039503;GO:0046872;GO:0006508;GO:0039694;GO:0004197;GO:0016740;GO:0008242 NODE_2_length_17507_cov_42.377194_g0_i1 1283332 GO:0019079;GO:0008233;GO:0008168;GO:0003968;GO:0016787;GO:0008234;GO:0090503;GO:0032259;GO:0036459;GO:0039579;GO:0016032;GO:0006508;GO:0039502;GO:0004197;GO:0030683;GO:0016740;GO:0039648;GO:0004519;GO:0019082;GO:0003723;GO:0005524;GO:0004386;GO:0016896;GO:0039520;GO:0000166;GO:0008270;GO:0039657;GO:0016779;GO:0006351;GO:0001172;GO:0003676;GO:0004518;GO:0039548;GO:0039595;GO:0046872;GO:0039503;GO:0039694;GO:0090305;GO:0008242 NODE_3_length_12488_cov_62.984615_g1_i0 511433 GO:0019079;GO:0008233;GO:0008168;GO:0039644;GO:0003968;GO:0016787;GO:0008234;GO:0090503;GO:0032259;GO:0042802;GO:0036459;GO:0039579;GO:0016032;GO:0006508;GO:0039502;GO:0004197;GO:0030683;GO:0016740;GO:0039648;GO:0004519;GO:0019082;GO:0003723;GO:0005524;GO:0004386;GO:0000175;GO:0016896;GO:0039520;GO:0000166;GO:0039657;GO:0008270;GO:0016779;GO:0006351;GO:0004518;GO:0003676;GO:0001172;GO:0039595;GO:0004527;GO:0046872;GO:0039503;GO:0039694;GO:0019083;GO:0090305;GO:0008242 NODE_4_length_9120_cov_42.882834_g0_i2 1415834 GO:0019079;GO:0008233;GO:0003968;GO:0016787;GO:0008234;GO:0016032;GO:0006508;GO:0004197;GO:0016740;GO:0019082;GO:0003723;GO:0039520;GO:0008270;GO:0003676;GO:0001172;GO:0046872;GO:0008242 ``` ## Summarizing with domain info Note that the two new commands that incorporate the domain information start as `bit-cov...`. I left the others in there as they were before (just `bit-...`), so their names might look similar. I also made a couple other small tweeks to this summary command you might notice. Now *not* reporting zeroes is the default and so `--keep-zeroes` is needed if wanted to keep all rows. And it only returns one summary table by default, rather than also individual ones for each namespace. > **NOTE** > Cassie noted keeping the rows with zero is preferred to help her merging tables later, so the example here includes that `--keep-zeroes` flag (this can increase file sizes quite a bit of course, so we can keep this in mind if that becomes a problem). ```bash bit-cov-summarize-go-annots-w-domains -g go_basic \ -i ex-SRR10903401-seqscreen-out.tsv \ -o ex-SRR10903401-summary.tsv \ --keep-zeroes ``` Which looks like this: ```bash head ex-SRR10903401-summary.tsv | gsed 's/\t/ | /g' | sed 's/^/| /' | sed 's/$/ |/' ``` | GO_term | namespace | depth | name | term_counts | term_perc | arc_counts | arc_perc | bac_counts | bac_perc | euk_counts | euk_perc | vir_counts | vir_perc | NA_tax_counts | NA_tax_perc | | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | | GO:0044699 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0007582 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0000004 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0008150 | biological_process | 0 | biological_process | 35 | 0.6793478260869565 | 0 | 0.0 | 8 | 2.4615384615384617 | 27 | 0.5844155844155844 | 0 | 0.0 | 0 | 0.0 | | GO:0044700 | biological_process | 1 | signaling | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0044763 | biological_process | 1 | cellular process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0000003 | biological_process | 1 | reproduction | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0006791 | biological_process | 1 | sulfur utilization | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0110148 | biological_process | 1 | biomineralization | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | **Column definitions** |Column | Column Name | Column Definition | |---|---|---| | 1 | GO_term | GO term ID | | 2 | namespace | [namespace](http://geneontology.org/docs/ontology-documentation/) | | 3 | depth | depth or level in the GO hierarchy | | 4 | name | [term name](http://geneontology.org/docs/GO-term-elements) | | 5 | term_counts | how many times this GO term was seen in this sample | | 6 | term_perc | prior `term_counts` column normalized to 100% | | 7 | arc_counts | how many times this GO term was seen in this sample on a read aligned to an archaeal UniProt reference sequence | | 8 | arc_perc | prior `arc_counts` column normalized to 100% | | 9 | bac_counts | same as `arc_counts` column, but for bacteria | | 10 | bac_perc | same as `arc_perc` column, but for bacteria | | 11 | euk_counts | "" | | 12 | euk_perc | "" | | 13 | vir_counts | "" | | 14 | vir_perc | "" | | 15 | NA_tax_counts | "" | | 16 | NA_tax_perc | "" | This program by default outputs one summary table holding all GO terms detected. It can additionally output individual tables for each GO "[namespace](http://geneontology.org/docs/ontology-documentation/)" if wanted by adding the `--by-namespace` flag. And it can return all GO terms if wanted (including those not detected) by adding the `--keep-zeroes` flag. ## Combining summaries with domain info ```bash # making a copy of the previous output so we can provide 2 to this command cp ex-SRR10903401-summary.tsv ex-SRR1090340x-summary.tsv bit-cov-combine-go-summaries-with-domains \ -i ex-SRR10903401-summary.tsv ex-SRR1090340x-summary.tsv \ -o ex-combined-summaries.tsv ``` Multiple input files can be provided as a space-delimited list like above (or with wildcard expansion, e.g. for above `-i ex*summary.tsv` – but be sure of the order if also using the `-n` option to rename them. By default it will take the basename of the input file up to the last period to name the columns in the merged table. But a comma-delimited list of wanted sample names (in the same order as the input files) can be provided to the `-n` option if wanted, e.g.: ```bash bit-cov-combine-go-summaries-with-domains \ -i ex-SRR10903401-summary.tsv ex-SRR1090340x-summary.tsv \ -n SRR10903401,SRR1090340x \ -o ex-combined-summaries.tsv ``` Which gives us a table that looks like this: ```bash head ex-combined-summaries.tsv | gsed 's/\t/ | /g' | sed 's/^/| /' | sed 's/$/ |/' ``` | GO_term | namespace | depth | name | SRR10903401_term_counts | SRR10903401_term_perc | SRR10903401_arc_counts | SRR10903401_arc_perc | SRR10903401_bac_counts | SRR10903401_bac_perc | SRR10903401_euk_counts | SRR10903401_euk_perc | SRR10903401_vir_counts | SRR10903401_vir_perc | SRR10903401_NA_tax_counts | SRR10903401_NA_tax_perc | SRR1090340x_term_counts | SRR1090340x_term_perc | SRR1090340x_arc_counts | SRR1090340x_arc_perc | SRR1090340x_bac_counts | SRR1090340x_bac_perc | SRR1090340x_euk_counts | SRR1090340x_euk_perc | SRR1090340x_vir_counts | SRR1090340x_vir_perc | SRR1090340x_NA_tax_counts | SRR1090340x_NA_tax_perc | | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | | GO:0044699 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0007582 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0000004 | biological_process | 0 | biological_process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0008150 | biological_process | 0 | biological_process | 35 | 0.6793478260869565 | 0 | 0.0 | 8 | 2.4615384615384617 | 27 | 0.5844155844155844 | 0 | 0.0 | 0 | 0.0 | 35 | 0.6793478260869565 | 0 | 0.0 | 8 | 2.4615384615384617 | 27 | 0.5844155844155844 | 0 | 0.0 | 0 | 0.0 | | GO:0044700 | biological_process | 1 | signaling | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0044763 | biological_process | 1 | cellular process | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0000003 | biological_process | 1 | reproduction | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0006791 | biological_process | 1 | sulfur utilization | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | | GO:0110148 | biological_process | 1 | biomineralization | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | Column definitions are same as above, except prepended to the columns that change per sample are the unique sample identifiers. Last note, tracking the associated domain information is only going to work on the full GO annotations. Meaning we can't run the go slim step and track the domain info through that. # Propagating counts to parent terms I think we should be using these versions for beta-diversity (things like clustering samples into groups and permutational anova-type tests) as well as for any differential abundance work we do (e.g. DESeq2/corncob). The program is `bit-cov-propagate-go-terms`, is in `bit` as of version 1.8.16, and it takes the output from the above `bit-cov-combine-go-summaries-with-domains`, here called "ex-combined-summaries.tsv". ```bash bit-cov-propagate-go-terms -i ex-combined-summaries.tsv \ -o ex-combined-summaries-parents-propagated.tsv ``` > By default this removes any rows (GO terms) that don't show up in any sample. If we wanted to keep them, we couuld provided the `--keep-zeroes` flag. Now if we look at, for example, GO:0002376 – "immune system process" in both tables, **Original** ```bash grep "^GO_term\|^GO:0002376" ex-combined-summaries.tsv | cut -f 1-5 | column ``` <a href="https://i.imgur.com/A3gdsRF.png"><img src="https://i.imgur.com/A3gdsRF.png"></a> ```bash grep "^GO_term\|^GO:0002376" ex-combined-summaries-parents-propagated.tsv | cut -f 1-5 | column ``` **Counts propagated** <a href="https://i.imgur.com/6g9KjiY.png"><img src="https://i.imgur.com/6g9KjiY.png"></a> > This GO term was being represented with 24, but it encompasses an additional 169 child terms that are in this sample, bringing this specific term's representation in this sample up to 193. Those 169 child terms enable greater resolution into the likely function, and it is informative to have counts for them, but it is still the case that "immune system process" encompasses those child terms, and we can more accurately see its representation in this sample when the counts are propagated up like this. (It is the default to do this with [GO enrichment analysis](https://github.com/tanghaibao/goatools#find-go-enrichment-of-genes-under-study) also.) ## Random thoughts on propagated parent terms #### On percentages included The percent columns are still included in these tables, but I find within-sample views like that to be less useful now. I think across-sample contrasts of terms are fine (which is what beta-diversity approaches and differential abundance testing does), but within-sample views (like looking at percentages of different terms in contrast to each other) may be more misleading or less useful now than before the propagation. Not speaking authoritatively, just something that crossed my mind and to think about.