Subsetting primary plaac results tabs (all GO terms)

--- tags: prions-fo-life title: Subsetting primary plaac results tabs (all GO terms) --- [toc] # Subsetting primary plaac results tabs focusing on all GO terms (and also adding full protein sequences) --- > **NOTE** > This page is focused on all GO terms. For enriched only, see this page: https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results --- Working with data from 1-Jun-2021, UniProt "Standard" proteomes only. The files used below are from [this drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH). The primary plaac results tables are in each domain's subdirectory/bucket. E.g., in the [archaea one](https://drive.google.com/drive/u/0/folders/1xH3kNsh1O5NdxUI9dbp7FV-frijGWvnK), the file is called "archaea-plaac-results.tsv.gz". And for adding the full protein sequences, the amino acid fasta files in each domain's directory are needed, e.g., archaea-proteomes.faa.gz for archaea. --- > **NOTE** > The code below works based on the directory structure in [the google drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH), with commands being run from within the sub-directory "[parsing-plaac-results](https://drive.google.com/drive/u/0/folders/18RFVd3lCqmcKFCXaG27dbbc_KJ0M7sz-)". If not working in that structure, the scripts below would need to be modified to point to the correct file paths. --- # Results **Output files of plaac results including all the plaac-positive proteins for specific GO terms can be found in the "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)" subdirectory.** For example, GO:0009055 is in its subdirectory there. The 3 \*.tsv tables hold the plaac results (and added on full protein sequence in the files that include "-with-prot-seqs" in their names) for plaac-positive proteins associated with GO:0005509 for each domain. ## Conda env This conda environment holds any required stuff. ```bash conda create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit conda activate bit ``` ## Pre-setup Making new files that hold GO annotations for all plaac-positive proteins for each, and fasta files of only plaac-positive proteins, in order to make the searching process go faster when we do a new GO term. **These steps only needed to be done once**, and are documented here, but the output files are included in the drive "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)" ```bash # subsetting protein to GO mapping files down to just plaac-positive bit-filter-table -i ../archaea/reference-genome-info/archaea-proteins-GO-associations.tsv -o archaea-plaac-positive-proteins-GO-associations.tsv --no-header -w ../archaea/plaac-core-score-0/archaea-plaac-core-score-0-positive-protein-accs.txt bit-filter-table -i ../bacteria/reference-genome-info/bacteria-proteins-GO-associations.tsv -o bacteria-plaac-positive-proteins-GO-associations.tsv -w ../bacteria/plaac-core-score-0/bacteria-plaac-core-score-0-positive-protein-accs.txt --no-header bit-filter-table -i ../eukarya/reference-genome-info/eukarya-proteins-GO-associations.tsv.gz -o eukarya-plaac-positive-proteins-GO-associations.tsv -w ../eukarya/plaac-core-score-0/eukarya-plaac-core-score-0-positive-protein-accs.txt --no-header --gz # making fasta files of just plaac-positive proteins bit-parse-fasta-by-headers -i ../archaea/reference-genome-info/archaea-proteomes.faa.gz -w ../archaea/plaac-core-score-0/archaea-plaac-core-score-0-positive-protein-accs.txt -o archaea-plaac-positive-proteins.faa --gz bit-parse-fasta-by-headers -i ../bacteria/reference-genome-info/bacteria-proteomes.faa.gz -w ../bacteria/plaac-core-score-0/bacteria-plaac-core-score-0-positive-protein-accs.txt -o bacteria-plaac-positive-proteins.faa --gz bit-parse-fasta-by-headers -i ../eukarya/reference-genome-info/eukarya-proteomes.faa.gz -w ../eukarya/plaac-core-score-0/eukarya-plaac-core-score-0-positive-protein-accs.txt -o eukarya-plaac-positive-proteins.faa --gz ``` ## Process This uses the scripts pasted below and in the new google drive for this "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)". Doing the ones first requested here, but all should work if there are some plaac positives for them. ### GO_0009055 ```bash # getting the plaac-positive protein IDs associated with this GO term for each domain bash get-wanted-protein-accs-all-GO.sh GO:0009055 # making subset plaac tables for each domain python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-archaea-uniprot-accs.txt -o GO_0009055/GO_0009055-archaea-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-bacteria-uniprot-accs.txt -o GO_0009055/GO_0009055-bacteria-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-eukarya-uniprot-accs.txt -o GO_0009055/GO_0009055-eukarya-plaac-results-all-GO.tsv # getting full protein seqs and adding to plaac results tab bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0009055 ``` ### GO_0008061 ```bash # getting the plaac-positive protein IDs associated with this GO term for each domain bash get-wanted-protein-accs-all-GO.sh GO:0008061 # making subset plaac tables for each domain python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-archaea-uniprot-accs.txt -o GO_0008061/GO_0008061-archaea-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-bacteria-uniprot-accs.txt -o GO_0008061/GO_0008061-bacteria-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-eukarya-uniprot-accs.txt -o GO_0008061/GO_0008061-eukarya-plaac-results-all-GO.tsv # getting full protein seqs and adding to plaac results tab bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0008061 ``` ### GO_0006313 ```bash # getting the plaac-positive protein IDs associated with this GO term for each domain bash get-wanted-protein-accs-all-GO.sh GO:0006313 # making subset plaac tables for each domain python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-archaea-uniprot-accs.txt -o GO_0006313/GO_0006313-archaea-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-bacteria-uniprot-accs.txt -o GO_0006313/GO_0006313-bacteria-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-eukarya-uniprot-accs.txt -o GO_0006313/GO_0006313-eukarya-plaac-results-all-GO.tsv # getting full protein seqs and adding to plaac results tab bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0006313 ``` ### GO_0005507 ```bash # getting the plaac-positive protein IDs associated with this GO term for each domain bash get-wanted-protein-accs-all-GO.sh GO:0005507 # making subset plaac tables for each domain python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-archaea-uniprot-accs.txt -o GO_0005507/GO_0005507-archaea-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-bacteria-uniprot-accs.txt -o GO_0005507/GO_0005507-bacteria-plaac-results-all-GO.tsv python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-eukarya-uniprot-accs.txt -o GO_0005507/GO_0005507-eukarya-plaac-results-all-GO.tsv # getting full protein seqs and adding to plaac results tab bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0005507 ``` ## Scripts ### get-wanted-protein-accs-all-GO.sh ```bash= #!/usr/bin/env bash set -e if [ "$#" == 0 ] || [ $1 == "-h" ] || [ $1 == "help" ] || [ $1 == "--help" ]; then printf "\n This script gets the protein accessions of plaac-positive proteins associated with\n" printf " the provided GO term for all 3 domains. Paths to required files are set inside the script.\n" printf " Having 'bit' installed ensures anything needed is present:\n" printf " https://github.com/AstrobioMike/bit#bioinformatics-tools-bit.\n\n" printf " Usage:\n\t bash get-wanted-protein-accs-all-GO.sh GO:0009055\n\n" exit fi # paths to required files archaea_protein_to_GO_map="archaea-plaac-positive-proteins-GO-associations.tsv" bacteria_protein_to_GO_map="bacteria-plaac-positive-proteins-GO-associations.tsv" eukarya_protein_to_GO_map="eukarya-plaac-positive-proteins-GO-associations.tsv" # changing target GO ID ":"" to "_" for filename purposes mod_GO_ID=$(echo ${1} | tr ":" "_") mkdir -p ${mod_GO_ID} # doing archaea grep "${1}" ${archaea_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-archaea-uniprot-accs.txt # doing bacteria grep "${1}" ${bacteria_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-bacteria-uniprot-accs.txt # doing eukarya grep "${1}" ${eukarya_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-eukarya-uniprot-accs.txt ``` ### subset-plaac-result-tab-all-GO.py ```python= #!/usr/bin/env python import argparse import gzip import pandas as pd parser = argparse.ArgumentParser(description='This script has a very specific set of skills.') required = parser.add_argument_group('required arguments') required.add_argument("-i", "--input-tsv", help='One of the domain-level files of plaac results such as "archaea-plaac-results.tsv.gz" (expects to be gzip\'d as currently written)', action="store", required=True) required.add_argument("-w", "--wanted-things", help="Our unique protein IDs we want.", action="store", dest="wanted", required=True) parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv") args = parser.parse_args() # creating empty set and populating with target accessions targets_set = set() for line in open(args.wanted): line = line.strip() targets_set.add(line) # adding "SEQid" to our set of targets so that the first header row is written out also targets_set.add("SEQid") # opening new output table and writing to it as we iterate through the gzip'd input table with open(args.output_tsv, "w") as out_tab: # opening input gz table with gzip.open(args.input_tsv, "rt") as in_file: for line in in_file: # writing out line to new file if its first column entry is in our targets_set if line.strip().split("\t")[0] in targets_set: out_tab.write(line) # reading in output file as table and sorting by COREscore tab = pd.read_csv(args.output_tsv, sep = "\t") tab.sort_values(by = ['COREscore'], ascending = False, inplace = True) # writing out tab.to_csv(args.output_tsv, index = False, sep = "\t") ``` ### get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh ```bash= #!/usr/bin/env bash set -e ## example usage: bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0009055 # should be run last in process, see this page for details: https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results-all-GO-terms # changing target GO ID ":"" to "_" for filename purposes mod_GO_ID=$(echo ${1} | tr ":" "_") ## checking directory exists already if [ ! -d ${mod_GO_ID} ]; then printf "\n The expected ${mod_GO_ID} directory doesn't exist already. This script needs\n" printf " to be run after others. See this page for details:\n\n" printf " https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results-all-GO-terms\n\n" exit fi # paths to required files archaea_faa_file="archaea-plaac-positive-proteins.faa" bacteria_faa_file="bacteria-plaac-positive-proteins.faa" eukarya_faa_file="eukarya-plaac-positive-proteins.faa" archaea_plaac_file="${mod_GO_ID}/${mod_GO_ID}-archaea-plaac-results-all-GO.tsv" bacteria_plaac_file="${mod_GO_ID}/${mod_GO_ID}-bacteria-plaac-results-all-GO.tsv" eukarya_plaac_file="${mod_GO_ID}/${mod_GO_ID}-eukarya-plaac-results-all-GO.tsv" ### getting archaea seqs ### # getting list from plaac table so can use it to reorder them for sticking together at end cut -f 1 ${archaea_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp bit-parse-fasta-by-headers -i ${archaea_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp # ordering so can stick together bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp # making new column with just seqs printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp # adding to plaac output tab paste ${archaea_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-archaea-plaac-results-with-prot-seqs.tsv # clean up rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp ### doing bacteria ### # getting list from plaac table so can use it to reorder them for sticking together at end cut -f 1 ${bacteria_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp bit-parse-fasta-by-headers -i ${bacteria_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp # ordering so can stick together bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp # making new column with just seqs printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp # adding to plaac output tab paste ${bacteria_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-bacteria-plaac-results-with-prot-seqs.tsv # clean up rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp ### doing eukarya ### # getting list from plaac table so can use it to reorder them for sticking together at end cut -f 1 ${eukarya_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp bit-parse-fasta-by-headers -i ${eukarya_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp # ordering so can stick together bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp # making new column with just seqs printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp # adding to plaac output tab paste ${eukarya_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-eukarya-plaac-results-with-prot-seqs.tsv # clean up rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp ```