---
tags: prions-fo-life
title: Subsetting primary plaac results tabs (all GO terms)
---
[toc]
# Subsetting primary plaac results tabs focusing on all GO terms (and also adding full protein sequences)
---
> **NOTE**
> This page is focused on all GO terms. For enriched only, see this page: https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results
---
Working with data from 1-Jun-2021, UniProt "Standard" proteomes only. The files used below are from [this drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH). The primary plaac results tables are in each domain's subdirectory/bucket. E.g., in the [archaea one](https://drive.google.com/drive/u/0/folders/1xH3kNsh1O5NdxUI9dbp7FV-frijGWvnK), the file is called "archaea-plaac-results.tsv.gz". And for adding the full protein sequences, the amino acid fasta files in each domain's directory are needed, e.g., archaea-proteomes.faa.gz for archaea.
---
> **NOTE**
> The code below works based on the directory structure in [the google drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH), with commands being run from within the sub-directory "[parsing-plaac-results](https://drive.google.com/drive/u/0/folders/18RFVd3lCqmcKFCXaG27dbbc_KJ0M7sz-)". If not working in that structure, the scripts below would need to be modified to point to the correct file paths.
---
# Results
**Output files of plaac results including all the plaac-positive proteins for specific GO terms can be found in the "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)" subdirectory.**
For example, GO:0009055 is in its subdirectory there. The 3 \*.tsv tables hold the plaac results (and added on full protein sequence in the files that include "-with-prot-seqs" in their names) for plaac-positive proteins associated with GO:0005509 for each domain.
## Conda env
This conda environment holds any required stuff.
```bash
conda create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
conda activate bit
```
## Pre-setup
Making new files that hold GO annotations for all plaac-positive proteins for each, and fasta files of only plaac-positive proteins, in order to make the searching process go faster when we do a new GO term. **These steps only needed to be done once**, and are documented here, but the output files are included in the drive "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)"
```bash
# subsetting protein to GO mapping files down to just plaac-positive
bit-filter-table -i ../archaea/reference-genome-info/archaea-proteins-GO-associations.tsv -o archaea-plaac-positive-proteins-GO-associations.tsv --no-header -w ../archaea/plaac-core-score-0/archaea-plaac-core-score-0-positive-protein-accs.txt
bit-filter-table -i ../bacteria/reference-genome-info/bacteria-proteins-GO-associations.tsv -o bacteria-plaac-positive-proteins-GO-associations.tsv -w ../bacteria/plaac-core-score-0/bacteria-plaac-core-score-0-positive-protein-accs.txt --no-header
bit-filter-table -i ../eukarya/reference-genome-info/eukarya-proteins-GO-associations.tsv.gz -o eukarya-plaac-positive-proteins-GO-associations.tsv -w ../eukarya/plaac-core-score-0/eukarya-plaac-core-score-0-positive-protein-accs.txt --no-header --gz
# making fasta files of just plaac-positive proteins
bit-parse-fasta-by-headers -i ../archaea/reference-genome-info/archaea-proteomes.faa.gz -w ../archaea/plaac-core-score-0/archaea-plaac-core-score-0-positive-protein-accs.txt -o archaea-plaac-positive-proteins.faa --gz
bit-parse-fasta-by-headers -i ../bacteria/reference-genome-info/bacteria-proteomes.faa.gz -w ../bacteria/plaac-core-score-0/bacteria-plaac-core-score-0-positive-protein-accs.txt -o bacteria-plaac-positive-proteins.faa --gz
bit-parse-fasta-by-headers -i ../eukarya/reference-genome-info/eukarya-proteomes.faa.gz -w ../eukarya/plaac-core-score-0/eukarya-plaac-core-score-0-positive-protein-accs.txt -o eukarya-plaac-positive-proteins.faa --gz
```
## Process
This uses the scripts pasted below and in the new google drive for this "[parsing-plaac-results-all-GO-terms](https://drive.google.com/drive/u/0/folders/1sUrgvc2rt4p8pyhRt21C1aE9BLNZqxsZ)".
Doing the ones first requested here, but all should work if there are some plaac positives for them.
### GO_0009055
```bash
# getting the plaac-positive protein IDs associated with this GO term for each domain
bash get-wanted-protein-accs-all-GO.sh GO:0009055
# making subset plaac tables for each domain
python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-archaea-uniprot-accs.txt -o GO_0009055/GO_0009055-archaea-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-bacteria-uniprot-accs.txt -o GO_0009055/GO_0009055-bacteria-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0009055/GO_0009055-target-eukarya-uniprot-accs.txt -o GO_0009055/GO_0009055-eukarya-plaac-results-all-GO.tsv
# getting full protein seqs and adding to plaac results tab
bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0009055
```
### GO_0008061
```bash
# getting the plaac-positive protein IDs associated with this GO term for each domain
bash get-wanted-protein-accs-all-GO.sh GO:0008061
# making subset plaac tables for each domain
python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-archaea-uniprot-accs.txt -o GO_0008061/GO_0008061-archaea-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-bacteria-uniprot-accs.txt -o GO_0008061/GO_0008061-bacteria-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0008061/GO_0008061-target-eukarya-uniprot-accs.txt -o GO_0008061/GO_0008061-eukarya-plaac-results-all-GO.tsv
# getting full protein seqs and adding to plaac results tab
bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0008061
```
### GO_0006313
```bash
# getting the plaac-positive protein IDs associated with this GO term for each domain
bash get-wanted-protein-accs-all-GO.sh GO:0006313
# making subset plaac tables for each domain
python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-archaea-uniprot-accs.txt -o GO_0006313/GO_0006313-archaea-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-bacteria-uniprot-accs.txt -o GO_0006313/GO_0006313-bacteria-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0006313/GO_0006313-target-eukarya-uniprot-accs.txt -o GO_0006313/GO_0006313-eukarya-plaac-results-all-GO.tsv
# getting full protein seqs and adding to plaac results tab
bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0006313
```
### GO_0005507
```bash
# getting the plaac-positive protein IDs associated with this GO term for each domain
bash get-wanted-protein-accs-all-GO.sh GO:0005507
# making subset plaac tables for each domain
python subset-plaac-result-tab-all-GO.py -i ../archaea/archaea-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-archaea-uniprot-accs.txt -o GO_0005507/GO_0005507-archaea-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../bacteria/bacteria-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-bacteria-uniprot-accs.txt -o GO_0005507/GO_0005507-bacteria-plaac-results-all-GO.tsv
python subset-plaac-result-tab-all-GO.py -i ../eukarya/eukarya-plaac-results.tsv.gz -w GO_0005507/GO_0005507-target-eukarya-uniprot-accs.txt -o GO_0005507/GO_0005507-eukarya-plaac-results-all-GO.tsv
# getting full protein seqs and adding to plaac results tab
bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0005507
```
## Scripts
### get-wanted-protein-accs-all-GO.sh
```bash=
#!/usr/bin/env bash
set -e
if [ "$#" == 0 ] || [ $1 == "-h" ] || [ $1 == "help" ] || [ $1 == "--help" ]; then
printf "\n This script gets the protein accessions of plaac-positive proteins associated with\n"
printf " the provided GO term for all 3 domains. Paths to required files are set inside the script.\n"
printf " Having 'bit' installed ensures anything needed is present:\n"
printf " https://github.com/AstrobioMike/bit#bioinformatics-tools-bit.\n\n"
printf " Usage:\n\t bash get-wanted-protein-accs-all-GO.sh GO:0009055\n\n"
exit
fi
# paths to required files
archaea_protein_to_GO_map="archaea-plaac-positive-proteins-GO-associations.tsv"
bacteria_protein_to_GO_map="bacteria-plaac-positive-proteins-GO-associations.tsv"
eukarya_protein_to_GO_map="eukarya-plaac-positive-proteins-GO-associations.tsv"
# changing target GO ID ":"" to "_" for filename purposes
mod_GO_ID=$(echo ${1} | tr ":" "_")
mkdir -p ${mod_GO_ID}
# doing archaea
grep "${1}" ${archaea_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-archaea-uniprot-accs.txt
# doing bacteria
grep "${1}" ${bacteria_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-bacteria-uniprot-accs.txt
# doing eukarya
grep "${1}" ${eukarya_protein_to_GO_map} | cut -f 1 > ${mod_GO_ID}/${mod_GO_ID}-target-eukarya-uniprot-accs.txt
```
### subset-plaac-result-tab-all-GO.py
```python=
#!/usr/bin/env python
import argparse
import gzip
import pandas as pd
parser = argparse.ArgumentParser(description='This script has a very specific set of skills.')
required = parser.add_argument_group('required arguments')
required.add_argument("-i", "--input-tsv", help='One of the domain-level files of plaac results such as "archaea-plaac-results.tsv.gz" (expects to be gzip\'d as currently written)', action="store", required=True)
required.add_argument("-w", "--wanted-things", help="Our unique protein IDs we want.", action="store", dest="wanted", required=True)
parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv")
args = parser.parse_args()
# creating empty set and populating with target accessions
targets_set = set()
for line in open(args.wanted):
line = line.strip()
targets_set.add(line)
# adding "SEQid" to our set of targets so that the first header row is written out also
targets_set.add("SEQid")
# opening new output table and writing to it as we iterate through the gzip'd input table
with open(args.output_tsv, "w") as out_tab:
# opening input gz table
with gzip.open(args.input_tsv, "rt") as in_file:
for line in in_file:
# writing out line to new file if its first column entry is in our targets_set
if line.strip().split("\t")[0] in targets_set:
out_tab.write(line)
# reading in output file as table and sorting by COREscore
tab = pd.read_csv(args.output_tsv, sep = "\t")
tab.sort_values(by = ['COREscore'], ascending = False, inplace = True)
# writing out
tab.to_csv(args.output_tsv, index = False, sep = "\t")
```
### get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh
```bash=
#!/usr/bin/env bash
set -e
## example usage: bash get-protein-seqs-and-combine-with-plaac-results-tab-all-GO.sh GO:0009055
# should be run last in process, see this page for details: https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results-all-GO-terms
# changing target GO ID ":"" to "_" for filename purposes
mod_GO_ID=$(echo ${1} | tr ":" "_")
## checking directory exists already
if [ ! -d ${mod_GO_ID} ]; then
printf "\n The expected ${mod_GO_ID} directory doesn't exist already. This script needs\n"
printf " to be run after others. See this page for details:\n\n"
printf " https://hackmd.io/@astrobiomike/subsetting-primary-plaac-results-all-GO-terms\n\n"
exit
fi
# paths to required files
archaea_faa_file="archaea-plaac-positive-proteins.faa"
bacteria_faa_file="bacteria-plaac-positive-proteins.faa"
eukarya_faa_file="eukarya-plaac-positive-proteins.faa"
archaea_plaac_file="${mod_GO_ID}/${mod_GO_ID}-archaea-plaac-results-all-GO.tsv"
bacteria_plaac_file="${mod_GO_ID}/${mod_GO_ID}-bacteria-plaac-results-all-GO.tsv"
eukarya_plaac_file="${mod_GO_ID}/${mod_GO_ID}-eukarya-plaac-results-all-GO.tsv"
### getting archaea seqs ###
# getting list from plaac table so can use it to reorder them for sticking together at end
cut -f 1 ${archaea_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp
bit-parse-fasta-by-headers -i ${archaea_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp
# ordering so can stick together
bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
# making new column with just seqs
printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp
grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp
# adding to plaac output tab
paste ${archaea_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-archaea-plaac-results-with-prot-seqs.tsv
# clean up
rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
### doing bacteria ###
# getting list from plaac table so can use it to reorder them for sticking together at end
cut -f 1 ${bacteria_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp
bit-parse-fasta-by-headers -i ${bacteria_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp
# ordering so can stick together
bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
# making new column with just seqs
printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp
grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp
# adding to plaac output tab
paste ${bacteria_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-bacteria-plaac-results-with-prot-seqs.tsv
# clean up
rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
### doing eukarya ###
# getting list from plaac table so can use it to reorder them for sticking together at end
cut -f 1 ${eukarya_plaac_file} | tail -n +2 > ${mod_GO_ID}-wanted-prot-IDs.tmp
bit-parse-fasta-by-headers -i ${eukarya_faa_file} -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots.faa.tmp
# ordering so can stick together
bit-reorder-fasta -i ${mod_GO_ID}-wanted-prots.faa.tmp -w ${mod_GO_ID}-wanted-prot-IDs.tmp -o ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
# making new column with just seqs
printf "protein_seq\n" > ${mod_GO_ID}-seq-col.tmp
grep -v "^>" ${mod_GO_ID}-wanted-prots-ordered.faa.tmp >> ${mod_GO_ID}-seq-col.tmp
# adding to plaac output tab
paste ${eukarya_plaac_file} ${mod_GO_ID}-seq-col.tmp > ${mod_GO_ID}/${mod_GO_ID}-eukarya-plaac-results-with-prot-seqs.tsv
# clean up
rm ${mod_GO_ID}-wanted-prot-IDs.tmp ${mod_GO_ID}-wanted-prots.faa.tmp ${mod_GO_ID}-seq-col.tmp ${mod_GO_ID}-wanted-prots-ordered.faa.tmp
```