---
tags: prions-fo-life
title: Getting plaac results and AA seqs for specific GO terms from ISS data
---
# Getting plaac results and AA seqs for specific GO terms from ISS data
> This takes off from what was explained on [this page](https://hackmd.io/@astrobiomike/annotating-and-running-plaac-on-genomes).
[toc]
Working with data from 1-Jun-2021, UniProt "Standard" proteomes only. The files used below are from [this 'ISS-genomes-work' drive](https://drive.google.com/drive/u/0/folders/10Le_AIfZUCQA74eDMgjsQAAIfDwljvDV
), and the results of the following work were placed in there. The code below depends on that directory structure being matched.
## Purpose
Prompted by Tom's [message](https://prionprion.slack.com/archives/DE4C8952L/p1640124746010400) to me about this:
> *these are GOs from the biofilm/ISS/ARIA project: GO:0015628, GO:0098776
Would you be able to send me PLAAC results for them? I need sequences of Full length/PrD/core score*
> *From file
Combined-GLDS-genome-plaac-positive-GO-term-info*
So this code is to take a given GO term, then grab all the plaac positive proteins and their plaac results that were annotated with that GO term.
## Conda env
This conda environment holds any required stuff.
```bash
conda create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
conda activate bit
```
## Doing an example
As written, this expects the directory structure as laid out in [this google drive](https://drive.google.com/drive/u/0/folders/10Le_AIfZUCQA74eDMgjsQAAIfDwljvDV), and us calling the program from that main directory with all the GLDS-* subdirectories. Script is below and in the 'helper-scripts/' subdirectory of the drive.
### GO:0015628
```bash
bash helper-scripts/get-protein-info-for-specific-GO.sh GO:0015628
```
For this particular one, which has 17, it takes about 10 seconds, but will take longer if there are more proteins associated with a specific GO term. When it's done, it'll print something like this to the terminal:
```
Info for 17 proteins annotated with GO:0015628 written in directory:
GO_0015628-plaac-positive-protein-info/
```
And the new directory holds the table with columns protein_ID, COREscore, PRDaa, and protein_seq:
```
protein_ID COREscore PRDaa protein_seq
GLDS-263-PISK00000000-Acinetobacter_radioresistens_973 35.572 GQSASSSSSSSSSNSNNRSNSINNLISNNQNSSSTTGSSSSSFSTPSINLNGNSSNSQNSSITSFNG MAFLNHQRPLWALLAAAPLIASVSTHVQAQTWKINLRDADLTAFINEVADITGKNFAVDPRVRGNVTVISNKPLNKDEVYDLFLGVLNVNGVVALPSGNTVKLVPDSNAKNSGIPYDARSRAGGDQIVTRVIWLQNTNPNDLIPALRPLMPQFAHLAAVAGTNALIVSDRATNIYQLENIVRNLDGTGQNDIEAISLQSSQAEEVIGLLETMSATGASKDFIGSRVRIIADNRTNRILIKGDPDSRKRLRQMVEMLDVPSADRLGGLKVFRLKYASAKNLAEILQGLVTGQSASSSSSSSSSNSNNRSNSINNLISNNQNSSSTTGSSSSSFSTPSINLNGNSSNSQNSSITSFNGAGISIIADPTQNALVVKAEPQLMREVEAAIQQLDIRRQQVLIEAAIIEVSGDDADQLGIQWALGDLSSGVGLISFSNVGASLASIAAGYATGGASGAAAAIAGDANKGNGATLGIGNFENSRKAYGALIQALKTNTKSNFLSTPSIVTMDNEEAYIVVGQNVPFVTGSVSTGTSGTVNPYTTVERKDVGVTLKVIPHIGDNGTVRLEVEQEVSDVQNNKGQATDLVTNKRAIKTAVLAEHGQTVVLGGLIADNTSLSRQGIPGLSSIPYLGRLFRADARSNIKRNLLVFIHPTIVGDANDVRRISQQRYNQLYSLQLSMDKNGNFAKLPENVADIYNPQAPVVNSPYQKVPTATQNKAVVVTTPVAVEAPQVQQKTVQLPAKETERSKNTVTTTTIRPASSQ
GLDS-361-F8_7S_12B-Acinetobacter_pittii_249 44.968 SSSNSNNNSSNSSNPINNLMGNNQNSSSNTSGSNGSSISTPSINLNGNSNNNNQNSISSFSQNG MALLNHQRPLWALLAAAPLIATVSSSVYAQTWKINLRDADLTAFINEVADITGKNFAVDPRVRGNVTVISNKPLNKNEVYDLFLGVLNVNGVVAIPSGNTIKLVPDSNVKNSGIPYDSRNRLRGDQIVTRVIWLENTNPNDLIPALRPLMPQFAHMAAIAGTNALIVSDRAANIYQLENIIRNLDGTGQNDIEAISLQSSQAEEIITQLEAMSATGASKDFNGARIRIIADNRTNRILVKGDPETRKRIRHMIEMLDVPSADRLGGLKVFRLKYASAKNLSEILQGLVTGQAVSSSNSNNNSSNSSNPINNLMGNNQNSSSNTSGSNGSSISTPSINLNGNSNNNNQNSISSFSQNGVSIIADNAQNSLVVKADPQLMREIESAIQQLDVRRQQVLIEAAIIEVSGDDADQLGIQWALGDLSSGIGLLSFSNVGASLSSIAAGYLSGGSAGAASAIAGGANKGNGATLALGNFENSRKAYGALIQALKTNTKSNLLSTPSIVTMDNEEAYIVVGQNVPFVTGSVTTNSTGINPYTTVERKDVGVTLKVVPHIGEGGTVRLEVEQEVSAVQDSRGQAADLVTSKRAIKTAVLAEHGQTVVLGGLVSDDTSLSRQGIPGLSSIPYVGRLFRSDNRSNVKRNLLVFIHPTIVGDANDVRRLSQQRYSQLYSLQLAMDKNGNFAKLPEQVDDVYNQKMTPPSIASQPKNYQQVPSGGKTSTVTTPVAVEPAVQKQTLHLPDPEVNRTKNTVTTTTLRPSTAQ
...
```
## Script
```bash=
#!/usr/bin/env bash
set -e
## see this page for info: https://hackmd.io/@astrobiomike/getting-prot-and-plaac-info-for-specific-GO-terms
## expects 1 positional argument as a GO term, e.g. GO:0015628
## example usage: bash helper-scripts/get-protein-info-for-specific-GO.sh GO:0015628
# removing ":" from input GO term ID
target_go_file_prefix=$(echo ${1} | tr ":" "_")
# some vars
GO_to_protein_map_file="combined/Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv"
output_dir="${target_go_file_prefix}-plaac-positive-protein-info"
# creating output dir
mkdir -p ${output_dir}
# getting protein IDs with this GO term
grep "${1}" ${GO_to_protein_map_file} | cut -f 1 > ${target_go_file_prefix}-target-proteins.tmp
# capturing number to report
num_prots=$(wc -l ${target_go_file_prefix}-target-proteins.tmp | sed 's/^ *//' | tr -s " " "\t" | cut -f 1)
# getting unique genome files we need to search (cutting off protein ID number from end of seq)
sed 's/_[0-9]*$//' ${target_go_file_prefix}-target-proteins.tmp | sort -u > ${target_go_file_prefix}-target-genomes.tmp
# looping through target genomes and to grab wanted protein seqs and wanted plaac result info
for genome in $(cat ${target_go_file_prefix}-target-genomes.tmp)
do
curr_GLDS=$(echo ${genome} | cut -f 1,2 -d "-")
### getting wanted protein seqs
bit-parse-fasta-by-headers -i ${curr_GLDS}/gene-calls/${genome}-gene-calls.faa -w ${target_go_file_prefix}-target-proteins.tmp -o ${target_go_file_prefix}-${genome}-prots.faa.tmp
### getting wanted plaac results
# making sure output file isn't there already, as we're appending
rm -rf ${target_go_file_prefix}-${genome}-plaac-results.tmp
for protein_ID in $(cat ${target_go_file_prefix}-target-proteins.tmp)
do
grep -w -m 1 "^${protein_ID}" ${curr_GLDS}/plaac-results/${genome}-plaac-out.tsv | cut -f 1,12,26 >> ${target_go_file_prefix}-${genome}-plaac-results.tmp
done
done
# combining protein files
cat ${target_go_file_prefix}-*.faa.tmp > ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp
rm ${target_go_file_prefix}-*.faa.tmp
# combining plaac results output
printf "protein_ID\tCOREscore\tPRDaa\n" > ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp
cat ${target_go_file_prefix}-*-plaac-results.tmp >> ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp
rm ${target_go_file_prefix}-*-plaac-results.tmp
# clean up
rm ${target_go_file_prefix}-target-proteins.tmp ${target_go_file_prefix}-target-genomes.tmp
# reording seqs (to be sure) and sticking together
tail -n +2 ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp | cut -f 1 > ${target_go_file_prefix}-seqs-wanted-order.tmp
bit-parse-fasta-by-headers -i ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp -w ${target_go_file_prefix}-seqs-wanted-order.tmp -o ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp
# getting just seqs and sticking to output plaac table
printf "protein_seq\n" > ${target_go_file_prefix}-just-AA-seqs.tmp
grep -v ">" ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp >> ${target_go_file_prefix}-just-AA-seqs.tmp
paste ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp ${target_go_file_prefix}-just-AA-seqs.tmp > ${output_dir}/${target_go_file_prefix}-plaac-positive-info.tsv
# clean up
rm ${target_go_file_prefix}-just-AA-seqs.tmp ${target_go_file_prefix}-seqs-wanted-order.tmp ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp
rm ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp
printf "\n Info for ${num_prots} proteins annotated with ${1} written in directory:\n"
printf " ${output_dir}/\n\n"
```