Getting plaac results and AA seqs for specific GO terms from ISS data

--- tags: prions-fo-life title: Getting plaac results and AA seqs for specific GO terms from ISS data --- # Getting plaac results and AA seqs for specific GO terms from ISS data > This takes off from what was explained on [this page](https://hackmd.io/@astrobiomike/annotating-and-running-plaac-on-genomes). [toc] Working with data from 1-Jun-2021, UniProt "Standard" proteomes only. The files used below are from [this 'ISS-genomes-work' drive](https://drive.google.com/drive/u/0/folders/10Le_AIfZUCQA74eDMgjsQAAIfDwljvDV ), and the results of the following work were placed in there. The code below depends on that directory structure being matched. ## Purpose Prompted by Tom's [message](https://prionprion.slack.com/archives/DE4C8952L/p1640124746010400) to me about this: > *these are GOs from the biofilm/ISS/ARIA project: GO:0015628, GO:0098776 Would you be able to send me PLAAC results for them? I need sequences of Full length/PrD/core score* > *From file Combined-GLDS-genome-plaac-positive-GO-term-info* So this code is to take a given GO term, then grab all the plaac positive proteins and their plaac results that were annotated with that GO term. ## Conda env This conda environment holds any required stuff. ```bash conda create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit conda activate bit ``` ## Doing an example As written, this expects the directory structure as laid out in [this google drive](https://drive.google.com/drive/u/0/folders/10Le_AIfZUCQA74eDMgjsQAAIfDwljvDV), and us calling the program from that main directory with all the GLDS-* subdirectories. Script is below and in the 'helper-scripts/' subdirectory of the drive. ### GO:0015628 ```bash bash helper-scripts/get-protein-info-for-specific-GO.sh GO:0015628 ``` For this particular one, which has 17, it takes about 10 seconds, but will take longer if there are more proteins associated with a specific GO term. When it's done, it'll print something like this to the terminal: ``` Info for 17 proteins annotated with GO:0015628 written in directory: GO_0015628-plaac-positive-protein-info/ ``` And the new directory holds the table with columns protein_ID, COREscore, PRDaa, and protein_seq: ``` protein_ID COREscore PRDaa protein_seq GLDS-263-PISK00000000-Acinetobacter_radioresistens_973 35.572 GQSASSSSSSSSSNSNNRSNSINNLISNNQNSSSTTGSSSSSFSTPSINLNGNSSNSQNSSITSFNG MAFLNHQRPLWALLAAAPLIASVSTHVQAQTWKINLRDADLTAFINEVADITGKNFAVDPRVRGNVTVISNKPLNKDEVYDLFLGVLNVNGVVALPSGNTVKLVPDSNAKNSGIPYDARSRAGGDQIVTRVIWLQNTNPNDLIPALRPLMPQFAHLAAVAGTNALIVSDRATNIYQLENIVRNLDGTGQNDIEAISLQSSQAEEVIGLLETMSATGASKDFIGSRVRIIADNRTNRILIKGDPDSRKRLRQMVEMLDVPSADRLGGLKVFRLKYASAKNLAEILQGLVTGQSASSSSSSSSSNSNNRSNSINNLISNNQNSSSTTGSSSSSFSTPSINLNGNSSNSQNSSITSFNGAGISIIADPTQNALVVKAEPQLMREVEAAIQQLDIRRQQVLIEAAIIEVSGDDADQLGIQWALGDLSSGVGLISFSNVGASLASIAAGYATGGASGAAAAIAGDANKGNGATLGIGNFENSRKAYGALIQALKTNTKSNFLSTPSIVTMDNEEAYIVVGQNVPFVTGSVSTGTSGTVNPYTTVERKDVGVTLKVIPHIGDNGTVRLEVEQEVSDVQNNKGQATDLVTNKRAIKTAVLAEHGQTVVLGGLIADNTSLSRQGIPGLSSIPYLGRLFRADARSNIKRNLLVFIHPTIVGDANDVRRISQQRYNQLYSLQLSMDKNGNFAKLPENVADIYNPQAPVVNSPYQKVPTATQNKAVVVTTPVAVEAPQVQQKTVQLPAKETERSKNTVTTTTIRPASSQ GLDS-361-F8_7S_12B-Acinetobacter_pittii_249 44.968 SSSNSNNNSSNSSNPINNLMGNNQNSSSNTSGSNGSSISTPSINLNGNSNNNNQNSISSFSQNG MALLNHQRPLWALLAAAPLIATVSSSVYAQTWKINLRDADLTAFINEVADITGKNFAVDPRVRGNVTVISNKPLNKNEVYDLFLGVLNVNGVVAIPSGNTIKLVPDSNVKNSGIPYDSRNRLRGDQIVTRVIWLENTNPNDLIPALRPLMPQFAHMAAIAGTNALIVSDRAANIYQLENIIRNLDGTGQNDIEAISLQSSQAEEIITQLEAMSATGASKDFNGARIRIIADNRTNRILVKGDPETRKRIRHMIEMLDVPSADRLGGLKVFRLKYASAKNLSEILQGLVTGQAVSSSNSNNNSSNSSNPINNLMGNNQNSSSNTSGSNGSSISTPSINLNGNSNNNNQNSISSFSQNGVSIIADNAQNSLVVKADPQLMREIESAIQQLDVRRQQVLIEAAIIEVSGDDADQLGIQWALGDLSSGIGLLSFSNVGASLSSIAAGYLSGGSAGAASAIAGGANKGNGATLALGNFENSRKAYGALIQALKTNTKSNLLSTPSIVTMDNEEAYIVVGQNVPFVTGSVTTNSTGINPYTTVERKDVGVTLKVVPHIGEGGTVRLEVEQEVSAVQDSRGQAADLVTSKRAIKTAVLAEHGQTVVLGGLVSDDTSLSRQGIPGLSSIPYVGRLFRSDNRSNVKRNLLVFIHPTIVGDANDVRRLSQQRYSQLYSLQLAMDKNGNFAKLPEQVDDVYNQKMTPPSIASQPKNYQQVPSGGKTSTVTTPVAVEPAVQKQTLHLPDPEVNRTKNTVTTTTLRPSTAQ ... ``` ## Script ```bash= #!/usr/bin/env bash set -e ## see this page for info: https://hackmd.io/@astrobiomike/getting-prot-and-plaac-info-for-specific-GO-terms ## expects 1 positional argument as a GO term, e.g. GO:0015628 ## example usage: bash helper-scripts/get-protein-info-for-specific-GO.sh GO:0015628 # removing ":" from input GO term ID target_go_file_prefix=$(echo ${1} | tr ":" "_") # some vars GO_to_protein_map_file="combined/Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv" output_dir="${target_go_file_prefix}-plaac-positive-protein-info" # creating output dir mkdir -p ${output_dir} # getting protein IDs with this GO term grep "${1}" ${GO_to_protein_map_file} | cut -f 1 > ${target_go_file_prefix}-target-proteins.tmp # capturing number to report num_prots=$(wc -l ${target_go_file_prefix}-target-proteins.tmp | sed 's/^ *//' | tr -s " " "\t" | cut -f 1) # getting unique genome files we need to search (cutting off protein ID number from end of seq) sed 's/_[0-9]*$//' ${target_go_file_prefix}-target-proteins.tmp | sort -u > ${target_go_file_prefix}-target-genomes.tmp # looping through target genomes and to grab wanted protein seqs and wanted plaac result info for genome in $(cat ${target_go_file_prefix}-target-genomes.tmp) do curr_GLDS=$(echo ${genome} | cut -f 1,2 -d "-") ### getting wanted protein seqs bit-parse-fasta-by-headers -i ${curr_GLDS}/gene-calls/${genome}-gene-calls.faa -w ${target_go_file_prefix}-target-proteins.tmp -o ${target_go_file_prefix}-${genome}-prots.faa.tmp ### getting wanted plaac results # making sure output file isn't there already, as we're appending rm -rf ${target_go_file_prefix}-${genome}-plaac-results.tmp for protein_ID in $(cat ${target_go_file_prefix}-target-proteins.tmp) do grep -w -m 1 "^${protein_ID}" ${curr_GLDS}/plaac-results/${genome}-plaac-out.tsv | cut -f 1,12,26 >> ${target_go_file_prefix}-${genome}-plaac-results.tmp done done # combining protein files cat ${target_go_file_prefix}-*.faa.tmp > ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp rm ${target_go_file_prefix}-*.faa.tmp # combining plaac results output printf "protein_ID\tCOREscore\tPRDaa\n" > ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp cat ${target_go_file_prefix}-*-plaac-results.tmp >> ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp rm ${target_go_file_prefix}-*-plaac-results.tmp # clean up rm ${target_go_file_prefix}-target-proteins.tmp ${target_go_file_prefix}-target-genomes.tmp # reording seqs (to be sure) and sticking together tail -n +2 ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp | cut -f 1 > ${target_go_file_prefix}-seqs-wanted-order.tmp bit-parse-fasta-by-headers -i ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp -w ${target_go_file_prefix}-seqs-wanted-order.tmp -o ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp # getting just seqs and sticking to output plaac table printf "protein_seq\n" > ${target_go_file_prefix}-just-AA-seqs.tmp grep -v ">" ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp >> ${target_go_file_prefix}-just-AA-seqs.tmp paste ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp ${target_go_file_prefix}-just-AA-seqs.tmp > ${output_dir}/${target_go_file_prefix}-plaac-positive-info.tsv # clean up rm ${target_go_file_prefix}-just-AA-seqs.tmp ${target_go_file_prefix}-seqs-wanted-order.tmp ${output_dir}/${target_go_file_prefix}-plaac-positive-AA-seqs.faa.tmp rm ${target_go_file_prefix}-AA-seqs-in-wanted-order.faa.tmp ${output_dir}/${target_go_file_prefix}-plaac-positive-slimmed-results.tmp printf "\n Info for ${num_prots} proteins annotated with ${1} written in directory:\n" printf " ${output_dir}/\n\n" ```