---
tags: prions-fo-life
title: Annotating and running plaac on genomes
---
# Annotating and running plaac on genomes
[toc]
---
> Mike you're figuring this out on kuat here: `~/nasa/prions-fo-life/ISS-isolates/testing`
---
# Running through process with a GLDS dataset
## Downloading the genomes
Processing here with GLDS-311. Used scripts are below this section.
```bash
curl -L -o GLDS-311-files.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/311
```
And parsing it:
```bash
python ../helper-scripts/filenames-json-to-tsv.py GLDS-311-files.json GLDS-311-files.tsv
```
### Getting metadata
```bash
curl -L -o GLDS-311-meta.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/meta/311
```
Trying to parse that was quickly becoming a nightmare for me, so doing things manually for the metadata.
#### Getting info tables
Downloaded the "Samples" table and the "Assays/Measurements" table. Since these are comma-delimited, and they have commas in certain fields, they are not friendly to parsing at the command line. So i selected specific columns to be downloaded. For the "Samples" table: Source Name, Sample Name, and Organism (these are what was provided by the submitter). For the "Assays/Measurements" table, Sample Name, NCBI Assembly Accession, and Derived Data File.
Cleaning up and combining into one tsv (this will need to be done differently for each GLDS based on what's in these tables):
```bash
# making sure they are in the same order
diff <( tr -d '"' < GLDS-311_samples_export.csv | cut -f 2 -d "," ) <( tr -d '"' < GLDS-311_genome_sequencing_export.csv | cut -f 1 -d "," )
# combining
tr -d '"' < GLDS-311_samples_export.csv | tr "," "\t" | sed 's/Characteristics: //' | cut -f 1-3 | sed 's/ $//' > p1.tmp
tr -d '"' < GLDS-311_genome_sequencing_export.csv | tr "," "\t" | cut -f 2-3 | sed 's/Parameter Value: //' > p2.tmp
paste p1.tmp p2.tmp | sed \$d > GLDS-311-genome-info.tsv
rm p1.tmp p2.tmp
```
### Building download commands
Doing it with shorter and more helpful filenames. This will need to be done specific for each GLDS depending on the info there and format.
```bash
# getting info in same order as our links
for i in $(tail -n +2 GLDS-311-files.tsv | cut -f 1)
do
grep ${i} GLDS-311-genome-info.tsv | cut -f 1,3 | tr " " "_" | tr "\t" "-" | sed 's/^/GLDS-311-/' | sed 's/$/.fasta.gz/'
done > wanted-filenames.tmp
# making sure they are all unique (should be the same number)
sort -u wanted-filenames.tmp | wc -l
tail -n +2 GLDS-311-files.tsv | cut -f 1 | sort -u | wc -l
paste -d " " <( sed 's/^/curl -L -o /' wanted-filenames.tmp ) <( tail -n +2 GLDS-311-files.tsv | cut -f 2 ) > dl-GLDS-311-genomes.sh
```
Adding our file names to summary file:
```bash
paste <( cat <( printf "our_file_name\n" ) wanted-filenames.tmp ) GLDS-311-genome-info.tsv > t && mv t GLDS-311-genome-info.tsv
rm *.tmp
```
### Downloading
```bash
bash dl-GLDS-311-genomes.sh
mkdir genomes
mv *.fasta.gz genomes
gunzip genomes/*.gz
```
## Running process with scripts
First making a list of unique IDs:
```bash
ls genomes/*.fasta | cut -f 2 -d "/" | cut -f 1 -d "." > unique-genome-IDs.txt
```
Scripts are below, but are utilized like so:
```bash
bash ../helper-scripts/call-genes-on-proks.sh unique-genome-IDs.txt
bash ../helper-scripts/run-plaac-on-genomes.sh unique-genome-IDs.txt
bash ../helper-scripts/run-interproscan-on-genomes.sh unique-genome-IDs.txt
# took 30 mins with this number of prots
## getting plaac-positive proteins
grep -hv "#" plaac-results/*-plaac-out.tsv \
| awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
| sed '/^SEQid$/d' > GLDS-311-genome-plaac-positive-protein-IDs.txt
## getting GO annotations
python ../helper-scripts/get-GO-annots.py \
-w GLDS-311-genome-plaac-positive-protein-IDs.txt \
-i interproscan-results/all-genes-interpro-out.tsv \
-o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp
## propagating parent terms
python ../helper-scripts/add-GO-parents.py \
-i GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp \
-o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
rm GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp
```
This table looks like this:
```
protein_ID GO_terms
GLDS-311-JC-0128-Pantoea_brenneri_2122 NA
GLDS-311-JC-0128-Pantoea_brenneri_2164 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0128-Pantoea_brenneri_2285 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0128-Pantoea_brenneri_2996 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0129-Pantoea_brenneri_464 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0129-Pantoea_brenneri_2324 NA
GLDS-311-JC-0129-Pantoea_brenneri_2418 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0129-Pantoea_brenneri_3069 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0130-Pantoea_brenneri_98 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
```
Making a table of all GO terms in plaac-positive proteins here:
```bash
tail -n +2 GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv \
| cut -f 2 | grep -v "NA" | tr ";" "\n" \
| sort -u > GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt
python ../helper-scripts/get-GO-general-info.py \
-g GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt \
-o GLDS-311-genome-plaac-positive-GO-term-info.tsv
```
And this looks like this:
```
go_term namespace name depth
GO:0001539 biological_process cilium or flagellum-dependent cell motility 4
GO:0003674 molecular_function molecular_function 0
GO:0003774 molecular_function cytoskeletal motor activity 1
GO:0005198 molecular_function structural molecule activity 1
GO:0005215 molecular_function transporter activity 1
GO:0005488 molecular_function binding 1
GO:0005515 molecular_function protein binding 2
GO:0005539 molecular_function glycosaminoglycan binding 3
GO:0005575 cellular_component cellular_component 0
```
# Making a combined table of all GLDS datasets run
```bash
mkdir combined
cd combined
ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv | sed 's/^/# /'
# ../GLDS-262/GLDS-262-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-263/GLDS-263-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-291/GLDS-291-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-298/GLDS-298-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-300/GLDS-300-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-302/GLDS-302-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-303/GLDS-303-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-311/GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-361/GLDS-361-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-64/GLDS-64-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-67/GLDS-67-genome-plaac-positive-proteins-and-GO-terms.tsv
printf "protein_ID\tGO_terms\n" > Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp
for file in $(ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv)
do
tail -n +2 ${file} >> Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp
done
mv Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv
```
Making a table of all GO terms in plaac-positive proteins here:
```bash
tail -n +2 Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv \
| cut -f 2 | grep -v "NA" | tr ";" "\n" \
| sort -u > Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt
python ../helper-scripts/get-GO-general-info.py \
-g Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt \
-o Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
```
Adding a column that has counts of how many proteins have that GO term:
```bash
for term in $(cut -f 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | tail -n +2)
do
grep -w -c "${term}" Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv
done > prot_counts.tmp
paste Combined-GLDS-genome-plaac-positive-GO-term-info.tsv \
<( cat <( printf "num_prots_with\n" ) prot_counts.tmp ) > t
mv t Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
# sorting by counts
head -n 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv > header.tmp
tail -n +2 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | sort -nrk 5 > data.tmp
cat header.tmp data.tmp > Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
rm *.tmp
head Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | column | sed 's/^/# /'
# go_term namespace name depth num_prots_with
# GO:0003674 molecular_function molecular_function 0 287
# GO:0005488 molecular_function binding 1 199
# GO:0008150 biological_process biological_process 0 182
# GO:0005575 cellular_component cellular_component 0 118
# GO:0016020 cellular_component membrane 2 52
# GO:0040011 biological_process locomotion 1 32
# GO:0051179 biological_process localization 1 23
# GO:0046903 biological_process secretion 4 23
# GO:0006810 biological_process transport 3 23
```
---
### Scripts
**filenames-json-to-tsv.py**
```python=
import json
import sys
with open(sys.argv[1], "r") as json_file:
data = json.load(json_file)
curr_study = list(data['studies'].keys())[0]
# this is a list of dictionaries
file_data = data['studies'][curr_study]['study_files']
link_prefix = "https://genelab-data.ndc.nasa.gov"
# now we will open a file for writing
with open(sys.argv[2], 'w') as outfile:
# starting header
outfile.write("file_name\turl\n")
for entry in file_data:
# making sure it is a fasta.gz (likely to be the genome files we're trying to pull; though some have these other extensions
possible_genome_fasta_extentions = [".fasta.gz", ".fasta", ".fsa_nt.gz", ".fna.gz"]
if entry['file_name'].endswith(tuple(possible_genome_fasta_extentions)):
outfile.write(str(entry['file_name']) + "\t" + link_prefix + str(entry['remote_url']) + "\n")
```
**call-genes-on-proks.sh**
```bash=
#!/usr/bin/env bash
set -e
# directory vars
genomes_dir="genomes"
gene_calls_dir="gene-calls"
mkdir -p ${gene_calls_dir}
printf "\n Calling genes...\n\n"
for genome in $(cat $1)
do
printf "\n ${genome}\n"
# predicting genes
source activate prodigal
prodigal -q -c -a ${gene_calls_dir}/${genome}-gene-calls.tmp \
-o ${gene_calls_dir}/${genome}-gene-calls.gbk \
-i ${genomes_dir}/${genome}.fasta
# need to remove "*" characters to work with interproscan
tr -d "*" < ${gene_calls_dir}/${genome}-gene-calls.tmp > ${gene_calls_dir}/${genome}-gene-calls.faa
conda deactivate
# cleaning up headers
source activate bit
bit-rename-fasta-headers -i ${gene_calls_dir}/${genome}-gene-calls.faa -w ${genome} \
-o ${gene_calls_dir}/${genome}-gene-calls.tmp
mv ${gene_calls_dir}/${genome}-gene-calls.tmp ${gene_calls_dir}/${genome}-gene-calls.faa
conda deactivate
done
printf "\n"
```
**run-plaac-on-genomes.sh**
```bash=
#!/usr/bin/env bash
set -e
# directory vars
gene_calls_dir="gene-calls"
plaac_dir="plaac-results"
mkdir -p ${plaac_dir}
printf "\n Running plaac...\n\n"
for genome in $(cat $1)
do
printf "\n ${genome}\n"
# predicting genes
source activate plaac
java -jar ~/bin/plaac/cli/target/plaac.jar -i ${gene_calls_dir}/${genome}-gene-calls.faa -a 0 > ${plaac_dir}/${genome}-plaac-out.tsv
conda deactivate
done
printf "\n"
```
**run-interproscan-on-genomes.sh**
```bash=
#!/usr/bin/env bash
set -e
# directory vars
gene_calls_dir="gene-calls"
interproscan_dir="interproscan-results"
mkdir -p ${interproscan_dir}
printf "\n Running interproscan...\n\n"
# first combining all input fastas into one file so interpro scan only needs to load refs once, then will parse after
rm -rf all-genes.faa.tmp temp/
for genome in $(cat $1)
do
cat ${gene_calls_dir}/${genome}-gene-calls.faa >> all-genes.faa.tmp
done
source activate interproscan
interproscan.sh --cpu 40 --goterms --disable-residue-annot -f tsv -i all-genes.faa.tmp -o ${interproscan_dir}/all-genes-interpro-out.tsv 2> ${interproscan_dir}/interproscan-run-stderr.log
rm -rf all-genes.faa.tmp temp/
conda deactivate
rmdir temp/
printf "\n"
```
**get-GO-annots.py**
```bash=
#!/usr/bin/env python
import argparse
parser = argparse.ArgumentParser(description='This takes a list of protein IDs and pulls out their GO annotations from the interproscan output.')
required = parser.add_argument_group('required arguments')
required.add_argument("-w", "--wanted-protein-IDs", help="Single-column file with wanted protein IDs", action="store", required=True)
required.add_argument("-i", "--interproscan-table", help='The output table from interproscan', action="store", required=True)
parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", default="out.tsv")
args = parser.parse_args()
# creating dictionary to hold stuff (keys = protein IDs; values = list of associated GO terms)
targets_dict = {}
for line in open(args.wanted_protein_IDs):
line = line.strip()
targets_dict[line] = []
# populating dictionary
for line in open(args.interproscan_table):
line = line.strip().split("\t")
if line[0] in targets_dict.keys() and len(line) > 13:
if line[13] != "-":
if len(line[13].split("|")) > 1:
for term in line[13].split("|"):
targets_dict[line[0]].append(term)
else:
targets_dict[line[0]].append(line[13])
# writing out
with open(args.output_tsv, "w") as out_tab:
out_tab.write("protein_ID\tGO_terms\n")
for prot_ID in targets_dict:
if len(targets_dict[prot_ID]) == 0:
out_tab.write(str(prot_ID) + "\t" + "NA" + "\n")
else:
out_tab.write(str(prot_ID) + "\t" + str(";".join(list(set(targets_dict[prot_ID]))) + "\n"))
```
**add-GO-parents.py**
```bash=
#!/usr/bin/env python
## this is modified from bit-get-go-term-info (github.com/AstrobioMike/bit), it needs to be run in an environment with that package active
# lots of this is from this great tutorial: GO Tutorial in Python - Solutions.ipynb, which comes from here: http://gohandbook.org/doku.php ; https://nbviewer.jupyter.org/urls/dessimozlab.github.io/go-handbook/GO%20Tutorial%20in%20Python%20-%20Solutions.ipynb
from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys
import contextlib
parser = argparse.ArgumentParser(description="Add the parents of GO terms to the \
'*-genome-plaac-positive-proteins-and-GO-terms.tmp' files.\
Needs to be run in an environment with the bit package active \
(github.com/AstrobioMike/bit).")
required = parser.add_argument_group('required arguments')
required.add_argument("-i", "--input-tab", help='Input tab (', action="store", required=True)
parser.add_argument("-g", "--GO_obo_file", help='GO obo file to use (e.g. from: geneontology.org/docs/download-ontology/). By default will \
use "go-basic.obo". "goslim_metagenomics.obo" is also a pre-packaged option (enter `-g goslim_metagenomics` to specify it). Or \
a different obo-formatted file can be specified here.',
action="store", dest="obo", default="go_basic")
parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv")
if len(sys.argv)==1:
parser.print_help(sys.stderr)
sys.exit(0)
args = parser.parse_args()
### checking and setting up obo file location
go_data_dir = os.environ["GO_DB_DIR"]
if args.obo == "goslim_metagenomics":
go_obo = go_data_dir + "goslim_metagenomics.obo"
elif args.obo == "go_basic":
go_obo = go_data_dir + "go-basic.obo"
else:
go_obo = args.obo
## loading GO database
go = obo_parser.GODag(go_obo)
# reading in starting tab
in_tab = pd.read_csv(args.input_tab, sep = "\t")
# iterating over rows
for index, row in in_tab.iterrows():
curr_GO_terms = row.GO_terms
if pd.isna(curr_GO_terms):
continue
curr_GO_terms_list = curr_GO_terms.split(";")
# initializing new output list that will include parents
new_GO_term_list = []
for curr_GO_term in curr_GO_terms_list:
# getting parents
try:
curr_GO_term_obj = go[curr_GO_term]
curr_parents_list = list(curr_GO_term_obj.get_all_parents())
new_GO_term_list += curr_parents_list
new_GO_term_list.append(curr_GO_term)
except:
new_GO_term_list.append(curr_GO_term)
# set used to remove any dupes that may have been introduced
new_GO_term_list = list(set(new_GO_term_list))
# replacing GO_terms entry of current row to now include the parent GO terms
row.GO_terms = ";".join(new_GO_term_list)
# writing out
in_tab.to_csv(args.output_tsv, index = False, sep = "\t", na_rep = "NA")
```
**get-GO-general-info.py**
```python=
#!/usr/bin/env python
from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys
parser = argparse.ArgumentParser(description='Writes out a table of GO info for a provided list of GO terms. It relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit).')
required = parser.add_argument_group('required arguments')
required.add_argument("-g", "--input-go-terms", help='Single-column list of target GO terms', action="store", required=True)
parser.add_argument("-o", "--output-tsv", help='Output table (default: "out.tsv")', action="store", dest="output_tsv", default="out.tsv")
if len(sys.argv)==1:
parser.print_help(sys.stderr)
sys.exit(0)
args = parser.parse_args()
### setting GO obo file location – this program relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit)
go_data_dir = os.environ["GO_DB_DIR"]
go_obo = go_data_dir + "go-basic.obo"
## helper function
def get_general_info(go_id):
go_term = go[go_id]
name = go_term.name
namespace = go_term.namespace
depth = go_term.depth
go_term_line = str(go_id) + "\t" + str(namespace) + "\t" + str(name) + "\t" + str(depth)
return go_term_line
## loading GO database
print("\n\tGO obo file being used:")
go = obo_parser.GODag(go_obo, load_obsolete = True)
print("")
# output header
header = ["go_term", "namespace", "name", "depth"]
# iterating through wanted IDs, and writing general info to output table
with open(args.output_tsv, "w") as out_tab:
out_tab.write(str("\t".join(header)) + "\n")
with open(args.input_go_terms) as wanted:
for line in wanted:
go_term = line.strip()
curr_go_info = get_general_info(go_term)
out_tab.write(str(curr_go_info) + "\n")
```
---
# Running through process with one genome
Grabbing one from [GLDS-303](https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-303/)
## Getting genome nucleotide fasta
```bash
wget https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz?version=1
mv GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz\?version\=1 GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz
gunzip GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz
```
## Predicting genes
This one is a prokaryote, so using prodigal.
```bash
conda activate prodigal
prodigal -q -c -a GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp \
-o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.gbk \
-i GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta
# need to remove "*" characters to work with interproscan
tr -d "*" < GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp > GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa
conda deactivate
```
### Cleaning up headers
```bash
conda activate bit
bit-rename-fasta-headers -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa \
-w GLDS-303_wgs_contigs_JC-0222_S1153_L001 \
-o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp
mv GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa
conda deactivate
```
## Running plaac
```bash
conda activate plaac
java -jar ~/bin/plaac/cli/target/plaac.jar -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -a 0 > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv
conda deactivate
```
## Annotating with Interproscan
I'm not entirely sure which things need to be run in order to get GO terms. All I can find is it links the IPR IDs to GO terms, but nothing specifically gives IPR IDs. I think they might come from matches to the other databases. So, just letting all run.
```bash
interproscan.sh --cpu 10 --goterms --disable-residue-annot -f tsv -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv
# took 13 minutes on ~5,000 proteins
```
> **NOTE**
> It will likely be much faster to get all genome proteins together in one file, run this annotation once, and then parse the output based on genomes. Due to renaming headers above, they will all have their genome ID in them followed by _#.
## Getting plaac positive protein IDs
```bash
grep -v "#" GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv \
| awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt
```
## Getting GO annotations that exist for the plaac positives
I should write something in python to do this part.
```bash
# starting output file
printf "protein_ID\tGO_terms\n" > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv
for ID in $(cat GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt)
do
curr_GO_terms=$(grep "^${ID}" GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv | cut -f 14 | sed '/^$/d' | sed '/^-$/d' | tr "|" "\n" | sort -u | tr "\n" "," | sed 's/|$/\n/')
printf "${ID}\t${curr_GO_terms}\n" >> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv
done
```
> **NOTE**
> This will leave us with a list of the GO terms for plaac-positive proteins. Then need to write something to find which hold GO terms that are part of the cellular component namespace.
# Environment setup
```bash
conda install -c conda-forge mamba
```
## Plaac
From https://github.com/whitehead/plaac
The below was done when the latest commit was from 27-Jan-2020 (https://github.com/whitehead/plaac/commit/44ca72882060eec4dc1696ebcf7f4f4185348232)
Needs to be built under python 2, so creating its own environment
```bash
mamba create -n plaac python=2
conda activate plaac
```
```
mkdir -p ~/bin/
cd ~/bin/
wget https://github.com/whitehead/plaac/archive/master.zip
unzip master.zip && rm master.zip
mv plaac-master/ plaac/
cd plaac/cli/
bash build_plaac.sh
```
> If this gives an error like `javac: command not found`, we can try installing java in the same conda environment like so: `conda install -c conda-forge -c bioconda -c defaults java-jdk`
We can test for basic install success with:
```bash
java -jar ~/bin/plaac/cli/target/plaac.jar -h
```
Depending on system, may need to modify the `build_plaac.sh` script to work with updated javac. This is how I've done that in the past. Here is the contents of the modified file, 'new-build-plaac.sh':
```bash
SCRIPT_DIR=`dirname $0`
cd $SCRIPT_DIR
echo cleanup
rm -rf target
mkdir target
echo build
javac -source 1.6 -target 1.6 -d target src/*.java
echo package
cp src/mainClass target/.
cp -r src/util target/.
cd target
jar cmf mainClass plaac.jar *
chmod +x plaac.jar
echo "ok (see target/plaac.jar)"
# also build PLAAC docs
echo "build-docs"
cd -
# generate documentation for command-line parameters, and dotfile for default hmm
java -jar target/plaac.jar -s -d -i dummy -h target/hmm_default.txt > target/plaac_headers.txt
./build_docs.py target/plaac_headers.txt > target/_plaac_headers.haml
echo "ok (see target/_plaac_headers.haml)"
if ! type "dot" > /dev/null; then
echo "dot not found, skipping genereration of hmm_default.png"
else
dot target/hmm_default.txt -Tpng > target/hmm_default.png
echo "ok (see target/hmm_default.png)"
fi
```
If the previous build_plaac.sh command failed, i'd make the above file and then run that one to try, e.g.:
```bash
bash new-build-plaac.sh
```
```bash
java -jar ~/bin/plaac/cli/target/plaac.jar -h
```
Adding an alias for plaac:
```bash
mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/
echo 'alias plaac="java -jar ~/bin/plaac/cli/target/plaac.jar"' >> ${CONDA_PREFIX}/etc/conda/activate.d/plaac.sh
# making sure it works
conda deactivate
conda activate plaac
plaac -h
conda deactivate
```
## MetaEuk
```bash
mamba create -n metaeuk -c conda-forge -c bioconda -c defaults metaeuk=5.34c21f2
```
## Prodigal
```bash
mamba create -n prodigal -c conda-forge -c bioconda -c defaults prodigal=2.6.3
```
## Interproscan
```bash
conda create -n interproscan -c conda-forge -c bioconda -c defaults interproscan=5.52_86.0
# after conda install, it prints out directions to do the following
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.0-64-bit.tar.gz
tar -pxvzf interproscan-5.52-86.0-64-bit.tar.gz
rm -rf /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan/data/
mv interproscan-5.52-86.0/data/ /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan
rm -rf interproscan-5.52-86.0/ interproscan-5.52-86.0-64-bit.tar.gz
# test
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv -dp
```
Setting to automatically delete working directory after finishing (though it only deletes the files, not the temp dir it makes):
```bash
sed 's/delete.temporary.directory.on.completion=false/delete.temporary.directory.on.completion=true/' ${CONDA_PREFIX}/share/InterProScan/interproscan.properties > t && mv t ${CONDA_PREFIX}/share/InterProScan/interproscan.properties
```
## bit
Using this to clean up headers after gene-calling and needed for any scripts that use the GO database.
```bash
mamba create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
```