Annotating and running plaac on genomes

Annotating and running plaac on genomes
Running through process with a GLDS dataset
- Downloading the genomes
- Running process with scripts
Making a combined table of all GLDS datasets run
- - Scripts
Running through process with one genome
Environment setup
- Plaac
- MetaEuk
- Prodigal
- Interproscan
- bit

Mike you're figuring this out on kuat here: ~/nasa/prions-fo-life/ISS-isolates/testing

Running through process with a GLDS dataset

Downloading the genomes

Processing here with GLDS-311. Used scripts are below this section.

curl -L -o GLDS-311-files.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/311

And parsing it:

python ../helper-scripts/filenames-json-to-tsv.py GLDS-311-files.json GLDS-311-files.tsv

Getting metadata


curl -L -o GLDS-311-meta.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/meta/311

Trying to parse that was quickly becoming a nightmare for me, so doing things manually for the metadata.

Getting info tables

Downloaded the "Samples" table and the "Assays/Measurements" table. Since these are comma-delimited, and they have commas in certain fields, they are not friendly to parsing at the command line. So i selected specific columns to be downloaded. For the "Samples" table: Source Name, Sample Name, and Organism (these are what was provided by the submitter). For the "Assays/Measurements" table, Sample Name, NCBI Assembly Accession, and Derived Data File.

Cleaning up and combining into one tsv (this will need to be done differently for each GLDS based on what's in these tables):

# making sure they are in the same order
diff <( tr -d '"' < GLDS-311_samples_export.csv | cut -f 2 -d "," ) <( tr -d '"' < GLDS-311_genome_sequencing_export.csv | cut -f 1 -d "," )

# combining
tr -d '"' < GLDS-311_samples_export.csv | tr "," "\t" | sed 's/Characteristics: //' | cut -f 1-3 | sed 's/ $//' > p1.tmp

tr -d '"' < GLDS-311_genome_sequencing_export.csv | tr "," "\t" | cut -f 2-3 | sed 's/Parameter Value:  //' > p2.tmp

paste p1.tmp p2.tmp | sed \$d > GLDS-311-genome-info.tsv

rm p1.tmp p2.tmp

Building download commands

Doing it with shorter and more helpful filenames. This will need to be done specific for each GLDS depending on the info there and format.

# getting info in same order as our links
for i in $(tail -n +2 GLDS-311-files.tsv | cut -f 1)
do 

    grep ${i} GLDS-311-genome-info.tsv | cut -f 1,3 | tr " " "_" | tr "\t" "-" | sed 's/^/GLDS-311-/' | sed 's/$/.fasta.gz/'
    
done > wanted-filenames.tmp

# making sure they are all unique (should be the same number)
sort -u wanted-filenames.tmp | wc -l
tail -n +2 GLDS-311-files.tsv | cut -f 1 | sort -u | wc -l

paste -d " " <( sed 's/^/curl -L -o /' wanted-filenames.tmp ) <( tail -n +2 GLDS-311-files.tsv | cut -f 2 ) > dl-GLDS-311-genomes.sh

Adding our file names to summary file:

paste <( cat <( printf "our_file_name\n" ) wanted-filenames.tmp ) GLDS-311-genome-info.tsv > t && mv t GLDS-311-genome-info.tsv

rm *.tmp

Downloading

bash dl-GLDS-311-genomes.sh

mkdir genomes
mv *.fasta.gz genomes
gunzip genomes/*.gz

Running process with scripts

First making a list of unique IDs:

ls genomes/*.fasta | cut -f 2 -d "/" | cut -f 1 -d "." > unique-genome-IDs.txt

Scripts are below, but are utilized like so:

bash ../helper-scripts/call-genes-on-proks.sh unique-genome-IDs.txt

bash ../helper-scripts/run-plaac-on-genomes.sh unique-genome-IDs.txt

bash ../helper-scripts/run-interproscan-on-genomes.sh unique-genome-IDs.txt
  # took 30 mins with this number of prots

## getting plaac-positive proteins
grep -hv "#" plaac-results/*-plaac-out.tsv \
             | awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
             | sed '/^SEQid$/d' > GLDS-311-genome-plaac-positive-protein-IDs.txt

## getting GO annotations
python ../helper-scripts/get-GO-annots.py \
       -w GLDS-311-genome-plaac-positive-protein-IDs.txt \
       -i interproscan-results/all-genes-interpro-out.tsv \
       -o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp
       
## propagating parent terms
python ../helper-scripts/add-GO-parents.py \
       -i GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp \
       -o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
       
rm GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp

This table looks like this:

protein_ID                              GO_terms
GLDS-311-JC-0128-Pantoea_brenneri_2122  NA
GLDS-311-JC-0128-Pantoea_brenneri_2164  GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0128-Pantoea_brenneri_2285  GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0128-Pantoea_brenneri_2996  GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0129-Pantoea_brenneri_464   GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0129-Pantoea_brenneri_2324  NA
GLDS-311-JC-0129-Pantoea_brenneri_2418  GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0129-Pantoea_brenneri_3069  GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0130-Pantoea_brenneri_98    GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431

Making a table of all GO terms in plaac-positive proteins here:

tail -n +2 GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv \
     | cut -f 2 | grep -v "NA" | tr ";" "\n" \
     | sort -u > GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt

python ../helper-scripts/get-GO-general-info.py \
       -g GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt \
       -o GLDS-311-genome-plaac-positive-GO-term-info.tsv

And this looks like this:

go_term     namespace           name                                         depth
GO:0001539  biological_process  cilium or flagellum-dependent cell motility  4
GO:0003674  molecular_function  molecular_function                           0
GO:0003774  molecular_function  cytoskeletal motor activity                  1
GO:0005198  molecular_function  structural molecule activity                 1
GO:0005215  molecular_function  transporter activity                         1
GO:0005488  molecular_function  binding                                      1
GO:0005515  molecular_function  protein binding                              2
GO:0005539  molecular_function  glycosaminoglycan binding                    3
GO:0005575  cellular_component  cellular_component                           0

Making a combined table of all GLDS datasets run

mkdir combined
cd combined

ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv | sed 's/^/# /'
# ../GLDS-262/GLDS-262-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-263/GLDS-263-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-291/GLDS-291-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-298/GLDS-298-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-300/GLDS-300-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-302/GLDS-302-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-303/GLDS-303-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-311/GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-361/GLDS-361-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-64/GLDS-64-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-67/GLDS-67-genome-plaac-positive-proteins-and-GO-terms.tsv


printf "protein_ID\tGO_terms\n" > Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp

for file in $(ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv)
do
    tail -n +2 ${file} >> Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp    
done

mv Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv

Making a table of all GO terms in plaac-positive proteins here:

tail -n +2 Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv \
     | cut -f 2 | grep -v "NA" | tr ";" "\n" \
     | sort -u > Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt

python ../helper-scripts/get-GO-general-info.py \
       -g Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt \
       -o Combined-GLDS-genome-plaac-positive-GO-term-info.tsv

Adding a column that has counts of how many proteins have that GO term:

for term in $(cut -f 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | tail -n +2)
do
    grep -w -c "${term}" Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv
done > prot_counts.tmp

paste Combined-GLDS-genome-plaac-positive-GO-term-info.tsv \
      <( cat <( printf "num_prots_with\n" ) prot_counts.tmp ) > t

mv t Combined-GLDS-genome-plaac-positive-GO-term-info.tsv

# sorting by counts
head -n 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv > header.tmp
tail -n +2 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | sort -nrk 5 > data.tmp

cat header.tmp data.tmp > Combined-GLDS-genome-plaac-positive-GO-term-info.tsv

rm *.tmp

head Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | column | sed 's/^/# /'
# go_term     namespace           name                depth  num_prots_with
# GO:0003674  molecular_function  molecular_function  0      287
# GO:0005488  molecular_function  binding             1      199
# GO:0008150  biological_process  biological_process  0      182
# GO:0005575  cellular_component  cellular_component  0      118
# GO:0016020  cellular_component  membrane            2      52
# GO:0040011  biological_process  locomotion          1      32
# GO:0051179  biological_process  localization        1      23
# GO:0046903  biological_process  secretion           4      23
# GO:0006810  biological_process  transport           3      23

Scripts

filenames-json-to-tsv.py


























import json
import sys

with open(sys.argv[1], "r") as json_file:
    data = json.load(json_file)

curr_study = list(data['studies'].keys())[0]

# this is a list of dictionaries
file_data = data['studies'][curr_study]['study_files']

link_prefix = "https://genelab-data.ndc.nasa.gov"

# now we will open a file for writing
with open(sys.argv[2], 'w') as outfile:

    # starting header
    outfile.write("file_name\turl\n")

    for entry in file_data:

        # making sure it is a fasta.gz (likely to be the genome files we're trying to pull; though some have these other extensions
        possible_genome_fasta_extentions = [".fasta.gz", ".fasta", ".fsa_nt.gz", ".fna.gz"]

        if entry['file_name'].endswith(tuple(possible_genome_fasta_extentions)):
            outfile.write(str(entry['file_name']) + "\t" + link_prefix + str(entry['remote_url']) + "\n")

call-genes-on-proks.sh









































#!/usr/bin/env bash
set -e

# directory vars
genomes_dir="genomes"
gene_calls_dir="gene-calls"

mkdir -p ${gene_calls_dir}

printf "\n  Calling genes...\n\n"

for genome in $(cat $1)
do

    printf "\n    ${genome}\n"

    # predicting genes
    source activate prodigal

    prodigal -q -c -a ${gene_calls_dir}/${genome}-gene-calls.tmp \
             -o ${gene_calls_dir}/${genome}-gene-calls.gbk \
             -i ${genomes_dir}/${genome}.fasta

    # need to remove "*" characters to work with interproscan
    tr -d "*" < ${gene_calls_dir}/${genome}-gene-calls.tmp > ${gene_calls_dir}/${genome}-gene-calls.faa

    conda deactivate

    # cleaning up headers
    source activate bit

    bit-rename-fasta-headers -i ${gene_calls_dir}/${genome}-gene-calls.faa -w ${genome} \
                             -o ${gene_calls_dir}/${genome}-gene-calls.tmp

    mv ${gene_calls_dir}/${genome}-gene-calls.tmp ${gene_calls_dir}/${genome}-gene-calls.faa

    conda deactivate

done

printf "\n"

run-plaac-on-genomes.sh


























#!/usr/bin/env bash
set -e

# directory vars
gene_calls_dir="gene-calls"
plaac_dir="plaac-results"

mkdir -p ${plaac_dir}

printf "\n  Running plaac...\n\n"

for genome in $(cat $1)
do

    printf "\n    ${genome}\n"

    # predicting genes
    source activate plaac

    java -jar ~/bin/plaac/cli/target/plaac.jar -i ${gene_calls_dir}/${genome}-gene-calls.faa -a 0 > ${plaac_dir}/${genome}-plaac-out.tsv

    conda deactivate

done

printf "\n"

run-interproscan-on-genomes.sh

































#!/usr/bin/env bash
set -e

# directory vars
gene_calls_dir="gene-calls"
interproscan_dir="interproscan-results"

mkdir -p ${interproscan_dir}

printf "\n  Running interproscan...\n\n"

# first combining all input fastas into one file so interpro scan only needs to load refs once, then will parse after

rm -rf all-genes.faa.tmp temp/

for genome in $(cat $1)
do

    cat ${gene_calls_dir}/${genome}-gene-calls.faa >> all-genes.faa.tmp

done

source activate interproscan

interproscan.sh --cpu 40 --goterms --disable-residue-annot -f tsv -i all-genes.faa.tmp -o ${interproscan_dir}/all-genes-interpro-out.tsv 2> ${interproscan_dir}/interproscan-run-stderr.log

rm -rf all-genes.faa.tmp temp/

conda deactivate

rmdir temp/

printf "\n"

get-GO-annots.py
























































#!/usr/bin/env python

import argparse

parser = argparse.ArgumentParser(description='This takes a list of protein IDs and pulls out their GO annotations from the interproscan output.')

required = parser.add_argument_group('required arguments')

required.add_argument("-w", "--wanted-protein-IDs", help="Single-column file with wanted protein IDs", action="store", required=True)
required.add_argument("-i", "--interproscan-table", help='The output table from interproscan', action="store", required=True)

parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", default="out.tsv")

args = parser.parse_args()

# creating dictionary to hold stuff (keys = protein IDs; values = list of associated GO terms)
targets_dict = {}

for line in open(args.wanted_protein_IDs):
    line = line.strip()
    targets_dict[line] = []

# populating dictionary

for line in open(args.interproscan_table):

    line = line.strip().split("\t")

    if line[0] in targets_dict.keys() and len(line) > 13:

        if line[13] != "-":

            if len(line[13].split("|")) > 1:

                for term in line[13].split("|"):

                    targets_dict[line[0]].append(term)

            else:

                targets_dict[line[0]].append(line[13])

# writing out
with open(args.output_tsv, "w") as out_tab:

    out_tab.write("protein_ID\tGO_terms\n")

    for prot_ID in targets_dict:

        if len(targets_dict[prot_ID]) == 0:

            out_tab.write(str(prot_ID) + "\t" + "NA" + "\n")

        else:

            out_tab.write(str(prot_ID) + "\t" + str(";".join(list(set(targets_dict[prot_ID]))) + "\n"))

add-GO-parents.py






















































































#!/usr/bin/env python

## this is modified from bit-get-go-term-info (github.com/AstrobioMike/bit), it needs to be run in an environment with that package active
# lots of this is from this great tutorial: GO Tutorial in Python - Solutions.ipynb, which comes from here: http://gohandbook.org/doku.php ; https://nbviewer.jupyter.org/urls/dessimozlab.github.io/go-handbook/GO%20Tutorial%20in%20Python%20-%20Solutions.ipynb

from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys
import contextlib

parser = argparse.ArgumentParser(description="Add the parents of GO terms to the \
                                              '*-genome-plaac-positive-proteins-and-GO-terms.tmp' files.\
                                              Needs to be run in an environment with the bit package active \
                                              (github.com/AstrobioMike/bit).")

required = parser.add_argument_group('required arguments')

required.add_argument("-i", "--input-tab", help='Input tab (', action="store", required=True)

parser.add_argument("-g", "--GO_obo_file", help='GO obo file to use (e.g. from: geneontology.org/docs/download-ontology/). By default will \
                                                 use "go-basic.obo". "goslim_metagenomics.obo" is also a pre-packaged option (enter `-g goslim_metagenomics` to specify it). Or \
                                                 a different obo-formatted file can be specified here.',
                    action="store", dest="obo", default="go_basic")

parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv")


if len(sys.argv)==1:
  parser.print_help(sys.stderr)
  sys.exit(0)

args = parser.parse_args()

### checking and setting up obo file location
go_data_dir = os.environ["GO_DB_DIR"]

if args.obo == "goslim_metagenomics":
    go_obo = go_data_dir + "goslim_metagenomics.obo"

elif args.obo == "go_basic":
    go_obo = go_data_dir + "go-basic.obo"

else:
    go_obo = args.obo

## loading GO database
go = obo_parser.GODag(go_obo)

# reading in starting tab
in_tab = pd.read_csv(args.input_tab, sep = "\t")


# iterating over rows
for index, row in in_tab.iterrows():
    curr_GO_terms = row.GO_terms

    if pd.isna(curr_GO_terms):
        continue    

    curr_GO_terms_list = curr_GO_terms.split(";")

    # initializing new output list that will include parents
    new_GO_term_list = []

    for curr_GO_term in curr_GO_terms_list:

        # getting parents
        try:
            curr_GO_term_obj = go[curr_GO_term]
            curr_parents_list = list(curr_GO_term_obj.get_all_parents())
            new_GO_term_list += curr_parents_list
            new_GO_term_list.append(curr_GO_term)

        except:
            new_GO_term_list.append(curr_GO_term)

    # set used to remove any dupes that may have been introduced
    new_GO_term_list = list(set(new_GO_term_list))

    # replacing GO_terms entry of current row to now include the parent GO terms
    row.GO_terms = ";".join(new_GO_term_list)

# writing out
in_tab.to_csv(args.output_tsv, index = False, sep = "\t", na_rep = "NA")

get-GO-general-info.py























































#!/usr/bin/env python

from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys

parser = argparse.ArgumentParser(description='Writes out a table of GO info for a provided list of GO terms. It relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit).')

required = parser.add_argument_group('required arguments')

required.add_argument("-g", "--input-go-terms", help='Single-column list of target GO terms', action="store", required=True)
parser.add_argument("-o", "--output-tsv", help='Output table (default: "out.tsv")', action="store", dest="output_tsv", default="out.tsv")

if len(sys.argv)==1:
  parser.print_help(sys.stderr)
  sys.exit(0)

args = parser.parse_args()

### setting GO obo file location – this program relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit)
go_data_dir = os.environ["GO_DB_DIR"]
go_obo = go_data_dir + "go-basic.obo"

## helper function
def get_general_info(go_id):
    go_term = go[go_id]
    name = go_term.name
    namespace = go_term.namespace
    depth = go_term.depth

    go_term_line = str(go_id) + "\t" + str(namespace) + "\t" + str(name) + "\t" + str(depth)
    return go_term_line


## loading GO database
print("\n\tGO obo file being used:")
go = obo_parser.GODag(go_obo, load_obsolete = True)
print("")

# output header
header = ["go_term", "namespace", "name", "depth"]

# iterating through wanted IDs, and writing general info to output table
with open(args.output_tsv, "w") as out_tab:

    out_tab.write(str("\t".join(header)) + "\n")

    with open(args.input_go_terms) as wanted:

        for line in wanted:
            go_term = line.strip()
            curr_go_info = get_general_info(go_term)
            out_tab.write(str(curr_go_info) + "\n")

Running through process with one genome

Grabbing one from GLDS-303

Getting genome nucleotide fasta

wget https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz?version=1

mv GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz\?version\=1 GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz

gunzip GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz

Predicting genes

This one is a prokaryote, so using prodigal.

conda activate prodigal

prodigal -q -c -a GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp \
         -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.gbk \
         -i GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta

# need to remove "*" characters to work with interproscan
tr -d "*" < GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp > GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa

conda deactivate

Cleaning up headers

conda activate bit

bit-rename-fasta-headers -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa \
                         -w GLDS-303_wgs_contigs_JC-0222_S1153_L001 \
                         -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp

mv GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa

conda deactivate

Running plaac

conda activate plaac

java -jar ~/bin/plaac/cli/target/plaac.jar -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -a 0 > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv

conda deactivate

Annotating with Interproscan

I'm not entirely sure which things need to be run in order to get GO terms. All I can find is it links the IPR IDs to GO terms, but nothing specifically gives IPR IDs. I think they might come from matches to the other databases. So, just letting all run.

interproscan.sh --cpu 10 --goterms --disable-residue-annot -f tsv -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv

 # took 13 minutes on ~5,000 proteins

NOTE
It will likely be much faster to get all genome proteins together in one file, run this annotation once, and then parse the output based on genomes. Due to renaming headers above, they will all have their genome ID in them followed by _#.

Getting plaac positive protein IDs

grep -v "#" GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv \
            | awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
            > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt

Getting GO annotations that exist for the plaac positives

I should write something in python to do this part.

# starting output file
printf "protein_ID\tGO_terms\n" > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv

for ID in $(cat GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt)
do

    curr_GO_terms=$(grep "^${ID}" GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv | cut -f 14 | sed '/^$/d' | sed '/^-$/d' | tr "|" "\n" | sort -u | tr "\n" "," | sed 's/|$/\n/')
    
    printf "${ID}\t${curr_GO_terms}\n" >> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv

done

NOTE
This will leave us with a list of the GO terms for plaac-positive proteins. Then need to write something to find which hold GO terms that are part of the cellular component namespace.

Environment setup

conda install -c conda-forge mamba

Plaac

From https://github.com/whitehead/plaac

The below was done when the latest commit was from 27-Jan-2020 (https://github.com/whitehead/plaac/commit/44ca72882060eec4dc1696ebcf7f4f4185348232)

Needs to be built under python 2, so creating its own environment

mamba create -n plaac python=2
conda activate plaac

mkdir -p ~/bin/
cd ~/bin/
wget https://github.com/whitehead/plaac/archive/master.zip
unzip master.zip && rm master.zip
mv plaac-master/ plaac/

cd plaac/cli/

bash build_plaac.sh

If this gives an error like javac: command not found, we can try installing java in the same conda environment like so: conda install -c conda-forge -c bioconda -c defaults java-jdk

We can test for basic install success with:

java -jar ~/bin/plaac/cli/target/plaac.jar -h

Depending on system, may need to modify the build_plaac.sh script to work with updated javac. This is how I've done that in the past. Here is the contents of the modified file, 'new-build-plaac.sh':

SCRIPT_DIR=`dirname $0`
cd $SCRIPT_DIR

echo cleanup
rm -rf target
mkdir target

echo build
javac -source 1.6 -target 1.6 -d target src/*.java

echo package
cp src/mainClass target/.
cp -r src/util target/.
cd target
jar cmf mainClass plaac.jar *
chmod +x plaac.jar

echo "ok (see target/plaac.jar)"

# also build PLAAC docs
echo "build-docs"

cd -
# generate documentation for command-line parameters, and dotfile for default hmm
java -jar target/plaac.jar -s -d -i dummy -h target/hmm_default.txt > target/plaac_headers.txt
./build_docs.py target/plaac_headers.txt > target/_plaac_headers.haml

echo "ok (see target/_plaac_headers.haml)"

if ! type "dot" > /dev/null; then
  echo "dot not found, skipping genereration of hmm_default.png"
else
  dot target/hmm_default.txt -Tpng > target/hmm_default.png
  echo "ok (see target/hmm_default.png)"
fi

If the previous build_plaac.sh command failed, i'd make the above file and then run that one to try, e.g.:

bash new-build-plaac.sh

java -jar ~/bin/plaac/cli/target/plaac.jar -h

Adding an alias for plaac:

mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/
echo 'alias plaac="java -jar ~/bin/plaac/cli/target/plaac.jar"' >> ${CONDA_PREFIX}/etc/conda/activate.d/plaac.sh

# making sure it works
conda deactivate

conda activate plaac

plaac -h

conda deactivate

MetaEuk

mamba create -n metaeuk -c conda-forge -c bioconda -c defaults metaeuk=5.34c21f2

Prodigal

mamba create -n prodigal -c conda-forge -c bioconda -c defaults prodigal=2.6.3

Interproscan

conda create -n interproscan -c conda-forge -c bioconda -c defaults interproscan=5.52_86.0

# after conda install, it prints out directions to do the following
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.0-64-bit.tar.gz

tar -pxvzf interproscan-5.52-86.0-64-bit.tar.gz

rm -rf /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan/data/

mv interproscan-5.52-86.0/data/ /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan

rm -rf interproscan-5.52-86.0/ interproscan-5.52-86.0-64-bit.tar.gz

# test
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv

interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv -dp

Setting to automatically delete working directory after finishing (though it only deletes the files, not the temp dir it makes):

sed 's/delete.temporary.directory.on.completion=false/delete.temporary.directory.on.completion=true/' ${CONDA_PREFIX}/share/InterProScan/interproscan.properties > t && mv t ${CONDA_PREFIX}/share/InterProScan/interproscan.properties

bit

Using this to clean up headers after gene-calling and needed for any scripts that use the GO database.

mamba create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit