--- tags: prions-fo-life title: Annotating and running plaac on genomes --- # Annotating and running plaac on genomes [toc] --- > Mike you're figuring this out on kuat here: `~/nasa/prions-fo-life/ISS-isolates/testing` --- # Running through process with a GLDS dataset ## Downloading the genomes Processing here with GLDS-311. Used scripts are below this section. ```bash curl -L -o GLDS-311-files.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/311 ``` And parsing it: ```bash python ../helper-scripts/filenames-json-to-tsv.py GLDS-311-files.json GLDS-311-files.tsv ``` ### Getting metadata ```bash curl -L -o GLDS-311-meta.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/meta/311 ``` Trying to parse that was quickly becoming a nightmare for me, so doing things manually for the metadata. #### Getting info tables Downloaded the "Samples" table and the "Assays/Measurements" table. Since these are comma-delimited, and they have commas in certain fields, they are not friendly to parsing at the command line. So i selected specific columns to be downloaded. For the "Samples" table: Source Name, Sample Name, and Organism (these are what was provided by the submitter). For the "Assays/Measurements" table, Sample Name, NCBI Assembly Accession, and Derived Data File. Cleaning up and combining into one tsv (this will need to be done differently for each GLDS based on what's in these tables): ```bash # making sure they are in the same order diff <( tr -d '"' < GLDS-311_samples_export.csv | cut -f 2 -d "," ) <( tr -d '"' < GLDS-311_genome_sequencing_export.csv | cut -f 1 -d "," ) # combining tr -d '"' < GLDS-311_samples_export.csv | tr "," "\t" | sed 's/Characteristics: //' | cut -f 1-3 | sed 's/ $//' > p1.tmp tr -d '"' < GLDS-311_genome_sequencing_export.csv | tr "," "\t" | cut -f 2-3 | sed 's/Parameter Value: //' > p2.tmp paste p1.tmp p2.tmp | sed \$d > GLDS-311-genome-info.tsv rm p1.tmp p2.tmp ``` ### Building download commands Doing it with shorter and more helpful filenames. This will need to be done specific for each GLDS depending on the info there and format. ```bash # getting info in same order as our links for i in $(tail -n +2 GLDS-311-files.tsv | cut -f 1) do grep ${i} GLDS-311-genome-info.tsv | cut -f 1,3 | tr " " "_" | tr "\t" "-" | sed 's/^/GLDS-311-/' | sed 's/$/.fasta.gz/' done > wanted-filenames.tmp # making sure they are all unique (should be the same number) sort -u wanted-filenames.tmp | wc -l tail -n +2 GLDS-311-files.tsv | cut -f 1 | sort -u | wc -l paste -d " " <( sed 's/^/curl -L -o /' wanted-filenames.tmp ) <( tail -n +2 GLDS-311-files.tsv | cut -f 2 ) > dl-GLDS-311-genomes.sh ``` Adding our file names to summary file: ```bash paste <( cat <( printf "our_file_name\n" ) wanted-filenames.tmp ) GLDS-311-genome-info.tsv > t && mv t GLDS-311-genome-info.tsv rm *.tmp ``` ### Downloading ```bash bash dl-GLDS-311-genomes.sh mkdir genomes mv *.fasta.gz genomes gunzip genomes/*.gz ``` ## Running process with scripts First making a list of unique IDs: ```bash ls genomes/*.fasta | cut -f 2 -d "/" | cut -f 1 -d "." > unique-genome-IDs.txt ``` Scripts are below, but are utilized like so: ```bash bash ../helper-scripts/call-genes-on-proks.sh unique-genome-IDs.txt bash ../helper-scripts/run-plaac-on-genomes.sh unique-genome-IDs.txt bash ../helper-scripts/run-interproscan-on-genomes.sh unique-genome-IDs.txt # took 30 mins with this number of prots ## getting plaac-positive proteins grep -hv "#" plaac-results/*-plaac-out.tsv \ | awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \ | sed '/^SEQid$/d' > GLDS-311-genome-plaac-positive-protein-IDs.txt ## getting GO annotations python ../helper-scripts/get-GO-annots.py \ -w GLDS-311-genome-plaac-positive-protein-IDs.txt \ -i interproscan-results/all-genes-interpro-out.tsv \ -o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp ## propagating parent terms python ../helper-scripts/add-GO-parents.py \ -i GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp \ -o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv rm GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp ``` This table looks like this: ``` protein_ID GO_terms GLDS-311-JC-0128-Pantoea_brenneri_2122 NA GLDS-311-JC-0128-Pantoea_brenneri_2164 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031 GLDS-311-JC-0128-Pantoea_brenneri_2285 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431 GLDS-311-JC-0128-Pantoea_brenneri_2996 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539 GLDS-311-JC-0129-Pantoea_brenneri_464 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431 GLDS-311-JC-0129-Pantoea_brenneri_2324 NA GLDS-311-JC-0129-Pantoea_brenneri_2418 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031 GLDS-311-JC-0129-Pantoea_brenneri_3069 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539 GLDS-311-JC-0130-Pantoea_brenneri_98 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431 ``` Making a table of all GO terms in plaac-positive proteins here: ```bash tail -n +2 GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv \ | cut -f 2 | grep -v "NA" | tr ";" "\n" \ | sort -u > GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt python ../helper-scripts/get-GO-general-info.py \ -g GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt \ -o GLDS-311-genome-plaac-positive-GO-term-info.tsv ``` And this looks like this: ``` go_term namespace name depth GO:0001539 biological_process cilium or flagellum-dependent cell motility 4 GO:0003674 molecular_function molecular_function 0 GO:0003774 molecular_function cytoskeletal motor activity 1 GO:0005198 molecular_function structural molecule activity 1 GO:0005215 molecular_function transporter activity 1 GO:0005488 molecular_function binding 1 GO:0005515 molecular_function protein binding 2 GO:0005539 molecular_function glycosaminoglycan binding 3 GO:0005575 cellular_component cellular_component 0 ``` # Making a combined table of all GLDS datasets run ```bash mkdir combined cd combined ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv | sed 's/^/# /' # ../GLDS-262/GLDS-262-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-263/GLDS-263-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-291/GLDS-291-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-298/GLDS-298-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-300/GLDS-300-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-302/GLDS-302-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-303/GLDS-303-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-311/GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-361/GLDS-361-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-64/GLDS-64-genome-plaac-positive-proteins-and-GO-terms.tsv # ../GLDS-67/GLDS-67-genome-plaac-positive-proteins-and-GO-terms.tsv printf "protein_ID\tGO_terms\n" > Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp for file in $(ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv) do tail -n +2 ${file} >> Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp done mv Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv ``` Making a table of all GO terms in plaac-positive proteins here: ```bash tail -n +2 Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv \ | cut -f 2 | grep -v "NA" | tr ";" "\n" \ | sort -u > Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt python ../helper-scripts/get-GO-general-info.py \ -g Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt \ -o Combined-GLDS-genome-plaac-positive-GO-term-info.tsv ``` Adding a column that has counts of how many proteins have that GO term: ```bash for term in $(cut -f 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | tail -n +2) do grep -w -c "${term}" Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv done > prot_counts.tmp paste Combined-GLDS-genome-plaac-positive-GO-term-info.tsv \ <( cat <( printf "num_prots_with\n" ) prot_counts.tmp ) > t mv t Combined-GLDS-genome-plaac-positive-GO-term-info.tsv # sorting by counts head -n 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv > header.tmp tail -n +2 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | sort -nrk 5 > data.tmp cat header.tmp data.tmp > Combined-GLDS-genome-plaac-positive-GO-term-info.tsv rm *.tmp head Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | column | sed 's/^/# /' # go_term namespace name depth num_prots_with # GO:0003674 molecular_function molecular_function 0 287 # GO:0005488 molecular_function binding 1 199 # GO:0008150 biological_process biological_process 0 182 # GO:0005575 cellular_component cellular_component 0 118 # GO:0016020 cellular_component membrane 2 52 # GO:0040011 biological_process locomotion 1 32 # GO:0051179 biological_process localization 1 23 # GO:0046903 biological_process secretion 4 23 # GO:0006810 biological_process transport 3 23 ``` --- ### Scripts **filenames-json-to-tsv.py** ```python= import json import sys with open(sys.argv[1], "r") as json_file: data = json.load(json_file) curr_study = list(data['studies'].keys())[0] # this is a list of dictionaries file_data = data['studies'][curr_study]['study_files'] link_prefix = "https://genelab-data.ndc.nasa.gov" # now we will open a file for writing with open(sys.argv[2], 'w') as outfile: # starting header outfile.write("file_name\turl\n") for entry in file_data: # making sure it is a fasta.gz (likely to be the genome files we're trying to pull; though some have these other extensions possible_genome_fasta_extentions = [".fasta.gz", ".fasta", ".fsa_nt.gz", ".fna.gz"] if entry['file_name'].endswith(tuple(possible_genome_fasta_extentions)): outfile.write(str(entry['file_name']) + "\t" + link_prefix + str(entry['remote_url']) + "\n") ``` **call-genes-on-proks.sh** ```bash= #!/usr/bin/env bash set -e # directory vars genomes_dir="genomes" gene_calls_dir="gene-calls" mkdir -p ${gene_calls_dir} printf "\n Calling genes...\n\n" for genome in $(cat $1) do printf "\n ${genome}\n" # predicting genes source activate prodigal prodigal -q -c -a ${gene_calls_dir}/${genome}-gene-calls.tmp \ -o ${gene_calls_dir}/${genome}-gene-calls.gbk \ -i ${genomes_dir}/${genome}.fasta # need to remove "*" characters to work with interproscan tr -d "*" < ${gene_calls_dir}/${genome}-gene-calls.tmp > ${gene_calls_dir}/${genome}-gene-calls.faa conda deactivate # cleaning up headers source activate bit bit-rename-fasta-headers -i ${gene_calls_dir}/${genome}-gene-calls.faa -w ${genome} \ -o ${gene_calls_dir}/${genome}-gene-calls.tmp mv ${gene_calls_dir}/${genome}-gene-calls.tmp ${gene_calls_dir}/${genome}-gene-calls.faa conda deactivate done printf "\n" ``` **run-plaac-on-genomes.sh** ```bash= #!/usr/bin/env bash set -e # directory vars gene_calls_dir="gene-calls" plaac_dir="plaac-results" mkdir -p ${plaac_dir} printf "\n Running plaac...\n\n" for genome in $(cat $1) do printf "\n ${genome}\n" # predicting genes source activate plaac java -jar ~/bin/plaac/cli/target/plaac.jar -i ${gene_calls_dir}/${genome}-gene-calls.faa -a 0 > ${plaac_dir}/${genome}-plaac-out.tsv conda deactivate done printf "\n" ``` **run-interproscan-on-genomes.sh** ```bash= #!/usr/bin/env bash set -e # directory vars gene_calls_dir="gene-calls" interproscan_dir="interproscan-results" mkdir -p ${interproscan_dir} printf "\n Running interproscan...\n\n" # first combining all input fastas into one file so interpro scan only needs to load refs once, then will parse after rm -rf all-genes.faa.tmp temp/ for genome in $(cat $1) do cat ${gene_calls_dir}/${genome}-gene-calls.faa >> all-genes.faa.tmp done source activate interproscan interproscan.sh --cpu 40 --goterms --disable-residue-annot -f tsv -i all-genes.faa.tmp -o ${interproscan_dir}/all-genes-interpro-out.tsv 2> ${interproscan_dir}/interproscan-run-stderr.log rm -rf all-genes.faa.tmp temp/ conda deactivate rmdir temp/ printf "\n" ``` **get-GO-annots.py** ```bash= #!/usr/bin/env python import argparse parser = argparse.ArgumentParser(description='This takes a list of protein IDs and pulls out their GO annotations from the interproscan output.') required = parser.add_argument_group('required arguments') required.add_argument("-w", "--wanted-protein-IDs", help="Single-column file with wanted protein IDs", action="store", required=True) required.add_argument("-i", "--interproscan-table", help='The output table from interproscan', action="store", required=True) parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", default="out.tsv") args = parser.parse_args() # creating dictionary to hold stuff (keys = protein IDs; values = list of associated GO terms) targets_dict = {} for line in open(args.wanted_protein_IDs): line = line.strip() targets_dict[line] = [] # populating dictionary for line in open(args.interproscan_table): line = line.strip().split("\t") if line[0] in targets_dict.keys() and len(line) > 13: if line[13] != "-": if len(line[13].split("|")) > 1: for term in line[13].split("|"): targets_dict[line[0]].append(term) else: targets_dict[line[0]].append(line[13]) # writing out with open(args.output_tsv, "w") as out_tab: out_tab.write("protein_ID\tGO_terms\n") for prot_ID in targets_dict: if len(targets_dict[prot_ID]) == 0: out_tab.write(str(prot_ID) + "\t" + "NA" + "\n") else: out_tab.write(str(prot_ID) + "\t" + str(";".join(list(set(targets_dict[prot_ID]))) + "\n")) ``` **add-GO-parents.py** ```bash= #!/usr/bin/env python ## this is modified from bit-get-go-term-info (github.com/AstrobioMike/bit), it needs to be run in an environment with that package active # lots of this is from this great tutorial: GO Tutorial in Python - Solutions.ipynb, which comes from here: http://gohandbook.org/doku.php ; https://nbviewer.jupyter.org/urls/dessimozlab.github.io/go-handbook/GO%20Tutorial%20in%20Python%20-%20Solutions.ipynb from goatools import obo_parser import os import argparse import pandas as pd import sys import contextlib parser = argparse.ArgumentParser(description="Add the parents of GO terms to the \ '*-genome-plaac-positive-proteins-and-GO-terms.tmp' files.\ Needs to be run in an environment with the bit package active \ (github.com/AstrobioMike/bit).") required = parser.add_argument_group('required arguments') required.add_argument("-i", "--input-tab", help='Input tab (', action="store", required=True) parser.add_argument("-g", "--GO_obo_file", help='GO obo file to use (e.g. from: geneontology.org/docs/download-ontology/). By default will \ use "go-basic.obo". "goslim_metagenomics.obo" is also a pre-packaged option (enter `-g goslim_metagenomics` to specify it). Or \ a different obo-formatted file can be specified here.', action="store", dest="obo", default="go_basic") parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv") if len(sys.argv)==1: parser.print_help(sys.stderr) sys.exit(0) args = parser.parse_args() ### checking and setting up obo file location go_data_dir = os.environ["GO_DB_DIR"] if args.obo == "goslim_metagenomics": go_obo = go_data_dir + "goslim_metagenomics.obo" elif args.obo == "go_basic": go_obo = go_data_dir + "go-basic.obo" else: go_obo = args.obo ## loading GO database go = obo_parser.GODag(go_obo) # reading in starting tab in_tab = pd.read_csv(args.input_tab, sep = "\t") # iterating over rows for index, row in in_tab.iterrows(): curr_GO_terms = row.GO_terms if pd.isna(curr_GO_terms): continue curr_GO_terms_list = curr_GO_terms.split(";") # initializing new output list that will include parents new_GO_term_list = [] for curr_GO_term in curr_GO_terms_list: # getting parents try: curr_GO_term_obj = go[curr_GO_term] curr_parents_list = list(curr_GO_term_obj.get_all_parents()) new_GO_term_list += curr_parents_list new_GO_term_list.append(curr_GO_term) except: new_GO_term_list.append(curr_GO_term) # set used to remove any dupes that may have been introduced new_GO_term_list = list(set(new_GO_term_list)) # replacing GO_terms entry of current row to now include the parent GO terms row.GO_terms = ";".join(new_GO_term_list) # writing out in_tab.to_csv(args.output_tsv, index = False, sep = "\t", na_rep = "NA") ``` **get-GO-general-info.py** ```python= #!/usr/bin/env python from goatools import obo_parser import os import argparse import pandas as pd import sys parser = argparse.ArgumentParser(description='Writes out a table of GO info for a provided list of GO terms. It relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit).') required = parser.add_argument_group('required arguments') required.add_argument("-g", "--input-go-terms", help='Single-column list of target GO terms', action="store", required=True) parser.add_argument("-o", "--output-tsv", help='Output table (default: "out.tsv")', action="store", dest="output_tsv", default="out.tsv") if len(sys.argv)==1: parser.print_help(sys.stderr) sys.exit(0) args = parser.parse_args() ### setting GO obo file location – this program relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit) go_data_dir = os.environ["GO_DB_DIR"] go_obo = go_data_dir + "go-basic.obo" ## helper function def get_general_info(go_id): go_term = go[go_id] name = go_term.name namespace = go_term.namespace depth = go_term.depth go_term_line = str(go_id) + "\t" + str(namespace) + "\t" + str(name) + "\t" + str(depth) return go_term_line ## loading GO database print("\n\tGO obo file being used:") go = obo_parser.GODag(go_obo, load_obsolete = True) print("") # output header header = ["go_term", "namespace", "name", "depth"] # iterating through wanted IDs, and writing general info to output table with open(args.output_tsv, "w") as out_tab: out_tab.write(str("\t".join(header)) + "\n") with open(args.input_go_terms) as wanted: for line in wanted: go_term = line.strip() curr_go_info = get_general_info(go_term) out_tab.write(str(curr_go_info) + "\n") ``` --- # Running through process with one genome Grabbing one from [GLDS-303](https://genelab-data.ndc.nasa.gov/genelab/accession/GLDS-303/) ## Getting genome nucleotide fasta ```bash wget https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz?version=1 mv GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz\?version\=1 GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz gunzip GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz ``` ## Predicting genes This one is a prokaryote, so using prodigal. ```bash conda activate prodigal prodigal -q -c -a GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp \ -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.gbk \ -i GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta # need to remove "*" characters to work with interproscan tr -d "*" < GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp > GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa conda deactivate ``` ### Cleaning up headers ```bash conda activate bit bit-rename-fasta-headers -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa \ -w GLDS-303_wgs_contigs_JC-0222_S1153_L001 \ -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp mv GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa conda deactivate ``` ## Running plaac ```bash conda activate plaac java -jar ~/bin/plaac/cli/target/plaac.jar -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -a 0 > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv conda deactivate ``` ## Annotating with Interproscan I'm not entirely sure which things need to be run in order to get GO terms. All I can find is it links the IPR IDs to GO terms, but nothing specifically gives IPR IDs. I think they might come from matches to the other databases. So, just letting all run. ```bash interproscan.sh --cpu 10 --goterms --disable-residue-annot -f tsv -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv # took 13 minutes on ~5,000 proteins ``` > **NOTE** > It will likely be much faster to get all genome proteins together in one file, run this annotation once, and then parse the output based on genomes. Due to renaming headers above, they will all have their genome ID in them followed by _#. ## Getting plaac positive protein IDs ```bash grep -v "#" GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv \ | awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \ > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt ``` ## Getting GO annotations that exist for the plaac positives I should write something in python to do this part. ```bash # starting output file printf "protein_ID\tGO_terms\n" > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv for ID in $(cat GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt) do curr_GO_terms=$(grep "^${ID}" GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv | cut -f 14 | sed '/^$/d' | sed '/^-$/d' | tr "|" "\n" | sort -u | tr "\n" "," | sed 's/|$/\n/') printf "${ID}\t${curr_GO_terms}\n" >> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv done ``` > **NOTE** > This will leave us with a list of the GO terms for plaac-positive proteins. Then need to write something to find which hold GO terms that are part of the cellular component namespace. # Environment setup ```bash conda install -c conda-forge mamba ``` ## Plaac From https://github.com/whitehead/plaac The below was done when the latest commit was from 27-Jan-2020 (https://github.com/whitehead/plaac/commit/44ca72882060eec4dc1696ebcf7f4f4185348232) Needs to be built under python 2, so creating its own environment ```bash mamba create -n plaac python=2 conda activate plaac ``` ``` mkdir -p ~/bin/ cd ~/bin/ wget https://github.com/whitehead/plaac/archive/master.zip unzip master.zip && rm master.zip mv plaac-master/ plaac/ cd plaac/cli/ bash build_plaac.sh ``` > If this gives an error like `javac: command not found`, we can try installing java in the same conda environment like so: `conda install -c conda-forge -c bioconda -c defaults java-jdk` We can test for basic install success with: ```bash java -jar ~/bin/plaac/cli/target/plaac.jar -h ``` Depending on system, may need to modify the `build_plaac.sh` script to work with updated javac. This is how I've done that in the past. Here is the contents of the modified file, 'new-build-plaac.sh': ```bash SCRIPT_DIR=`dirname $0` cd $SCRIPT_DIR echo cleanup rm -rf target mkdir target echo build javac -source 1.6 -target 1.6 -d target src/*.java echo package cp src/mainClass target/. cp -r src/util target/. cd target jar cmf mainClass plaac.jar * chmod +x plaac.jar echo "ok (see target/plaac.jar)" # also build PLAAC docs echo "build-docs" cd - # generate documentation for command-line parameters, and dotfile for default hmm java -jar target/plaac.jar -s -d -i dummy -h target/hmm_default.txt > target/plaac_headers.txt ./build_docs.py target/plaac_headers.txt > target/_plaac_headers.haml echo "ok (see target/_plaac_headers.haml)" if ! type "dot" > /dev/null; then echo "dot not found, skipping genereration of hmm_default.png" else dot target/hmm_default.txt -Tpng > target/hmm_default.png echo "ok (see target/hmm_default.png)" fi ``` If the previous build_plaac.sh command failed, i'd make the above file and then run that one to try, e.g.: ```bash bash new-build-plaac.sh ``` ```bash java -jar ~/bin/plaac/cli/target/plaac.jar -h ``` Adding an alias for plaac: ```bash mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/ echo 'alias plaac="java -jar ~/bin/plaac/cli/target/plaac.jar"' >> ${CONDA_PREFIX}/etc/conda/activate.d/plaac.sh # making sure it works conda deactivate conda activate plaac plaac -h conda deactivate ``` ## MetaEuk ```bash mamba create -n metaeuk -c conda-forge -c bioconda -c defaults metaeuk=5.34c21f2 ``` ## Prodigal ```bash mamba create -n prodigal -c conda-forge -c bioconda -c defaults prodigal=2.6.3 ``` ## Interproscan ```bash conda create -n interproscan -c conda-forge -c bioconda -c defaults interproscan=5.52_86.0 # after conda install, it prints out directions to do the following wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.0-64-bit.tar.gz tar -pxvzf interproscan-5.52-86.0-64-bit.tar.gz rm -rf /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan/data/ mv interproscan-5.52-86.0/data/ /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan rm -rf interproscan-5.52-86.0/ interproscan-5.52-86.0-64-bit.tar.gz # test interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv -dp ``` Setting to automatically delete working directory after finishing (though it only deletes the files, not the temp dir it makes): ```bash sed 's/delete.temporary.directory.on.completion=false/delete.temporary.directory.on.completion=true/' ${CONDA_PREFIX}/share/InterProScan/interproscan.properties > t && mv t ${CONDA_PREFIX}/share/InterProScan/interproscan.properties ``` ## bit Using this to clean up headers after gene-calling and needed for any scripts that use the GO database. ```bash mamba create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit ```