Mike you're figuring this out on kuat here:
~/nasa/prions-fo-life/ISS-isolates/testing
Processing here with GLDS-311. Used scripts are below this section.
curl -L -o GLDS-311-files.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/files/311
And parsing it:
python ../helper-scripts/filenames-json-to-tsv.py GLDS-311-files.json GLDS-311-files.tsv
curl -L -o GLDS-311-meta.json https://genelab-data.ndc.nasa.gov/genelab/data/glds/meta/311
Trying to parse that was quickly becoming a nightmare for me, so doing things manually for the metadata.
Downloaded the "Samples" table and the "Assays/Measurements" table. Since these are comma-delimited, and they have commas in certain fields, they are not friendly to parsing at the command line. So i selected specific columns to be downloaded. For the "Samples" table: Source Name, Sample Name, and Organism (these are what was provided by the submitter). For the "Assays/Measurements" table, Sample Name, NCBI Assembly Accession, and Derived Data File.
Cleaning up and combining into one tsv (this will need to be done differently for each GLDS based on what's in these tables):
# making sure they are in the same order
diff <( tr -d '"' < GLDS-311_samples_export.csv | cut -f 2 -d "," ) <( tr -d '"' < GLDS-311_genome_sequencing_export.csv | cut -f 1 -d "," )
# combining
tr -d '"' < GLDS-311_samples_export.csv | tr "," "\t" | sed 's/Characteristics: //' | cut -f 1-3 | sed 's/ $//' > p1.tmp
tr -d '"' < GLDS-311_genome_sequencing_export.csv | tr "," "\t" | cut -f 2-3 | sed 's/Parameter Value: //' > p2.tmp
paste p1.tmp p2.tmp | sed \$d > GLDS-311-genome-info.tsv
rm p1.tmp p2.tmp
Doing it with shorter and more helpful filenames. This will need to be done specific for each GLDS depending on the info there and format.
# getting info in same order as our links
for i in $(tail -n +2 GLDS-311-files.tsv | cut -f 1)
do
grep ${i} GLDS-311-genome-info.tsv | cut -f 1,3 | tr " " "_" | tr "\t" "-" | sed 's/^/GLDS-311-/' | sed 's/$/.fasta.gz/'
done > wanted-filenames.tmp
# making sure they are all unique (should be the same number)
sort -u wanted-filenames.tmp | wc -l
tail -n +2 GLDS-311-files.tsv | cut -f 1 | sort -u | wc -l
paste -d " " <( sed 's/^/curl -L -o /' wanted-filenames.tmp ) <( tail -n +2 GLDS-311-files.tsv | cut -f 2 ) > dl-GLDS-311-genomes.sh
Adding our file names to summary file:
paste <( cat <( printf "our_file_name\n" ) wanted-filenames.tmp ) GLDS-311-genome-info.tsv > t && mv t GLDS-311-genome-info.tsv
rm *.tmp
bash dl-GLDS-311-genomes.sh
mkdir genomes
mv *.fasta.gz genomes
gunzip genomes/*.gz
First making a list of unique IDs:
ls genomes/*.fasta | cut -f 2 -d "/" | cut -f 1 -d "." > unique-genome-IDs.txt
Scripts are below, but are utilized like so:
bash ../helper-scripts/call-genes-on-proks.sh unique-genome-IDs.txt
bash ../helper-scripts/run-plaac-on-genomes.sh unique-genome-IDs.txt
bash ../helper-scripts/run-interproscan-on-genomes.sh unique-genome-IDs.txt
# took 30 mins with this number of prots
## getting plaac-positive proteins
grep -hv "#" plaac-results/*-plaac-out.tsv \
| awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
| sed '/^SEQid$/d' > GLDS-311-genome-plaac-positive-protein-IDs.txt
## getting GO annotations
python ../helper-scripts/get-GO-annots.py \
-w GLDS-311-genome-plaac-positive-protein-IDs.txt \
-i interproscan-results/all-genes-interpro-out.tsv \
-o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp
## propagating parent terms
python ../helper-scripts/add-GO-parents.py \
-i GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp \
-o GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
rm GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tmp
This table looks like this:
protein_ID GO_terms
GLDS-311-JC-0128-Pantoea_brenneri_2122 NA
GLDS-311-JC-0128-Pantoea_brenneri_2164 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0128-Pantoea_brenneri_2285 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0128-Pantoea_brenneri_2996 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0129-Pantoea_brenneri_464 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
GLDS-311-JC-0129-Pantoea_brenneri_2324 NA
GLDS-311-JC-0129-Pantoea_brenneri_2418 GO:0140694;GO:0071840;GO:0008150;GO:0016043;GO:0009424;GO:0070925;GO:0044781;GO:0005198;GO:0030030;GO:0006996;GO:0110165;GO:0003674;GO:0005575;GO:0022607;GO:0009987;GO:0044780;GO:0030031
GLDS-311-JC-0129-Pantoea_brenneri_3069 GO:0005515;GO:0097367;GO:0042834;GO:0005488;GO:0003674;GO:0005539
GLDS-311-JC-0130-Pantoea_brenneri_98 GO:0006928;GO:0008150;GO:0048870;GO:0003774;GO:0001539;GO:0040011;GO:0110165;GO:0071973;GO:0003674;GO:0005575;GO:0097588;GO:0009987;GO:0009431
Making a table of all GO terms in plaac-positive proteins here:
tail -n +2 GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv \
| cut -f 2 | grep -v "NA" | tr ";" "\n" \
| sort -u > GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt
python ../helper-scripts/get-GO-general-info.py \
-g GLDS-311-GO-terms-in-genome-plaac-positive-proteins.txt \
-o GLDS-311-genome-plaac-positive-GO-term-info.tsv
And this looks like this:
go_term namespace name depth
GO:0001539 biological_process cilium or flagellum-dependent cell motility 4
GO:0003674 molecular_function molecular_function 0
GO:0003774 molecular_function cytoskeletal motor activity 1
GO:0005198 molecular_function structural molecule activity 1
GO:0005215 molecular_function transporter activity 1
GO:0005488 molecular_function binding 1
GO:0005515 molecular_function protein binding 2
GO:0005539 molecular_function glycosaminoglycan binding 3
GO:0005575 cellular_component cellular_component 0
mkdir combined
cd combined
ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv | sed 's/^/# /'
# ../GLDS-262/GLDS-262-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-263/GLDS-263-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-291/GLDS-291-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-298/GLDS-298-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-300/GLDS-300-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-302/GLDS-302-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-303/GLDS-303-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-311/GLDS-311-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-361/GLDS-361-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-64/GLDS-64-genome-plaac-positive-proteins-and-GO-terms.tsv
# ../GLDS-67/GLDS-67-genome-plaac-positive-proteins-and-GO-terms.tsv
printf "protein_ID\tGO_terms\n" > Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp
for file in $(ls ../*/*-genome-plaac-positive-proteins-and-GO-terms.tsv)
do
tail -n +2 ${file} >> Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp
done
mv Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tmp Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv
Making a table of all GO terms in plaac-positive proteins here:
tail -n +2 Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv \
| cut -f 2 | grep -v "NA" | tr ";" "\n" \
| sort -u > Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt
python ../helper-scripts/get-GO-general-info.py \
-g Combined-GLDS-GO-terms-in-genome-plaac-positive-proteins.txt \
-o Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
Adding a column that has counts of how many proteins have that GO term:
for term in $(cut -f 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | tail -n +2)
do
grep -w -c "${term}" Combined-GLDS-genome-plaac-positive-proteins-and-GO-terms.tsv
done > prot_counts.tmp
paste Combined-GLDS-genome-plaac-positive-GO-term-info.tsv \
<( cat <( printf "num_prots_with\n" ) prot_counts.tmp ) > t
mv t Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
# sorting by counts
head -n 1 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv > header.tmp
tail -n +2 Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | sort -nrk 5 > data.tmp
cat header.tmp data.tmp > Combined-GLDS-genome-plaac-positive-GO-term-info.tsv
rm *.tmp
head Combined-GLDS-genome-plaac-positive-GO-term-info.tsv | column | sed 's/^/# /'
# go_term namespace name depth num_prots_with
# GO:0003674 molecular_function molecular_function 0 287
# GO:0005488 molecular_function binding 1 199
# GO:0008150 biological_process biological_process 0 182
# GO:0005575 cellular_component cellular_component 0 118
# GO:0016020 cellular_component membrane 2 52
# GO:0040011 biological_process locomotion 1 32
# GO:0051179 biological_process localization 1 23
# GO:0046903 biological_process secretion 4 23
# GO:0006810 biological_process transport 3 23
import json
import sys
with open(sys.argv[1], "r") as json_file:
data = json.load(json_file)
curr_study = list(data['studies'].keys())[0]
# this is a list of dictionaries
file_data = data['studies'][curr_study]['study_files']
link_prefix = "https://genelab-data.ndc.nasa.gov"
# now we will open a file for writing
with open(sys.argv[2], 'w') as outfile:
# starting header
outfile.write("file_name\turl\n")
for entry in file_data:
# making sure it is a fasta.gz (likely to be the genome files we're trying to pull; though some have these other extensions
possible_genome_fasta_extentions = [".fasta.gz", ".fasta", ".fsa_nt.gz", ".fna.gz"]
if entry['file_name'].endswith(tuple(possible_genome_fasta_extentions)):
outfile.write(str(entry['file_name']) + "\t" + link_prefix + str(entry['remote_url']) + "\n")
#!/usr/bin/env bash
set -e
# directory vars
genomes_dir="genomes"
gene_calls_dir="gene-calls"
mkdir -p ${gene_calls_dir}
printf "\n Calling genes...\n\n"
for genome in $(cat $1)
do
printf "\n ${genome}\n"
# predicting genes
source activate prodigal
prodigal -q -c -a ${gene_calls_dir}/${genome}-gene-calls.tmp \
-o ${gene_calls_dir}/${genome}-gene-calls.gbk \
-i ${genomes_dir}/${genome}.fasta
# need to remove "*" characters to work with interproscan
tr -d "*" < ${gene_calls_dir}/${genome}-gene-calls.tmp > ${gene_calls_dir}/${genome}-gene-calls.faa
conda deactivate
# cleaning up headers
source activate bit
bit-rename-fasta-headers -i ${gene_calls_dir}/${genome}-gene-calls.faa -w ${genome} \
-o ${gene_calls_dir}/${genome}-gene-calls.tmp
mv ${gene_calls_dir}/${genome}-gene-calls.tmp ${gene_calls_dir}/${genome}-gene-calls.faa
conda deactivate
done
printf "\n"
#!/usr/bin/env bash
set -e
# directory vars
gene_calls_dir="gene-calls"
plaac_dir="plaac-results"
mkdir -p ${plaac_dir}
printf "\n Running plaac...\n\n"
for genome in $(cat $1)
do
printf "\n ${genome}\n"
# predicting genes
source activate plaac
java -jar ~/bin/plaac/cli/target/plaac.jar -i ${gene_calls_dir}/${genome}-gene-calls.faa -a 0 > ${plaac_dir}/${genome}-plaac-out.tsv
conda deactivate
done
printf "\n"
run-interproscan-on-genomes.sh
#!/usr/bin/env bash
set -e
# directory vars
gene_calls_dir="gene-calls"
interproscan_dir="interproscan-results"
mkdir -p ${interproscan_dir}
printf "\n Running interproscan...\n\n"
# first combining all input fastas into one file so interpro scan only needs to load refs once, then will parse after
rm -rf all-genes.faa.tmp temp/
for genome in $(cat $1)
do
cat ${gene_calls_dir}/${genome}-gene-calls.faa >> all-genes.faa.tmp
done
source activate interproscan
interproscan.sh --cpu 40 --goterms --disable-residue-annot -f tsv -i all-genes.faa.tmp -o ${interproscan_dir}/all-genes-interpro-out.tsv 2> ${interproscan_dir}/interproscan-run-stderr.log
rm -rf all-genes.faa.tmp temp/
conda deactivate
rmdir temp/
printf "\n"
#!/usr/bin/env python
import argparse
parser = argparse.ArgumentParser(description='This takes a list of protein IDs and pulls out their GO annotations from the interproscan output.')
required = parser.add_argument_group('required arguments')
required.add_argument("-w", "--wanted-protein-IDs", help="Single-column file with wanted protein IDs", action="store", required=True)
required.add_argument("-i", "--interproscan-table", help='The output table from interproscan', action="store", required=True)
parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", default="out.tsv")
args = parser.parse_args()
# creating dictionary to hold stuff (keys = protein IDs; values = list of associated GO terms)
targets_dict = {}
for line in open(args.wanted_protein_IDs):
line = line.strip()
targets_dict[line] = []
# populating dictionary
for line in open(args.interproscan_table):
line = line.strip().split("\t")
if line[0] in targets_dict.keys() and len(line) > 13:
if line[13] != "-":
if len(line[13].split("|")) > 1:
for term in line[13].split("|"):
targets_dict[line[0]].append(term)
else:
targets_dict[line[0]].append(line[13])
# writing out
with open(args.output_tsv, "w") as out_tab:
out_tab.write("protein_ID\tGO_terms\n")
for prot_ID in targets_dict:
if len(targets_dict[prot_ID]) == 0:
out_tab.write(str(prot_ID) + "\t" + "NA" + "\n")
else:
out_tab.write(str(prot_ID) + "\t" + str(";".join(list(set(targets_dict[prot_ID]))) + "\n"))
#!/usr/bin/env python
## this is modified from bit-get-go-term-info (github.com/AstrobioMike/bit), it needs to be run in an environment with that package active
# lots of this is from this great tutorial: GO Tutorial in Python - Solutions.ipynb, which comes from here: http://gohandbook.org/doku.php ; https://nbviewer.jupyter.org/urls/dessimozlab.github.io/go-handbook/GO%20Tutorial%20in%20Python%20-%20Solutions.ipynb
from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys
import contextlib
parser = argparse.ArgumentParser(description="Add the parents of GO terms to the \
'*-genome-plaac-positive-proteins-and-GO-terms.tmp' files.\
Needs to be run in an environment with the bit package active \
(github.com/AstrobioMike/bit).")
required = parser.add_argument_group('required arguments')
required.add_argument("-i", "--input-tab", help='Input tab (', action="store", required=True)
parser.add_argument("-g", "--GO_obo_file", help='GO obo file to use (e.g. from: geneontology.org/docs/download-ontology/). By default will \
use "go-basic.obo". "goslim_metagenomics.obo" is also a pre-packaged option (enter `-g goslim_metagenomics` to specify it). Or \
a different obo-formatted file can be specified here.',
action="store", dest="obo", default="go_basic")
parser.add_argument("-o", "--output-tsv", help='Default: "out.tsv"', action="store", dest="output_tsv", default="out.tsv")
if len(sys.argv)==1:
parser.print_help(sys.stderr)
sys.exit(0)
args = parser.parse_args()
### checking and setting up obo file location
go_data_dir = os.environ["GO_DB_DIR"]
if args.obo == "goslim_metagenomics":
go_obo = go_data_dir + "goslim_metagenomics.obo"
elif args.obo == "go_basic":
go_obo = go_data_dir + "go-basic.obo"
else:
go_obo = args.obo
## loading GO database
go = obo_parser.GODag(go_obo)
# reading in starting tab
in_tab = pd.read_csv(args.input_tab, sep = "\t")
# iterating over rows
for index, row in in_tab.iterrows():
curr_GO_terms = row.GO_terms
if pd.isna(curr_GO_terms):
continue
curr_GO_terms_list = curr_GO_terms.split(";")
# initializing new output list that will include parents
new_GO_term_list = []
for curr_GO_term in curr_GO_terms_list:
# getting parents
try:
curr_GO_term_obj = go[curr_GO_term]
curr_parents_list = list(curr_GO_term_obj.get_all_parents())
new_GO_term_list += curr_parents_list
new_GO_term_list.append(curr_GO_term)
except:
new_GO_term_list.append(curr_GO_term)
# set used to remove any dupes that may have been introduced
new_GO_term_list = list(set(new_GO_term_list))
# replacing GO_terms entry of current row to now include the parent GO terms
row.GO_terms = ";".join(new_GO_term_list)
# writing out
in_tab.to_csv(args.output_tsv, index = False, sep = "\t", na_rep = "NA")
#!/usr/bin/env python
from goatools import obo_parser
import os
import argparse
import pandas as pd
import sys
parser = argparse.ArgumentParser(description='Writes out a table of GO info for a provided list of GO terms. It relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit).')
required = parser.add_argument_group('required arguments')
required.add_argument("-g", "--input-go-terms", help='Single-column list of target GO terms', action="store", required=True)
parser.add_argument("-o", "--output-tsv", help='Output table (default: "out.tsv")', action="store", dest="output_tsv", default="out.tsv")
if len(sys.argv)==1:
parser.print_help(sys.stderr)
sys.exit(0)
args = parser.parse_args()
### setting GO obo file location – this program relies on `bit` being installed in current environment (https://github.com/AstrobioMike/bit#bioinformatics-tools-bit)
go_data_dir = os.environ["GO_DB_DIR"]
go_obo = go_data_dir + "go-basic.obo"
## helper function
def get_general_info(go_id):
go_term = go[go_id]
name = go_term.name
namespace = go_term.namespace
depth = go_term.depth
go_term_line = str(go_id) + "\t" + str(namespace) + "\t" + str(name) + "\t" + str(depth)
return go_term_line
## loading GO database
print("\n\tGO obo file being used:")
go = obo_parser.GODag(go_obo, load_obsolete = True)
print("")
# output header
header = ["go_term", "namespace", "name", "depth"]
# iterating through wanted IDs, and writing general info to output table
with open(args.output_tsv, "w") as out_tab:
out_tab.write(str("\t".join(header)) + "\n")
with open(args.input_go_terms) as wanted:
for line in wanted:
go_term = line.strip()
curr_go_info = get_general_info(go_term)
out_tab.write(str(curr_go_info) + "\n")
Grabbing one from GLDS-303
wget https://genelab-data.ndc.nasa.gov/genelab/static/media/dataset/GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz?version=1
mv GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz\?version\=1 GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz
gunzip GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta.gz
This one is a prokaryote, so using prodigal.
conda activate prodigal
prodigal -q -c -a GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp \
-o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.gbk \
-i GLDS-303_wgs_contigs_JC-0222_S1153_L001.fasta
# need to remove "*" characters to work with interproscan
tr -d "*" < GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp > GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa
conda deactivate
conda activate bit
bit-rename-fasta-headers -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa \
-w GLDS-303_wgs_contigs_JC-0222_S1153_L001 \
-o GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp
mv GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.tmp GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa
conda deactivate
conda activate plaac
java -jar ~/bin/plaac/cli/target/plaac.jar -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -a 0 > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv
conda deactivate
I'm not entirely sure which things need to be run in order to get GO terms. All I can find is it links the IPR IDs to GO terms, but nothing specifically gives IPR IDs. I think they might come from matches to the other databases. So, just letting all run.
interproscan.sh --cpu 10 --goterms --disable-residue-annot -f tsv -i GLDS-303_wgs_contigs_JC-0222_S1153_L001-gene-calls.faa -o GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv
# took 13 minutes on ~5,000 proteins
NOTE
It will likely be much faster to get all genome proteins together in one file, run this annotation once, and then parse the output based on genomes. Due to renaming headers above, they will all have their genome ID in them followed by _#.
grep -v "#" GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-out.tsv \
| awk -F $'\t' ' { if ( $12 != "NaN" ) print $1 } ' \
> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt
I should write something in python to do this part.
# starting output file
printf "protein_ID\tGO_terms\n" > GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv
for ID in $(cat GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-prot-IDs.txt)
do
curr_GO_terms=$(grep "^${ID}" GLDS-303_wgs_contigs_JC-0222_S1153_L001-interpro-out.tsv | cut -f 14 | sed '/^$/d' | sed '/^-$/d' | tr "|" "\n" | sort -u | tr "\n" "," | sed 's/|$/\n/')
printf "${ID}\t${curr_GO_terms}\n" >> GLDS-303_wgs_contigs_JC-0222_S1153_L001-plaac-positive-proteins-and-GO-annots.tsv
done
NOTE
This will leave us with a list of the GO terms for plaac-positive proteins. Then need to write something to find which hold GO terms that are part of the cellular component namespace.
conda install -c conda-forge mamba
From https://github.com/whitehead/plaac
The below was done when the latest commit was from 27-Jan-2020 (https://github.com/whitehead/plaac/commit/44ca72882060eec4dc1696ebcf7f4f4185348232)
Needs to be built under python 2, so creating its own environment
mamba create -n plaac python=2
conda activate plaac
mkdir -p ~/bin/
cd ~/bin/
wget https://github.com/whitehead/plaac/archive/master.zip
unzip master.zip && rm master.zip
mv plaac-master/ plaac/
cd plaac/cli/
bash build_plaac.sh
If this gives an error like
javac: command not found
, we can try installing java in the same conda environment like so:conda install -c conda-forge -c bioconda -c defaults java-jdk
We can test for basic install success with:
java -jar ~/bin/plaac/cli/target/plaac.jar -h
Depending on system, may need to modify the build_plaac.sh
script to work with updated javac. This is how I've done that in the past. Here is the contents of the modified file, 'new-build-plaac.sh':
SCRIPT_DIR=`dirname $0`
cd $SCRIPT_DIR
echo cleanup
rm -rf target
mkdir target
echo build
javac -source 1.6 -target 1.6 -d target src/*.java
echo package
cp src/mainClass target/.
cp -r src/util target/.
cd target
jar cmf mainClass plaac.jar *
chmod +x plaac.jar
echo "ok (see target/plaac.jar)"
# also build PLAAC docs
echo "build-docs"
cd -
# generate documentation for command-line parameters, and dotfile for default hmm
java -jar target/plaac.jar -s -d -i dummy -h target/hmm_default.txt > target/plaac_headers.txt
./build_docs.py target/plaac_headers.txt > target/_plaac_headers.haml
echo "ok (see target/_plaac_headers.haml)"
if ! type "dot" > /dev/null; then
echo "dot not found, skipping genereration of hmm_default.png"
else
dot target/hmm_default.txt -Tpng > target/hmm_default.png
echo "ok (see target/hmm_default.png)"
fi
If the previous build_plaac.sh command failed, i'd make the above file and then run that one to try, e.g.:
bash new-build-plaac.sh
java -jar ~/bin/plaac/cli/target/plaac.jar -h
Adding an alias for plaac:
mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d/
echo 'alias plaac="java -jar ~/bin/plaac/cli/target/plaac.jar"' >> ${CONDA_PREFIX}/etc/conda/activate.d/plaac.sh
# making sure it works
conda deactivate
conda activate plaac
plaac -h
conda deactivate
mamba create -n metaeuk -c conda-forge -c bioconda -c defaults metaeuk=5.34c21f2
mamba create -n prodigal -c conda-forge -c bioconda -c defaults prodigal=2.6.3
conda create -n interproscan -c conda-forge -c bioconda -c defaults interproscan=5.52_86.0
# after conda install, it prints out directions to do the following
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.52-86.0/interproscan-5.52-86.0-64-bit.tar.gz
tar -pxvzf interproscan-5.52-86.0-64-bit.tar.gz
rm -rf /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan/data/
mv interproscan-5.52-86.0/data/ /home/mdlee4/miniconda3/envs/interproscan2/share/InterProScan
rm -rf interproscan-5.52-86.0/ interproscan-5.52-86.0-64-bit.tar.gz
# test
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv -dp
Setting to automatically delete working directory after finishing (though it only deletes the files, not the temp dir it makes):
sed 's/delete.temporary.directory.on.completion=false/delete.temporary.directory.on.completion=true/' ${CONDA_PREFIX}/share/InterProScan/interproscan.properties > t && mv t ${CONDA_PREFIX}/share/InterProScan/interproscan.properties
Using this to clean up headers after gene-calling and needed for any scripts that use the GO database.
mamba create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit
Interactive function plots and tables are here. BRAILLE update (31-Mar-2021) Previous update docs https://hackmd.io/@astrobiomike/BRAILLE-notes-17-Mar-2021 https://hackmd.io/@astrobiomike/BRAILLE-update-24-Feb-2021 https://hackmd.io/@astrobiomike/BRAILLE-update-3-Feb-2021 https://hackmd.io/@astrobiomike/BRAILLE-notes-12-Dec-2020
Nov 24, 2024Overview Metagenomics attempts to sequence all the DNA present in a sample. It can provide a window into the taxonomy and functional potential of a mixed community. There are a ton of things that can be done with metagenomics data as this non-exhaustive overview figure begins to highlight: <a href="https://astrobiomike.github.io/images/metagenomics_overview.png "><img src="https://astrobiomike.github.io/images/metagenomics_overview.png "></a> This page is an introduction to some concepts about one of the things we can try to do with metagenomics data: recovering metagenome-assembled genomes (MAGs). Key concepts
Jul 25, 2024<a href="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "><img src="https://github.com/AstrobioMike/AstrobioMike.github.io/raw/master/images/GToTree-logo-1200px.png "></a>
Jul 22, 2024GUI used was Jetstream2 exosphere: https://jetstream2.exosphere.app/ Summary info The base image created below is publicly available as "STAMPS-2023" and includes: conda v23.5.2 / mamba v1.4.9 jupyterlab v3.6.3 in base conda env an anvio-dev conda environment R v4.3.1 / Rstudio Server (2023.06.1-524) with:BiocManager 1.30.21 remotes 2.4.2
Jul 8, 2024or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up