---
tags: prions-fo-life
title: Annotating plaac-positive proteins with InterProScan and KOs
---
# Annotating plaac-positive proteins with InterProScan and KOs
Working with data from 1-Jun-2021, UniProt "Standard" proteomes only. The files used below are from [this drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH).
---
> **NOTE**
> The code below works based on the directory structure in [the google drive](https://drive.google.com/drive/u/0/folders/1AO0Q_nHx4iNA1loP8LO8CGmSUloJ9lVH), with commands being run from within the sub-directory "[further-annotating-plaac-positive-proteins](https://drive.google.com/drive/u/0/folders/1unPlMfWbnvGCV5fD59qXVGwOepE7-R18)". If not working in that structure, the code below would need to be modified to point to the correct file paths.
---
[toc]
# Environment setup
```bash
conda install -c conda-forge mamba
```
## bit
```bash
mamba create -n bit -c conda-forge -c bioconda -c defaults -c astrobiomike bit=1.8.42
```
## Interproscan
```bash
mamba create -n interproscan5 -c conda-forge -c bioconda -c defaults interproscan=5.54_87.0
5.54_87.0
conda activate interproscan
# setting up reference db
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.54_87.0/interproscan-5.54_87.0-64-bit.tar.gz
tar -pxvzf interproscan-5.54_87.0-64-bit.tar.gz
rm -rf ${CONDA_PREFIX}/share/InterProScan/data/
mv interproscan-5.54_87.0/data/ ${CONDA_PREFIX}/share/InterProScan
rm -rf interproscan-5.54_87.0/ interproscan-5.54_87.0-64-bit.tar.gz
# test
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv
interproscan.sh -i ${CONDA_PREFIX}/share/InterProScan/test_all_appl.fasta -f tsv -dp
```
Setting to automatically delete working directory after finishing (though it only deletes the files, not the temp dir it makes):
```bash
# doing in a way that works on typical darwin sed also
sed 's/delete.temporary.directory.on.completion=false/delete.temporary.directory.on.completion=true/' ${CONDA_PREFIX}/share/InterProScan/interproscan.properties > t && mv t ${CONDA_PREFIX}/share/InterProScan/interproscan.properties
```
## KOFamScan
```bash
mamba create -y -n kofamscan -c conda-forge -c bioconda -c defaults kofamscan=1.3.0 hmmer=3.3.0
# getting ref db, need to point to these when running
curl -L -O ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
gunzip ko_list.gz
curl -L -O ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
tar -xzvf profiles.tar.gz && rm profiles.tar.gz
```
# Getting fasta files of plaac-positive proteins
```bash
mkdir archaea bacteria eukarya
conda activate bit
bit-parse-fasta-by-headers -i ../archaea/reference-genome-info/archaea-proteomes.faa.gz -w ../archaea/plaac-core-score-0/archaea-plaac-core-score-0-positive-protein-accs.txt -o archaea/archaea-plaac-positive-seqs.faa --gz
bit-parse-fasta-by-headers -i ../bacteria/reference-genome-info/bacteria-proteomes.faa.gz -w ../bacteria/plaac-core-score-0/bacteria-plaac-core-score-0-positive-protein-accs.txt -o bacteria/bacteria-plaac-positive-seqs.faa --gz
bit-parse-fasta-by-headers -i ../eukarya/reference-genome-info/eukarya-proteomes.faa.gz -w ../eukarya/plaac-core-score-0/eukarya-plaac-core-score-0-positive-protein-accs.txt -o eukarya/eukarya-plaac-positive-seqs.faa --gz
```
# Running annotations
Scripts are below and in the [google drive](https://drive.google.com/drive/u/0/folders/1unPlMfWbnvGCV5fD59qXVGwOepE7-R18) in the "helper-scripts" subdirectory.
```bash
bash helper-scripts/run-kofamscan.sh
bash helper-scripts/run-interproscan.sh
```
# Scripts
## run-kofamscan.sh
```bash=
#!/usr/env/bin bash
set -e
eval "$(conda shell.bash hook)"
for domain in archaea bacteria eukarya
do
printf "\n\n\tDoing ${domain}\n\n"
mkdir -p ${domain}/${domain}-ko-annotations
conda activate kofamscan
exec_annotation -p profiles/ -k ko_list --cpu 50 -f detail-tsv -o ${domain}/${domain}-ko-annotations/${domain}-ko-annots.tmp ${domain}/${domain}-plaac-positive-seqs.faa
rm -rf tmp
conda deactivate
conda activate bit
bit-filter-KOFamScan-results -i ${domain}/${domain}-ko-annotations/${domain}-ko-annots.tmp -o ${domain}/${domain}-ko-annotations/${domain}-plaac-positive-KO-annots.tsv
rm ${domain}/${domain}-ko-annotations/${domain}-ko-annots.tmp
done
```
## run-interproscan.sh
```bash=
#!/usr/env/bin bash
set -e
eval "$(conda shell.bash hook)"
conda activate interproscan5
for domain in archaea bacteria
do
printf "\n\n\tDoing ${domain}\n\n"
mkdir -p ${domain}/${domain}-interproscan-out
interproscan.sh --cpu 20 --goterms --disable-residue-annot -f tsv -i ${domain}/${domain}-plaac-positive-seqs.faa -o ${domain}/${domain}-interproscan-out/${domain}-interpro-out.tsv 2> ${domain}/${domain}-interproscan-out/${domain}-interpro-stderr.log
rm -rf temp/
done
# splitting euks because it's a lot of seqs
head -n 110000 eukarya/eukarya-plaac-positive-seqs.faa > eukarya/eukarya-plaac-positive-seqs-p1.faa
sed -n '110001,210000p' eukarya/eukarya-plaac-positive-seqs.faa > eukarya/eukarya-plaac-positive-seqs-p2.faa
sed -n '210001,310000p' eukarya/eukarya-plaac-positive-seqs.faa > eukarya/eukarya-plaac-positive-seqs-p3.faa
sed -n '310001,421018p' eukarya/eukarya-plaac-positive-seqs.faa > eukarya/eukarya-plaac-positive-seqs-p4.faa
for set in p1 p2 p3 p4
do
printf "\n\n\tDoing eukarya ${set}\n\n"
interproscan.sh --cpu 20 --goterms --disable-residue-annot -f tsv -i eukarya/eukarya-plaac-positive-seqs-${set}.faa -o eukarya/eukarya-interproscan-out/eukarya-interpro-out-${set}.tsv 2> eukarya/eukarya-interproscan-out/eukarya-interpro-${set}-stderr.log
rm -rf temp
done
```