# Lab 6: rpblast and orthofinder
-rpblast is used to compare a protein query sequence agaisnt a database of protein files
-COGs, or clusters of orthologous groups,categorize proteins across genomes based on orthology(or should be).Orthologs are genes in different species that evolved from a common ancestor, retaining similar functions. COGs help predict functions of less-studied genes by associating them with well-known counterparts in the same group.
-We will be using COGs profiles to see if our hypothetocal proteins form last weeks resemble any known proetin profiles in the COG database
1. Establish where system should look for its blast database:
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB
```
We are running rpblast of group of hypothetical proteins from last week to annotate for potential fucntions
i copied the subset of hypothicalprotiens file to my current lab folder:
```
cp /home2/cibene26/lab_05/SUBHPPROKKA.faa .
```
2. Run rpsblast of protein file:
```
rpsblast -query bbqs395_hypothetical.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table
```
for me this looked like:
```
rpsblast -query SUBHPPROKKA.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > TSUBHPPROKKA.rpsblast.table
```
result was table full of matches for any of my hypothetical proteins that scored sufficiently high with COG profiles (format 6 results)
TSUBHPPROKKA.rpsblast.table is table for any found proteins for my subset of hps
3. Keep only the best COG match for each query sequence that have query coverage of at least 70. This means at least 70% of the query sequence is covered by the subject that matched.
```
awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bbqs395_hypothetical.rpsblast.table > bbqs395_hypothetical.cog.table
```
For me this looked like
```
awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" TSUBHPPROKKA.rpsblast.table > CSUBHPPROKKA.cog.table
```
CSUBHPPROKKA.cog.table is the reorganized table that shows that at least 70% of the query sequnce is covered by the subject that matched and gene id. the output i got:
bbqs295_02086 COG1886
4. Look up the functional categories for the COG numbers you have. This info is stored in a table cognames2003-2014.tab
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.table > temp
```
For me this loooked like;
```
awk -F "/t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print 41, $2, a[$2]} else {print $0}}' OFS="/t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab CSUBHPPROKKA.cog.table > temp
```
the output I got:
bbqs295_02086 COG1886 NU
(NU are its two functional categories, the first is its primary one)
5.Now we want to keep the first category for each COG number. COGs can belong to multiple functional categories but the first one is its primary category.
```
awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2
```
the output i got from temp2:
bbqs295_02086 COG1886 N
6.Lastly, we want to have a full description of the functional category as well. This can be looked up in another table: fun2003-2014.tab
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized
```
for me this looked like:
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > CSUBPPROKKA.cog.categorized
```
this is the output i got from CSUBPPROKKA.cog.categorized:
bbqs295_02086 COG1886 N Cell motility
If it looks like mine, you will see a gene_id, a COG number, a letter category, and a decription of the category.
remove temp files:
```
rm temp temp2
```
not all hypothetical proteins are actually completely unknown. Many can be related to known proteins or look like them. COGs and other orthology prediction mechanisms are great for this type of application.It is also interesting that there appear to be COGs that are common enough proteins to warrant a formal COG description but we don’t know what their specific functions are. These fall into category S.
8. OPTIONAL. what do the individual COG numbers stand for too. We can add that information to the final table.
```
awk -F "\t" 'NR==FNR {a[$1]=$3;next} {if ($2 in a){print $1, $2, a[$2], $3, $4} else {print "no match"}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.categorized
```
## Exercise 2: Orthofinder and its Output
Orthofinder is a popular orthogroup detection software that finds homologous groups of genes (=orthogroup). It can be used for many different purposes, but we are using it as a way to run a genome comparison analysis.
1. Frst collect all predicted protein sequences (PROKKA*.faa equivalent) from the genomes you want to compare into a directory. Then run Orthofinder within that folder.
```
#orthofinder -f ./ -X
```
I now have results in a nested set of directories: Orthofinder/Results_date.
These results are in: /courses/bi278/Course_Materials/lab_06/
```
#mv OrthoFinder/Results_Oct22/ .
#rmdir OrthoFinder/
#cd Results_Oct22
```
2. Check overall statistics
```
cat Comparative_Genomics_Statistics/Statistics_Overall.tsv
```
results, fist part of the file:
```
Number of species 7
Number of genes 26880
Number of genes in orthogroups 25526
Number of unassigned genes 1354
Percentage of genes in orthogroups 95.0
Percentage of unassigned genes 5.0
Number of orthogroups 4357
Number of species-specific orthogroups 272
Number of genes in species-specific orthogroups 960
Percentage of genes in species-specific orthogroups 3.6
Mean orthogroup size 5.9
Median orthogroup size 7.0
```
For my 7 genomes, 95 percent of all genes belong to orthogroups while the remaining 5 percent do not.
3. Check out the species tree file. Orthofinder uses phylogenies (trees) to more accurately find orthogroups so we need to know what the species tree looks like
```
cat Species_Tree/SpeciesTree_rooted_node_labels.txt
```
I then visualized mine at iToL
4. Now I want to check out the file that describes the orthogroups of genes shared by more than one genome.
```
head Orthogroups/Orthogroups.tsv
```
This file lists out all the genes from each genome that belong to each detected orthogroup (first column). The orthogroups are listed from largest to smallest, so if you check out the tail of this same file you will find genes that are shared by only 2 genomes.
5. A different way of achieving the same outcome would be to use this file: `Orthogroups/Orthogroups.GeneCount.tsv`
```
awk '$2=="0" && $6=="0" && $7=="0" && $8=="0" && $9>=3' ../Orthogroups/Orthogroups.GeneCount.tsv | wc -l
```
With this command, I am asking to see lines where the second (outgroup) and 6-8th columns (P. hayleyella) are 0 (no genes are included in the orthogroup), and the total gene count (column 9) is 3 or larger.
6. For each of the orthogroups, you can find fasta files for the genes included here by orthogroup id: `Orthogroup_Sequences`
7. We can also check out which genes are unique to individual genomes here: `Orthogroups/Orthogroups_UnassignedGenes.tsv`
This file can point you to genes that are unique to individual genomes, that may do something distinct but may also do something similar while appearing very different.
## Exercise 3. Mkae and use a custom BLAST database
1. Make your own directory to store BLAST databases in your bi278 home. (these examples are all executed from my bi278 home; I called my directory myblastdb) (in ~)
2. In order to search against this new database, you will have to add this directory to your $BLASTDB. Let’s do that now
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/COLBYID/myblastdb"
```
3. To make protein databases, modify and execute the following command on an amino acid multi-fasta file:
```
makeblastdb -in PATH/PROKKA.faa -dbtype prot -title NAME_prot -out myblastdb/NAME_prot -parse_seqids
```
for me this looked like:
```
makeblastdb -in /home2/cibene26/lab_05/PROKKA_10032023AHP.faa -dbtype prot -title P.bonniea_bbqs395_prot -out myblastdb/P.bonniea_bbqs395_prot -parse_seqids
```
output said; Adding sequences from FASTA; added 11 sequences in 0.000670195 seconds.
4. Make nucleotide bases, modify the following:
```
makeblastdb -in PATH/PROKKA.ffn -dbtype nucl -title NAME_nucl -out myblastdb/NAME_nucl -parse_seqids
```
for me this looked like:
```
makeblastdb -in /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.ffn -dbtype nucl -title P.bonniea_bbqs395_nucl -out myblastdb/P.bonniea_bbqs395_nucl -parse_seqids
```
This will create a series of files: *.phr, *.pin, *.psq and more for protein databases and *.nhr, *.nin, *.nsq and more for nucleotide databases.\Note that you’ll have to give different databases different titles, and ideally ones that you can remember easily so you can use them later.
5. Lastly, you can combine any BLAST databases of the same type (nucleotide or protein) you make in this way using a command like the one below to search against multiple databases at once. You just need to use the exact titles you previously gave the databases you want to combine in -dblist. In this scenario, I created 4 protein databases from *.faa files and wanted to combine them into a single searchable database. The names of these custom databases are listed in double-quotes.
```
#blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot bh69_prot"
```
6. Let’s make sure everything worked by searching against one of your new custom databases. I want to find all the matches to tssH, a secretion system gene. I downloaded a sequence from Uniprot (Universal Protein Resource), a database that contains both curated and electronically annotated proteins. Swissprot (the same one from last lab) is a highly curated subset of Uniprot.
```
blastp -query /courses/bi278/Course_Materials/lab_06/tssH_bpseudo.fasta -db YOUR_DB -outfmt 6
```
**Questions for NCBI**
Are there any recent ehnhancemnts or updates to NCBI's databases or tools based on user feedback?
How does NCBI ensure the accuracy and relaibility of the information in its database?