# BI278 Lab notebook 6
### Reversed position specific BLAST
`Rpsblast` - used to compare a protein query sequence against a database of protein profiles
`ssh` into bi278
Establish where the system should look for its BLAST database
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB
```
Run `rpsblast` on a group of proteins or whole genomes. Use the hypothetical proteins from lab 5.
`mkdir lab_06`
`cd lab_06`
```
rpsblast -query ../lab_05/bh69_hypothetical_protein.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bh69_hypothetical.rpsblast.table
```
Keep only the best COG match for each query sequence with a query coverage of at least 70.
```
awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bh69_hypothetical.rpsblast.table > bh69_hypothetical.cog.table
```
* Takes the format 6 table and splits it up by both tabs and commas. The comma is because the last column includes the COG number along with a lot of other additional descriptors separated by commas. We want to just grab the COG number from this column for the next step.
* It then looks at all the rows that match the same query ($1) as long as each has query coverage of at least 70
* It keeps the first one (best match) and prints two columns (=fields)
#### Categorize COGs
Look up the function categories for the COG numbers -> stored in a table: `cognames2003-2014.tab`
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bh69_hypothetical.cog.table > temp
```
This takes the first file `cognames2003-2014.tab` and makes a look-up table that can access the categories by COG number. It then takes the second file `bh69_hypothetical.cog.table` and goes through it row by row and looks up what category the COG number in that row belongs to. It then writes the row along with the COG category that it looked up and stores it in a new file.
Keep the first category for each COG number which is its primary category (if it belongs to multiple categories)
```
awk -F "\t" '{if (length($3)>1 ) {$3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2
```
Keeps the first category (letter) from the third field
Look up the full description of the functional category in table: `fun2003-2014.tab`
```
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0,a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bh69_hypothetical.cog.categorized
```
Then remove the temp files: `rm temp temp2`
### Make and use a custom BLAST database
Make some local custom nucleotide and protein databases to search against. This is the first step to doing reciprocal BLAST in order to identify BBH/RBH between a small number of genomes.
Make your own directory for BLAST databases in your bi278 home: `mkdir ~/myblastdb`
Add the directory to your `$BLASTDB`
```
export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/tywang23/myblastdb"
```
Make protein database using an amino acid multi-fasta file:
```
makeblastdb -in /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA_09242022.faa -dbtype prot -title bhqs69_prot -out ~/myblastdb/bhqs_prot -parse_seqids
```
Make nucleotide database:
```
makeblastdb -in /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA_09242022.ffn -dbtype nucl -title bhqs69_nucl -out ~/myblastdb/bhqs69_nucl -parse_seqids
```
Repeat for the other genomes in the lab_04:
* `bbqs395`
* `bbqs433`
* `bhqs155`
Combine the databses of the same type (nucleotide or protein)
Use the titles of the databases you want to combine in `-dblist`
```
blastdb_aliastool -dbtype prot -title 4burk_prot -out ~/myblastdb/4burk_prot -dblist "bbqs395_prot bbqs433_prot bhqs155_prot bhqs69_prot "
blastdb_aliastool -dbtype nucl -title 4burk_nucl -out ~/myblastdb/4burk_nucl -dblist "bbqs395_nucl bbqs433_nucl bhqs155_nucl bhqs69_nucl "
```
**Test it out**
Find all matches to tssH, a secretion system gene sequence downloaded from Uniport (in the Course materials, lab_06 folder)
```
blastp -query /courses/bi278/Course_Materials/lab_06/tssH_bpseudo.fasta -db 4burk_prot -outfmt 6
```
Got results from all four of the custom databases, so it works:

Can see that the results contain for several subjects 2
HSPs that were not combined in the gapped realignment step. I can tell this by the columns that represent `qstart` `qend` `sstart` `send`. These proteins must be just a little bit different compared to the query sequence that there is a section in the middle that doesn’t want to align between the query and the subject.
### Orthofinder and its output
`Orthofinder` is a popular orthogroup detection software that finds homologous groups of genes that are descended from a single ancestor gene. It can be used for many different purposes, but we are using it as a way to run a genome comparison analysis.
Collect all predicted protein sequences (prokka.faa) into a directory and run orthofinder within that folder
[Run `orthofinder` on the predicted protein sequences of 2 strains of *P. bonniea* (Bbqs395, Bbqs433), 2 strains of *P. hayleyella* (Bhqs155, Bhqs69), and *Burkholderia
pseudomallei* downloaded from the Burkholderia genome database.
`orthofinder -f ./ -X`
In this command, -f specifies where my protein fasta files are (./ means in this directory) and -X specifies that I don’t want `orthofinder` to add species names to my gene ids because I already did this when running `prokka`.
The results are in a nested set of directories: Orthofinder/Results_date.
Move the results up a level and get rid of the empty directory for convenience.
`mv OrthoFinder/Results_Oct22/ `.
`rmdir OrthoFinder/`
`cd Results_Oct22` ]
The results are in: `/courses/bi278/Course_Materials/lab_06/`
Look at the results
```
cd /courses/bi278/Course_Materials/lab_06/Results_Oct22/
```
* Check the overall statistics:
`cat Comparative_Genomics_Statistics/Statistics_Overall.tsv`

For the 5 genomes, 93.5 percent of all genes belong to orthogroups while the remaining 6.5 percent do not.
* Check the species tree file
`Orthofinder` uses phylogenies (trees) to more accurately find orthogroups so we need to know what the species tree looks like and what the names of the internal nodes are.
`cat Species_Tree/SpeciesTree_rooted_node_labels.txt`
* Visualize it at iToL
* Check out the file that describes the orthogroups of genes shared by more than one genome. (N0 is ‘en’-‘zero’, not ‘no’)
`head Phylogenetic_Hierarchical_Orthogroups/N0.tsv`
This file lists out all the genes that belong to each detected orthogroup (second column).
* The orthogroups are listed from largest to smallest, so if you check out the tail of this same file you will find genes that are shared by only 2 genomes.
* The first column refers to the hierarchical level at this the orthogroup is found.
* N0 includes all orthogroups.
* In this case, N1 are orthogroups that exclude *B. pseudomallei*, N2 are inclusive only of *P. hayleyella* genomes, and N3 is inclusive of only *P. bonniea* genomes.
This file will make it easy to sort out the members or orthogroups that are only found in a particular subset of genomes.
Find genes that are shared between genome 1 (*Burkholderia pseudomallei*) and genomes 4 and 5 (Bhqs155 and Bhqs69). The genomes are listed as column names in the header after the first three fields.
Find rows of this file that have an entry in fields $4, $7, $8 but not $5, $6 (the two *P. bonniea*):
```
awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv
```
`==` means equal to
`!=` means not equal to
`&&` means AND
See which genes are unique to individual genomes:
`head Orthogroups/Orthogroups_UnassignedGenes.tsv`
Once you’ve gotten specific orthogroups and therefore genes of interest to look at, you may want to go back and check out their functions (now you know how to assign COGs), and do some additional literature research. It could also help to know specific types of genes you are looking for ahead of time and you can use a combination of custom `BLAST` with `Orthofinder` results to see if those genes exist in some but not all of your genomes.