# BI 278 Lab 6 Notes
## Exercise 1: Reversed position specific BLAST
To use COG profiles to determine whether your hypothetical proteins resemble any known protein profiles in the COG database
### 1. To establish where your system should look for its BLAST databases:
`export BLASTDB="/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB`
### 2. To run rpsblast on a group of proteins use:
`rpsblast -query bbqs395_hypothetical.faa -db Cog -evalue 0.01 -
max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs
stitle" > bbqs395_hypothetical.rpsblast.table`
This will give the a table full of matches.
### 3. In order to keep only the best COG match for each query sequence that have at least 70 (at least 70% of the query sequence is covered by the subject that matched):
`awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t"
bbqs395_hypothetical.rpsblast.table >
bbqs395_hypothetical.cog.table`
- Takes the format 6 table and splits it up by both tabs and commas. The comma thing is because the last column includes the COG number along with a lot of other additional descriptors separated by commas. We want to just grab the COG number from this column for the next step.
- It then looks at all the rows that match the same query ($1) as long as each has query coverage of at least 70
- It keeps the first one (best match) and prints two columns (=fields). You should be able to figure out what these fields are based on the context above.
### 4. To look up the functional categories for the COG numbers you have first from the table cognames2003-2014.tab
`awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1,
$2, a[$2]} else {print $0}}' OFS="\t"
/courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab
bbqs395_hypothetical.cog.table > temp`
- This command takes the first file cognames2003-2014.tab and makes a look-up table that can access the categories by COG number. It then takes the second file bbqs395_hypothetical.cog.table and goes through it row by row and looks up what category the COG number in that row belongs to. It then writes the row along with the COG category that it looked up and stores it in a new file. You should take a look at the temp file that was just created to understand the next command.
### 5. Now, to keep the first category for each COG number (the first one is its primary category)
`awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) }
else { $3 = $3 }; print}' OFS="\t" temp > temp2`
- This command keeps the first category (letter) from the third field
### 6. To have full description of the functional category (looked up in the table fun2003-2014.tab)
`awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0,
a[$3]} else {print $0}}' OFS="\t"
/courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized`
### 7. Remove the temp files created in this process.
Looking at the final file, we see that not all hypothetical proteins are actually hypothetical!
## Exercise 2: Making and Using a custom BLAST database
### 8. Make your own directory for BLAST databases in your bi278 home.
### 9. Add this directory to your $BLASTDB using
`export
BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/colbyid
/myblastdb"`
### 10. To make protein databases, modify and execute the following command on an amino acid multi-fasta file:
`makeblastdb -in
/courses/bi278/Course_Materials/lab_04/prokka.faa -dbtype prot
-title genomename_prot -out myblastdb/genomename_prot -
parse_seqids`
- genomename_prt is the name of the database you are creating
- myblastdb/genomename_prot is your database folder and the name of the database you are creating
### 11. To create a nucleotide databases, modify the following:
`makeblastdb -in
/courses/bi278/Course_Materials/lab_04/prokka.ffn -dbtype nucl
-title genomename_nucl -out myblastdb/genomename_nucl -
parse_seqids`
This creates a series of files for protein databases and other files for nucleotide databases. Note that you give different database titles so you can remember them and use them later.
### 12. To combine BLAST databases of the same type using the files with similar ends (.pin, .pos as compared to .not .ntf) use the command
`blastdb_aliastool -dbtype prot -title 4burk_prot -out
myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot
bh69_prot "`
### 13. To check if everything works with your databases, you can download a sequence from Uniprot and run the command:
`blastp -query tssH_bpseudo.fasta -db 4burk_prot -outfmt 6`
tssH_bpseudo.fasta - the downloaded sequence, it can change. Can also put path before it to go to downloaded file.
## Exercise 3: Orthofinder and its output
### 14. Collect all predicted protein sequences into a directory then run orthofinder within that folder.
`#orthofinder -f ./ -X`
- -f specifies where my protein fasta files are
- (./ means here because I was in the same directory as these files)
- -X specifies that I don't want orthofinder to add species names to my gene ids because it was already done when running prokka.
- Within the orthofinder results file, there are many different result files organized into several directories:
- info on running orthofinder and its results: https://github.com/davidemms/OrthoFinder#what-orthofinder-provides
### 15. To check the overall statistics of of Orthofinder results, check the overall folder with the command:
`cat Comparative_Genomics_Statistics/Statistics_Overall.tsv`

### 16. To check out the species tree file (phylogenies) use the command:
`cat Species_Tree/SpeciesTree_rooted_node_labels.txt`
Refer to end of lab 4 notes to see how to input this tree into iTol
### 17. Use the command
`head Phylogenetic_Hierarchical_Orthogroups/N0.tsv`
### to check out the file that describes the orthogroups of genes shared by more than one genome
- lists out all the genes that belong to each detected orthogroup
- They are arranged by largest to smallest, so the tail command would find genes shared only by 2 genomes
This file makes it easy to sort out the members of orthogroups that are only found in a particular subset of genomes
### 18. To find genes that are shared between genome 1 and genomes 4 and 5, you have to use the genomes listings as column names in the header after the first three fields. So, you can find rows of this file that have an entry in fields $4, $7, $8 but not $5, $6 (these $ correspond with the genomes):
`awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!=""
{print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv`
- == means equal to
- != means not equal to
- && means AND (logical)
- If you want to look at clade-specific genes, it may save time to look at the other files N1-N3.
### 19. To look at which genes are unique to individual genomes, use the command:
`head Orthogroups/Orthogroups_UnassignedGenes.tsv`
given these unique genes (genes of interest) you may want to check out their functions using COGs, and do additional literature research.
- knowing the genes you are looking for ahead of time is also helpful as it allows for combined use of custom BLAST and Orthofinder to see if those genes exist in some but not all of your genomes