BI278 Lab 6 – rpsblast and Orthofinder

# BI278 Lab 6 – rpsblast and Orthofinder Goal: use rpsblast for functional characterization, learn how to make your own blast databases & use Orthofinder for orthogroup detection Steps: 1) run rpsblast 2) make your own blast database 3) parse orthofinder results **Exercise 1. Reversed Position Specific BLAST** Rpsblast = used to compare protein query sequence against a database of protein profiles Essentially, it's the reverse version of PSI-BLAST (position-specific iterated BLAST) NCBI has collection of protein profiles stored in conserved domain databases (CDD), which includes profile models based on NCBI-curated domains, NCBIfams & Pfam HMM profiles, COGs, and other external data sources (we're using COGs in this lab) Goal: use COG profiles to detemine whether hypothetical proteins from prev week resemble any known protein profiles in the COG database 1. Establish where your system should look for its BLAST databases export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB Now you can input anything you'd like BUT a good test of how rpblast works would be to use hyp proteins from last week Basically this is applicable to any group of proteins you want to annotate for their potential functions (including whole genome) 2. Run rpblast on your group of proteins! REMEMBER: you need to start w/ protein fasta file. rpsblast -query bbqs395_hypothetical.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table #FOR ME: rpsblast -query PROKKA_09272022.faa.hypotheticalprotein -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table Like other 6 format results, you'll get table full of matches for any of your hyp proteins that scored sufficiently high w/ COG profiles Look at result file because next we are going to keep only the most significant COG match for each query sequence. 3. Keep only the best COG match for each query seq that have query coverage of at least 70 ==> basically this means that at least 70% of the query sequence is covered by the subject that matched awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bbqs395_hypothetical.rpsblast.table > bbqs395_hypothetical.cog.table What does this command do? - Takes the format 6 table & splits it up by both tabs & commas (the comma thing is because the last column includes the COG number along with a lot of other additional descriptors separated by commas.) We want to just grab COG # from this column for the next step - Then it looks at all the rows that match the same query ($1) as long as each has query coverage of at least 70 - It keeps the first one (best match) and prints two columns (=fields). You should be able to figure out what these fields are based on the context above. Now we can look up what each of these COGs is & categorize them to 17 broad functional categories individual COGs belong to. 4. Look up the functional categories for the COG numbers you have first. This info is stored in table: cognames2003-2014.tab. awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.table > temp This command basically takes the first file (cognames2003-2014.tab) and makes a look-up table that can access the categories by COG number. It then takes the second file bbqs395_hypothetical.cog.table & goes thru it row by row & looks up what category the COG # in that row belongs to. Then, it writes row along w/ COG category that it looked up & stores it in a new file. Look at this temp file to understand next command!! 5. Now we want to keep the first category for each COG #. This is because COGs can belong to multiple functional categories but first one is its primary category. awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2 This command keeps the first category (letter) from the third field. 6. Lastly, we want to have a full description of the functional category as well. This can be looked up in another table: fun2003-2014.tab awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized ![](https://i.imgur.com/J3BKfly.png) 7. We will also remove the temp files we made in the process (just to clean up!) rm temp temp2 FINAL THOUGHTS OF EXERCISE 1: - look at the final file: - you should see that hyp proteins are not hyp! - basically many can be related to known proteins (though perhaps from other organisms) COGs and other orthology prediction mechanisms = great for this type of application. Also interesting that there appear to be COGs that are common enough proteins to warrant a formal COG description but we don’t know what their specific functions are. **Exercise 2. Make and use a custom BLAST database** Goal: make some custom nucleotide & protein databases to search against. This is the first step to doing reciprocal BLAST in order to identify BBH/RBH between a small # of genomes 8. Make your own directory for BLAST databases in your bi278 hme mkdir myblastdb 9. In order to search against this new database, you have to add this directory to your $BLASTDB (which tells blast where your directories are) export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/sdivit25/myblastdb" 10. To make protein databases, modify nd execute the following comand on an amino acid multi-fasta file: makeblastdb -in /courses/bi278/Course_Materials/lab_04/prokka.faa -dbtype prot -title genomename_prot -out myblastdb/genomename_prot - parse_seqids makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395/PROKKA_09162022.faa -dbtype prot -title bbqs395_prot -out myblastdb/bbqs395_prot -parse_seqids 11. To make nucleotide databases, modify the following: makeblastdb -in /courses/bi278/Course_Materials/lab_04/prokka.ffn -dbtype nucl -title genomename_nucl -out myblastdb/genomename_nucl -parse_seqids makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395/PROKKA_09162022.ffn -dbtype nucl -title bbqs395_nucl -out myblastdb/bbqs395_nucl -parse_seqids This will create a series of files: *.phr, *.pin, *.psq & more for protein databases & *.nhr, *.nin, *.nsq & more for nucleotide databases NOTE: you'll have to give diff databases diff titles & ideally ones that you can remember easily so you can use them later 12. Lastly, you can combine any BLAST databases of the same type (nucleotide or protein) you make in this way using a command like the one below to search against multiple databases you want to cmbine in -dblist. In this scenario (EXAMPLE): Prof Noh created 4 protein databases from the .faa files in lab_04 & wanted to combine them into a single searchable database. The names of these custom databases are listed in double quotes. blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot bh69_prot" 13. Make sure everything worked by searching against one of new custom databases! Ex: Prof Noh wants to find all the matches to tssH (secretion systeme gene). She downloaded a seq from Uniprot (Universal Protein Resource), a database that contains both curated & electronically annotated proteins. Swiss-Prot is highly curated subset of Uniprot. blastp -query tssH_bpseudo.fasta -db 4burk_prot -outfmt 6 Results should look something like this: ![](https://i.imgur.com/3Tv2YbS.png) Prof Noh's comments: "One thing I notice with these results is that the results contain for several subjects 2 HSPs that were not combined in the gapped realignment step. I can tell this by the columns that represent qstart qend sstart send. These proteins must be just a little bit different compared to the query sequence that there is a section in the middle that doesn’t want to align between the query and the subject." **Exercise 3. Orthofinder and its output** Orthofinder = popular orthogroup detection software that finds homologous groups of genes that are descended from a single ancestor gene. It can be used for many diff purposes ==> we are using it as a way to run a genome comparison analysis One of the foci of comparative genomic analysis = compare which homologous genes are common across genomes & which genes are unique There are many states in betweeen completely shared & completely unique for genes (ex: some proportion of genomes is shared but not all) Think about it basically like a set theory applied to genes each genome possesses. This may also remind you about what BUSCO is supposed to represent: single copy genes shared by all members of particular groups, that are universal enough that we use these genes to determine whether our sequencing project was successful or not at capturing most of the genome. NOTE: using software is low key easy part and interpreting results is low key the hard part 14. First collect all predicted protein seqs (prokka.faa equivalent) into a directory. Then run orthofinder within that folder. Prof Noh's results are from running orthofinder on the predicted protein sequences of 2 strains of P. bonniea (bbqs395, bbqs433), 2 strains of P.hayleyella (bhqs155, bhqs69), and Burkholderia pseudomallei downloaded from the Burkholderia genome database. #orthofinder -f ./ -X In this command: -f specifies where our protein fasta files are -./ means here because we were in same directory as these files -X specifies that we don't want orthofinder to add species names to our gene ids because we already did this when running prokka Now, we have results in a nested set of directories: Orthofinder/Results_date. We moved the results up a level & got rid of the empty directory for convenience. These results are in: /courses/bi278/Course_Materials/lab_06/ #mv OrthoFinder/Results_Oct25/ . #rmdir OrthoFinder/ #cd Results_Oct25 Changing into the output directory you will see that there are many results files organized into several directories! You can read about running Orthofinder & its results here: https://github.com/davidemms/OrthoFinder#what-orthofinder-provides Now, look at the key results! 15. Check overall stats: cat Comparative_Genomics_Statistics/Statistics_Overall.tsv ![](https://i.imgur.com/jxfG9US.png) Results are telling you that (esp first part of file uptil "% of genes in species-specific orthogroups") for Prof Noh's 5 genomes, 93.5% of all genes belong to orthogroups while remaining 6.5% do not. The files that are most relevant for finding out which genes you may want to focus on at a larger scale = highlighted/emphasized in next steps. 16. Look at species tree file. Orthofinder uses phylogenies (trees) to more accurately find orthogroups so we need to know what the species tree looks like & what the names of the internal nodes are. cat Species_Tree/SpeciesTree_rooted_node_labels_txt Prof Noh then visualized it w/ iTol (done in prev lab)! ![](https://i.imgur.com/xgOpemH.png) 17. Now you want to check out the file that describes the orthogroups of genes shared by more than one genome. N0 is 'en'-'zero', not 'no' head Phylogenetic_Hierarchical_Orthogroups/N0.tsv This file lists out all the genes that belong to each detected orthogroup (second column). The orthogroups are listed from largest to smallest so if you check out the tail of this same file you will find genes that are shared by only 2 genomes. The first column refers to the hierarchical level at this the orthogroup is found. N0 includes all orthogroups. Then in my case, N1 are orthogroups that exclude B. pseudomallei, N2 are inclusive only of P. hayleyella genomes, & N3 is inclusive of only P. bonniea genomes. NOTE: This file will make it easy to sort out the members or orthogroups that are only found in a particular subset of genomes. 18. Ex: Prof Noh wants to find genes that are shared between genome 1 (Burkholderia pseduomallei) and genomes 4 &5 (bhqs155 & bhqs69). The genomes are listed as column names in the header after the first three fields. I can now find rows of this file that have an entry in fields $4, $7, $8 but not $5, $6 (the two P. bonniea): awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv NOTE: == means equal to != means not equal to && means AND (logical) There are 174 Orthologs that match this condition out of 4337 total! So we narrowed down genes of interest quite a bit. If you wanted to look at clade-specific genes, it may save some time to look at the other files N1-N3. (These files are just subsets of the N0 file so you will find the same info in the big file.) Results from middle: ![](https://i.imgur.com/XEAeBx3.png) 19. Now let's see which genes are unique to individual genomes: head Orthogroups/Orthogroups_UnassignedGenes.tsv ^This file can point you to genes that are unique to individual genomes, that may do something distinct but may instead do something similar. - Prof Noh's comments on this^: - One of the things that make biology both cool but also difficult is that there are often a few different ways to achieve the same goal. Think of convergent evolution, where the overarching phenotype may be the same but the underlying genetics can be distinct. Prof Noh's final comments: Once you’ve gotten specific orthogroups and therefore genes of interest to look at, you may want to go back and check out their functions (now you know how to assign COGs), and do some additional literature research. It could also help to know specific types of genes you are looking for ahead of time and you can use a combination of custom BLAST with Orthofinder results to see if those genes exist in some but not all of your genomes.