BI278 Lab 6: rpsblast and orthofinder

# BI278 Lab 6: rpsblast and orthofinder 1. Reversed position specific BLAST Many kinds of BLAST, rpsblast is used to compare protein query sequence against a database of protein profiles. Reverse of PSI-BLAST. NCBI has a collection of protein profiles in their conserved domain database (CDD), and we will be using COGs (clusters of orthologous groups) in this lab. COGS were developed to define curated groups of proteins across genomes that should be orthologous to each other. We find orthologs and orthogroups to use the functional predictions from well-studied members of each group and to extrapolate to what the poorly-studied or newly discovered members of the group do biologically. 1. Establish where your system should look for BLAST databases. ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB #move files from lab_05 into lab_06 for example: cp ~/lab_05/PROKKA_09162022.faa ~/lab_06/ ``` Now, you can add any input but a good test is to use the hypothetical proteins from last week. 2. Run rpsblast on your group of proteins. Start with protein fasta file. ``` rpsblast -query PROKKA_10042022.faa.hypotheticalprotein -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table ``` (like other format 6 results) full table of matches for any of the hypothetical proteins that scored sifficiently high with COG profiles. Check out your result file because we're only going to keep the most significant COG match for each query sequence ``` head bbqs395_hypothetical.rpsblast.table ``` 3. Keep only the best COG match for each query sequence that have query coverage of at least 70. This means that at least 70% of the query suquence is covered by the subject that matched. ``` awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bbqs395_hypothetical.rpsblast.table > bbqs395_hypothetical.cog.table ``` This comand: - takes the format 6 table and splits it up by tabs and commas. the comma is because the last column includes the COG number along with a lot of other descriptors seperated by commas. we just want to grab the COG number from this column for the next step. - it looks at all of the rows that match the same query ($1) as long as each query has coverage of at least 70 - it keeps the first one (best match) and prints two columns (=fields). you should be able to figure out what these fields are based on the context above Now, we can look up what each of the COGs is and then categorize them into the 17 broad functional categories individual COGs belong to. 4. Look up functional categories for the COG numbers. This info is stored in a table: cognames2003-2014.tab ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.table > temp ``` This command takes the first file cognames2003-2014.tab and makes a look-up table that can access the categories by COG number. It then takes the second file bbqs395_hypothetical.cog.table and goes through it row by row and looks up what category the COG number in the row belogns to. It then writes the row along with the COG category that it looked up and stores it in a new file. Take a look at the temp file created to understand the comand (head temp). 5. Now we want to keep the first category for each COG number. COGs can belong to multifunctional categories but the first one is its primary category. ``` awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2 ``` This command keeps the first category (letter) from the third field. 6. Lastly, we want to have a full description of the functional category as well. This can be looked up in another table: fun2003-2014.tab ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized ``` 7. Clean up! remove temp files ``` rm temp temp2 ``` Final file: - hypothetical proteins are not all hypothetical - Many related to known proteins - COGS and orhtology prediction are great for this type of application ![](https://i.imgur.com/vMTJVaJ.png) 2.Exercise 2. Make and use a custom BLAST database 8. Make a directory for BLAST databases in bi278 home 9. Add this directory to $BLASTDB ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/cgsnow25/BLAST" ``` 10. To make protein databases, modify and excecute this command on an amino acid multi-fasta file ``` makeblastdb -in /courses/bi278/Course_Materials/lab_04/prokka.faa -dbtype prot -title genomename_prot -out myblastdb/genomename_prot -parse_seqids makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395PROKKA_09162022.faa -dbtype prot -title bbqs395_prot -out BLAST/bbqs395_prot –parse_seqids #didn't work, renamed .faa file as bbqs395_prokka.faa and moved to home. ran: makeblastdb -in bbqs395_prokka.faa -dbtype prot -title bbqs395_prot -out BLAST/bbqs395_prot -parse_seqids ``` 11. To make nucleotide databases, modify: ``` makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395/PROKKA_09162022.fnn -dbtype nucl -title bbqs395_nucl -out BLAST/bbqs395_nucl -parse_seqids ``` This will create a series of files: *.phr, *.pin, and *.psq for more protein databases, and *.nhr, *.nin, and *.nsq and more for nucleotide databases. *** remember!! amino acids are .faa, nucleotides are .ffn 12. You can combine any BLAST databases of the same type (nucleotide or protein) that you make by using a command like the one below to search against multiple databases at once. You just need to use the exact titles you previously gave the databases you want to combine in -dblist. She created 4 protein databases from the .faa files in lab_04 and wanted to combine them into a single searchable database. ``` blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot bh69_prot " ``` 13. Make sure that everything worked by searching agains on =e of the new custom databases. If you want to find all of the matches to tssH (secretion system gene), download seqence from Uniprot. ``` blastp -query tssH_bpseudo.fasta -db 4burk_prot -outfmt 6 ``` ![](https://i.imgur.com/WhmAHP9.png) Prof. noh's comment: One thing I notice with these results is that the results contain for several subjects 2 HSPs that were not combined in the gapped realignment step. I can tell this by the columns that represent qstart qend sstart send. These proteins must be just a little bit different compared to the query sequence that there is a section in the middle that doesn’t want to align between the query and the subject. 3.Excersize 3. Orthofinder and its output Orthofinder is a popular orthogroup detection software that finds homologous groups of genes that are descended from a single ancestor gene (used for many purposes but we are using it to run genome comparison analysis). Comparative genomics analyses--> compare which homologous genes are in common across genomes and which are unique. Many states are in between comletely shared/completely unique. BUSCO is supposed to represent single copy genes shared by all members of particular grops (universal enough that we use these genes to determine whether our sequencing project was successful or not at capturing most of the genome). 14. First collect all of the predicted protein sequences (prokka .faa equivalent) into a directory. Then run orthofinder within that folder. My results are from running orthofinder on the predicted protein sequences of 2 strains of P. bonniea (bbqs395 and bbqs433) and 2 strains of p. hayleyalla (bhqs155 and bhqs69) and burkholderia pseudomallei downloaded from the burkholderia genome database. ``` #orthofinder -f ./ -X ``` In this command, -f specifies where my protein fasta files are (./ means here because i was in the same directory as these files). and -X specifies that i don't want orthofinder to add species names to my gene ids because I already did this when running prokka. I now have results in a nested set of directories: Orthofinder/Results_date. I moved the results up a level and god rid of the empty directory for convenience. Results are in /courses/bi278/Course_Materials/lab_06/ ``` #mv OrthoFinder/Results_Oct22/ . #rmdir OrthoFinder/ #cd Results_Oct22 ``` Go into Results_Oct22 Changing into the output directory, there are many results files organized into several directories. Key results: 15. Check the overall statistics ``` cat Comparative_Genomics_Statistics/Statistics_Overall.tsv ``` ![](https://i.imgur.com/Z3wNMdx.png) For my 5 genomes, 93.5 percent of all genes belong to orthogroups while the remaining 6.5 percent do not. The files that are most relevant for finding out which genes you may want to focus on at a larger scale are highlighted in the following steps. 16. Check out the species tree file. Orthofinder uses phylogenies trees) to more accurately find orthogroups so we need to know what the species tree looks like and what the names of the internal nodes are ``` cat Species_Tree/SpeciesTree_rooted_node_labels_txt ``` Visualized in iTol 17. Now I want to check out the file that describes the orthogroups of genes shared by more than one genome. (N0 is ‘en’-‘zero’, not ‘no’) ``` head Phylogenetic_Hierarchical_Orthogroups/N0.tsv ``` This file lists out all the genes that belong to each detected orthogroup (second column). The orthogroups are listed from largest to smallest, so if you check out the tail of this same file you will find genes that are shared by only 2 genomes. The first column refers to the hierarchical level at this the orthogroup is found. N0 includes all orthogroups. Then in my case, N1 are orthogroups that exclude B. pseudomallei, N2 are inclusive only of P. hayleyella genomes, and N3 is inclusive of only P. bonniea genomes. This file will make it easy to sort out the members or orthogroups that are only found in a particular subset of genomes. 18. Want to find genes shared between genome 1 (burkholderia pseudomallei) and genomes 4 and 5 (bhqs155 and bhqs69). The genomes are listed as column names in the header after the first three fields. I can now find rows of this file that have an entry in fields $4, $7, $8 but not $5, $6 (the two P. bonniea): ``` awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv ``` == means equal to != means not equal to && means AND (logical) Prof Noh note: "There are 174 orthogroups that match this condition out of a total of 4337, so I’ve narrowed down my genes of interest quite a bit. If you wanted to look at clade-specific genes, it may save some time to look at the other files N1-N3. (These files are just subsets of the N0 file so you will find the same information in the big file.)" 19. See which genes are unique to individual genomes ``` head Orthogroups/Orthogroups_UnassignedGenes.tsv ``` - Different ways to achieve the same goal - convergent evolution: overarching phenotype is the same, underlying genetics is distinct - Once you have specific orthogroups, you can go back and check out functions - if you know specific types of genes ahead of time, you can use a comination of custom BLAST with Orthofinder results to see if those genes exist in some but not all genomes