Bi278 Lab 6 v2 rpsblast and Orthofinder

# Bi278 Lab 6 v2 rpsblast and Orthofinder ### By Lee Ferenc 10/25/2023 ## 1. Reversed position specific BLAST Rpsblast is used to compare a protein query sequence against a database of protein profiles (NCBI has a collection of protein profiles on their conserved domain database (CDD)). COGs (Clusters Of Orthologous Groups) were developed as a way of defining curated groups of proteins across genomes that should be orthologous to each other. This is to extropolate well-studied groups onto poorly-studied ones. ### Step 1: Establish BLAST onto ssh. ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB ``` ### Step 2: Run rpsblast on your group of proteins. You need to start with a protein fasta file. ``` rpsblast -query bbqs395_hypothetical.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table ``` #### My example: ``` rpsblast -query subset_hypo_proteins_prokka_09262023.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > hypo_proteins_prokka.rpsblast.table ``` ### Step 3: Keep only the best COG match for each query sequence that have query coverage of at least 70. This means at least 70% of the query sequence is covered by the subject that matched. ``` awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bbqs395_hypothetical.rpsblast.table > bbqs395_hypothetical.cog.table ``` The command: * Takes the format 5 table and splits it by tabs and commas. We have to do this to get the COG # * Match the same query ($1) as long as each query has a coverage of 70 or above * Keeps best match and prints two columns (=fields) #### My example: `awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" hypo_proteins_prokka.rpsblast.table > hypo_proteins_prokka.cog.table ` Okay so that didn't work so none of them were =>70. Oops. Trying w/ another subset of just toxin proteins. ``` rpsblast -query toxin.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > toxin_prokka.rpsblast.table awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" toxin_prokka.rpsblast.table > toxin_prokka.cog.table` ``` Output: ``` BHQS69_01563 COG2867 BHQS69_01564 COG2914 BHQS69_01744 COG2337 BHQS69_01799 COG2274 BHQS69_01800 COG0845 BHQS69_02317 COG5450 ``` ### Step 4: Look up the functional categories for the COG numbers you have. Info in table: `cognames2003-2014.tab` ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.table > temp ``` Command: * Takes `cognames2003-2014.tab` and makes a look-up table that can access categories by COG number * Takes input file and goes thru row by row and finds what category each COG number belongs to * Writes the row along with the COG category that it looked up and stores it in a new file (`temp`) #### My example: ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab toxin_prokka.cog.table > temp ``` Output of file head: ``` BHQS69_01563 COG2867 J BHQS69_01564 COG2914 V BHQS69_01744 COG2337 V BHQS69_01799 COG2274 V BHQS69_01800 COG0845 MV BHQS69_02317 COG5450 K ``` ### Step 5: Keep the first category for each COG number. COGs can belong to multiple functional categories but the first one is its primary category. ``` awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2 ``` #### My example (same as above) Output of file head: ``` BHQS69_01563 COG2867 J BHQS69_01564 COG2914 V BHQS69_01744 COG2337 V BHQS69_01799 COG2274 V BHQS69_01800 COG0845 M BHQS69_02317 COG5450 K ``` ### Step 6: Get full description of the functional category Other table for this: `fun2003-2014.tab` ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized ``` #### My example: ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > prokka_toxins.cog.categorized ``` Output of file head: ``` BHQS69_01563 COG2867 J Translation, ribosomal structure and biogenesis BHQS69_01564 COG2914 V Defense mechanisms BHQS69_01744 COG2337 V Defense mechanisms BHQS69_01799 COG2274 V Defense mechanisms BHQS69_01800 COG0845 M Cell wall/membrane/envelope biogenesis BHQS69_02317 COG5450 K Transcription ``` ### Step 7: Clean up (remove temp files) `rm temp temp2` ### (Optional) Step 8: Add what the individual COG numbers stand for to the final table ``` awk -F "\t" 'NR==FNR {a[$1]=$3;next} {if ($2 in a){print $1, $2, a[$2], $3, $4} else {print "no match"}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.categorized ``` #### My example: ``` awk -F "\t" 'NR==FNR {a[$1]=$3;next} {if ($2 in a){print $1, $2, a[$2], $3, $4} else {print "no match"}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab prokka_toxins.cog.categorized ``` Output: ``` BHQS69_01563 COG2867 Ribosome association toxin PasT (RatA) of the RatAB toxin-antitoxin module J Translation, ribosomal structure and biogenesis BHQS69_01564 COG2914 Putative antitoxin component PasI (RatB) of the RatAB toxin-antitoxin module, ubiquitin-RnfH superfamily V Defense mechanisms BHQS69_01744 COG2337 mRNA-degrading endonuclease, toxin component of the MazEF toxin-antitoxin module V Defense mechanisms BHQS69_01799 COG2274 ABC-type bacteriocin/lantibiotic exporters, contain an N-terminal double-glycine peptidase domain V Defense mechanisms BHQS69_01800 COG0845 Multidrug efflux pump subunit AcrA (membrane-fusion protein) M Cell wall/membrane/envelope biogenesis BHQS69_02317 COG5450 Transcription regulator of the Arc/MetJ class K Transcription ``` ## 2. Orthofinder Orthofinder is a popular orthogroup detection software that finds homologous groups of genes (=orthogroup). ### Step 1: Collect predicted protein sequences (PROKKA*.faa equivalent) from the genomes you want to compare into a directory. Then run Orthofinder. ``` orthofinder -f ./ -X ``` `-f` Species where protein fasta files, (`./` means here) and `-X` species I don't want orthofiner to add species names to gene ids. #### My example: Due to the files taking too long I used the files from `/courses/bi278/Course_Materials/lab_06/Results_Oct23/` and didn't copy since there's a lot of folders ### Step 2: Check the overall statistics. ``` cat Comparative_Genomics_Statistics/Statistics_Overall.tsv ``` Looks something like: ``` Number of species 7 Number of genes 26880 Number of genes in orthogroups 25526 Number of unassigned genes 1354 Percentage of genes in orthogroups 95.0 Percentage of unassigned genes 5.0 Number of orthogroups 4357 Number of species-specific orthogroups 272 Number of genes in species-specific orthogroups 960 Percentage of genes in species-specific orthogroups 3.6 Mean orthogroup size 5.9 Median orthogroup size 7.0 ``` Note: For all examples I'm using the same files so I don't see a reason to create examples when the path is the only different thing ### Step 3: Examine species tree file. Orthofinder uses phylogenies (trees) to more accurately find orthogroups so we need to know what the species tree looks like. ``` cat Species_Tree/SpeciesTree_rooted_node_labels.txt ``` Output: ``` (Burkholderia_pseudomallei_K96243_132:0.139475,(((bbqs433_prokka:0.00032017,bbqs395_prokka:0.00012987)N4:0.00129303,bbqs859.prokka:0.0026247)N2:0.0949503,((bhqs11.prokka:0.00262492,bhqs155_prokka:0.000209001)N5:0.000404395,bhqs69_prokka:0.000737986)N3:0.10081)N1:0.139475)N0; ``` I input it into the tree of life (see lab 3) and it looked the same as the tutorial. ![](https://hackmd.io/_uploads/Bk-TSFBf6.png) ### Step 4: Examine file. ``` head Orthogroups/Orthogroups.tsv ``` Output: Incomprehensible cat for `.tsv` format but it's all the genes from each genome that belond to each detected orthogrou (first column). Listed largest to smallest. You can just take the `.tsv` file and upload it to Google Slides so it's readable. Then you can see how many groups of size 3+. It was 453. ### Step 4: Examine File (But Different) ``` awk '$2=="0" && $6=="0" && $7=="0" && $8=="0" && $9>=3' ../Orthogroups/Orthogroups.GeneCount.tsv | wc -l ``` You also get 453. :) ### Step 5: Find fasta files for the genes by orthogroup id for each orthogroup: Found here: `Orthogroup_Sequences` Remember, each fasta file is one gene. ### Step 6: Check out which genes are unique to individual genomes Here: ``` Orthogroups/Orthogroups_UnassignedGenes.tsv ``` ## 3. Make and use a custom BLAST database ### Step 1: Make your own directory to store BLAST databases in your bi278 home. ``` mkdir enfere24blastDB ``` ### Step 2: Add this directory to your $BLASTDB ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/enfere24/enfere24blastDB" ``` ### Step 3: Make protein databases ``` makeblastdb -in PATH/PROKKA.faa -dbtype prot -title NAME_prot -out enfere24blastDB/NAME_prot -parse_seqids ``` This will make a series of files: *.phr, *.pin, *.psq and more. #### My Example: ``` makeblastdb -in lab_03/PROKKA_09262023.faa -dbtype prot -title BHQS69_prot -out enfere24blastDB/BHQS69_prot -parse_seqids ``` ### Step 4: Make nucleotide databases ``` makeblastdb -in PATH/PROKKA.ffn -dbtype nucl -title NAME_nucl -out enfere24blastDB/NAME_nucl -parse_seqids ``` Files created: *.nhr, *.nin, *.nsq and more. My example: ``` makeblastdb -in lab_03/PROKKA_09262023.ffn -dbtype nucl -title BHQS69_nucl -out enfere24blastDB/BHQS69_nucl -parse_seqids ``` (Don't worry for the project we'll make a different BLAST database this just testing. I did have to create a new DB cause I accidentally ran this. ) ### Step 5: Combine BLAST databases You can combine any BLAST databases of the same type (nucleotide or protein). This way you can search against multiple databases. Lab example: ``` blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot bh69_prot" ``` Make sure to use the exact titles you give the databases you want to combine in `-dblist`. ### Step 6: Test. ``` blastp -query /courses/bi278/Course_Materials/lab_06/tssH_bpseudo.fasta -db YOUR_DB -outfmt 6 ``` My example: ``` blastp -query /courses/bi278/Course_Materials/lab_06/tssH_bpseudo.fasta -db BHQS69_prot -outfmt 6 ``` So rememer the database name is "BHQS69_prot" or whatever you named it in step 3 Output: ``` tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01420 59.216 510 198 3 1 501 1 509 3.00e-166 509 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02560 48.356 517 256 4 1 515 53 560 1.22e-144 452 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02560 49.062 320 150 2 613 929 582 891 5.17e-78 271 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03640 44.372 462 249 2 1 454 15 476 5.89e-108 354 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03640 49.235 327 161 2 604 925 540 866 1.85e-86 295 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01886 41.935 496 250 7 10 495 3 470 2.20e-101 335 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01886 39.244 344 197 5 610 948 520 856 5.70e-73 256 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02364 42.995 414 206 10 32 430 21 419 3.34e-79 271 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02364 35.197 304 179 8 658 956 459 749 5.80e-46 174 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02351 72.222 18 5 0 984 1001 1311 1328 0.035 34.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02351 72.222 18 5 0 987 1004 1311 1328 0.035 34.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02351 72.222 18 5 0 990 1007 1311 1328 0.035 34.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02351 68.750 16 5 0 993 1008 1311 1326 1.5 29.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02351 61.111 18 7 0 981 998 1311 1328 1.8 29.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03064 35.484 62 37 2 736 795 137 197 0.11 32.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02258 28.846 52 33 1 303 350 125 176 0.12 32.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_00225 39.474 38 23 0 688 725 275 312 0.20 32.0 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03498 26.882 186 103 8 687 857 26 193 0.28 31.2 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02301 30.081 123 52 8 662 772 156 256 0.33 31.2 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02301 27.731 119 57 6 208 311 165 269 0.39 31.2 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01188 27.612 134 84 3 462 594 77 198 0.58 30.4 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_00631 30.612 49 31 1 676 724 149 194 2.1 28.5 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01307 40.476 42 24 1 693 733 31 72 3.4 27.7 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03134 38.776 49 29 1 572 619 20 68 4.2 26.9 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_03397 36.207 58 31 2 877 928 322 379 4.6 27.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_01126 45.455 22 12 0 949 970 27 48 4.9 25.0 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02698 35.526 76 37 2 25 91 29 101 5.4 26.6 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_02758 30.526 95 40 4 4 81 87 172 5.7 26.9 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BHQS69_00111 25.974 77 52 1 225 301 101 172 6.7 26.9 ``` As for NCBI questions, I genuinely don't have any right now. If I do, I'll bug my cousin and he'll probably have the answer or know who will. Only good family member. (He works for the NDB which is apart of the NIH and is specifically involved with coordinating with other NIH branches)