# Lab 6 `mkdir labwk6` ## Exercise 1: Reversed position specific BLAST 1. `export BLASTDB="/courses/bi278/Course_Materials/blastdb"` to establish where the system should look for the BLAST database 2. `echo $BLASTDB` got the output /courses/bi278/Course_Materials/blastdb so I know it worked 3. Next I needed to run rpsblast on my group of proteins, starting with a fasta file `rpsblast -query hypPROKKA -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > hyp.rpsblast.table` 4. NOTE: Keep only the best COG match for each query sequence that have query coverage of at least 70, meaning at least 70% of the query sequence is covered by the subject that matched. Use the commad `awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" hyp.rpsblast.table > hyp.cog.table` 5. Next we can look up what each of these COGs is and categorize them to the 17 broad functional categroies the individuals COGs belong to. First we want to look up the functional categories for the COG numbers we have. This is stored in `cognames2003-2014.tab` Use the command: `awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="t\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab hyp.cog.table > temp` Then `head temp` to make sure this worked and to see what was created 6. I now want to keep the first category for each COG number, because the first one is the primary category `awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2` I then opened temp2 to look at the new file 7. Now I wanted to have a full description of the functional category which can be looked up in the table `fun2003-2014.tab` Use the command `awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > hyp.cog.categorized` 8. `rm temp temp2` ## Exercise 2: Make and use a custom BLAST database 9. `cd ~` to make sure the directory for the BLAST database is in my bi278 home`mkdir blastdb` 10. In order to search against the new database I need to add this to the $BLASTDB using the command `export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/slpuro26/blastdb"` 11. To make protein databases, use the command `makeblastdb -in /courses/bi278/Course_Materials/labwk4/prokka.faa -dbtype prot -title genomename_prot -out blastdb/genomename_prot - parse_seqids` BUT modify on an amino acid multi-fasta file. The command now looks like`makeblastdb -in ~/labwk5/hypPROKKA -dbtype prot -title hypPROKKA_protdb -out blastdb/hypPROKKA_protdb -parse_seqids` 12. To make a nucleotide database, use the command `makeblastdb -in ~/labwk5/PROKKA_09162022.ffn -dbtype nucl -title PROKKA_nucldb -out blastdb/PROKKA_nucldb -parse_seqids` 13. Lastly, combine any BLAST databases of the same type to search against multiple datbases at once. To do this, list the exact titles given to the databases I want to combine in `-dblist`. command example `blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_port -dblist "bb395_prot bb433_prot bh155_prot bh69_prot "` I could do this using the files in the lab_06 directory 14. to check that it worked and searched against the new databases I created `blastp -query tssH_bpseudo.fasta -db hypPROKKA_protdb -outfmt 6 ` this worked! my results: tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_03207 29.927 137 81 4 687 811 26 159 0.32 29.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02418 34.146 41 25 1 200 240 437 475 0.39 29.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02877 38.776 49 29 1 572 619 20 68 1.3 26.6 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02020 34.694 49 32 0 501 549 565 613 1.8 27.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02020 56.667 30 13 0 954 983 610 639 5.9 25.4 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02020 28.947 76 52 1 507 580 619 694 7.1 25.0 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02020 45.238 42 23 0 955 996 629 670 8.0 25.0 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_01993 46.512 43 23 0 77 119 104 146 2.5 26.6 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_01456 45.000 40 20 1 423 462 36 73 3.3 26.2 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_00274 44.444 27 15 0 566 592 43 69 3.5 24.3 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_03341 28.125 32 23 0 511 542 369 400 5.3 25.4 tr|A0A8A4DWX9|A0A8A4DWX9_BURPE BO395_02667 23.256 86 50 2 431 512 44 117 8.3 24.3 ## Exercise 3: Orthofinder and its output 15. Collect all the predicted protein sequences into a directory `mkdir forortho` and then copied the four .faa files from the lab_06 file into this new forortho file `cp bhqs69_prokka.faa bbqs395_prokka.faa bbqs433_prokka.faa bhqs155_prokka.faa ~/labwk6/forortho` 16. then run orthofinder within that folder`#orthofinder -f ./ -X` 17. `mv OrthoFinder/Results_Oct25/ .` 18. `rmdir OrthoFinder` 19. `cd Results_Oct25` These commands are not actually being run 20. Now `cd /courses/bi278/Course_Materials/lab_06/Results_Oct22` 21. `cat Comparative_Genomics_Statistics/Statistics_Overall.tsv` 22. To look at the species tree file: `cat Species_Tree/SpeciesTree_rooted_node_labels_txt` or can visualize using iToL as we have in a previous lab 24. To look at the file that describe the orthogroups of genes shared by more than one genome `head Phylogenetic_Hierarchical_Orthogroups/N0.tsv` This lists out all of the genes that belong to each of the detected orthogroups, losted from smallest to largest. This file makes it easy to sort out the members or orthogroups that are only found in a paticular subset of genomes 25. For example, to find genes that are shared between genome 1 and genomes 4 and 5. Use the command `awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv 26. To see which genes are unqiue to individual genomes `head Orthogroups/Orthogroups_UnassignedGenes.tsv` my results for this were Orthogroup Burkholderia_pseudomallei_K96243_132 bbqs395_prokka bbqs433_prokka bhqs155_prokka bhqs69_prokka OG0004292 BPSL0020 OG0004293 BPSL0024 OG0004294 BPSL0025 OG0004295 BPSL0035 OG0004296 BPSL0042 OG0004297 BPSL0043 OG0004298 BPSL0056 OG0004299 BPSL0057 OG0004300 BPSL0059