BI278 Lab#6 - HackMD

# BI278 Lab#6 ### Exercise 1. Reversed position specific BLAST 1. As before, we need to establish where your system should look for its BLAST databases. ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB ``` 2. Run rpsblast on your group of proteins. Note that you need to start with a protein fasta file. First I copy my hypothetical protein file from last week into this week's folder ``` cp lab_05/hypothetical_protein.faa lab_06 ``` Then I run this command in lab06 directory ``` rpsblast -query hypothetical_protein.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table ``` 3. Keep only the best COG match for each query sequence that have query coverage of at least 70. This means at least 70% of the query sequence is covered by the subject that matched. ``` awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" bbqs395_hypothetical.rpsblast.table > bbqs395_hypothetical.cog.table ``` Next, we can look up what each of these COGs is, and categorize them to the 17 broad functional categories individual COGs belong to. 4. Let’s look up the functional categories for the COG numbers you have first. This information is stored in a table: cognames2003-2014.tab ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab bbqs395_hypothetical.cog.table > temp ``` This command takes the first file cognames2003-2014.tab and makes a look-up table that can access the categories by COG number. It then takes the second file bbqs395_hypothetical.cog.table and goes through it row by row and looks up what category the COG number in that row belongs to. It then writes the row along with the COG category that it looked up and stores it in a new file. You should take a look at the temp file that was just created to understand the next command. The first 5 entries of my temp file: >BB433_00043 COG3247 S BB433_00056 COG3159 S BB433_00062 COG5285 Q BB433_00077 COG3568 R BB433_00091 COG0748 P 5. Now we want to keep the first category for each COG number. COGs can belong to multiple functional categories but the first one is its primary category. ``` awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2 ``` This command keeps the first category (letter) from the third field. For example, this one change from >BB433_00132 COG3501 UXR to >BB433_00132 COG3501 U 6. Lastly, we want to have a full description of the functional category as well. This can be looked up in another table: fun2003-2014.tab ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > bbqs395_hypothetical.cog.categorized ``` The first 5 entries of my file: >BB433_00043 COG3247 S Function unknown BB433_00056 COG3159 S Function unknown BB433_00062 COG5285 Q Secondary metabolites biosynthesis, transport and catabolism BB433_00077 COG3568 R General function prediction only BB433_00091 COG0748 P Inorganic ion transport and metabolism 7. We will also remove the temp files we made in the process, to clean up after ourselves. ``` rm temp temp2 ``` ### Exercise 2. Make and use a custom BLAST database 8. Make your own directory for BLAST databases in your bi278 home. ``` mkdir myblastdb ``` 9. In order to search against this new database, you will have to add this directory to your $BLASTDB. Let’s do that now: ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/colbyid /myblastdb" ``` 10. To make protein databases, modify and execute the following command on an amino acid multi-fasta file: ``` makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395/PROKKA_09162022.faa -dbtype prot -title bbqs395_prot -out myblastdb/bbqs395_prot -parse_seqids ``` 11. To make nucleotide databases, modify the following: ``` makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs395/PROKKA_09162022.ffn -dbtype nucl -title bbqs395_nucl -out myblastdb/bbqs395_nucl -parse_seqids ``` 12. Lastly, you can combine any BLAST databases of the same type (nucleotide or protein) you make in this way using a command like the one below to search against multiple databases at once. You just need to use the exact titles you previously gave the databases you want to combine in -dblist. In this scenario, I created 4 protein databases from the .faa files in lab_04 and wanted to combine them into a single searchable database. The names of these custom databases are listed in doublequotes. First I repect step 10 for bbqs433 bhqs155 bhqs69: ``` #for bbqs433 makeblastdb -in /courses/bi278/Course_Materials/lab_04/bbqs433/PROKKA_09242022.faa -dbtype prot -title bbqs433_prot -out myblastdb/bbqs433_prot -parse_seqids #for bhqs155 makeblastdb -in /courses/bi278/Course_Materials/lab_04/bhqs155/PROKKA_10222022.faa -dbtype prot -title bhqs155_prot -out myblastdb/bhqs155_prot -parse_seqids #bhqs69 makeblastdb -in /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA_09242022.faa -dbtype prot -title bhqs69_prot -out myblastdb/bhqs69_prot -parse_seqids ``` Then I run the following command: ``` blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bbqs395_prot bbqs433_prot bhqs155_prot bhqs69_prot" ``` 13. Let’s make sure everything worked by searching against one of your new custom databases. I want to find all the matches to tssH, a secretion system gene. I downloaded a sequence from Uniprot (Universal Protein Resource), a database that contains both curated and electronically annotated proteins. Swiss-Prot is a highly curated subset of Uniprot. ``` blastp -query lab_06/tssH_bpseudo.fasta -db 4burk_prot -outfmt 6 ``` ### Exercise 3. Orthofinder and its output 14. First collect all predicted protein sequences (prokka .faa equivalent) into a directory. Then run orthofinder within that folder. My results are from running orthofinder on the predicted protein sequences of 2 strains of P. bonniea (Bbqs395, Bbqs433), 2 strains of P. hayleyella (Bhqs155, Bhqs69), and Burkholderia pseudomallei downloaded from the Burkholderia genome database. ``` orthofinder -f ./ -X ``` 15. Check the overall statistics. ``` cat OrthoFinder/Results_Oct25/Comparative_Genomics_Statistics/Statistics_Overall.tsv ``` >Number of species 3 Number of genes 10586 Number of genes in orthogroups 10110 Number of unassigned genes 476 Percentage of genes in orthogroups 95.5 Percentage of unassigned genes 4.5 Number of orthogroups 3428 Number of species-specific orthogroups 40 Number of genes in species-specific orthogroups 149 Percentage of genes in species-specific orthogroups 1.4 For my 3 genomes, 95.5 percent of all genes belong to orthogroups while the remaining 4.5 percent do not 16. Check out the species tree file. Orthofinder uses phylogenies (trees) to more accurately find orthogroups so we need to know what the species tree looks like and what the names of the internal nodes are. ``` cat SpeciesTree_rooted_node_labels.txt ``` >(PROKKA_09162022:0.5,(PROKKA_09242022:1,PROKKA_10222022:1)N1:0.5)N0; 17. Now I want to check out the file that describes the orthogroups of genes shared by more than one genome. (N0 is ‘en’-‘zero’, not ‘no’) ``` head N0.tsv ``` 18. For example, I want to find genes that are shared between genome 1 (Burkholderia pseudomallei) and genomes 4 and 5 (Bhqs155 and Bhqs69). The genomes are listed as column names in the header after the first three fields. I can now find rows of this file that have an entry in fields $4, $7, $8 but not $5, $6 (the two P. bonniea): ``` awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" N0.tsv ``` >N0.HOG0003477 OG0003422 - BH69_03640 BH155_03627 N0.HOG0003478 OG0003423 - BH69_03641 BH155_03628 N0.HOG0003479 OG0003424 - BH69_03642 BH155_03629 N0.HOG0003480 OG0003425 - BH69_03643 BH155_03630 N0.HOG0003481 OG0003426 - BH69_03644 BH155_03631 N0.HOG0003482 OG0003427 - BH69_03649 BH155_03636 19. Next let’s see which genes are unique to individual genomes. ``` head Orthogroups/Orthogroups_UnassignedGenes.tsv ``` >Orthogroup PROKKA_09162022 PROKKA_09242022 PROKKA_10222022 OG0003428 BO395_00030 OG0003429 BO395_00031 OG0003430 BO395_00037 OG0003431 BO395_00040 OG0003432 BO395_00045 OG0003433 BO395_00051 OG0003434 BO395_00054 OG0003435 BO395_00057 OG0003436 BO395_00062