Lab 06: rpsblast and Orthofinder

# Lab 06: rpsblast and Orthofinder ## Exercise 1: Reversed position specific BLAST Point the system towards where the BLAST dbs are located. ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB ``` Run rpsblast on group of proteins. I had to use a protein sequence provided by Professor Noh as the one that I used last week was producing no hits even after removing the evalue as a parameter (setting it to its default value) ``` rpsblast -query bbqs395_hypothetical.faa -db Cog - max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bbqs395_hypothetical.rpsblast.table ``` Keep only the best COG (clusters of orthologous groups) matches (i.e. any query squence that has a query coverage of above 70%) ``` awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" M_tuberculosis.rpsblast.table > M_tuberculosis.cog.table ``` We can look up the category of each COG by using the look-up table, cognames2003-2014.tab ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a {print$1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab M_tuberculosis.cog.table > temp ``` The first part of the command creates the look-up table from the cogs2003-2014.tab table in the course materials folder and the second part of the command looks through the input file and categorizes each COG in a new column. ``` awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1)} else { $3 = $3 }; print}' OFS="\t" temp > temp2 ``` Since COGs can belong to multiple functional categories, we need to them down to only the first category because it is the primary functional group for the COG. We can then take this table of labeled COGs and add a full description of each functional category. These descriptions can be found in the fun2003-2014.tab in the Course Materials folder ``` awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > $species.cog.categorized rm temp temp2 ``` ## Exercise 2: Make and use a custom BLAST database Point BLASTDB to where our new custom BLAST database will be stored ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/kyamad23/myblastdb" ``` Make a protein database out of the lab_04 bbqs433 amino acid multi-fasta sequence. ``` makeblastdb -in ../lab_04/PROKKA_09242022.faa -dbtype prot -title bbqs433_prot -out myblastdb/bbqs433_prot -parse_seqids ``` Make a nucleotide database out of the ffn file from the same lab and species. ``` makeblastdb -in ../lab_04/PROKKA_09242022.ffn -dbtype nucl -title bbqs433_nucl -out myblastdb/bbqs433_nucl -parse_seqids ``` This outputted various files in the myblastdb folder. Combine BLAST databases of the same type. The command requires exact titles of the input files. This can be done after creating a protein database from the bbqs395 P. bonniea species similarly to the way we constructed the database above. ``` blastdb_aliastool -dbtype prot -title 2burk_prot -out myblastdb/2burk_prot -dblist "bbqs433_prot bbqs395_prot" ``` To test whether our blast databases were properly combined, we can query our custom blast db for sequences from tssH ``` blastp -query tssH_bpseudo.fasta -db 2burk_prot -outfmt 6 ``` ## Exercise 3: Orthofinder and its output Move all faa PROKKA protein annotations in to one folder in lab_06 (bbqs433 and bbqs395) and run OrthoFinder ``` orthofinder -f ./ -X ``` Since the species names are already in the gene annotations, we use the -X flag to not add them again. ``` mv OrthoFinder/Results_Oct25 ../ rmdir OrthoFinder/ cd ../Results_Oct25 ``` We can access the first few lines of the statistical overview produced by orthofinder. ``` cat Comparative_Genomics_Statistics/Statistics_Overall.tsv ``` ![](https://i.imgur.com/GtbSW1x.png) For the three species analyzed, 99.7% of the genes belong to orthogroups. ``` cat Species_Tree/SpeciesTree_rooted_node_labels_txt ``` By inputing the results of this command into iTol, we can create a tree visualization of phylogenies. ![](https://i.imgur.com/2onyX9X.png) If we had run the orthofinder on the protein sequences of more species, the phylogenetic tree would be more robust. The code below returns the orthogroups of genes shared by more than one genome. ``` head Phylogenetic_Hierarchical_Orthogroups/N0.tsv ``` ![](https://i.imgur.com/hvSWbVU.png) Assuming we had run the orthofinder on more than two genomes, below is the code that lists how we would search for genes that are shared within certain genomes but not others: ``` awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv ``` It specifies rows of this file that have an entry in fields $4, $7, $8 but not $5, $6. To find genes that were not assigned to an orthogroup, run the following command: ``` head Orthogroups/Orthogroups_UnassignedGenes.tsv ``` ![](https://i.imgur.com/HMnYqsx.png) The faidx command in samtools allows us to return the amino acid sequences from given indices. The rpsblast command takes the protein sequence and creates an rpsblast table. ``` samtools faidx ./prot_dir/PROKKA_09242022.faa BB433_00643 BB433_01586 BB433_01908 BB433_02771 BB433_02772 BB433_02773 BB433_02776 BB433_02948 BB433_02979 > bb433_unassigned.faa rpsblast -query bb433_unassigned.faa -db Cog - max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > bb433_unassigned.rpsblast.table ``` -------------------------------------------------------- Pulls the second column out of the file. ``` cat Orthogroups_UnassignedGenes.tsv | tr "\t" "~" | cut -d"~" -f2 > unassigned.txt ```