Lab #6 - HackMD

# Lab #6 Goal: use rpsblast to understand functional characterization, and use Orthofinder for orthogroup detection. Orthofinder takes alot of memory to run so we will only be working with results today. The steps we will be following today are: • Run rpsblast • Parse Orthofinder results ## Exercise 1. Reversed position specific BLAST Taken from Professor Noh's lab guide: rpsblast is used to compare a protein query sequence against a database of protein profiles. NCBI has a collection of protein profiles stored in what they call their conserved domain database (CDD). The database includes profile models based on NCBI-curated domains, NCBIfams and Pfam HMM profiles, COGs, and other external data sources. We will be using COGs today. COGS are cluster of orthologoud groups, and they are a way of defining curated groups of proteins across genomes that be othologous to each other. by finding orthologs we can figure out the the functional predictions of well studied protien groups, and pull out (extrapolate) the newly discovered protiens are and what they do biologically. Today we'll be using COG profiles to determine whether our hypothetical protiens from previous weeks resemble any known protein profiles in the COG database. First we need to establish where our system should find its BLAST databases. ``` export BLASTDB="/courses/bi278/Course_Materials/blastdb" echo $BLASTDB ``` Now that we have established where the system should find the database we can run any blastwe would like, but Prof. Noh wants us to use our hypothetical proteins from last week. What we do with these will be applicable to any group of proteins you want to annotate for their potential functions, including whole genomes. Running the rpsblast on the hypothetical protiens `rpsblast -query 09242022_hypothetical.prot.faa -db Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > 09242022_hypothetical.prot.rpsblast.table` In our table results we can see the COG results, and we only want to keep the most significant cog results so use this comand `awk -F'[\t,]' '!x[$1]++ && $4>=70 {print $1,$5}' OFS="\t" 09242022_hypothetical.prot.rpsblast.table > 09242022_hypothetical.prot.cog.table ` Professor Noh's Analysis of the above command "This command: Takes the format 6 table and splits it up by both tabs and commas. Previously we used only tabs to split these types of tables. The comma is needed here because the last column includes the COG number along with a lot of other additional descriptors separated by commas. We want to just grab the COG number from this column for the next step. It then looks at all the rows that match the same query ($1) as long as each has query coverage of at least 70 It keeps the first result for each query (best match) and prints two columns (=fields). You should be able to figure out what these fields are based on the context above. " Now that we have the significant coverage COGs we can lookup the functional categories for each COG number we have. These categories are stored in this table `cognames2003-2014.tab` To look up the categories use this command `awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab 09242022_hypothetical.prot.cog.table > temp` Professor Noh's Anaylsis of this Command "The Command Takes the first file cognames2003-2014.tab and makes a look-up table that can access the categories by COG number It then takes the second file 09242022_hypothetical.prot.cog.table and goes through it row by row and looks up what category the COG number in that row belongs to It then writes the row along with the COG category that it looked up and stores it in a new file." This temporary file thus has a different column with the categories Now we want to keep the first category for each COG number. COGs can belong to multiple functional categories but the first one is its primary category so we use the following command to isolate it. `awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2` The categories are still just symbolized by a letter and we want a full description so we use this command to retrieve that description from this table fun2003-2014.tab `awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > 09242022_hypothetical.prot.cog.categorized` Use the head or cat command to check this. Finally we have to remove the temp files we were using we can do this by using this command ` rm temp temp2` Prof Noh's Comments on our final file "Take a look at the final file. If it looks like mine, you will see a gene_id, a COG number, a letter category, and a decription of the category. You can maybe also see that not all hypothetical proteins are actually completely unknown. Many can be related to known proteins or look like them. COGs and other orthology prediction mechanisms are great for this type of application. It is also interesting that there appear to be COGs that are common enough proteins to warrant a formal COG description but we don’t know what their specific functions are. These fall into category S." To add what each cog number stands for to the file we can use the following command `awk -F "\t" 'NR==FNR {a[$1]=$3;next} {if ($2 in a){print $1, $2, a[$2], $3, $4} else {print "no match"}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab 09242022_hypothetical.prot.cog.categorized` ## Exercise 2. Orthofinder and its output Orthofinder is a orthogroup detection software that finds homologous groups of genes. We will be using it for genome comparison analysis. On thing we like doing is comparing which homologous genes are common or unique across genomes. This isn't so clear cut as there are many states of homology between completely shared and completely unique (e.g. shared be some proportion of genomes but not all). Prof Noh suggests "Think of it as set theory applied to the genes each genome possesses" This should remind us of BUSCO from the first few weeks of lecture. "BUSCO is supposed to represent: single copy genes shared by all members of particular groups, that are universal enough that we use these genes to determine whether our sequencing project was successful or not at capturing most of the genome" Running the software will be easy, interpolating the results will be hard. From prof Noh on her results "My results are from running orthofinder on the predicted protein sequences of 3 strains of P. bonniea (Bbqs395, Bbqs433, Bbqs859), 3 strains of P. hayleyella (Bhqs155, Bhqs69, Bhqs11). I also included as an outgroup taxon Burkholderia pseudomallei, whose file I downloaded from the Burkholderia genome database". For our projects, prof Noh suggests using the same outgroup as the Husnik et al paper: Pseudomonas aeruginosa. This organism is in a different taxonomic group than the Enterobacterales we are working with. In order to run orthofinder we need to organize all of our .faa files into one folder and run it on the folder simultaenously so use the mv command to move the files to a new folder to run it on so I need to take all the faa files from bbqs433 bhqs69 bbqs395 and bhqs155 Use this command to run this command `orthofinder -f ./ -X` Prof Noh's intrepretation of this command "In this command, -f specifies where my protein fasta files are (./ means here because I was in the same directory as these files) and -X specifies that I don’t want orthofinder to add species names to my gene ids because I already did this when running prokka" We are just going to look at prof noh's results here which there are many results files in the new directory and you can read a bit more about orthofinder and its results here: https://github.com/davidemms/OrthoFinder#what-orthofinder-provides Command to find overall statitics is looking within the overall statistics results tsv file. `cat Comparative_Genomics_Statistics/Statistics_Overall.tsv` to find the tree open the following doc and paste into iTOL see previous lab journal to find the link `cat Species_Tree/SpeciesTree_rooted_node_labels.txt` To see the file that describes the othrologs of genes shared by more than one of our genomes look at the following document `head Orthogroups/Orthogroups.tsv` Prof Noh of this File: "This file lists out all the genes from each genome that belong to each detected orthogroup (first column). The orthogroups are listed from largest to smallest, so if you check out the tail of this same file you will find genes that are shared by only 2 genomes." She also suggests looking at the results in a spreadsheet in order to see the set type comparisons. When in the spreadsheet you can look at othrologs not in the outgroup and orthologs in it etc. Really you can apply multiple filters. Professor Noh created a filter that created a filter so she could see orthogroups of genes that are not in the outgroup, and also not in any of the P. hayleyella genomes. In other words, they are “(blank)” according to google sheets. There were 453 groups of size 3+. Some of these include gene duplication events, so they are not actually shared orthogroups across all three P. bonniea genomes. If we want to compare the two Paraburkholderia species to understand genomic factors that differentiate them, this and the equivalent P. hayleyella-shared orthogroups is where prof noh suggets we would start. You can also achieve this without using a spreadsheet by using the following commands and this file `Orthogroups/Orthogroups.GeneCount.tsv` `awk '$2=="0" && $6=="0" && $7=="0" && $8=="0" && $9>=3' /Orthogroups/Orthogroups.GeneCount.tsv | wc -l` This command asks to see lines where the second (outgroup) and the 6-8th columns (P. hayleyella) are 0 (no genes are included in the orthogroup), and the total gene count (column 9) is 3 or larger. For each of the orthogroups, you can find fasta files for the genes included here by orthogroup id: `Orthogroup_Sequences` We can also see which genes are unique to individual genomes here: `Orthogroups/Orthogroups_UnassignedGenes.tsv` Prof Noh's Comments: "One of the things that make biology both cool but also difficult is that there are often a few different ways to achieve the same goal. Think of convergent evolution, where the overarching phenotype may be the same but the underlying genetics can be distinct. This file can point you to genes that are unique to individual genomes, that may do something distinct but may also do something similar while appearing very different." After we have the orthogroups, and the genes of interest to look at, we can go back and check out their functions (We do this by assigning cogs, see above), and look at literature on those genes. Alternatively we can also have a few specific types of genes we are looking for ahead of time and we can use a combination of custom BLAST with Orthofinder results to see if those genes exist in some but not all of our genomes. ## Exercise 3. Make and use a custom BLAST database We'll make local custom nucleotide and protien databases to search against. This may be helpful to see the similarirites across two taxa. This is the first step to doing reciprocal BLAST in order to identify BBH/ RBH between a small number of genomes. 1. Make your own directory to store BLAST databases in your bi278 home. (these examples are all executed from my bi278 home; I called my directory myblastdb) 2. In order to search against this new database, you will have to add this directory to your $BLASTDB. `export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/tsbroa25/myblastdb"` 3. To make protein databases, modify and execute the following command on an amino acid multi-fasta file: `makeblastdb -in ~/lab_04/lab_04/bhqs69/PROKKA*.faa -dbtype prot -title bhqs69_prot -out myblastdb/bhqs69_prot -parse_seqids` 4. to make a nucleotide database use the following. `makeblastdb -in ~/lab_04/lab_04/bbqs395/PROKKA*.fnn -dbtype nucl -title bbqs395_nucl -out myblastdb/bbqs395_nucl -parse_seqids` When you make a new database you will end up with a series of files: *.phr, *.pin, *.psq and more for protein databases and *.nhr, *.nin, *.nsq and more for nucleotide databases. Just like files, every database need different titles. Lastly, you can combine any BLAST databases of the same type (nucleotide or protein) you make in this way using a command like the one below to search against multiple databases at once. `blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot bh69_prot"` This is the example from prof noh hence why there are four db being combined. Searching against databases We want matches of tssH, a secretion system gene. Prof Noh downloaded a sequence from Uniprot (Universal Protein Resource), a database that contains both curated and electronically annotated proteins. Use the following command to run that blastp `blastp -query /courses/bi278/Course_Materials/lab_06/tssH_bpseudo.fasta -db bhqs69_prot -outfmt 6` Sometimes you will get two hsps that were not alligned during gapped reallignment. You can tell this because of the results in the columns in the blast result that represent qstart qend sstart send. Prof Noh's comments on this "These proteins must be just a little bit different compared to the query sequence that there is a section in the middle that doesn’t want to align between the query and the subject. You might recall seeing gaps like this in the MSA lab a few weeks ago" ### Questions for NCBI Why are COGs databases no longer being regularly updated?