# Lab #5: BLAST
we will use command line BLAST on our typical prokka annotations (Which we learned last lab)
The steps we will follow today are:
Run command-line BLAST
Test different BLAST flavors
Parse the output of BLAST searches
As always first start by logging onto the terminal using `ssh tsbroa25@bi278`
## Command-line BLAST
We will be using command-line BLAST (a.k.a. BLAST+).
Command-line BLAST allows us to input many sequences into the BLAST algorithm through scripting rather than using a web interface such as NCBI or DNA subway.
DIRECTLY FROM PROFESSOR NOH's LAB 5 GUIDE:
Table 1. BLAST+ opions you can choose among depending on your search goal
Prior to any BLAST Searches you need a Data Base
##### Setting the Database
1. set an environmental variable $BLASTDB
Run the Following Command
`echo $BLASTDB`
if this turns up nothing that means that the environmental variable is empty. To fix this you need to set an evironmental variable.
Note: every new shh sesion requires you to reset the environmental variable.
2. Tell BLAST where these databases are:
to locate a data base use the following command `export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"`
To verify this worked reenter `echo $BLASTDB`
For this lab we are working with two protien databases and two nucleotide databases from NCBI.
Here is a link to all of the BLAST databases provided by NCBI: https://ftp.ncbi.nlm.nih.gov/blast/db/
The README file in the blastdb directory contains information about the databases.
3. Gain a better understanding of how BLAST+ works
Type in the following commands.
`blastn -help
blastp -help`
#### 1.1 Running blastp
Following is from Professor NOH:
Before we get started we need files to work with. To Do this we will use the Prokka Files that contain the nucleotide and protein sequences for all the genes that Prokka identified for your genome in the last lab. Both files should be in FASTA format but will have different extensions to distinguish between nucleotide and protein. (Beware that you do NOT want the other fasta file that contains your genome sequence)
Remember that fasta files can contain multiple sequences that each belong to a different gene etc. The name of each sequence is declared in a specific way, that includes the symbol >. We can use this symbol to delimit (separate) records (sequences) in a multi-fasta file (a file that contains multiple fasta sequences). There are many ways you can do this, but here’s a way that uses the programming language awk:
`awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PATH/PROKKA_FILE`
So for my specific situation I am using this path way `/home2/tsbroa25/lab_04/lab_04/bhqs69/PROKKA_09242022.faa`
and this command
`BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/bhqs69/PROKKA_09242022.faa`
This command works by
- Reading the file you designate at the end of the command using whatever you said was the record separator (RS=">")
- Match a particular pattern (/pattern/) in each record, so for us it is "hypothetical protein"
- When a record contains hypothetical protien it then prints the symbol > followed by a record with the pattern of $0 (I'm not sure what that means)
- then the output record separator ORS helps with printing?
The `awk` command is in single quotes, the output format `ORS` is outside the quotes. After the awk command you need to specify the input file for it to use, and then have it either printed to your screen or sent to a file by using the `>` operator such as above. Typically prof Noh suggests using the `|` operator to pipe your command to `head` or `tail` before sending it to a file (I assume this works as a check)
for example: `SOMECOMMAND | head`
so to see the hypothetical protiens use
```
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/bhqs69/PROKKA_09242022.faa | head
```
###### Modifying the awk Command to learn how it functions.
try printing out all the genes that contain the word Replication
`awk 'BEGIN {RS=">"} /replication/ {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/bhqs69/PROKKA_09242022.faa | head`
###### Now Start with Rewriting the hypothetical protiens to their own hypothetical protien file.
You can do this by using the following command
`awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/bhqs69/PROKKA_09242022.faa > /home2/tsbroa25/lab_04/lab_04/PROKKA_09242022_hyp_prot.faa`
Next, we want to take a subset of these proteins and test different BLASTs using just a few of them. Running a few will allows us to see the software before running the entire dataset.
Professor Noh suggets opening your file and using a text editor to subset the data set. You can just copy and paste however many I want into the new file. But You can also use the following awk command to do it as well. This command takes a percentage of the file's records for example, 0.2% = 0.002:
`awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/PROKKA_09242022_hyp_prot.faa`
Notice how only the middle of the awk command changes
now send this subset to a new file using the following command
`awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" /home2/tsbroa25/lab_04/lab_04/PROKKA_09242022_hyp_prot.faa > /home2/tsbroa25/lab_04/lab_04/PROKKA_09242022_sub.faa`
finally runa protien blast on this subset
This is a basic amino acid query that we can adjust further using optional flags (I choose to use the refseq_select_prot database)
` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_09242022_sub.faa`
Note: The dashes - belong next to each option name, and each option should be seperated by a space.
There are multiple output formats you can use for example
```
-outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10
or
-outfmt 3 -num_descriptions 10 -num_alignments 10
```
to use these you would need `>` and a file for it to write too.
so I used the command
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query PROKKA_09242022_sub.faa > PROKKA_09242022_sub_exp1.faa
```
But I could've also used
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query PROKKA_09242022_sub.faa > PROKKA_09242022_sub_exp2.faa
```
The `-max_target_seqs`, `-num_descriptions`, and `-num_alignments options` can all be used to limit the results you see in the middle of the database search process. DO NOT USE THESE FOR YOUR PROJECT because you will get different results if these numbers are too small.
Instead use the defaults (250 or 500 depending on which option) and filter your results after the BLAST search. At minimum you should at least use a larger number (typiclly in the hundreds).
Blast con sometimes be slow as it is attempting to run through all of the subjects withing a data base which can be as high as 63 million or more subjects. (swissport is smaller at 480,000 subjects)
If a blast is taking too long you can use `&` at the end of a comma to run another command but DO NOT run two blasts symltaenously since the server will get too overloaded.
You should also see examples of out put in this weeks file labeled example*blastp*outfmt*txt
every output format gives overlapping information in different formats but professor Noh suggests 3 and 6
Output format 6 often ends up being the most useful for her because it contains the following data: "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" all of it in table format.
You can also use format 6 results to get the amino acid or nucleotide sequences of the subjects you are matching to.
You can edit the data oputs by using multiple options, see the tablke below for refrence.



If you're interested in finding all of the subject sequences of one of your queries (BH69_01498), take the subject sequence ids from output 6 and save them as a list.
Use the following command to do that using the output file from your blast
`awk '$1=="BH69_01498" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BH69_01498.sseqids.txt`
Then you can take that list and use `blastdbcmd` to find the entries from the search target database that match the subject sequence ids:
`blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BH69_01498.sseqids.txt`
### 1.2 Running blastn
This is a basic nulceotide blast search to start with and then adjust with options. from Prof Noh: " if you are interested in protein coding genes you are unlikely to need to use nucleotide BLAST. But there are other genomic features that do not code for proteins, so in those cases you should use blastn rather than blastp."
You need to run this on a fnn file not a faa so to do this I isolated the hypothetical protiens from the fnn file and
Use the following basic command to run a blastn
`blastn -task megablast -num_threads 2 -db nt_prok -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query PROKKA_09242022_hyp_prot.fnn > PROKKA_09242022_blastn_nt_prok.fnn`
We have two nucleotide databases we can use env_nt (93 million sequences) and nt_prok (8 million sequences)
I then also ran a blastn on the env_nt database
`blastn -task megablast -num_threads 2 -db env_nt -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query PROKKA_09242022_hyp_prot.fnn > PROKKA_09242022_blastn_env_nt.fnn`
This took very long to run. However, I could've founnd the ids from my faa files and then using those sequence ID's matched and isolated the same sequences in the fnn files. using `samtools faidx`
`samtools faidx ~/lab_04/lab_04/bhqs69/PROKKA_*faa`
env_nt contains nucleotide sequences from environmental samples so its good at finding viruses and other similarly hard to detect organisms
nt_prok contains prokaryotic nucleotide sequences which is helpful for, well, prokaryotes.
### 1.3 Parse the BLAST+ Results
Blast formating is a tricky thing to get right as there is so much variance in how it can get done.
Prof Noh "BLAST+ has the unfortunate characteristic of truncating subject names. If you take a look at the top of you output format 3 files, this will be most obvious. I made BLAST+ output the additional fields stitle and sscinames and because of this."
Profesor Noh's columns in her example are in the following order "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs stitle sscinames staxid"
Each of these topics corresponds to a different $ and number
query sequence id ($1)
subject sequence id ($2)
full name of subject sequence ($14)
scientific name of subject sequence ($15)
e-value ($11)
bitscore ($12)
query coverage ($13)
and to get these into the final output table use the following command
`awk -F "\t" '{print $1,$2,$14,$15,$11,$12,$13}' OFS="\t" PROKKA_09242022_blastn_nt_prok.fnn`
This command takes the information from the output file and reorganizes it aacording to which topics you want present.
Typically, outputs in genomics are tables as they are easy for future analysis, and awk is a useful command for oragnizing these tables.
Pandas library in Python can also serve as a tool to work for tables.
###### Q1. What are the default values of the options matrix, evalue, max_target_seqs, num_descriptions, num_alignments for each type of BLAST? At what point in the BLAST search would these filters be applied, and under what circumstances would you consider using a value other than the default? (Use each command with the -help option or check out the BLAST+ online manual appendix; between the two you can see all the defaults when you do this)
blastn:
matrix: No Default (https://www.arabidopsis.org/Blast/BLASToptions.jsp)
evalue: 10
max_target_seqs: 500
num_descriptions: 500
num_alignments: 250
blastp:
matrix: Blosum62
evalue: 10
max_target_seqs: 500
num_descriptions: 500
num_alignments: 250
###### Q2. I encourage you to run blastp searches with a non-hypothetical protein that looks interesting and check out the results you get by changing some of the parameters above. Report back on what you find.
I found that the results were far less varied. This makes sense as, if a protien is known, there has already been an identified link between that protien and an origin species.