# BI278 Lab 5 - BLAST+
Goal: use command-line BLAST on Prokka annnotations
Steps:
1) Run commandline BLAST
2) Test diff BLAST flavors
3) Parse output of results
NOTE: -r before directory you're moving to helps copy everything in that directory
**Exercise. command-line BLAST**
*Why is command-line BLAST important?*
Command-line BLAST allows us to run a bunch of seqs through the BLAST algorithm through scripting rather than through a web-broswer interface!
***BLAST+ options that you can choose among depending on your search goal***
**blastp program**
blastp: traditional blastp to compare a protein query to a protein database
blastp-short: blastp optimized for queries shorter than 30 residues
blastp-fast: a faster version of blastp that uses a larger word size (6 vs. 2-3)
**blastn program**
blastn: traditional blastn requiring an exact match of 11
blastn-short: blastn program optimized for sequences shorter than 50 bases
megablast: traditional megablast used to find very similar (e.g.intraspecies or closely related species) sequence
dc-megablast: discontinuous megablast used to find more distant (e.g. interspecies) sequences
To run a BLAST search from bi278 --> need to tell BLAST+ where to look for your databases. To do that, you need to set an environmental variable, which will tell your Unix environment where to look for searchable databases. This command will help check this:
echo $BLASTDB
This tells BLAST+ where my databases are:
export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"
Check whether it worked by doing the first command again:
echo $BLASTDB
NOTE: you have to run this export command every session or logging onto bi278 before using BLAST unless you modify your .bash_profile
We're working with 3 databases:
1) nt (for nucleotides)
2) swissprot (for proteins)
3) refseq_select_prot (for proteins)
^these databases are provided by NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
REMEMBER: these databases have to be downloaded and processed or else BLAST+ will not be able to use them
**1.1 - Run Command Line BLASTP**
Check Prokka output & identify two files that respectively contain the nucleotide & protein sequences for all the genes that Prokka identified for your genome:
1) PROKKA_09272022.ffn (nucleotide)
2) PROKKA_09272022.faa (protein)
^both files should be and are in the FASTA format but will have diff extensions to distinguish between them
REMEMBER: fasta files can contain multiple sequences that each belong to a different gene etc. The name of each sequence is declared in a specific way, that includes the symbol “>”.
We can use this symbol to delimit (separate) records (sequences) in a multi-fasta file (a file that contains multiple fasta sequences). There are many ways you can do this, but here’s a way that uses the programming language awk:
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_inputfile
^This command:
- reads the file you designate at the end of the command by whatever you tell it to use as a record separator (RS=">")
- matches a particular pattern (/pattern/) you give it in each record, such as the word 'hypothetical'
- when a record contains that pattern, it then prints the symbol '>' followed by the record ($0)
- when it prints, it doesn't use an additional output record separator (ORS="")
Breakdown of command:
- The entire awk command is contained within single quotes, and anything to do with the output format is basically outside these single quotes.
- After the command, you specify the file you want awk to work on, and then you can either have everything printed to your screen or have it sent to a file (as above using the > operator).
- Typically before sending anything to a file I will always check what the output looks like using the | operator to pipe my command to head or tail:
some command (awk) | head
NOTE: this command depends on your FASTA headers being informative (they need to be more than just gene ids).
To see Prof. Noh's prokka annotation files for eith bbqs395, bbqs4333, or bhqs69, they're still in:
courses/bi278/Course_Materials/lab_04/
1. Write hypothetical proteins to their own fasta file. Do this for just the protein sequences.
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09272022.faa > PROKKA_09272022.faa.hypotheticalprotein
1.5) Now, we want to subset these proteins & test diff BLASTs using just a few of them. This will essentially help you see how your software behaves before running your entire dataset thru potential commands that you need to modify to get the results you actually want.
Simplest way to do this:
Open up your file using a text editor, and copy and paste however many you want into a new file.
Command line way:
Use awk by taking a percentage of a file's records:
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile
^changing proportion (0.02) changes output
- 0.003 gave me a subset of 3 hypothetical proteins
2) Take a subset of your hypothetical proteins (3 or less total) and send them to their own fasta file.
awk 'BEGIN {RS=">"; srand()} (rand()<=.003) {print ">"$0}' ORS="" PROKKA_09272022.faa.hypotheticalprotein > PROKKA_09272022.hypotheticalprotein.fasta
3) Run a protein blast (blastp). This is the basic protein (amino acid) query blast you should start w/ and then adjust as you go w/ optional flags:
blastp -task blastp-fast -num_threads 2 -db database_name -evalue 1e-6 -query faa_file
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_09272022.hypotheticalprotein.fasta
If good results then send results of searches to a file w/ the > operator so you can work with them.
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_09272022.hypotheticalprotein.fasta > PROKKA_09272022.hypotheticalprotein.fasta.blastp
Database: swissprot database is smaller, manually curated databased compared to reseq_select_prot. If match is present, swissprot can give a really good informative match. BUT since database is so small, you might not get a match at all. Use refseq_select_prot most of the time.
evalue threshold (-evalue 1e-6): determines stringency of search. If you don't get any matches, try removing this option. But when you look at your results, note that good e-values for BLAST searches are considered to be those that are 1e-6 or smaller (including 0). Basically, this means you can adjust your BLAST parameters to get atches, but not all matches are informative.
4. Now try adding to the search above one of those specifications for how you want your search output to look:
-outfmt "6 std ppos qcovs stitle sscinames staxid" - max_target_seqs 10
-outfmt 3 -num_descriptions 10 -num_alignments 10
So this becomes:
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query faa_file(fasta) > subset_blastp_fmt6.txt
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query faa_file(fasta) > subset_blastp_fmt3.txt
*** The -max_target_seqs, -num_descriptions, and -num_alignments options limit the results you see somewhere in the middle of the database search process. When you run BLAST for a project, Professor Noh DOESN'T recommend using these options because you will get different results of these numbers are too small.
Best to use either the defaults (250 or 500 depending on which option) and filter your results after the BLAST search, or use at least a larger number (e.g. 100+).
REMEMBER: Some BLAST searches can be quite slow so even when using your hypothetical sequence subset you'll probs have to wait a little for your command to finish running (or use Prof Noh's results instead).
NOTE: You can use & at the end to run a search in the background but DON'T run another BLAST search b/c you'll just "bog" down your server and speed won't improve.
*Why do BLAST searches take so long to finish?*
b/c of size of each database --> when BLAST is running, it's finding the best matches by searching through MILLIONS of potential match subjects in the database
Look at examples of output files in this week's folder named:
example*blastp*outfmt*txt
cd /courses/bi278/Course_Materials/lab_05
cat example_blastp_outfmt6.txt
cat example_blastp_outfmt3.txt
^Output format 6 is usually most helpful b/c it contains all of the following data in table format (equivalent to keyword "std"):
"qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
We asked BLAST to output a few more pieces of data in addition to these in the command above. You can check out other options for data fields to include in your output in Table 2 (in the lab5 handout).
You can also use format 6 results to get the amino acid or nucleotide sequences of the subjects you are matching to.
Say (ex) we're interested in finding all the subject sequences of one of the queries BO395_01177. The subject sequence ids that match this query can be found in the output format 6 results. We just need to get the ones that match the specific query of interest ($1=="BO395_01177").
5. Now, take the subject sequence ids from the second column ($2) and save them as a list:
awk '$1=="BO395_01177" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BO395_01177.sseqids.txt
FOR ME:
awk '$1=="BB395_00122" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BB395_00122.sseqids.txt
6. Then, use the blastdbcmd to find the entries from my search target database that match the subject sequence ids:
blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BO395_01177.sseqids.txt
FOR ME:
blastdbcmd -db refseq_select_prot -entry_batch subset_blastp_BB395_00122.sseqids.txt
**SKIP 1.2 - Run Command Line BLASTN**
This is a basic nucleotide BLAST search you should start with and adjust as you go.
blastn -task megablast -num_threads 2 -query ffn_file -db nt -evalue 1e-6
7. Run nucleotide BLAST search for on query seq that you got results back in blastp search above.
**1.3 - Parse the Output of BLAST Results**
You can use following command to pull out some key fields from the output format 6 file I generated using the modification (output the additional fields stitle and sscinames). Columns should be in this order:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid
Let’s say I want these columns to end up in my reorganized final output file:
1. query sequence id
2. subject sequence id
3. full name of subject sequence
4. scientific name of subject sequence
5. percent positive scoring matches
6. query coverage
7. e-value
.
8. Run a command to retrieve these fields only:
awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" example_blastp_outfmt6.txt
FOR ME:
awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt
^This command basically takes the info from the output file and reorganizes it. Each of the columns (fields) of the table format at specified w/ dollar-signs ($) & numbers for each column.
9. Parse results from various BLAST searches you have been running to figure out your answers to the following questions.
NOTE: you will have at most 10 results for each query seq (unless you changed this option)
Q1: **Can't answer because we didn't do 1.2**
Q2:
The default values:
matrix - BLOSUM62
evalue - 10 (Expectation value (E) threshold for saving hits)
max_target_seqs - 500 (Max number of aligned seqs to keep)
num_descriptions - 500 (Number of database seqs to show one-line descriptions for)
num_alignments - 250 (Number of database seqs to show alignments for)
You would change the defaults in order to match the specificities of your query (higher threshold, etc).