# BI278 Lab 5 Notes
utilizing blast on PROKKA annotations
## blastp
blastp: traditional blastp to compare a protein query to a protein database
blastp-short: blastp optimized for queries shorter than 30 residues
blastp-fast: a faster version of blastp that uses sa larger word size (6 vs. 2-3)
## blastn
blastn: traditional blastn requiring an exact match of 11
blastn-short: blastn program optimized for sequences shorter than 50 bases
megablast: traditional megablast used to find very similar (e.g. intraspecious or closely related species) sequence
dc-megablast- discontinous megablast used to find more distant (e.g. interspeies) sequences.
To check if there is an environmental variable to tell the UNIX environment where to look for searchable databases use the following command:
`echo $BLASTDB`
If nothing shows up with this command, set an environmental variable utilizing the following command:
`export
BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"`
Use the echo command to check if this worked.
You will have to do this export command every session logging onto bi278 (before using BLAST).
For help, use the commands:
`blastn -help`
`blastp -help`
## 1.1 Run Command Line Blastp
Utilize the ">" symobol to seperate sequences in a multi-fasta file (a file that contains multiple fasta sequences) using the following command:
awk `BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}` ORS="" PROKKA_inputfile
to check the output of any given command utilize the command
`some command | head`
(I used this to show the results of the awk BEGIN command)
### 1. Writing hypothetical proteins to their own fasta file. (used only for protein sequences in this lab)
Utilize the command:
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.ffn > hypotheticalprotein
- "hypothetical protein"- could be any file name
- this file will be put in the directory you are currently working from
### 2. In order to take any percentage of a file's records utilize the command:
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}'
ORS="" hypothetical_inputfile
Upon application of this command, I found issue in the usage of single quotes. If there is an error make sure you are using ', not `
- hypothetical_inputfile - was the PROKKA file in this case
- "0.002"- takes 0.2 percent of the input file
to send it to a seperate file, use
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}'
ORS="" hypothetical_inputfile > newfile
- "newfile"- the file it is being sent to
- ">" command moves something to somewhere else
### 3. Running a protein blast. The following command is the basic protein query blast that should be used:
`blastp -task blastp-fast -num_threads 2 -db PATH/database_name -evalue 1e-6 -query newfile`
* The "PATH/database_name" for this lab was `/courses/bi278/Course_Materials/blastdb/refseq_select_prot`
- 1e-6: determines stringency of a search. If there are no matches, try removing this option. Good e-values for BLAST searches are within 1e-6.
- Initially, if you get no hits on a scan with 1e-b, take out "-evalue 1e-6"
OPTIONS FOR DATABASES:
swissprot (for proteins)- smaller but better
refseq_select_prot (for proteins)- larger database
nt (for nucleotides)
### 4. To add specifications to your search output use the commands:
`-outfmt "6 std ppos qcovs stitle sscinames staxid" - max_target_seqs 10`
`-outfmt 3 -num_descriptions 10 -num_alignments 10`
Example (applied to this lab):
`blastp -task blastp-fast -num_threads 2 -db
refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs
stitle sscinames staxid" -max_target_seqs 10 -query
faa_file > subset_blastp_fmt6.txt`
`blastp -task blastp-fast -num_threads 2 -db
refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions
10 -num_alignments 10 -query faa_file >
subset_blastp_fmt3.txt`
- max_target_seqs, -num_descriptions and -num_alignments options limit the results you see somewhere in the middle of the database search process. DON'T USE THESE FOR future PROJECTS USING BLAST
- use the default numbers: either 250 or 500
Helpful: BLAST searches may take awhile, you can use the "&" to run a search in the background, just not another BLAST search.
- Output formats: most helpful are outmft3 and outmft6.
- Output format 6 (outmft6) contains all of the following data in table format: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore.
- This data is very helpful!
Following a BLAST, find your queries by opening the file that was exported
- A query should like like this "BO395_0117898" or something similar
To find the subject sequence ids of one of your queries use the output format 6 results and specificy your query of interest.
- Can be done by $1=="YOUR QUERY (BO395...)"
### 5. Taking the subject sequence ids and saving them as a list (see above process to specify query):
Utilize the command:
```
awk '$1=="BO395_01177" {print $2}'
example_blastp_outfmt6.txt | sort | uniq >
example_blastp_BO395_01177.sseqids.txt
```
I believe this was all written in one line of code.
### 6. Use the blastdbcmd to find the entries from a search target database that match the subject seqeunce ids
Utilize the command:
```
blastdbcmd -db refseq_select_prot -entry_batch
example_blastp_BO395_01177.sseqids.txt
```
## 1.2 Run Command Line Blast- skipped during Lab 5
The basic nucleotide blast search that should be started with and adjusted as you go:
`blastn -task megablast -num_threads 2 -query ffn_file -db
nt -evalue 1e-6`
- "ffn_file": the other PROKKA file gotten from previous labs
### 7. Run a nucleotide BLAST search for one query sentence that you got results back from blastp search above. (you must use nt database for blastn)
- this will take a long time, and will hang.
## 1.3 Parse the Output of BLAST results
The format specifiers for BLAST:

Within the output format 6 file, columns should be in the following order:
qseqid sseqid pident length mismatch gapopen
qstart qend sstart send evalue bitscore ppos
qcovs stitle sscinames staxid
Given this information, if you wanted these columns to end up in the reorganized final output file:
1. query sequence id
2. subject sequence id
3. full name of subject sequence
4. scientific name of subject sequence
5. percent positive scoring matches
6. query coverage
7. e-value
### 8. you would use this command to retrieve only these fields:
`awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t"
example_blastp_outfmt6.txt`
- "example_blastp_outfmt6.txt"- variable dependent on what you used.
- This command will reorganize information from the output file.
- Each of the columns of the table are specified using a dollar sign and a number.