# Lab 05
## Exercise: command-line BLAST
### Pointing to Local BLAST database
```
export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"
echo $BLASTDB
```
The code above points the BLASTDB package to where the database is located.
The three databases we will be working with are nt (nucleotides), swissprot or refseq_select_prot (both for proteins). BLAST databases always need to be downloaded and processed prior to use.
## Exercise 1.1: Run command line BLASTP
```
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}'
ORS="" PROKKA_09242022.ffn > PROKKA_hypothetical.fasta
```
The command above delimits separate records into a fasta file that only contains the hypothetical proteins from the bbqs433 PROKKA annotation.
To take a subset of these proteins and test different BLASTs using a specific amount, we can run the following command.
```
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}'
ORS="" PROKKA_hypothetical.fasta
```
This will take 0.2% of the file's record.

```
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}'
ORS="" PROKKA_hypothetical.fasta > PROKKA_hyp_sample.fasta
```
This will take 3 hypothetical proteins and send them to their own fasta file.
```
blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query PROKKA_hyp_sample.fasta
```
The three options provided here are number of threads, which database to search, evalue, and the faa file to be queried. We use the swissprot database which is a much smaller curated database compared to refseq_select_prot which we will be using the majority of the time. The evalue threshold determines the sensitivity of the search. 1e-6 is considered to be a "good" BLAST search and the smaller the value is, the more stringent the search is.
In order to specify how the output of the search should look, here are two potential commands that will yield different results.
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_hyp_sample.fasta -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 > subset_blastp_fmt6.txt
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_hyp_sample.fasta -outfmt 3 -num_descriptions 10 -num_alignments 10 > subset_blastp_fmt3.txt
```
The first format contains all of the data for qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend sstart, send evalue, and bitscore. The format six can also return the amino acid or nucletodie sequences of the matched subjects.
*In order to find any hits, I had to sample .1% of the protein sequence fasta file as well as removing the argument for the evalue. *
Using the example output in format 6, we can search for all subject sequences matching a specific query.
```
awk '$1 == "BO395_01177" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BO395_01177.sseqids.txt
```
This will take the subject sequence ids which are located in the second column of the output 6 format.
```
blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BO395_01177.sseqids.txt
```
This will use blastdbcmd will find all entries from the targeted database that matches the sequence ids.
```
awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt
```
The code above returns only the query sequence id, subject sequence id, full name of subject sequence, scientific name of subject sequence, percent positive scoring matches, query coverage, and e-value. The awk command can pull out specific columns in the specified order so referencing the blast-p documentation files and pulling out only the relevant data can help parse the results of BLAST searches.
Q2: **What are the default values of the options matrix, evalue,max_target_seqs, num_descriptions, num_alignments for each type of BLAST? Under what circumstances would you use those defaults vs. when would you consider changing them? (Use each command with the -help option or check out the BLAST+ online manual appendix; between the two you can see all the
defaults when you do this)**
For the parameters max_target_seqs and num_descriptions, only one or the other can be used. For outfmt values greater than or equal to four, num_descriptions and num_alignments are used while outfmt values less than four uses max_target_seqs. They both specify the number of hits to be returned.
The evalue corresponds to the how stringent the search will be. An evalue of 1e-6 is considered a "good" hit and anything closer to zero to this value would be considered even better. Higher values of evalue correspond to more sensitive searches meaning it is easier to find a hit but the validity is more questionable. If there is difficulty finding a hit in a protein sequence, raising the evalue may be recommended.
The matrix corresponds to the substitution matrix that assigns scores to any alignment of any two possible pairs of residues. BLAST-P uses blosum62 which is calculated from comparisons of sequences that have a pairwise identity of no more than 62%. You would adjust the threshold value higher if the sequences being compared were more similar to each other. BLAST-N does not use a substitution matrix.
| Parameter Name | BLAST-P default | BLAST-N default |
| ---------------- | --------------- | --------------- |
| num_alignments | 15 | 15 |
| num_descriptions | 25 | 25 |
| max_target_seqs | 500 | 500 |
| evalue | 10 | 10 |
| Matrix | blosum62 | no default |