# BI278 Lab 5
#### command-line BLAST
#### Olivia Schirle
#### 10/18/2022
```
ssh oeschi23@bi278 # Connect to course environment
mkdir lab_05 # Folder for this week's lab material
```
## Set Database Location
```
echo $BLASTDB
export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb" # Set an enviroment variable to tell Unix environment where to look for searchable databases
echo $BLASTDB # Check that this variable references the location of the databases
```
## 1.1 Run Command Line BLASTP
Prokka output files containing nucleotide and protein sequences: lab_04/PROKKA_09242022.faa,
/courses/bi278/Course_Materials/lab_04/bbqs433/PROKKA_09242022.ffn
.fna = nucleotide
.faa = amino acid
```
# Copy Prokka files into this weeks lab folder
cp lab_04/PROKKA_09242022.faa ./lab_05
cp /courses/bi278/Course_Materials/lab_04/bbqs433/PROKKA_09242022.ffn ./lab_05
cd lab_05
```
Using awk command to separate sequences in a multi-fasta file:
```
awk 'BEGIN {RS=">"} /pattern/ {print ">"$0}'
ORS="" PROKKA_inputfile
```
This command reads the file designated at the end by the record separator (RS=">"). It looks for matches of the pattern (ie. the word 'hypothetical') and when a record contains that pattern, it prints '>' followed by the record. It doesn't use an additional output record separator (ORS="") when it prints.
The awk command is contained in single quotes and things to do with the output format are outside of the single quotes. After the command, specify the file for awk to work on. You can print everything to your screen or use " > output_file" or " | head" to send the output to a file or only print the first few files to your screen.
### 1. Write the hypothetical proteins to their own fasta file
```
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.faa > PROKKA_09242022_hypothetical_protein.faa
```
### 2. Take a subset of the hypothetical proteins (3 or less) and send them to their own fasta file
```
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" PROKKA_09242022_hypothetical_protein.faa > PROKKA_09242022_hypothetical_protein_subset.faa # Takes a percentage of a file's records (0.2% in this case) and assigns them to a new file
# Check the number of hypothetical proteins that are in the new file
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022_hypothetical_protein_subset.faa
```
There are 2 hypotehtical proteins in the subset.
### 3. Run a protein blast (blastp)
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_09242022_hypothetical_protein_subset.faa
blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query PROKKA_09242022_hypothetical_protein_subset.faa
```
I found multiple matches in both the refseq_select_prot and swissprot databases.
### 4. Add specifications to the output of the search
Output format 6:
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query PROKKA_09242022_hypothetical_protein_subset.faa > subset_blastp_fmt6.txt
cat subset_blastp_fmt6.txt # Check the contents of the file
```
Output format 6 contains the followng data in table format: qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend sstart, send evalue, bitscore.
Output format 3:
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query PROKKA_09242022_hypothetical_protein_subset.faa > subset_blastp_fmt3.txt
```
Table of available BLAST outut specifiers available in lab 05 handout (p. 7).
### 5. Take the subject sequence ids from second column ($2) that match my specific query of interest ($1=="BH69_03580") and save them as a list
```
awk '$1=="BH69_03580" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BH69_03580.sseqids.txt
```
### 6. Find entries from search target database that match the subject sequence ids using blastdbcmd
```
blastdbcmd -db refseq_select_prot -entry_batch subset_blastp_BH69_03580.sseqids.txt
```
## 1.2 Run Command Line BLASTN
## 1.3 Parse the Output of BLAST Results
### 8. Reorganize the Output File
reorganize the output file with the following columns in the following order: 1. query sequence id, 2. subject sequence id, 3. full name of subject sequence, 4. scientific name of subject sequence, 5. percent positive scoring matches, 6. query coverage, and 7. e-value.
```
awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt
```
Each of the column/fields are specified with dollar-signs($) and numbers for each column.
### 9. Parse results to answer the following questions
Q2. What are the default values of the options matrix, evalue,
max_target_seqs, num_descriptions, num_alignments for each type of
BLAST? Under what circumstances would you use those defaults vs. when would
you consider changing them?
A2. For blastn, the default option for
matrix is BLOSUM62,
evalue is 10,
max_target_seqs is 500,
num_descriptions is 500,
and num_alignments is 250.
For blastp, the default option for
matrix is BLOSUM62,
evalue is 10,
max_target_seqs is 500,
num_descriptions is 500,
and num_alignments is 250.
The default BLOSUM62 matrix will likely work well in many situations, but you may want to change the matrix if you want to look at more distantly related or more closely related sequences.
The evalue threshold determines the stringency of the search. You will likely want to change the evalue to a smaller number than the default to get informative matches since e-values 1e-6 and smaller are considered good e-values for BLAST searches.
For the max_target_seqs, num_descriptions, and num_alignments options, can be used in many cases, but if you want BLAST to return more or less sequences, you would want to adjust these values. They should typically all be kept above 100 to get good/consistent results.