# BI278 Lab noteboook 5
### Command Line Blast
Command line BLAST - BLAST+:
* `blastp` - BLAST for proteins
* `blastn` - BLAST for nucleotides
Set an environment variable to tell Unix where to look for searchable databases
* Check the environment - `echo $BLASTDB`
* Did not return anything
* Tell BLAST wehre the daabases are:
`export BLASTDB= "$BLASTDB:/courses/bi278/Course_Materials/blastdb"`
* Check if it worked `echo $BLASTDB`
* Returns `:/courses/bi278/Course_Materials/blastdb
`
#### Databases
Working with:
* `nt` for nucleotides
* `swissprot` for proteins
* `refseq_select_prot` for proteins (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/)
* Provided by NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
* The databases have to be downladed and processed
`blastn -help` and `blastp - help` will show how BLAST+ works
BLAST+ online user manual appendix: https://www.ncbi.nlm.nih.gov/books/NBK279684/
### Run command line BLASTP
FId the two files that contain the nucleotide and protein sequences for the geens that Prokka identified for genome from lab_04:
In the lab_04 directory(copied form courses):
* `PROKKA_09242022.ffn` - nucleotide sequences
* `PROKKA_09242022.faa` - protein sequences
`mkdir lab_05`
```
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" lab_04/PROKKA_09242022.faa | head
```
* This reads the PROKKA file by the record separator `RS=">"`
* Match a particular pattern (`/pattern/`) you give it in each record - ‘hypothetical’
* When a record contains that pattern, it then prints the symbol ‘>’ followed by the record ($0)
* When it prints, it doesn’t use an additional output record separator (ORS="")
* `| head` prints only the top few lines
* Write the 'hypothetical proteins' from the protein sequences file into a file
```
awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" lab_04/PROKKA_09242022.faa > lab_05/bh69_hypothetical_protein.faa
```
**Subset data**
`cd lab_05`
Subset a percentage of the file's record using `awk`
Print out a subset of 0.2%:
```
awk 'BEGIN {RS=">"; srand()} (rand()<=0.002) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa
```
Check if 0.002% is printed out by using the grep ">" command
```
grep ">" bh69_hypothetical_protein.faa | wc -l
awk 'BEGIN {RS=">"; srand()} (rand()<=0.002) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa | grep ">" | wc -l
```
Subset = 3/1214 = 0.0024
Take a subset of 3 or less hypothetical proteins (use 0.005) and send it to a file
```
awk 'BEGIN {RS=">"; srand()} (rand()<=0.005) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa > bh69_hypothetical_sub.faa
```
Has 8 hypothetical proteins in it - to be safe
**Run a protein blast**
Use `swissprot` for now but the database is so small you may not get a match at all, or you won’t get comprehensive matches, so use refseq_select_prot most of the time.
```
blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query bh69_hypothetical_sub.faa
```
(The evalue threshold (-evalue 1e-6) determines the stringency of your search. If you don't get any matches, try removing this option)
Add the specifications to change how you want the search output to look like:
`-outfmt "6 std ppos qcovs stitle sscinames staxid" -
max_target_seqs 10`
`-outfmt 3 -num_descriptions 10 -num_alignments 10`
They limit the results you see somewhere in the middle of the database search process.
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query bh69_hypothetical_sub.faa > subset_blastp_fmt3.txt
```
Limits maximum target sequences to 10
Can also do something like:
```
blastp -task blastp-fast -num_threads 2 -db
refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions
10 -num_alignments 10 -query faa_file >
subset_blastp_fmt3.txt
```
There can be many formats used `fmt3` futs it in format 3 and `fmt6` puts it in format 6
**Find subject sequences of a query**
* Choose a query: `BH69_01005` (`$1=="BH69_01005"`)
Subject sequence ids for the query can be found in the second column ($2) in the output format 6
* Save the subjects sequences ids as a list:
```
awk '$1=="BH69_01005" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BH69_01005.sseqids.txt
```
* Use the `blastdbcmd` to find the entries from the search target database that match the subject sequence ids
```
blastdbcmd -db refseq_select_prot -entry_batch subset_blastp_BH69_01005.sseqids.txt
```
### Run command line BLASTN
Run `blastn` on the nucleotide ffn file
```
blastn -task megablast -num_threads 2 -query ../lab_04/*.ffn -db nt -evalue 1e-6
```
Run nucleotide BLAST search for one query sequence that you got from `blastp` search
### Parse output of BLAST results
Using the outut format 6 file from the `blastp`
The columns are in the order: `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid`
Reorganize the columns in an output file, retrieving only specif columns (`$n` for column n)
```
awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt
```
**`blastp` options**
`matrix` - The scoring matrix name (default = BLOSUM62)
`evalue` - expectation value threshold for saving hits (deafult = 10)
`max_target_seqs` - Maximum number of aligned sequences to keep (default = 500)
`num_descriptions` - Number of database sequences to show one-line descriptions for (default = 500)
`num_alignments` - Number of database sequences to show alignments for (default = 250)
**`blastn` default values**
`evalue` - deafult = 10
`max_target_seqs` - default = 500
`num_descriptions` - default = 500
`num_alignments` - default = 250
* It will be useful to change these parameters depending on whether the default gives good results or not.
* One might want to reduce the evalue to find hits that are much less likely to be by chance.
* If we want a quick search to get a general sense and test the BLAST and can change the values to see what is appropriate or choose smaller values to speed up the process and see if it working.
* Depending on the sequences we are looking at, we can change the scoring matrix to account fpr evolutionary time.
* It willbe useful to change these options depending on the specifics of the search we are doing