BI278 Lab noteboook 5

# BI278 Lab noteboook 5 ### Command Line Blast Command line BLAST - BLAST+: * `blastp` - BLAST for proteins * `blastn` - BLAST for nucleotides Set an environment variable to tell Unix where to look for searchable databases * Check the environment - `echo $BLASTDB` * Did not return anything * Tell BLAST wehre the daabases are: `export BLASTDB= "$BLASTDB:/courses/bi278/Course_Materials/blastdb"` * Check if it worked `echo $BLASTDB` * Returns `:/courses/bi278/Course_Materials/blastdb ` #### Databases Working with: * `nt` for nucleotides * `swissprot` for proteins * `refseq_select_prot` for proteins (https://www.ncbi.nlm.nih.gov/refseq/refseq_select/) * Provided by NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/ * The databases have to be downladed and processed `blastn -help` and `blastp - help` will show how BLAST+ works BLAST+ online user manual appendix: https://www.ncbi.nlm.nih.gov/books/NBK279684/ ### Run command line BLASTP FId the two files that contain the nucleotide and protein sequences for the geens that Prokka identified for genome from lab_04: In the lab_04 directory(copied form courses): * `PROKKA_09242022.ffn` - nucleotide sequences * `PROKKA_09242022.faa` - protein sequences `mkdir lab_05` ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" lab_04/PROKKA_09242022.faa | head ``` * This reads the PROKKA file by the record separator `RS=">"` * Match a particular pattern (`/pattern/`) you give it in each record - ‘hypothetical’ * When a record contains that pattern, it then prints the symbol ‘>’ followed by the record ($0) * When it prints, it doesn’t use an additional output record separator (ORS="") * `| head` prints only the top few lines * Write the 'hypothetical proteins' from the protein sequences file into a file ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" lab_04/PROKKA_09242022.faa > lab_05/bh69_hypothetical_protein.faa ``` **Subset data** `cd lab_05` Subset a percentage of the file's record using `awk` Print out a subset of 0.2%: ``` awk 'BEGIN {RS=">"; srand()} (rand()<=0.002) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa ``` Check if 0.002% is printed out by using the grep ">" command ``` grep ">" bh69_hypothetical_protein.faa | wc -l awk 'BEGIN {RS=">"; srand()} (rand()<=0.002) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa | grep ">" | wc -l ``` Subset = 3/1214 = 0.0024 Take a subset of 3 or less hypothetical proteins (use 0.005) and send it to a file ``` awk 'BEGIN {RS=">"; srand()} (rand()<=0.005) {print ">"$0}' ORS="" bh69_hypothetical_protein.faa > bh69_hypothetical_sub.faa ``` Has 8 hypothetical proteins in it - to be safe **Run a protein blast** Use `swissprot` for now but the database is so small you may not get a match at all, or you won’t get comprehensive matches, so use refseq_select_prot most of the time. ``` blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query bh69_hypothetical_sub.faa ``` (The evalue threshold (-evalue 1e-6) determines the stringency of your search. If you don't get any matches, try removing this option) Add the specifications to change how you want the search output to look like: `-outfmt "6 std ppos qcovs stitle sscinames staxid" - max_target_seqs 10` `-outfmt 3 -num_descriptions 10 -num_alignments 10` They limit the results you see somewhere in the middle of the database search process. ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query bh69_hypothetical_sub.faa > subset_blastp_fmt3.txt ``` Limits maximum target sequences to 10 Can also do something like: ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query faa_file > subset_blastp_fmt3.txt ``` There can be many formats used `fmt3` futs it in format 3 and `fmt6` puts it in format 6 **Find subject sequences of a query** * Choose a query: `BH69_01005` (`$1=="BH69_01005"`) Subject sequence ids for the query can be found in the second column ($2) in the output format 6 * Save the subjects sequences ids as a list: ``` awk '$1=="BH69_01005" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BH69_01005.sseqids.txt ``` * Use the `blastdbcmd` to find the entries from the search target database that match the subject sequence ids ``` blastdbcmd -db refseq_select_prot -entry_batch subset_blastp_BH69_01005.sseqids.txt ``` ### Run command line BLASTN Run `blastn` on the nucleotide ffn file ``` blastn -task megablast -num_threads 2 -query ../lab_04/*.ffn -db nt -evalue 1e-6 ``` Run nucleotide BLAST search for one query sequence that you got from `blastp` search ### Parse output of BLAST results Using the outut format 6 file from the `blastp` The columns are in the order: `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid` Reorganize the columns in an output file, retrieving only specif columns (`$n` for column n) ``` awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt ``` **`blastp` options** `matrix` - The scoring matrix name (default = BLOSUM62) `evalue` - expectation value threshold for saving hits (deafult = 10) `max_target_seqs` - Maximum number of aligned sequences to keep (default = 500) `num_descriptions` - Number of database sequences to show one-line descriptions for (default = 500) `num_alignments` - Number of database sequences to show alignments for (default = 250) **`blastn` default values** `evalue` - deafult = 10 `max_target_seqs` - default = 500 `num_descriptions` - default = 500 `num_alignments` - default = 250 * It will be useful to change these parameters depending on whether the default gives good results or not. * One might want to reduce the evalue to find hits that are much less likely to be by chance. * If we want a quick search to get a general sense and test the BLAST and can change the values to see what is appropriate or choose smaller values to speed up the process and see if it working. * Depending on the sequences we are looking at, we can change the scoring matrix to account fpr evolutionary time. * It willbe useful to change these options depending on the specifics of the search we are doing