BI278 Lab 5: BLAST

# BI278 Lab 5: BLAST Goal: use command line BLAST on prokka annotations We will be using BLAST which will allow us to run a bunch of sequences through the BLAST algorithm through scripting rather than through a web-browser interface BLAST+ options that you can choose among depending on search goal Program: blastp - blastp: traditional blastp to compare a protein query to a protein database - blastp–short: blastp optimized for queries shorter than 30 residues - blastp–fast: a faster version of blastp that uses a larger word size (6 vs 2-3) Program: blastn - blastn: a faster version of blastp that uses a larger word size - blastn–short: blastn program optimized for sequences shorter than 50 bases - megablast: traditional megablast used to find very similar (e.g.intraspecies or closely related species) sequence - dc–megablast: discontinuous megablast used to find more distant (e.g. interspecies) sequences First, you need to tell BLAST+ where to look for your databases, so run ``` echo $BLASTDB ``` Databases are in the Course_Materials directory, so to tell blast where the data bases are use: ``` export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb" ``` You can then check that this worked with the echo command. *You need to use this export command every session or logging into bi278 before using BLAST unless you modify your .bash_profile.* We will be working with 3 databases: 1. nt (nucleotides) 2. swissprot (proteins) 3. refseq_select_prot (proteins) Other databases found at ftp://ftp.ncbi.nlm.nih.gov/blast/db/ You can also take your own sequences and make a database out of them. 1.1 Run command line BLASTP Check Prokka output and identify the two files that respectively contain the nucleotide and protein sequences for all the genes that Prokka identified for your genome. Both files should be in FASTA format (one of them is for your genome sequence– don't want this one) ``` cd bbqs395 #working in bbqs395 directory under lab_03 ls ``` Fasta files are .faa (protein), .ffn (nucleotide) ``` PROKKA_10042022.faa PROKKA_10042022.ffn ``` Fasta files can contain multiple sequences that each belong to a different gene, and the name of each sequence is declared ina specific way using '>'. We can use this symol to seperate (delimit) sequences (records) in a multi-fasta file. One way to do this using awk: ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_inputfile ``` This command: - reads the designated file at the end of the command by whatever you tell is to use as a record seperator - matches a particular patter you give it (like the word hypothetical) - when a record contains that pattern, it then prints the smbol '>' followed by the record ($) - when it prints, it doesn't use an additional record seperator Awk command is contained within quotes, and anything to do with output format is outside the quotes. After the command, you specify the file you want awk to work on, and then you can either have everything printed to your screen or have it sent to a file (as above using > operator). Before sending anything to a file, check output using the | operator. ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_10042022.faa | head ``` And then into PROKKA_10042022.faa.hypotheticalprotein file using ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_10042022.faa > PROKKA_10042022.faa.hypotheticalprotein ``` Professor Noh's prokka annotation files for bbqs395, bbqs433, and bhqs69 are still in /courses/bi278/Course_Materials/lab_04 1. Write hypothetical proteins to their own fasta file Now, we want to take a subset of the proteins and test different BLASTs using a few of them to see how teh software behaves before running the whole dataset. The simplest way to subset data is to just open up your file using a text editor and copy and paste however many you want into a new file. Or, you can do it in command line using awk by taking a percentage (ex 0.2% = 0.002) of a file's records ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile ``` Check how it's working using different numbers instead of .002 .003 gives 2 hypothetical proteins 2. take a subset of your hypothetical proteins (3 or less w=total) and send them to their own fasta file ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.003) {print ">"$0}' ORS="" PROKKA_10042022.faa.hypotheticalprotein > PROKKA_10042022.hypotheticalprotein.fasta ``` Subset fasta file: PROKKA_10042022.hypotheticalprotein.fasta 3. Run a protein blast blastp. This is the basic protein (amino acid) query blast to start with and adjust as you go with additional flags ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query PROKKA_10042022.hypotheticalprotein.fasta #the command wasn't working, so we had to specify the path before the database blastp -task blastp-fast -num_threads 2 -db /courses/bi278/Course_Materials/blastdb/refseq_select_prot -evalue 1e-6 -query PROKKA_10042022.hypotheticalprotein.fasta ``` If good results, send to a file with the > operator so you can work with them. *the swissprot database is much smaller manually curated database compared to refseq_select_prot. If there is a match, swissprot can give a good match, but because the database is so small, we should plan to use refseq_select_prot. **the evalue threshold (-evalue le-6) determines the stringency of the search, so if you don't get any results, try removing this option. But not that good e-values for BLAST are considered to be 1e-6 or less (including 0). This means that you can adjust your BLAST parameters to get matches, but not all matches are informative. 4. Now try adding to the search one of the specifications for how you want your search output to look: ``` -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -outfmt 3 -num_descriptions 10 -num_alignments 10 ``` for example ``` blastp -task blastp-fast -num_threads 2 -db /courses/bi278/Course_Materials/blastdb/refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 –query PROKKA_10042022.hypotheticalprotein.fasta > subset_blast_fmt6.txt blastp -task blastp-fast -num_threads 2 -db /courses/bi278/Course_Materials/blastdb/refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 –query PROKKA_10042022.hypotheticalprotein.fasta > subset_blast_fmt3.txt ``` ***The -max_target_seqs, -num_descriptions, and -num_alignments options limit the results you see somewhere in the middle of the database search process. We are reccomended NOT to use these options when we run BLAST for a project. Instead use either the defaults (250 or 500) and filter results after the BLAST search. Mine wasn't working so I copied Prof. Noh's files over from the lab_04 directory and started using the ``` example_blastp_outfmt3.txt example_blastp_outfmt6.txt ``` BLAST searches can take a while to run, so don't run other searches on top of it. While waiting, check out examples of output files in this week's folder named ``` example*blastp*outfmt*txt cd /courses/bi278/Course_Materials/lab_05/ ls # files are example_blastp_outfmt3.txt and example_blastp_outfmt6.txt ``` They give overlapping information in very different formats. Output 6 format often ends up being the most useful because it contains all of the following data in table format ``` "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" ``` BLAST outputs a few more pieces of data in addition to these, and you can check these out in table 2. You can also use format 6 results to get the amino acid or nucleotide sequences of the subjects you are matching to. If I'm interested in finding all of the subjects of one of my queries BO395_01177, the subject sequence ids that match this query can be found in the output format 6 results. To get the ones that match the specific query of interest run ``` $1=="BO395_01177" #mine is WP_153076995.1 ``` 5. Then, we can take the subject sequence ids from the second colum ($2) and save them as a list: ``` awk '$1=="BO395_01177" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BO395_01177.sseqids.txt awk '$1=="WP_153076995.1" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_WP_153076995.1.sseqids.txt ``` list: example_blastp_WP_153076995.1.sseqids.txt 6. Then, I can use the blastdbcmd to find the entries from my search target database that match the subject sequence ids: ``` blastdbcmd -db refseq_select_prot -entry_batch example_blastp_WP_153076995.1.sseqids.txt ``` It would be useful to generate a MSA from these fasta sequences (or phylogeny, etc) TABLE 2: available BLAST output format specifiers ![](https://i.imgur.com/8gRHZkQ.png) *Skip 1.2: Run command line BLASTN* 1.3 Parse the output of BLAST results Using BLAST+ can be a great way to search through a large number of sequences at a time. However, figuring out how best to use the result formating is more challenging. BLAST+ truncates subject names. If you look at the top of the output format 3 files, it will be obcious. We can now use the following command to pull out some key fields from the output format 6 file in this order: ``` qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid ``` If you want the columns to end up in the reorganized output file: 1.query sequence id 2.subject sequence id 3.full name of subject sequence 4.scientific name of subject sequence 5.percent positive scoring matches 6.query coverage 7.e-value 8. Run a command to retriece these fields only: ``` awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" example_blastp_outfmt6.txt ``` Each of the columns (fields) of the table format are specified with $ and numbers for each column. awk is useful for doing a lot of table reorganization tasts but the Pandas library in Python is another common tool. 9. What are the default values of the options matrix, evalue, max_target_seqs, num_descriptions, num_alignments for each type of BLAST? Under what circumstances would you use those defaults vs. when would you consider changing them? (Use each command with the -help option or check out the BLAST+ online manual appendix; between the two you can see all the defaults when you do this) Matrix: BLOSUM62 Evalue: 10 (Expectation value (E) threshold for saving hits) Max_target_seqs: 500 (Maximum number of aligned sequences to keep) Num_descriptions: 500 (Number of database sequences to show one-line descriptions for) Num_alignments: 250 (Number of database sequences to show alignments for) You would probably consider changing the defaults when you want to change the e-value (different confidence level/threshold for results– how specific/general you want your results to be), or change the number of databases to show one-line descriptions or alignments for Note: **when moving files!** - cp bbqs395/* ~/lab_05/ - asterisc moves everything in the directory, but not the directory itself - -r before the directory that you're copying indicates that you're moving everything in the directory