Bi278 Lab 5 v2 BLAST

# Bi278 Lab 5 v2 BLAST ### By Lee Ferenc 10/17/2023 ## 1. Command-line BLAST List of `BLAST+` tasks | Program | task name | Description | | -------- | ------------- | ---------------------------------------------- | | `blastq` | `blastq` | Traditional `blastp` to compare a protein query to a protein database | | | `blastq-short` | `blastp` optimized for queries shorter than 30 residues | | | `blastq-fast` | a faster version of `blastp` that uses a larger word size (6 vs. 2-3) | | `blastn` | `blastn` | traditional `blastn` requiring an exact match of 11| | | `blastn-short` | `blastn` program optimized for sequences shorter than 50 bases | | | `megablast` | traditional `megablast` used to find very similar (e.g. intraspecies or closely related species) sequence | | | `mc-megablast` | discontinuous `megablast` used to find more distant (e.g. interspecies) sequences| ### Step 1: Check enviromental variable. ```echo $BLASTDB``` If you get nothign that means the enviromental variable `$BLASTDB` is empty. You need to set it every new `ssh` session. ### Step 2: Tell BLAST where databases are `export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"` ### `BLAST+` Help ``` blastn -help blastp -help ``` ## 1.1 Run `blastp` ### Step 1. Write hypothetical proteins to fasta file. Fasta files can contain multiple sequences that each belong to a different gene. Use `>`, which seperates sequences, to seperate he sequences. One way: `awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PATH/PROKKA_FILE` The command: * reads the file using the sequence (record) seperator `RS=">"` * matches a particular pattern (`/pattern`) in each record, think hypothetical protein * When a record/seq contains that pattern, print `>` followed by the pattern (`$0`) * When pinting, doesn't use an addition output record separator Example code (the end | head is just to make sure it looks good): ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/enfere24/lab_03/PROKKA_09262023.faa | head ``` Example code *printing to a new file*: `awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/enfere24/lab_03/PROKKA_09262023.faa > hypo_proteins.faa` ### Step 2. Take subset of the proteins (3-5) anad send them to their own fasta file. You can open them up in a text editor but another way to do this is by taking a percentage of those proteins and putting that into the `rand` function, like 002 = .2%. `awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" INPUTFILE` Example code: `awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypo_proteins.faa` Example printing into a new file (judge my file names all you want but i know what's in them): `awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypo_proteins.faa > subset_hypo_proteins_prokka_09262023.faa` I got 7 very cool proteins. ### Step 3. Run a protein blast (`blastp`) Basic protein (AA) query blast: `blastp -task blastp-fast -num_threads 2 -db PROTEIN_DB -evalue 1e-6 -query FAA_FILE` Note: All dashes (-) belong right next to each option name and te option you choose should be seperated from the option name w/ a space. * ***1*** for PROTEIN_DB you will need to choose a protein database. * "**swissprot**" database: smaller (> 0.4 million sequences), manually curated; If a match is present, can give a really good informative match but b/c small you may not get a match at all * "**refseq_select_prot**" (> 63 million sequences), will use most of the time. * ***2*** The e-value threshold (-evalue 1e-6) determines the stringency of your search. Remove if no matches. Note that good e-values for BLAST searches are considered to be those that are 1e-6 or smaller. Example with swissport: `blastp -task blastp-fast -num_threads 2 -db swissport -evalue 1e-6 -query subset_hypo_proteins_prokka_09262023.faa ` Got nothing. Output: "BLAST Database error: No alias or index file found for protein database [swissport] in search path" Example with refseq_select_prot: `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query subset_hypo_proteins_prokka_09262023.faa ` Took a long time ### Step 4. Add more specifics on how search output looks `-outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10` or `-outfmt 3 -num_descriptions 10 -num_alignments 10` Lab Examples: `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query INPUTFILE > OUTPUTFILE` `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query INPUTFILE > OUTPUTFILE` My example: `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query subset_hypo_proteins_prokka_09262023.faa > blast_subhypo_prokka_09262023.txt` * ***3*** `-max_target_seqs`, `-num_descriptions`, and `-num_alignments` options limit the results you see. *(**Project Note**: When you run BLAST for a project, DON’T USE THESE OPTIONS because you will get different results if these numbers are too small. Use either the defaults (250 or 500 depending on which option) and filter your results after the BLAST search, or at least use a larger number (e.g. 100+).)* You can use `&` at the end of a command to run a search in the background. ### Step 5. Take subject sequence ids from second column (`$2`) and save them as a list `awk '$1=="(sequence of interest)" {print $2}' example_blastp_outfmt6.txt | sort | uniq > input_file.sseqids.txt` Example, using BHQS69_01670: `awk '$1=="BHQS69_01670" {print $2}' blast_subhypo_prokka_09262023.txt | sort | uniq > bhqs69_0170_bplast.sseqids.txt` ### Step 6. Use `blastdbcmd` to find the entries from my search target database that match the subject sequence ids: Lab Example: `blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BO395_01177.sseqids.txt` Example: `blastdbcmd -db refseq_select_prot -entry_batch bhqs69_0170_bplast.sseqids.txt ` ### Table of BLAST output format specifiers (insert later but [link to table](https://hackmd.io/@snoh-bi278/BJLL_9gxp#1-command-line-BLAST)) ## 1.2 Run `blastn` Run a `blastn` query: `blastn -task megablast -num_threads 2 -query INPUTFILE -db NUCLEOTIDE_DB -evalue 1e-6` My example: ``` #get same proteins as nucleotide sequences samtools faidx /home2/enfere24/lab_03/PROKKA_09262023.ffn BHQS69_00275 BHQS69_00293 BHQS69_00525 BHQS69_01275 BHQS69_01625 BHQS69_01670 > REALsubsethypoprotein.ffn ``` I got not hits :( Nucleotide databases: * **env_nt** (93 million sequences) * **nt_prok** (8 million sequences) ## 1.3 Parse BLAST result outputs Output format 6 outputs all this info: `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs stitle sscinames staxid` Don't need all those columns. Instead use: `awk -F "\t" '{print $1,$2,$14,$15,$11,$12,$13}' OFS="\t" example_blastp_outfmt6.txt` Retrieves the fields: * query sequence id (`$1`) * subject sequence id (`$2`) * full name of subject sequence (`$14`) * scientific name of subject sequence (`$15`) * e-value (`$11`) * bitscore (`$12`) * query coverage (`$13`) I did it: `awk -F "\t" '{print $1,$2,$14,$15,$11,$12,$13}' OFS="\t" blast_subhypo_prokka_09262023.txt > better_blast_subhypo_prokka_09262023.txt` ## Questions #### Question 1: What are the default values of the options `matrix`, `evalue`, `max_target_seqs`, `num_descriptions`, `num_alignments` for each type of BLAST? At what point in the BLAST search would these filters be applied, and under what circumstances would you consider using a value other than the default? [Link with great information that I used since -help was not very helpful.](https://www.ncbi.nlm.nih.gov/books/NBK279684/) * `matrix`, BLOSUM62 for blastp, PAM30 for blastp-short,, change your matrix type, if you wanted a different matrix (see physical notes for each and why) * `evalue`, 10.0, but change value of threshold for expected hits * `max_target_seqs`, 500, when you want more or less aligned sequences to keep * `num_descriptions`, 500 change num of database sequences * `num_alignments`, 250, when you want more or less database sequences #### Question 2: I encourage you to run `blastp` searches with a non-hypothetical protein that looks interesting and check out the results you get by changing some of the parameters above. Report back on what you find. I ran it with BHQS69_01799 toxin transport-related membrane protein. Sounded like an efflux pump and that's. ``` awk 'BEGIN {RS=">"} /toxin/ {print ">"$0}' ORS="" /home2/enfere24/lab_03/PROKKA_09262023.faa > toxin.faa awk 'BEGIN {RS=">"; srand()} (rand()<=.2) {print ">"$0}' ORS="" toxin.faa > subset_toxin.faa cat subset_toxin.faa #saw what looked to be an efflux pump samtools faidx /home2/enfere24/lab_03/PROKKA_09262023.faa BHQS69_01799 > maybeefflux.faa blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query maybeefflux.faa blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -outfmt 6 -max_target_seqs 10 -query maybeefflux.faa > blast_maybeefflux.txt ``` Anyways I didn't have to go to far to see some interesting results. ``` BHQS69_01799 Q04473.1 28.197 649 458 4 56 703 54 695 1.31e-82 279 BHQS69_01799 Q933I3.1 27.427 649 463 4 56 703 51 692 1.71e-82 279 BHQS69_01799 P23702.1 28.397 655 453 7 56 705 50 693 4.02e-82 278 BHQS69_01799 Q9RCG7.1 28.025 628 442 5 78 703 76 695 6.82e-82 277 BHQS69_01799 P26760.2 27.575 631 445 6 78 705 72 693 1.01e-81 276 BHQS69_01799 Q93FH2.1 27.427 649 463 4 56 703 51 692 1.42e-81 276 BHQS69_01799 P16532.1 27.427 649 463 4 56 703 51 692 1.48e-81 276 BHQS69_01799 Q93FH6.1 27.427 649 463 4 56 703 51 692 1.50e-81 276 BHQS69_01799 Q933E0.1 27.273 649 464 4 56 703 51 692 4.03e-81 275 BHQS69_01799 P0C086.1 27.273 649 464 4 56 703 51 692 5.27e-81 275 ``` It appears all those different species have very similar homologs. I only uses swissprot but I looked and the species aren't all super related. But the small e-values indicate they are almost def. related. If this is an efflux pump then I wonder if this could've been passed through plasmids? I'm hypothesizing but efflux pumps are vital and a gene that would want to be passed on without too much change.