Lab 5: BLAST - HackMD

# Lab 5: BLAST Blast+ options; **BlastP**: blastp- traditional blastp to compare a protein query to a protein database blastp-short: blastp optimized for queries shorter than 30 residues blastp-fast: a faster version of blastp that uses a larger word size (6 vs. 2-3) **Blastn**: blastn: traditional blastn requiring an exact match of 11, to compare a nucleotide query to a nucleotide database blastn-short: blastn program optimized for sequences shorter than 50 bases megablast: traditional megablast used to find very similar sequences (e.g. intraspecies or closely related species) dc-megablast: discontinuous megablast used to find more distinct sequences (e.g. interspecies) ## 1.1 Command Line Blast 1. To get blast to runn a search from bi278, you need to tell it where to find your databases.To do this, we need to set an environmental variable called $BLASTDB.we run echo $BLASTDB i got nothing which meant that the environmental variable is empty 2. Tell Blast where these databases are: ``` export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb" ``` 3. to check if this worked try: ``` echo $BLASTDB ``` I did this and it gave me the path of the file: :/courses/bi278/Course_Materials/blastdb *BTW TO SEE HOW BLAST+ WORKS I CAN TYPE*: blastn -help blastp -help I looked at the prokka output from last labs, i went into lab_03, into bbqs395, then got the faa(amino acid) and ffn (nucleotide) file ## 1.1 Run BLASTp 1. These files contain multiple sequences that each belong to a different gene, name of each gene declared in specfic way using >. we can use this symbol to seperate sequences in a multi=fatsa files. we can do this by using: ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PATH/PROKKA_FILE ``` this looked like: ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.faa ``` (RS=">") is a record seperator (/pattern/) match to particular pattern When a record contains that pattern, then print the symbol > followed by the record with the pattern ($0) When printing, it doesn’t use an additional output record separator (ORS="") 2. We had to move all of these hypothetical proteins to a new file we do this by doing `>` this is what it looked like( new file w all hyp proteuins is after the >): ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.faa > BWAHPPROKKA.faa ``` The entire **awk command** is contained within single quotes, and anything to do with the output format is outside these single quotes. After the command, you specify the input file you want awk to work on. Then you can either have everything printed to your screen, or have it sent to a file (as above using the > operator). Typically before sending anything to a file I will always check what the output looks like using the | operator to pipe my command to head or tail: ``` SOMECOMMAND | head ``` The absolutely simplest way to subset your data may be to just open up your file using a text editor, and copy and paste however many you want into a new file. But here is a way to do it in command line using awk by taking a percentage (for example, 0.2% = 0.002) of a file’s records: ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" INPUTFILE` ``` for me this looked like: ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.faa` ``` 2. Take a subset of your hypotheical proteins ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" BWAHPPROKKA.faa >SUBHPPROKKA.faa` ``` *instead of using path you can move things to a new file this is how: the use of the `>` will send info to whatever file i name after it PROKKA_10032023.faa is all proteins SUBHPPROKKA.faa (3-5) hypothetical proteins BWAHPPROKKA.faa is all hps seperated **We will be working with 2 protein databases: swissprot or refseq_select_prot and 2 nucleotide databases env_nt and nt_prok.** 3.Run a protein blast (blastp). This is the basic protein (amino acid) query blast to start with and adjust as you go with optional flags: ``` blastp -task blastp-fast -num_threads 2 -db PROTEIN_DB -evalue 1e-6 -query FAA_FILE ``` for me this looked like: lab_05]$ ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query SUBHPPROKKA.faa ` ``` *The evalue threshold (-evalue 1e-6) determines the stringency of your search. If you don’t get any matches, try removing this option. But when you look at your results, note that good e-values for BLAST searches are considered to be those that are 1e-6 or smaller (including 0). What this means is you can adjust your BLAST parameters to get matches, but not all matches are informative.* I added these specifications to the command above( for how we want our search outputs to look); -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 could also do this but i didnt try it: -outfmt 3 -num_descriptions 10 -num_alignments 10 **Run blastp** the new command would look like: ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query INPUTFILE > OUTPUTFILE ``` or ``` `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query INPUTFILE > OUTPUTFILE` ``` # STOPPED HERE# in my case this looked like: ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query SUBHPPROKKA.faa > 2SUBHPPROKKA.faa ``` 2SUBHPPROKKA.faa is a reorganized blastp ran through file 3 The `-max_target_seqs`, `-num_descriptions`, and `-num_alignments` options limit the results you see somewhere in the middle of the database search process. Will take some time When BLAST is running, it’s finding the best matches using a heuristic to search through all of these potential match subjects in the database.You can use & at the end of a command to run a search in the background Here is what the output for a different search I ran looked like. All 10 results ($2) are for the same query sequence ($1).You can also use format 6 results to get the amino acid or nucleotide sequences of the subjects you are matching to. Let’s say I’m interested in finding all the subject sequences of one of my queries BO395_01177. The subject sequence ids that match this query can be found in my output format 6 results. I just need to get the ones that match my specific query of interest ($1=="BO395_01177"). Table 2. Available BLAST output format specifiers specifier meaning qseqid Query Seq-id qgi Query GI qacc Query accesion sseqid Subject Seq-id sallseqid All subject Seq-id(s), separated by a ‘;’ sgi Subject GI sallgi All subject GIs sacc Subject accession sallacc All subject accessions qstart Start of alignment in query qend End of alignment in query sstart Start of alignment in subject send End of alignment in subject qseq Aligned part of query sequence sseq Aligned part of subject sequence evalue Expect value bitscore Bit score score Raw score length Alignment length pident Percentage of identical matches nident Number of identical matches mismatch Number of mismatches positive Number of positive-scoring matches gapopen Number of gap openings gaps Total number of gap ppos Percentage of positive-scoring matches frames Query and subject frames separated by a ‘/’ qframe Query frame sframe Subject frame btop Blast traceback operations (BTOP) staxids unique Subject Taxonomy ID(s), separated by a ‘;’(in numerical order) sscinames unique Subject Scientific Name(s), separated by a ‘;’ scomnames unique Subject Common Name(s), separated by a ‘;’ sblastnames unique Subject Blast Name(s), separated by a ‘;’ (in alphabetical order) sskingdoms unique Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order) stitle Subject Title salltitles All Subject Title(s), separated by a ‘<>’ sstrand Subject Strand qcovs Query Coverage Per Subject (for all HSPs) qcovhsp Query Coverage Per HSP ## 1.2 Run Blastn(ERROR) To call up fatsa format sequences based on their newly indexed gene ids( we are using gene ids found in subset hypothetical; protein file, my SUBHPPROKKA.faa): samtools faidx PATH/PROKKA_*faa GENEID1 GENEID2 … in my case this looked like: samtools faidx ./SUBHPPROKKA.faa bbqs295_02086 bbqs295_02612 bbqs295_03216 bbqs295_03241 > NUCPROKKA.faa gave me error, this is what i ran bc the info saved to this file: blastn -task megablast -num_threads 2 -query SUBHPPROKKA.faa.fai -db nt_prok -evalue 1e-6 say: Near line 1, there's a line that doesn't look like plausible data, but it's not marked as defline or comment. # Run Blastn 10/22 i am using my subset hypothetical protein taking those gene idsa and doing samtools on it from the ffn file and sending thsie to a new file to run blast on this is how: 1. make sure i got the nucletode sequences from the gene ids pulled from the ffn file( got ATGC for each hp gene id) ``` samtools faidx /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.ffn bbqs295_02086 bbqs295_02612 bbqs295_03216 bbqs295_03241 ``` 2. link it back to the ffn to get the nucleotide file, send those nucletide sequences to a file to run blastn on ``` samtools faidx /home2/cibene26/lab_03/bbqs395/PROKKA_10032023.ffn bbqs295_02086 bbqs295_02612 bbqs295_03216 bbqs295_03241 > BTPROKKA.ffn ``` BTPPROKKA is my nucletode sequences from my hp gen ids # 1.3 Parse Blast Results Output 1.editing my results, using format 3, I ran I(could sned results fo a new file the wya i wnat themmorganized) ``` blastn -task megablast -num_threads 2 -query BTPROKKA.ffn -db nt_prok -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 ```