# Bi278 Lab Num 5 ### By Lee Ferenc 10/18/2022 ## Exercise 1. command-ine BLAST(+) ### Here are the blast commands: | Program | task name | Description | | -------- | ------------- | ---------------------------------------------- | | blastq | blastq | Traditional blastp to compare a protein query to a protein database | | | blastq-short | blastp optimized for queries shorter than 30 residues | | | blastq-fast | a faster version of blastp that uses a larger word size (6 vs. 2-3) | | blastn | blastn | traditional blastn requiring an exact match of 11| | | blastn-short | blastn program optimized for sequences shorter than 50 bases | | | megablast | traditional megablast used to find very similar (e.g. intraspecies or closely related species) sequence | | | mc-megablast | discontinuous megablast used to find more distant (e.g. interspecies) sequences| Search for databases: (set enviromental variable for Blast+) ``` echo @BLASTDB ``` If nothing appears set database and check: ``` export BLASTDB="$/courses/bi278/Course_Materials/blastdb" ``` ``` echo @BLASTDB ``` #### You will need to run this every session We'll work w/ 3 databases: nt (for nucleotides) and swissprot or refseq_select_prot (both for proteins). They're processed and downloaded but source: ftp://ftp.ncbi.nlm.nih.gov/blast/db/ ### 1.1 Run commandline BLAST+ To seperate our records in a file that contains multiple fasta sequenes: ``` awk 'BEGIN {RS=">"} /b/ {print ">"$0}' ORS="" PROKKA_inputfile ``` Example of creating a fasta file of hypothetical proteins output into another fasta file ``` awk 'BEGIN {RS=">"} /hypothetical/ {print ">"$0}' ORS=" " /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA_09242022.faa > PROKKA_09242022Hypothetical.faa ``` the > writes the output into what you declare to the left (here it's x file under lab_05). It gives all the hypothetical proteins 1) Write your hypothetical proteins to their own fasta file. Do this for just the protein sequences. Hypothetical proteins: ``` awk 'BEGIN {RS=">"} /hypothetical/ {print ">"$0}' ORS=" " /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA_09242022.faa > PROKKA_09242022Hypothetical.faa ``` Take a subset of your hypothetical proteins (3 or less total) and send them to their own fasta file. This randomizes it: ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile ``` Mine: ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.005) {print ">"$0}' ORS="" PROKKA_09242022Hypothetical.faa ``` ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" PROKKA_09242022Hypothetical.faa > PROKKA_09242022HyotheticalSubset.faa ``` ##### Reminder: This has different protein number outputs so the first time I ran it had 6 proteins then 4 then 2 Run a protein blast (blastp). This is the basic protein (amino acid) query blast I want you to start with and adjust as you go with optional flags: ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query /home2/enfere24/colbyhome/Genomics/lab_05/PROKKA_09242022HypotheticalSubset.faa ``` Structure how it looks by adding one of these tags before -query ``` -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -outfmt 3 -num_descriptions 10 -num_alignments 10 #I did blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query PROKKA_09242022HypotheticalSubset.faa > subset_lab05_blast.txt ``` I can then take the subject sequence ids from the second column and save them as a list. What I did: ``` awk '$1=="BH69_00784" {print $2}' subset_lab05_blast.txt | sort | uniq > subset_lab05_blastfmt6.sseqids.txt ``` Use the blastdbcmd to find the entries from my search target database that match the subject sequence ids: ``` blastdbcmd -db refseq_select_prot -entry_batch subset_lab05_blastfmt6.sseqids.txt ``` ### 1.2 Run commandline BLAST+ In case you ever need nuclotides instead of amino acids: ``` blastn -task megablast -num_threads 2 -query ffn_file -db nt -evalue 1e-6 ``` I was unable to this cause of the database not being active but it's very similar to the first part. ### 1.3 PARSE THE OUTPUT OF BLAST RESULTS Get the fieds of: 1. query sequence id 2. subject sequence id 3. full name of subject sequence 4. scientific name of subject sequence 5. percent positive scoring matches 6. query coverage 7. e-value ``` awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_lab05_blastfmt6.sseqids.txt ``` ##### Question 1: Apparently we can't do this so I'm skipping but probably considering they are different databases. Also blastn is looking for an exact match but there are different sequences with the same amino acids so blastp is te best option. ##### Question 2: Default options and under what circustances to change them? matrix, BLOSUM62, change your matrix type evalue, 10.0 (real) (This doesn't really give a good answer) but change value of threshold for expected hits max_target_seqs, 500, when you want more or less aligned sequences to keep num_descriptions, 500 change num of database sequences num_alignments, 250, when you want more or less database sequences