BI278 Lab #5 - HackMD

# BI278 Lab #5 ### Exercise. command-line BLAST I first need to tell BLAST+ where to look for the databases. We need to set an environmental variable, to tell your Unix environment where to look for searchable databases. ``` echo $BLASTDB ``` Reminder: I will need to run this export command every session or logging onto bi278 before using BLAST unless you modify your .bash_profile. #### 1.1. RUN COMMAND LINE BLASTP First I used the following commant to print out all sequences match the patthern "hypothetical protein". ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.faa ``` Then I used head to view only the begining of the file ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.faa | head ``` Write hypothetical proteins to their own fasta file. Do this for just the protein sequences. ``` awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.faa > hypothetical_protein.faa ``` Take a subset of hypothetical proteins (3 or less total) and send them to their own fasta file. ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.0015) {print ">"$0}' ORS="" hypothetical_protein.faa > hypothetical_protein_subset.faa ``` When I check the subset file, there are two protein sequences ``` cat hypothetical_protein_subset.faa ``` >BB433_00744 hypothetical protein MSLERTVNSLSRLVALRGREADRLSAVLVGKQAQCERQRHMLERMETLKNSTVNEVASVT AKHHPALALNRADYQDALQQLIETQREALARHETEVAASRDAWSGVALKHKSLDTVLQRK QMTLQQAGIASEQKRQDQMANQAWQRRAAR >BB433_02692 hypothetical protein MKLHRLAALLPIVYFAAGVAGAVAQPAADVSYSRHPELASAQASIRQAFVQIERAQSANR DRLGDHAENAKRLLDQASWELKAAAEYANGR 3. Run a protein blast (blastp). This is the basic protein (amino acid) query blast I want you to start with and adjust as you go with optional flags: ``` blastp -task blastp-fast -num_threads 2 -db swissprot -evalue 1e-6 -query hypothetical_protein_subset.faa ``` But using the Swissport did not give me any match. So I tried again with ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -query hypothetical_protein_subset.faa ``` 4. Now try adding to the search above one of these specifications for how you want your search output to look: From this I get the file *subsetblastpfmt6.txt* ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query hypothetical_protein_subset.faa > subset_blastp_fmt3.txt ``` From this I get the file *subsetblastpfmt3.txt* ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query hypothetical_protein_subset.faa > subset_blastp_fmt6.txt ``` 5. I can then take the subject sequence ids from the second column ($2) and save them as a list: ``` awk '$1=="BB433_00744" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BO395_01177.sseqids.txt ``` I check the *subset_blastp_BO395_01177.sseqids.txt* ``` cat subset_blastp_BO395_01177.sseqids.txt ``` >WP_006766955.1 WP_047195751.1 WP_116810647.1 WP_129649243.1 WP_129779102.1 WP_133587960.1 WP_153074626.1 WP_161815974.1 WP_174769863.1 WP_189460473.1 6. Then, I can use the blastdbcmd to find the entries from my search target database that match the subject sequence ids ``` blastdbcmd -db refseq_select_prot -entry_batch subset_blastp_BO395_01177.sseqids.txt ``` And I get ten results back. #### 1.2. RUN COMMAND LINE BLASTN Unforunately we cannot use the nuclotide database. So I retrieve the ids for my query: >BB433_00744 BB433_02692 And I serach the sequence in the .ffn file from last week and find the matches. ``` samtools faidx PROKKA_09242022.ffn BB433_00744 BB433_02692 ``` And I get the nuclotide sequence for these two query. #### 1.3. PARSE THE OUTPUT OF BLAST RESULTS 8. Run a command to retrieve these fields only: ``` awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" subset_blastp_fmt6.txt ``` 9. Parse your results from the various BLAST searches you have been running to figure out your answers to the following questions. Note that you will have at most 10 results for each query sequence (unless you changed this option). ##### Q2. What are the default values of the options matrix, evalue, max_target_seqs, num_descriptions, num_alignments for each type of BLAST? Under what circumstances would you use those defaults vs. when would you consider changing them? (Use each command with the -help option or check out the BLAST+ online manual appendix; between the two you can see all the defaults when you do this) Default values: matrix: BLOSUM62 and PAM30 max_target_seqs: 500 num_descriptions: 500 num_alignments: 250 When we are testing out, like sitution today in the lab, we would want to make the search quicker. So we will make the dafault max_target_seqs smaller so that we get less matching result and a quicker search. I think that is similar with num_descriptions. When we don't necessarily need too much information from the search, we can choose a smaller num_descriptions.