# Notebook 5 Genomics Lab `export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"` set value for BLASTDB (tell where the databse is) `echo $BLASTDB` gets the value. ### 1.1 Run command line BLASTP `awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_inputfile` What the command above does is: 1.Read the file you designate at the end of the command by whatever you tell it to use as a record separator (RS=">") 2.Match a particular pattern (/pattern/) you give it in each record, such as the word ‘hypothetical’ 3.When a record contains that pattern, it then prints the symbol ‘>’ followed by the record ($0) 4.When it prints, it doesn’t use an additional output record separator (ORS="") Simply, it gets all records with the /pattern/ separated by ">". `awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_inputfile > output_file_name` Ouput the records to a new file. `awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile` Gets a random part of the input file. `blastp -task blastp-fast -num_threads 2 -db database_name - evalue 1e-6 -query faa_file` Note: all the dashes (-) belong right next to each option name and the option you choose should be separated from the option name with a space. If you are happy with the results, you should send the results of these searches to a file with the > operator so you can work with them. The evalue threshold (-evalue 1e-6) determines the stringency of your search. If you don't get any matches, try removing this option. But when you look at your results, note that good e-values for BLAST searches are considered to be those that are 1e-6 or smaller (including 0). Add`-outfmt "6 std ppos qcovs stitle sscinames staxid" - max_target_seqs 10` and `-outfmt 3 -num_descriptions 10 -num_alignments 10`to the command above like this ``` blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query faa_file > subset_blastp_fmt6.txt blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query faa_file > subset_blastp_fmt3.txt ``` `awk '$1=="BO395_02501" {print $2}' subset_blastp_fmt6.txt | sort | uniq > subset_blastp_BO395_02501.sseqids.txt ` This line of code take the formate 6 file and get all second column data of specific query of BO395_02501. What is in the format 6 file is as following. **"qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"** `blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BO395_01177.sseqids.txt` This code above gets all entries with subject ids found from the last code. ### 1.2 BLASTN `blastn -task megablast -num_threads 2 -query ffn_file -db nt -evalue 1e-6` get the nucleotide BLAST search. Then retrieve the fasta sequence from the nucleotide sequence file for query sequence get from 1.1 with code below. `samtools faidx PROKKA_09162022.ffn BO395_02501 BO395_00780 BO398_00049` ### 1.3 PARSE OUTPUT OF BLAST RESULT `awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" example_blastp_outfmt6.txt` Retrieves the entries indicated from entries below. `qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid` #### Q2 | Parameters | Deafult value BLASTN | Deafult value BLASTP | |:---------------- | -------------------- |:-------------------- | | Matrix | BLOSUM 62 | BLOSUM62 | | evalue | 10.0 | 10.0 | | max_target_seqs | 500 | 500 | | num_descriptions | 500 | 500 | | num_alignments | 250 | 250 | For matrix, I would change the value when I don't want to use the BLOSUM62 matrix. For evalue, I would change the value if I don't want to make the expect value for saving hits to be 10. For max_target_seqs, I would change the value when I want to keep less or more aligned sequences than 500. For num_descriptions, I would change it if I want descriptions to appear more frequently than one description per 500 database sequences. For num_aligments, I would change it if I want to see a different number of alignments from 250.