BI278 Lab 5 Notes

# BI278 Lab 5 Notes utilizing blast on PROKKA annotations ## blastp blastp: traditional blastp to compare a protein query to a protein database blastp-short: blastp optimized for queries shorter than 30 residues blastp-fast: a faster version of blastp that uses sa larger word size (6 vs. 2-3) ## blastn blastn: traditional blastn requiring an exact match of 11 blastn-short: blastn program optimized for sequences shorter than 50 bases megablast: traditional megablast used to find very similar (e.g. intraspecious or closely related species) sequence dc-megablast- discontinous megablast used to find more distant (e.g. interspeies) sequences. To check if there is an environmental variable to tell the UNIX environment where to look for searchable databases use the following command: `echo $BLASTDB` If nothing shows up with this command, set an environmental variable utilizing the following command: `export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"` Use the echo command to check if this worked. You will have to do this export command every session logging onto bi278 (before using BLAST). For help, use the commands: `blastn -help` `blastp -help` ## 1.1 Run Command Line Blastp Utilize the ">" symobol to seperate sequences in a multi-fasta file (a file that contains multiple fasta sequences) using the following command: awk `BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}` ORS="" PROKKA_inputfile to check the output of any given command utilize the command `some command | head` (I used this to show the results of the awk BEGIN command) ### 1. Writing hypothetical proteins to their own fasta file. (used only for protein sequences in this lab) Utilize the command: awk 'BEGIN {RS=">"} /hypothetical protein/ {print ">"$0}' ORS="" PROKKA_09242022.ffn > hypotheticalprotein - "hypothetical protein"- could be any file name - this file will be put in the directory you are currently working from ### 2. In order to take any percentage of a file's records utilize the command: awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile Upon application of this command, I found issue in the usage of single quotes. If there is an error make sure you are using ', not ` - hypothetical_inputfile - was the PROKKA file in this case - "0.002"- takes 0.2 percent of the input file to send it to a seperate file, use awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" hypothetical_inputfile > newfile - "newfile"- the file it is being sent to - ">" command moves something to somewhere else ### 3. Running a protein blast. The following command is the basic protein query blast that should be used: `blastp -task blastp-fast -num_threads 2 -db PATH/database_name -evalue 1e-6 -query newfile` * The "PATH/database_name" for this lab was `/courses/bi278/Course_Materials/blastdb/refseq_select_prot` - 1e-6: determines stringency of a search. If there are no matches, try removing this option. Good e-values for BLAST searches are within 1e-6. - Initially, if you get no hits on a scan with 1e-b, take out "-evalue 1e-6" OPTIONS FOR DATABASES: swissprot (for proteins)- smaller but better refseq_select_prot (for proteins)- larger database nt (for nucleotides) ### 4. To add specifications to your search output use the commands: `-outfmt "6 std ppos qcovs stitle sscinames staxid" - max_target_seqs 10` `-outfmt 3 -num_descriptions 10 -num_alignments 10` Example (applied to this lab): `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query faa_file > subset_blastp_fmt6.txt` `blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt 3 -num_descriptions 10 -num_alignments 10 -query faa_file > subset_blastp_fmt3.txt` - max_target_seqs, -num_descriptions and -num_alignments options limit the results you see somewhere in the middle of the database search process. DON'T USE THESE FOR future PROJECTS USING BLAST - use the default numbers: either 250 or 500 Helpful: BLAST searches may take awhile, you can use the "&" to run a search in the background, just not another BLAST search. - Output formats: most helpful are outmft3 and outmft6. - Output format 6 (outmft6) contains all of the following data in table format: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore. - This data is very helpful! Following a BLAST, find your queries by opening the file that was exported - A query should like like this "BO395_0117898" or something similar To find the subject sequence ids of one of your queries use the output format 6 results and specificy your query of interest. - Can be done by $1=="YOUR QUERY (BO395...)" ### 5. Taking the subject sequence ids and saving them as a list (see above process to specify query): Utilize the command: ``` awk '$1=="BO395_01177" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BO395_01177.sseqids.txt ``` I believe this was all written in one line of code. ### 6. Use the blastdbcmd to find the entries from a search target database that match the subject seqeunce ids Utilize the command: ``` blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BO395_01177.sseqids.txt ``` ## 1.2 Run Command Line Blast- skipped during Lab 5 The basic nucleotide blast search that should be started with and adjusted as you go: `blastn -task megablast -num_threads 2 -query ffn_file -db nt -evalue 1e-6` - "ffn_file": the other PROKKA file gotten from previous labs ### 7. Run a nucleotide BLAST search for one query sentence that you got results back from blastp search above. (you must use nt database for blastn) - this will take a long time, and will hang. ## 1.3 Parse the Output of BLAST results The format specifiers for BLAST: ![](https://i.imgur.com/bLxY54j.png) Within the output format 6 file, columns should be in the following order: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore ppos qcovs stitle sscinames staxid Given this information, if you wanted these columns to end up in the reorganized final output file: 1. query sequence id 2. subject sequence id 3. full name of subject sequence 4. scientific name of subject sequence 5. percent positive scoring matches 6. query coverage 7. e-value ### 8. you would use this command to retrieve only these fields: `awk -F "\t" '{print $1,$2,$15,$16,$13,$14,$11}' OFS="\t" example_blastp_outfmt6.txt` - "example_blastp_outfmt6.txt"- variable dependent on what you used. - This command will reorganize information from the output file. - Each of the columns of the table are specified using a dollar sign and a number.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.