###### tags: `BI278` # Lab 3 Notebook (9/28) ## Exercise 1. Annotate a genome with Prokka #### 1.1 ANNOTATE A GENOME After entering the server with ```ssh lmdrep23@bi278``` and creating a new folder with ```mkdir lab_03``` and going to that folder ```cd lab_03``` and copying files from course materials over ```cp /courses/bi278/Course_Materials/lab_03/*.``` , I ran the following command to run a Prokka annotation of a FASTA file ``` prokka --outdir baqs70 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_ K96243_132.gbk --locustag PAG70 --genus Paraburkholderia --species agricolaris - -strain BaQS70 /courses/bi278/Course_Materials/lab_03/b70_dnaa_irp_repolished.fa sta ``` Note: copying and pasting from lab instructions required insuring there are spaces before each -- for tags and making sure there's a space after tags and before directory locations. 1) Prokka outputs a folder in the current directory with multiple files. The .faa file gives a the amino acid sequences in the found proteins that this sequence codes for. The .gff file looks at a gene structure prediction of the sequence, identifying what is likely protein coding, etc. The txt file gives meta-information, such as the organism, number of contigs, total number of bases, etc. The .tsv file provides whether each CD is a hypothetical protein, etc. 2) As given by the txt file, the total number of CDS is 7860. I then used the following command to search for the number of CDS that are a hypothetical protein in the .tsv file ``` grep -c "hypothetical protein" PROKKA_09282020.tsv ``` The output of this command was 2703, suggesting 2703 of the 7860 CDS are hypothetical proteins. Switching out "hypothetical protein" with "ribosomal protein" reveals 57 ribosomal proteins. Repeating this with the term "tRNA" reveals 147 tRNAs (note: I repeatedly used the command ```less PROKKA_09282020.tsv``` to make sure the file uses these exact terms so the count is accurate. I also do not see any instances of these terms being used when not to indicate a CDS, such as a summary at the top of the file, etc.) 3) Since the .tsv file provides the locus tag for each instance of any type of annotation, this would be a good file to use to find the location of certain annotation types. ## Exercise 2. command-line BLAST To be able to use blast, I used the following command to clarify where the databases to be used are located. ```export BLASTDB="$BLASTDB:/courses/bi278/Course_Materials/blastdb"``` Then I used this command to check and make sure that the the right directory came up when I ask where the databases are ```echo $BLASTDB``` Indeed the output was :/courses/bi278/Course_Materials/blastdb. #### 2.1 RUN COMMAND LINE BLASTN ON A SUBSET OF UNKNOWN HYPOTHETICAL PROTEINS The Prokka output file PROKKA_09282020.ffn gives nucleotide sequences, and the .faa file gives the protein sequence (primary structure of proteins). I used the command ```awk 'BEGIN {RS=">"} /hypothetical protein/{print ""$0}' ORS=" " PROKKA_09282020.ffn ``` to separate sequences in the .ffn file. To break down this command: - The RS= determines the record separator - the /hypothetical protein/ establishes that term as the pattern. So every a record containing "hypothetical protein" is found,the program prints the record separator (> in this case) and then the record that contains the pattern. Essentially your identifying what you want to find and what you what to use to mark instances of it. Note that you can add >filename to the end of this command to put the awk output in a new file. If you are hesistant to make a new file due to uncertainty if it's right, use ```some command | head```, which shows on the screen what the beginning of the file will look like. 1. I used the above steps to write the nucleotide sequences of the hypothetical proteins to a file. I also used the following command to do the same for the protein sequences. Note that the original file contained sections for many different categories other than hypthetical proteins, but this file only included hypothetical proteins thanks to the awk command. ```awk 'BEGIN {RS=">"} /hypothetical protein/{print ""$0}' ORS=" " PROKKA_09282020.faa > proteinseq_hypotheticalproteins.faa``` Next, I can use the awk commands take a subset of the data using the following command (this takes .2% of the data) ```awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" PROKKA_09282020.faa > subset_of_records.faa``` 2. I used the following command to take my .ffn file of hypothetical proteins and export a subset of them to a new file ```awk 'BEGIN {RS="PAG"; srand()} (rand()<=.002) {print ">"$0}' ORS="" ntseq_hypotheticalproteins.ffn > subset_ntseq_hypotheticalproteins.ffn ``` The following was done for the protein files ``` awk 'BEGIN {RS="PAG"; srand()} (rand()<=.002) {print ">"$0}' ORS="" proteinseq_hypotheticalproteins.faa > subset_proteinseq_hypotheticalproteins.faa``` #### 2.2. RUN COMMAND LINE BLASTP ON A SUBSET OF UNKNOWN HYPOTHETICAL PROTEINS I used the following command to complete a nucleotide blast search. The options mainly specify the formatting of the output file so that it is readable/ easy to work with. As you can see, I write the output to a file. ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastn -task megablast -query subset_ntseq_hypotheticalproteins.ffn -db nt -outfmt "6 std stitle sscinames" -max_target_seqs 5 -evalue 1e-6 > ffn_megablast_result.txt &``` Note that the & at the end allows me to use the command line while it is running I did the same for the protein sequence data with the following command. Note the use of blastp instead of megablast. ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast - num_threads 4 -query subset_proteinseq_hypotheticalproteins.faa -db refseq_protein -outfmt "6 std stitle sscinames" -max_target_seqs 5 -evalue 1e-6 > blastp_result_1b.txt ``` There are actually two possible databases to use for protein sequence data. Just replacing refseq_protein with swissprot tries this other database (I also changed the output file name in this command so I'd be able to differentiate the two) Note that the above command is a series of modifications upon the basic command for BLAST on amino acid sequences: ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast - num_threads 4 -query faa_file -db refseq_protein``` #### 2.3. PARSE THE OUTPUT OF BLAST RESULTS 4. I used the following command to see in terminal the query id, scientific name of subject sequence, full name of subject sequence, percent identity, alignment length, and e-value of each sequence in the BLAST result. I changed out the last item in the comamand for each of my BLAST output files ```awk -F "\t" '{print $1,$14,$13,$3,$4,$11}' OFS="\t" blastp_result_1b.txt``` Q4: Looking at this organized form of my output files, I can see that the blastn (the nucleotide blast search) was able to recognize more sequences than blastp (the protein blast search), perhaps making it more useful that blastp. Q5: Refseq seemed to work better than swissprot, which only yielded information for one sequence in my subset of hypothetical proteins. I also kept the -evalue 1e-6 parameter since it seemed to effectively find matches without keeping too many that were a questionable fit. The majority of the parameters refer to format, and the suggested parameters created a readable format for me which is important in analyzing data. Q6: To find the default parameters for blastn and blastp, I entered the following commands ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastn -help``` ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastp -help``` For blastp and blastn, the default value for outfmt is 0, meaning the data is displayed pairwise rather than tabular. Without tabs between the columns, it would be difficult to draw comparisons between sequences, which is why I would avoid the default option and use 6 as I did in this lab. Another parameter we specified was evalue. The default evalue is 10 for both blastp and blastn as well. We chose 1 x 10(^-6), which I found to have an appropriate level of stringency. This higher default would give more matches, but ones that are not necessarily accurate. This is why I would also avoid this default option.