# Lab 03 ## Gene Prediction with Prokka ### Exercise 1: Annotate a genome with Prokka We first set our working directory to a new directory for this lab located in the BI278 folder of the colbyhome directory ``` mkdir /home2/kyamad23/BI278/lab_03 cd /home2/kyamad23/BI278/lab_03 ``` We copy the P. bonniea bbqs433 fasta folder from the Course Materials folder to our working directory ``` cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs433.nanopore.fasta ./ ``` We will be running the Prokka annotation on the fasta file, P.bonniea_bbqs433.nanopore.fasta which corresponds to the bbqs433 strain of the species, P. bonniea. ``` prokka --force --outdir ./bbqs433 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96 243_132.gbk --locustag BB433 --genus Paraburkholderia --species bonniea --strain bbqs433 PATH/P.bonniea_bbqs433.nanopore.fasta ``` The command above uses the well-documented similar genome of B. Pseudomallei in order to improve gene prediction. After running for ~4 minutes, the command will output the results of Prokka annotation into the newly created directory, "bbqs433" #### Q1: What information is in each of the different output files that Prokka generates? (focus on these: .faa, .ffn, .fna, .gff, .txt, .tsv) The following is information contained in the [documentation](https://github.com/tseemann/prokka) for PROKKA The faa file contains the individual coding sequence features from the genomic record translated into a protein sequence. This is an output of translating the nucleotide sequence in the fna file. This file is in FASTA format. The ffn file is in FASTA format and contains all of the nucleotide features. This is an output of all the prediction transcripts (i.e. Coding Sequences, rRNA, tRNA, tmRNA, miscellaneous RNA) The fna file containing the original input, contiguous sequences before assembly. These are the original shorter contig segments before assembly. The gff file contains the main annotation output in GFF3 format (tab-delimited and standardized). This file contains both sequences and annotations. In order to view these annotations directly, we can run the following command ``` grep -v "^#" *.gff | less -S ``` Running the above command will skip the annotated sequences and go straight to the annotations. It searches for the "#" at the beginning of each line. The tsv file contains is in the file format, "tab-separated values" and contains the features, locus tag, filetype, length in bp, gene, EC number, COG, and product. The txt file contains the summary statistics of running the gene annotation. This includes the organism name and the number of contigs, bases, coding sequences, and CRISPER, rRNA, tRNA, and tmRNA sequences. #### Q2: Then, explore and use the best file from above to answer each question: ##### How many CDS do you have From the summary statistics of the annotation (txt file), we see that there are 3430 coding sequences. ##### How many tRNA do you have From the summary statistics of the annotation (txt file), we see that there are 56 tRNA sequences. ##### How many 'ribosomal proteins' do you have ``` grep "ribosomal" *.ffn |wc -l ``` The ffn file contains all the output of all of the prediction sequences. There are 66 ribosomal proteins. ##### How many ‘hypothetical proteins’ do you have ``` grep "hypothetical" *.ffn|wc -l ``` There are 1464 hypothetical proteins. #### Q3: Which file is the most useful for finding genomic locations (chromosome and base address) of any type of annotated feature (genes (CDS), tRNA, etc)? The gff file is the master output that contiains sequences, annotations, and locations. #### Q4: Compare your metrics to that of the reference genomes for these two species. Are your results similar? Does your answer impact your confidence in the sequencing quality of your new draft genome? ![](https://i.imgur.com/IrQmriL.png) ![](https://i.imgur.com/EoFO0hj.png) ![](https://i.imgur.com/7rp1fAx.png) The size of the reference genomes are 4.1 Mb for P. bonniea and 4.13 Mb for P.hayleyella. The size of the P. bonniea bbqs433 that we derived our results from is 4.01 Mb. All of the genomes have two contigs. By using our script from the first week, we find that the GC% is 58.8%. These are both comparable to the GC% found in the reference genomes. The metrics of our new draft genome are close to those of the reference genomes from NCBI so I am more confident in the sequencing quality of the newly drafted genome.