# BI278 Lab 03
### Gene prediction with Prokka
Olivia Schirle
09/27/2022
## 1. Annotate a genome with Prokka
```
mkdir lab_03 # Make a new folder
cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs395.nanopore.fasta ./lab_03 # Copy one of the genomes into the folder
```
```
ls lab_03/P.bonniea_bbqs395.nanopore.fasta # Check path to genome
```
Run Prokka annotation:
```
cd lab_03
prokka --force --outdir ./bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 P.bonniea_bbqs395.nanopore.fasta
```
## Q1. What information is in each of the different output files that Prokka generates?
.faa: protein FASTA file containing the translated coding sequences (CDS)
.ffn: nucleotide FASTA containing all the predicted transcripts - CDS, rRNA, tRNA, tmRNA, & misc_RNA
.fna: nucleotide FASTA file containing the input contig sequences
.gff: the master annotation in GFF3 format with both sequences and annotations
.txt: statistics about the annotated features found
.tsv: tab-separated file of all features - locus_tag, ftype, len_bp, gene, EC_number, COG, product
## Q2. Then, explore and use the best file from above to answer each question:
### How many CDS do you have?
```
head bbqs395/PROKKA_09272022.txt
```
There are 3427 CDS.
### How many tRNA do you have?
```
head bbqs395/PROKKA_09272022.txt
```
There are 56 tRNA.
### How many 'ribosomal proteins' do you have?
```
grep "ribosomal protein" bbqs395/PROKKA_09272022.faa | wc -l
```
There are 53 ribosomal proteins.
### How many ‘hypothetical proteins’ do you have?
```
grep "hypothetical" bbqs395/PROKKA_09272022.faa | wc -l
```
There are 1461 hypothetical proteins.
## Q3. Which file is the most useful for finding genomic locations (chromosome and base address) of any type of annotated feature (genes (CDS), tRNA, etc)?
The .gff file is most useful for finding genomic locations as it includes both the sequences and annotations. The .tsv files are also very helpful because they present a lot of infomarion about all features as well, but in a more organized manner.
## Q4. Compare your metrics to that of the reference genomes for these two species. Are your results similar? Does your answer impact your confidence in the sequencing quality of your new draft genome?
From the reference genome, there are 3,550 CDS, 56 tRNA, 59 ribosomal proteins, and 515 hypothetical proteins. My results for the number of CDS, tRNA, and ribosomal proteins were pretty similar to the reference genomes, with the number of tRNAs being the same and the number of CDA and ribosomal proteins only slightly different. My results for the number of hypothetical proteins was very different from the reference genome. I found that there were 1461 hypothetical proteins. but there were only 515 hypothetical proteins in the reference genome. The discrepancies slightly lower my confidence, but the results are very similar, so I am still fairly confident in the quality of the new draft genome because the differences were small.