# BI278 Lab 3 – Gene prediction with Prokka
Exercise 1 - Annotate a genome w/ Prokka
Online manual: https://github.com/tseemann/prokka
1.1 - Annotate a genome
1) make a new folder for lab_03 in home & copy one of genomes (*.fasta) from this week's folder
```
mkdir lab_03 #in home
cp /courses/bi278/Course_Materials/lab_03/*.fasta ./lab_03
```
3) Run Prokka annotation
```
prokka --force --outdir PATH/bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96 243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 PATH/P.bonniea_bbqs395.nanopore.fasta
prokka --force --outdir lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96 243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 lab_03/P.bonniea_bbqs395.nanopore.fasta
```
**Q1. What information is in each of the different output files that Prokka generates? (focus on
these: .faa, .ffn, .fna, .gff, .txt, .tsv)**
.faa files are essentially protein FASTA files of the translated CDS (coding region) sequences
.ffn files are nucleotide FASTA files of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.fna files are nucleotide FASTA files of the input contig sequences
.gff file is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.txt files are stats relating to the annotated features found
.tsv file is a tab-separated file of all of the following features: locus_tag, ftype, len_bp, gene, EC_number, COG, product
**Q2. Then, explore and use the best file from above to answer each question:
How many CDS do you have?**
There are 3427 CDS.
**How many tRNA do you have?**
There are 56 tRNAs.
```
^for the above two questions:
cat PROKKA_09272022.txt
```
**How many 'ribosomal proteins' do you have?**
```
grep ribosomal PROKKA_09272022.faa
```
There are 3294 ribosomal proteins.
**How many ‘hypothetical proteins’ do you have?**
**Hint: try grep, then grep -c or grep | wc -l for the latter two.**
^QUESTION FOR PROF NOH: I don't understand this hint.
```
grep hypothetical PROKKA_09272022.faa
```
There are 3496 hypothetical proteins.
**Q3. Which file is the most useful for finding genomic locations (chromosome and base address) of
any type of annotated feature (genes (CDS), tRNA, etc)?**
I think the file that would be most useful for finding genomic locations (chromosome and base address) of any annotated feature is the .gff since it contains both sequences and annotations. I would give "second place" to the .tsv file since it contains locus tag information.
**Q4. Compare your metrics to that of the reference genomes for these two species. Are your results
similar? Does your answer impact your confidence in the sequencing quality of your new draft
genome?
P. bonniea:
https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+bonniea
P. hayleyella:
https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+hayleyella**
Comparing my metrics to that of the reference genomes for these two species, I saw that the metrics were similar. For P.bonniea, there were 3550 genes and 56 tRNAs while for P.hayleyella, there were 3607 genes and 57 tRNAs. I calculated metrics for the P.bonniea file. Since my metrics are closer to the P.bonniea reference file (3427 coding regions and 56 tRNAs), it increases my confidence in the sequencing quality of my new draft genome.
**Bonus Q. Hopefully you are wondering, how does Prokka find genes? You can read more about it here: https://doi.org/10.1093/bioinformatics/btu153**