# Lab 03 - Gene Prediction with Prokka
Exercise 3. Annotate a Genome with Prokka
3.1 Annotate a Genome
1. Make a new folder for lab_03 in your home and copy one of the genomes (.fasta) from this week's folder
```
#ssh into bi278
ssh klpast23@bi278
#make directory
mkdir lab_03
#copy over a file:
cp P.bonniea_bbqs395.nanopore.fasta ~/lab_03/
```
2. know new genome's PATH by checking you can find it from your current location with ls and autocomplete
Path: ~/lab_03/
3. Run Prokka annotation
```
prokka --force --outdir ~/lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_02/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 ~/lab_03/P.bonniea_bbqs395.nanopore.fasta
```
**Q1. What are the different output files that Prokka generates?**
.faa: Protein FASTA file of the translated CDS sequences
.ffn: Nucleotide FASTA file of all the prediction transcripts
.fna: Nucleotide FASTA file of the input contig sequences
.gff: This is the master annotation in GFF3 formation, containing both sequences and annotations
.txt: Statistics relating to the anotated features
.tsv: Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product
**Q2. Use the best file to answer each question **
- How many CDS do you have?
```
grep -c ">" PROKKA_09272022.faa
#returns 3427
```
- How many of these are "hypothetical proteins"
```
grep - c "hypothetical protein" PROKKA_09272022.faa
#returns 1461
```
- How many "ribosomal proteins" do you have
```
grep -c "ribosomal" PROKKA_09272022.faa
#returns 54
```
- How many tRNA do you have
```
head PROKKA_09272022.txt
#returns:
tRNA: 56
```
**Q3. Which file is the most useful for finding the genomic loations (chromosome and base address) of any type of annotated features (genes (CDS), tRNA, etc)?**
The .tsv file returns the locus_tag, the ftype (including CDS and tRNA), the length in base pairs, the gene, the EC_number, the COG, and the product.
**Q4. Now compare your metrics to that of your reference genome for these two species?**
In the genomic data for Paraburkholderia bonniea, the Protein count is 3,390, the tRNA is 56. They indicate that there is 550 hypothetical proteins.
| Feature | Genomic Data | Prokka Data |
| --------------------- | ------------ | ----------- |
| tRNA | 56 | 56 |
| Ribosomal Proteins | 60 | 54 |
| Hypothetical Proteins | 515 | 1461 |
| CDS | 3,390 | 3,427 |
**Bonus Q: How does Prokka find genes?**
Prokka relies on external feature prediction tools to identify the coordinates of genomic features within contigs. The traditional way to predict what a gene codes for is to compare it with a large database of known sequences, and transfer the annotation of the best significant match. Prokka uses this method in a hierarchical manner, starting with a smaller database and moving to larger databases, finally to curated models of protein families.