# Bi278 Lab 3 v2 gene prediction with `prokka`
### By Lee Ferenc 9/26/2023
## Exercise 1. Annotate a Bacterial Genome
[Here is the manual for prokka](https://github.com/tseemann/prokka#invoking-prokka).
Run Prokka annotation with Burkholderia pseudomallei as a closely related genome to enhance prediction: `prokka --force --outdir PATH/STRAIN --proteins /courses/bi278/Course_Materials/lab_o3/Burkholderia_pseudomallei_K96243_132.gbk --locustag STRAIN --genus Paraburkholderia --species bonniea --strain STRAIN PATH/FILENAME`
#### Example using P.hayleyella
```
readlink -f P.hayleyella_bhqs69.pacbio.fasta
prokka --force --outdir /home2/enfere24/lab_03 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BHQS69 --genus Paraburkholderia --species bonniea --strain BHQS69 /home2/enfere24/lab_03/P.hayleyella_bhqs69.pacbio.fasta
```
Notice `--outdir` which means you need to grab the path of the file and also the output path/strain
### Question 1: Different output files and what they contain (I added .faa from my last. lab)
* .gff: The master annotation in .gff3 "contains both sequences and annotations. It can be viewed directly in Artemis or IGV."
* .gbk: "Standard Genback file derived from .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence" Cotains a Genbank file with sequences and annotations.
* .faa: "Protein FASTA file of the translated CDS sequences."
* .ffn: "Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)"
* .fna: "Nucleotide FASTA file of the input contig sequences."
* .txt: Text file with statistics relating to the annotated features
* .tsv: "tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product"
### Question 2. Respond w/ the command and the answer for each of the following questions:
#### How many CDS (coding sequences) do you have?
3584
`head PROKKA_09262023.txt`
#### How many rRNA and tRNA does it have
12 rRNA and 63 tRNA
`head PROKKA_09262023.txt`
#### How many ‘hypothetical proteins’ does it have
1214
`grep -c "hypothetical protein" PROKKA_09262023.faa`
#### How many genes associated with toxin does it have
7
`grep -c "toxin" PROKKA_09262023.faa`
### Question 3. Which file can you use for finding genomic locations (chromosome and base pair position) of any type of annotated feature?
I believe .gff is the best. It contains both sequences and annotations and so would be effective in finding the position of an annotated feature.