# BI278 Lab #3 – Gene prediction with Prokka
1. Make new lab folder and copy over one of the genomes from the folder
cp /courses/bi278/Course_Materials/lab_03/*.fasta lab_03
Accidentally copied over all four genomes, used rm command to isolate the P.bonniea_bbqs395.nanopore.fasta file
2. Confirm genome's PATH using ls command (--> in lab_03)
3. Run PROKKA annotation
prokka --force --outdir lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96 243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 lab_03/P.bonniea_bbqs395.nanopore.fasta
**The script that I copied in didn't work, and I tried multiple ways and also used a classmate's exact command, and it still didn't work. I answered the first, third, and fourth questions on my own, but worked with Saa to complete the second.
QUESTIONS
Q1. What info is in each of the different output files that Prokka generates?
**.faa**– Protein FASTA file of the translated CDS sequences.
**.ffn**– Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
**.fna**– Nucleotide FASTA file of the input contig sequences.
**.gff**– This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
**.txt**– Statistics relating to the annotated features found.
**.tsv**– Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product
Q2.
cat PROKKA_09272002.txt
How many CDs? 3427 CDs
How many tRNA? 56 tRNAs
```
grep ribosomal PROKKA_09272002.faa
```
How many 'ribosomal proteins'? 3294
```
grep hypothetical PROKKA_09272002.faa
```
How many 'hypothetical proteins'? (hint: try grep, then grep -c or grep | wc -l for the latter two) 3496
Q3. Which file is the most useful for finding genomic locations (chromosome, base address) of any type of annotated feature?
The file that is the most useful for finding the locations of any type of annotated figure is the .gff command because it contains both sequences and annotations (the 'master annotation'). You could also use .tsv because it will display a file of all of the features.
Q4. Compare metrics to that of the reference genomes for the two species. Are the results similar? Does your answer impact your confidence in the sequencing quality of your new draft genome?
The protein count and tRNA count are similar for both genomes (3,390 and 56 for P. bonniea, 3,420 and 57 for P. hayleyella). Because they are similar, this increases my confidence in the sequencing quality of the new draft genome because the two species are closely related and have similar metrics. The metrics that I calculated using PROKKA were also very similar to the counts found on the database.
P. bonniea:
https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+bonniea
P. hayleyella:
https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+hayleyella