# BI278 Lab notebook 3
### Annotate a genome with Prokka
Online manual fro Prokka - https://github.com/tseemann/prokka
`prokka` - information on prokka commands
`mkdir lab_03` - create a lab_03 in the home folder
Copy the P.bonniea_bbqs395.nanopore.fasta to lab_03 directory:
`cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs395.nanopore.fasta ./lab_03
`
Run:
```
prokka --force --outdir lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 lab_03/P.bonniea_bbqs395.nanopore.fasta
```
In this command:
* `--force` - Force overwriting existing output folder
* `--outdir` - specifies the output folder
* `--proteins` - FASTA or GBK file to use as 1st priority
* `--locustag` - Locus tag prefix [auto]
* `--genus` - Genus name
* `--species` - Species name
* `--strain` - Strain name
The output files generated:
* `.faa` - Protein FASTA file of the translated CDS sequences
* `.ffn` - Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
* `.fna` - Nucleotide FASTA file of the input contig sequences
* `.gff` - The master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
* `.txt` - Statistics relating to the annotated features found.
* `.tsv` - Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product
`cd lab_03/bbqs395` change directory
#### Finding components
* Total CDS:
* `grep ">" *.faa -c` = 3427
* or `grep ">" *.faa | wc -l`
* or `head *.txt`
* Total tRNA:
* `head *.txt` = 56
* or `grep "tRNA" *.ffn | wc -l`
* Ribosomal proteins:
* `grep "ribosomal protein" *.tsv -c` = 53
* or `grep "ribosomal protein" *.tsv |wc -l`
* or `grep "ribosomal protein" *.faa |wc -l`
* Hypothetical proteins:
* `grep "hypothetical protein" *.tsv |wc -l` = = 1461
* or `grep "hypothetical protein" *.faa |wc -l`
##### Which file is the most useful for finding genomic locations(chromosome and base address) of any type of annotated feature (genes(CDS), tRNA, etc)?
The `.gff` file would be the most useful since it contains the sequences and the annotations. It has the name, the gene, locus_tag, product, ID, etc.
##### Compare your metrics to that of the reference genome for *P. bonniea*. Are they similar and does it impact your confidence in the sequencing quality of your new draft genome?
https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+bonniea
The reference genome states that there are 56 tRNA which matches with our value. The numebr of CDS is 3550, which is not a huge difference from our value of 3427. The values I found are not too far off from the reference genome, so I think the sequencing quality of the new draft genome is quite good.