# BI278 Lab notebook 3 ### Annotate a genome with Prokka Online manual fro Prokka - https://github.com/tseemann/prokka `prokka` - information on prokka commands `mkdir lab_03` - create a lab_03 in the home folder Copy the P.bonniea_bbqs395.nanopore.fasta to lab_03 directory: `cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs395.nanopore.fasta ./lab_03 ` Run: ``` prokka --force --outdir lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 lab_03/P.bonniea_bbqs395.nanopore.fasta ``` In this command: * `--force` - Force overwriting existing output folder * `--outdir` - specifies the output folder * `--proteins` - FASTA or GBK file to use as 1st priority * `--locustag` - Locus tag prefix [auto] * `--genus` - Genus name * `--species` - Species name * `--strain` - Strain name The output files generated: * `.faa` - Protein FASTA file of the translated CDS sequences * `.ffn` - Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) * `.fna` - Nucleotide FASTA file of the input contig sequences * `.gff` - The master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. * `.txt` - Statistics relating to the annotated features found. * `.tsv` - Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product `cd lab_03/bbqs395` change directory #### Finding components * Total CDS: * `grep ">" *.faa -c` = 3427 * or `grep ">" *.faa | wc -l` * or `head *.txt` * Total tRNA: * `head *.txt` = 56 * or `grep "tRNA" *.ffn | wc -l` * Ribosomal proteins: * `grep "ribosomal protein" *.tsv -c` = 53 * or `grep "ribosomal protein" *.tsv |wc -l` * or `grep "ribosomal protein" *.faa |wc -l` * Hypothetical proteins: * `grep "hypothetical protein" *.tsv |wc -l` = = 1461 * or `grep "hypothetical protein" *.faa |wc -l` ##### Which file is the most useful for finding genomic locations(chromosome and base address) of any type of annotated feature (genes(CDS), tRNA, etc)? The `.gff` file would be the most useful since it contains the sequences and the annotations. It has the name, the gene, locus_tag, product, ID, etc. ##### Compare your metrics to that of the reference genome for *P. bonniea*. Are they similar and does it impact your confidence in the sequencing quality of your new draft genome? https://www.ncbi.nlm.nih.gov/genome/?term=paraburkholderia+bonniea The reference genome states that there are 56 tRNA which matches with our value. The numebr of CDS is 3550, which is not a huge difference from our value of 3427. The values I found are not too far off from the reference genome, so I think the sequencing quality of the new draft genome is quite good.