# Lab 03 - Gene Prediction with Prokka Exercise 3. Annotate a Genome with Prokka 3.1 Annotate a Genome 1. Make a new folder for lab_03 in your home and copy one of the genomes (.fasta) from this week's folder ``` #ssh into bi278 ssh klpast23@bi278 #make directory mkdir lab_03 #copy over a file: cp P.bonniea_bbqs395.nanopore.fasta ~/lab_03/ ``` 2. know new genome's PATH by checking you can find it from your current location with ls and autocomplete Path: ~/lab_03/ 3. Run Prokka annotation ``` prokka --force --outdir ~/lab_03/bbqs395 --proteins /courses/bi278/Course_Materials/lab_02/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain bbqs395 ~/lab_03/P.bonniea_bbqs395.nanopore.fasta ``` **Q1. What are the different output files that Prokka generates?** .faa: Protein FASTA file of the translated CDS sequences .ffn: Nucleotide FASTA file of all the prediction transcripts .fna: Nucleotide FASTA file of the input contig sequences .gff: This is the master annotation in GFF3 formation, containing both sequences and annotations .txt: Statistics relating to the anotated features .tsv: Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product **Q2. Use the best file to answer each question ** - How many CDS do you have? ``` grep -c ">" PROKKA_09272022.faa #returns 3427 ``` - How many of these are "hypothetical proteins" ``` grep - c "hypothetical protein" PROKKA_09272022.faa #returns 1461 ``` - How many "ribosomal proteins" do you have ``` grep -c "ribosomal" PROKKA_09272022.faa #returns 54 ``` - How many tRNA do you have ``` head PROKKA_09272022.txt #returns: tRNA: 56 ``` **Q3. Which file is the most useful for finding the genomic loations (chromosome and base address) of any type of annotated features (genes (CDS), tRNA, etc)?** The .tsv file returns the locus_tag, the ftype (including CDS and tRNA), the length in base pairs, the gene, the EC_number, the COG, and the product. **Q4. Now compare your metrics to that of your reference genome for these two species?** In the genomic data for Paraburkholderia bonniea, the Protein count is 3,390, the tRNA is 56. They indicate that there is 550 hypothetical proteins. | Feature | Genomic Data | Prokka Data | | --------------------- | ------------ | ----------- | | tRNA | 56 | 56 | | Ribosomal Proteins | 60 | 54 | | Hypothetical Proteins | 515 | 1461 | | CDS | 3,390 | 3,427 | **Bonus Q: How does Prokka find genes?** Prokka relies on external feature prediction tools to identify the coordinates of genomic features within contigs. The traditional way to predict what a gene codes for is to compare it with a large database of known sequences, and transfer the annotation of the best significant match. Prokka uses this method in a hierarchical manner, starting with a smaller database and moving to larger databases, finally to curated models of protein families.