# bi278 lab3 Exercise 1. Annotate a genome with Prokka 1. Make a new folder for lab_03 in your home and copy one of the genomes (*.fasta)from this week’s folder ``` mkdir lab_03 cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs433.nanopore.fasta ./lab_03 ``` 2. Make sure you know your new genome’s PATH by checking whether you can find it from your current location with ls and autocomplete. ``` ls ``` 3. Run Prokka annotation. We are using a really well studied genome from the pathogenic Burkholderia pseudomallei as a closely related genome to improve gene prediction ``` prokka --force --outdir lab_03/bbqs433 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB433 --genus Paraburkholderia --species bonniea --strain bbqs433 lab_03/P.bonniea_bbqs433.nanopore.fasta ``` ## Q1 ### .faa Protein FASTA file of the translated CDS sequences. ### .ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA). It contains different proteins and their corresponding gene ### .fna Nucleotide FASTA file of the input contig sequences ### .gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. ### .txt It contains the name of organism. number of contigs, number of bases, and number for CDS, CRISPR, rRNA, tRNA, tmRNA. ### .tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ## Q2 #### How many CDS do you have? I used the .txt file. Number of CDS is 3430 #### How many tRNA do you have? I used the .txt file. Number of tRNA is 56 #### How many 'ribosomal proteins' do you have? ``` grep -c "ribosomal protein" PROKKA_09272022.tsv ``` I got 53. #### How many ‘hypothetical proteins’ do you have? ``` grep -c "hypothetical protein" PROKKA_09272022.tsv ``` I got 1464. ## Q3 #### Which file is the most useful for finding genomic locations (chromosome and base address) of any type of annotated feature (genes (CDS), tRNA, etc)? I think it should be the .gff file. Because in this file, there are information about the features (eg. CDS and the tRNA) with respect to which segment (location) of the genome. For example: >1 Aragorn:001002 tRNA 32338 32413 . + . ID=BB433_00027;inference=COORDINATES:profile:Aragorn:001002;locus_tag=BB433_00027;product=tRNA-Phe(gaa) ## Q4 #### Compare your metrics to that of the reference genomes for these two species. Are your results similar? Does your answer impact your confidence in the sequencing quality of your new draft genome? I think the result is similar. At least for the tRNA part, my result is the same as the one on website. I think this result gives me more confidence in the quality of my draft genome.