# bi278 Lab#3 notes ## exercise 1 ### exercise 1.1 created a new directory lab_03 by using `mkdir lab_03` copied a file fromclass file by using `cp /courses/bi278/Course_Materials/lab_03/P.hayleyella_bhqs69.pacbio.fasta lab_03` Now that I have a genome file in my directory, I used `prokka --force --outdir lab_03/bhqs69 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BH69 --genus Paraburkholderia --species hayleyella --strain bhqs69 lab_03/P.hayleyella_bhqs69.pacbio.fasta` to run the command prokka **Q1. What information is in each of the different output files that Prokka generates? (focus on these: .faa, .ffn, .fna, .gff, .txt, .tsv)** A. faa has codons, ffn has bases for gene coding area, fna has the whole sequence maybe? fna has just bases which I assume is the whole sequence. gff also contains just bases, which I am not sure the difference is to fna. txt is the summary of the genome, and tsv is the summary of each proteins (I think). **Q2. Then, explore and use the best file from above to answer each question: How many CDS do you have How many tRNA do you have How many 'ribosomal proteins' do you have How many ‘hypothetical proteins’ do you have Hint: try grep, then grep -c or grep | wc -l for the latter two.** CDS 3584 tRNA 63 ribosomal proteins 3663 hypothetical proteins came out to be 65422, which does not seem right. **Q3. Which file is the most useful for finding genomic locations (chromosome and base address) of any type of annotated feature (genes (CDS), tRNA, etc)?** txt since it shows the summary of each categories. **Q4. Compare your metrics to that of the reference genomes for these two species. Are your results similar? Does your answer impact your confidence in the sequencing quality of your new draft genome?** rRNA came out to be 12 in both, CDS is 3584 in our file whereas 3607 in NCBI file. tRNA is 63 in ours vs 57 in NCBI. The size is 4.12 MB in our file vs. 4.13 MB in NCBI. It does influence, but overall the numbers are close enough.