Bi278 Lab 3 v2 NCBI data, kmers, and prokka

# Bi278 Lab 3 v2 gene prediction with `prokka` ### By Lee Ferenc 9/26/2023 ## Exercise 1. Annotate a Bacterial Genome [Here is the manual for prokka](https://github.com/tseemann/prokka#invoking-prokka). Run Prokka annotation with Burkholderia pseudomallei as a closely related genome to enhance prediction: `prokka --force --outdir PATH/STRAIN --proteins /courses/bi278/Course_Materials/lab_o3/Burkholderia_pseudomallei_K96243_132.gbk --locustag STRAIN --genus Paraburkholderia --species bonniea --strain STRAIN PATH/FILENAME` #### Example using P.hayleyella ``` readlink -f P.hayleyella_bhqs69.pacbio.fasta prokka --force --outdir /home2/enfere24/lab_03 --proteins /courses/bi278/Course_Materials/lab_03/Burkholderia_pseudomallei_K96243_132.gbk --locustag BHQS69 --genus Paraburkholderia --species bonniea --strain BHQS69 /home2/enfere24/lab_03/P.hayleyella_bhqs69.pacbio.fasta ``` Notice `--outdir` which means you need to grab the path of the file and also the output path/strain ### Question 1: Different output files and what they contain (I added .faa from my last. lab) * .gff: The master annotation in .gff3 "contains both sequences and annotations. It can be viewed directly in Artemis or IGV." * .gbk: "Standard Genback file derived from .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence" Cotains a Genbank file with sequences and annotations. * .faa: "Protein FASTA file of the translated CDS sequences." * .ffn: "Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)" * .fna: "Nucleotide FASTA file of the input contig sequences." * .txt: Text file with statistics relating to the annotated features * .tsv: "tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product" ### Question 2. Respond w/ the command and the answer for each of the following questions: #### How many CDS (coding sequences) do you have? 3584 `head PROKKA_09262023.txt` #### How many rRNA and tRNA does it have 12 rRNA and 63 tRNA `head PROKKA_09262023.txt` #### How many ‘hypothetical proteins’ does it have 1214 `grep -c "hypothetical protein" PROKKA_09262023.faa` #### How many genes associated with toxin does it have 7 `grep -c "toxin" PROKKA_09262023.faa` ### Question 3. Which file can you use for finding genomic locations (chromosome and base pair position) of any type of annotated feature? I believe .gff is the best. It contains both sequences and annotations and so would be effective in finding the position of an annotated feature.