Week 3 Notes (Prokka)

# Week 3 Notes (Prokka) Prokka: Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. Link to an Online manual for prokka: https://github.com/tseemann/prokka#invoking-prokka A Prokka manual will also appear when you type Prokka into your terminal. ## 1. Annotate a bacterial genome Make a new directory for my work `mkdir lab_03` Copy one of the genomes (*.fasta) from here: /courses/bi278/Course_Materials/lab_03 `cp /courses/bi278/Course_Materials/lab_03/P.bonniea_bbqs433.nanopore.fasta ~/lab_03` Note the new genome’s PATH. Run prokka annotation on the genome file using the following command. ``` prokka --force --outdir PATH/STRAIN --proteins /courses/bi278/Course_Materials/lab_o3/Burkholderia_pseudomallei_K96243_132.gbk --locustag STRAIN --genus Paraburkholderia --species bonniea --strain STRAIN PATH/FILENAME ``` * note: path and strain both get used as a tag to locate the Pokka output. We are using a really well studied pathogen genome from Burkholderia pseudomallei as a closely related genome, to improve gene prediction. prokka --force --outdir PATH/STRAIN --proteins /courses/bi278/Course_Materials/lab_o3/Burkholderia_pseudomallei_K96243_132.gbk --locustag STRAIN --genus Paraburkholderia --species bonniea --strain STRAIN PATH/FILENAME Noh "Each of the output files from prokka is essentially a text file, which means that you can search for and count different patterns." ##### Q1. Take a look at the following output files that prokka generates. What information is in each type of file? Ideally, you will be able to recognize each file type for its utility (what it’s most useful for) later on. Look at the First 10 Lines of each file using `head` *.gff gff files give us information on each scaffold or chromosome so we can see where they begin/end, their strributes, reads, score etc. see: https://useast.ensembl.org/info/website/upload/gff.html *.gbk I didn't find a gbk file but it should be similar to a gbf file which contains information like the locus, # of bps and is its linear. *.ffn The ffn file apears to contain the nucleotide sequence of the genes. *.fna The fna file appears to have the genes, their sequences and the protein they code for. *.txt The txt file contains information about the data including organism name, # of contigs, bases, CDS, CRISPR, rRNA, tRNA, tmRNA *.tsv The tsv file contains information about each of the annotated genes including ftype, length in bp, ene, EC number, COG, and product. See this source for more help: https://www.hadriengourle.com/tutorials/annotation/#:~:text=faa%20file%20contains%20the%20protein,sequences%20of%20the%20genes%20annotated. ##### Q2. Using the appropriate files, report back the command you used and the answer for each of the following questions: How many CDS (coding sequence) does this genome have The file has 3430 CDS. I found this using `head PROKKA_09262023.txt`. How many rRNA and tRNA does it have The file has 12 rRNA and 56 tRNA.I found this using `head PROKKA_09262023.txt`. How many ‘hypothetical proteins’ does it have I found that there are 1,363 hypothetical proteins using the following command `grep "hypothetical protein" PROKKA_09262023.tsv | wc -l` and the tsv file. How many genes associated with toxin does it have I found that it has 5 genes associated with a toxin. I found this using the tsv file and the following command `grep "toxin" PROKKA_09262023.tsv | wc -l` Q3. Which file can you use for finding genomic locations (chromosome and base pair position) of any type of annotated feature? The a gff 3 file can tell you the genomic location, as well as most annotated features. I found this out by using the command `head PROKKA_09262023.gff`