# BI278 Lab 02 ### NCBI data, kmers, and prokka #### Olivia Schirle #### 09/20/2022 ## 1. Download public genomic data from NCBI ### 1.1 How to Navigate NCBI for Genomic Data #### Use the link https://www.ncbi.nlm.nih.gov/bioproject/browse. #### Find a bacteria (*Yersinia pestis*) and download a genome: ##### To do this, search for the bacteria in the BioProjects, then navigate to a project with that bacteria. Select 'Genome' in the 'See Genome Information for Yersinia pestis' box at the top. Select 'genome' to download the genome. #### Copy the genome to your colbyhome and move it to a new folder in your bi278 home: ##### If a filer is not mounted, use command+K and enter smb://filer.colby.edu. ``` cp Desktop/GCF_000222975.1_ASM22297v1_genomic.fna.gz /Volumes/Personal/oeschi23/ # Copy genome file from desktop's Downloads folder to colbyhome ssh oeschi23@bi278 # ssh onto bi278 mkdir lab_02 # Make a folder for this week mv /personal/oeschi23/GCF_000222975.1_ASM22297v1_genomic.fna.gz ~/lab_02 # Move the genome file to your bi278 home gunzip GCF_000222975.1_ASM22297v1_genomic.fna.gz # Unzip file ``` ### Use shell script from last week to collect genome stats on this genome: #### Editted the script file to generalize to any file location: ``` #!/bin/bash grep -v ">" $1 | tr -d -c GCATgcat | wc -c grep -v ">" $1 | tr -d -c GCgc | wc -c ``` #### The *Yersinia pestis* A1122 genome has a total of 4658411 nucleotides in the genome and it has a %GC of 0.4763. ### 1.2 Download Raw Reads via SRA Toolkit #### First, find the SRR run number. Navigate to the "Project Data" table at the bottom of the project page, click the link next to "BioSample," and select one containing an SRA. For *Yersinia pestis*, the SRR run number is SRR6512812. The SRA instrument was Illumina HiSeq 2500. #### We are using the fastq-dump command from the SRAtoolkit, which interacts with the NCBI SRA (Sequence Read Archive) database to download the raw sequence reads. ``` fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR8142457 # Download raw sequencing reads without quality scores head -20 *.fasta # Look at the first 20 lines of the FASTA files ``` #### Download raw sequencing reads from another organism: #### For *Haemophilus influenzae*, the SRR run number is SRR8142457. ``` fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR8142457 ``` | Organism | SRA instrument | SRA run number | Genome size (bp) | Estimated size | | -------------- | --------------------- | -------------- | ---------------- | -------------- | | *P.fungorum* | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 | 9,185.185 | | *P.sprentiae* | Illumina GAA IIx | SRR3927471 | 7,829,542 | | | *P.terrae* | PacBio RS II | DRR322713 | 10,062,489 | | | *P.xenovorans* | Illumina HiSeq 2000 | SRR2889773 | 9,702,951 | | | *Y.pestis* | Illumina HiSeq 2500 | SRR6512812 | 4,607,140 | 5,110,614 | | *H.influenzae* | PacBio RS II | SRR8142457 | 1,847,700 | 2,940,735 | #### Remove extra SRA files and rename the SRA files: ``` rm SRR6512812_pass_2.fasta # remove second SRA file mv SRR6512812_pass_1.fasta Y.pestis_SRA.fasta # rename file mv SRR8142457_pass.fasta H.influenzae_SRA.fasta #rename file ``` ## 2. Count k-mers and estimate genome size based on k-mer frequencies ### 2.1 Count k-mers with jellyfish #### I chose a k-mer size of 21. #### We will be using jellyfish count and jellyfish histo. jellyfish count counts all k-mers and jellyfish histo computes the histogram of the k-mer occurences. ``` jellyfish count -t 2 -C -s 1G -m 21 -o Y.pestis.m21.count Y.pestis_SRA.fasta jellyfish histo -o Y.pestis.m21.histo Y.pestis.m21.count # Get frequency distribution across all k-mers rm Y.pestis.m21.count # Remove count file since it takes up a lot of memory less Y.pestis.m21.histo # look at the histo file q # exit out of file view ``` #### The histo file has two columns - occurrence and density. The occurrence is how often a k-mer appears in the file and the density is how many k-mers have that level of occurence. #### Write a shell script to create histo files for other SRA files: ``` #!/bin/bash jellyfish count -t 2 -C -s 1G -m 21 -o $1.m21.count $1.fasta jellyfish histo -o $1.m21.histo $1.m21.count rm $1.m21.count ``` #### Run the script and visualize the histo file: ``` sh kmer_histo.sh H.influenzae_SRA less H.influenzae_SRA.m21.histo ``` ### 2.2 Visualize k-mer Counts and Estimate Genome Size in R #### Navigate to the RStudio server: bi278.colby.edu. #### Set the working directory to the directory containing the histo files: ``` getwd() setwd("/home2/oeschi23/lab_02") ``` #### Read in the histo files: ``` Y.pestis.m21.histo <- read.table("Y.pestis.m21.histo", h=F, sep=" ") H.influenzae.m21.histo <-read.table("H.influenzae_SRA.m21.histo", h=F, sep=" ") ``` #### Check the shape of the histogram: ``` plot(Y.pestis.m21.histo, type="l") # View histogram as a line plot plot(Y.pestis.m21.histo[5:250,], type="l") # Narrow down the range to find the x-axis value at the main peak ``` #### The initial peak represents kmers resulting from sequencing errors. We ignore the initial peak and are interested in the midpoint of the other peak that represents single copy regions of the genome. #### The midpoint appears to be around 100. #### Check the data directly to determine the midpoint of the main peak: ``` Y.pestis.m21.histo[90:120,] ``` #### The midpoint for the *Y. pestis* with the highest density is at about 101. #### Estimate the genome size: #### Estimate by dividing the area under the curve by the coverage estimate. ``` nrow(Y.pestis.m21.histo) # Number of rows in the file head(Y.pestis.m21.histo) # First few rows in the file head(Y.pestis.m21.histo[,1]) # First few rows of just the first column head(Y.pestis.m21.histo[,2]) # First few rows of just the second file head(Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),]) # Showing a few rows at top of file starting at row 5 head(Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),1]*Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),2]) # The values in the first column multiplied by the values in the second column starting at row 5 and showing the next few rows sum(Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),1]*Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),2]) # Sums together the product of the first and second columns for every row starting at row 5 sum(Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),1]*Y.pestis.m21.histo[5:nrow(Y.pestis.m21.histo),2])/101 # Sums together the product of the first and second columns for every row starting at row 5 and divides by the highest density (101) ``` #### For *Y.pestis*, the estimate is 5,110,614. ``` plot(H.influenzae.m21.histo, type="l") # Plot the H.influenzae histogram plot(H.influenzae.m21.histo[5:250,], type="l") # Narrow the range plot(H.influenzae.m21.histo[10:50,], type="l") # Narrowed the range more since peak was still small H.influenzae.m21.histo[15:25,] ``` #### The midpoint with the highest density for the *H. influenzae* histogram is about 20. ``` sum(H.influenzae.m21.histo[5:nrow(H.influenzae.m21.histo),1]*H.influenzae.m21.histo[5:nrow(H.influenzae.m21.histo),2])/20 ``` #### For *H.influenzae*, the estimate is 2,940,735. ## 3. Annotate a genome with Prokka ### 3.1 Annotate a Genome #### Find the genome's PATHs: ``` ls /courses/bi278/Course_Materials/lab_01b # location of newly sequenced nanopore genomes ls /courses/bi278/Course_Materials/lab_02 # location of B.pseudomallei genome ``` #### Error encountered running Prokka annotation.