# Lab 02
## NCBI Data, kmers, and prokka
### Exercise 1.2: Download public NCBI genome data
I will be getting the data from *Yersina pestis* as it has a reference genome as well as a raw sequencing reads
```
cp /Downloads/GCF_000222975.1_ASM22297v1_genomic.fna /colbyhome/BI278/lab_02
cp /home/students/k/kyamad23/lab_01b/bishbashbosh.sh /colbyhome/Bi278/lab_02
```
We first download the genome .fna file from the NCBI database and copy it from our desktop into our server space. We also copy the genome file reading script we wrote last week into the same folder.
```
sh bishbashbosh.sh ./GCF_000222975.1_ASM22297v1_genomic.fna
```
By running the script as last time, we can get the GC count and the genome size from the NCBI file. We can calculate the GC % by dividing the GC count by the total genome size.
```
grep ">" ./GCF_000222975.1_ASM22297v1_genomic.fna
```
The grep command searches the designated file for the specified character in the first argument. In fna files, they give us the contig count, strain, species, and the status of the sequence.
| Name | Strain | Contig Count | Genome Size| GC % |
| -------- | -------- | ------------ | -----------| -----|
| *Yersina pestis* | A1122 | 3 | 4658411 | 47.63|
### Exercise 1.2: Download Raw Reads via SRA Toolkit
From NCBI, we choose a [study](https://www.ncbi.nlm.nih.gov/bioproject/881323) conducted by King's College London on real-time whole genome sequencing to guide patient tailored therapy of SARS-CoV-2 infection. This project has 1 SRA experiment and we download the [file](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR21607323&display=metadata) which viral RNA sample sequenced with GridION. The size of this sequence is 51.3M and the GC % is 40.6%
```
fastq-dump -X 3 -Z SRR21607323
```
fastq-dump is the first non-native Unix command that we will use. This command is from the fastq-dump package and allows us to access NCBI sequencing data remotely through the terminal. The command above will allow us to view the first three "spots" to ascertain information about the file before downloading the full file.

From the output above, we can tell that there should be three 490bp sequences per "spot". This is how data is stored by the NCBI database but in order to combine the reads, we need to run the command below.
```
fastq-dump -–split-3 --skip-technical --readids --read-filter pass --
dumpbase --clip -v --fasta default
--outdir ./ SRR21607323
sh bishbashbosh.sh SRR21607323_pass.fasta
```
The command above downloads the combined SRR id as one fasta file and outputs it to the specified directory. Running the script allows us to get the GC content and the total genome size.
We will now do the same steps on raw sequencing reads for [our bacteria](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16693328&display=metadata).
```
fastq-dump -–split-3 --skip-technical --readids --read-filter pass --
dumpbase --clip -v --fasta default
--outdir ./ SRR16693328
```
| Organism. | SRA instrument type | SRA run number | Genome Size(bp) | Est. Size |
| ------------ | --------------------- | -------------- | --------------- | --------- |
| P. fungorum | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 | 9,185,185 |
| P. sprentiae | Illumina GA IIx | SRR3927471 | 7,829,542 | |
| P. terrae | PacBio RS II | DRR322713 | 10,062,489 | |
| Y. pestis | Illumina | SRR16693328 | | 2384033 |
### Exercise 2.1: Count k-mers and estimate genome size based on k-mer frequencies
In order to count the number of k-mers in the genome file, we run the following command from the jellyfish package:
```
jellyfish count -t 2 -C -s 1G -m 29 -o Yersinia.m29.count GCF_000222975.1_ASM22297v1_genomic.fna
jellyfish histo -o Yersinia.m29.histo Yersinia.m29.count
rm Yersinia.m29.count
```
The first command creates a file of raw counts and the second command runs on the count file and outputs a file with the frequency distribution of all k-mers.
We will turn the commands above into the following script:
```
jellyfish count -t 2 -C -s 1G -m $1 -o $2.count $3
jellyfish histo -o $2.histo $2.count
rm $2.count
```
This script, named fishy.sh, will take the k-mer size as the first argument, the name of the organism as the second argument, and the input_file as the third argument. It automates the process of taking in a fasta file, creating a count file, creating a frequency distribution across all k-mers from the count file, and removing the count file to free up memory.
### Exercise 2.2 Visualize your K-mer Counts and Estimate Genome Size in R
For exercise 2.2, we will be working in R after copying the histo files produced above into the working directory of our R studio file
```
getwd()
setwd('/home2/kyamad23')
```
The command "getwd()" will output the current working directory of Rstudio in the console and the command setwd() will set the working directory to the value passed into the argument.
```
data <- read.table("Y_ent_m29.histo", h = F, sep = " ")
plot(data, type = "l")
```
The first command will load the histo file as an R table and the second one produces the following plot:

There is a very small bump in before 1000 kmer size. This is because the data is being skewed by an intial bump that is caused by errors. By selecting the exact range of values that we want, we can get a more informative plot.
```
plot(data[5:250,], type = "l")
```

The plot above is much more informative in telling us where the main peak lies.
In order to find where the midpoint of the peak or the kmer value with the higheset density is, we can list all values on the curve with the following command:
```
data[150:180,]
```
This lists all kmer values within the range 150-180. The "highest density" occurs at kmer = 170 for this frequency distribution.
The following is the code to calculate the area under the curve in R
```
nrow(data) #2188
head(data) # first 6 values
head(data[,1]) #kmer indices of the first 6 values
head(data[,2]) # frequency of the kmers 1-6
head(data[5:nrow(data),]) # kmer values 5 and beyond
head(data[5:nrow(data),1]*data[5:nrow(data),2]) # multiplying the length of each kmer by the frequency to count up bp
sum(data[5:nrow(data),1]*data[5:nrow(data),2]) # sum of bp excluding first five kmer values
sum(data[5:nrow(data),1]*data[5:nrow(data),2])/170 # estimating the area under the curve by creating a square region around the curve and dividing by the height
```
The estimated area under the curve for the Y. enterocolitica frequency distribution was 2384033.This means that the estimated size of the genome based on the kmer frequency distribution was 2384033bp.
Unfortunately, because the bi278 server ran out of memory, I was unable to calculate the estimated sizes of two other species. Following the code above and replacing the names, it should be relatively simple to find the estimated size of a genome from the kmer frequency distribution.