# Bi278 Lab 2 - NCBI data, kmers, and prokka
## Exercise 1. Downloading publc genome data from NCBI
Go to the website below:
[https://www.ncbi.nlm.nih.gov/bioproject/browse](https://)
Search for a desirable genome, for example "*Taylorella equigenitalis*”, and filter by data type for Genome sequencing projects for fewer results. I chose based on the fact it had a publication tied to it. I choose accension **PRJNA61285.**
Click the accession and then you will see a page. Click on genome, and then chromosome level genome to download onto the computer.
##### Copy over the file to the lab folder
Make sure filer is mounted (ctrl + k and input "smb://filter.colby.edu" and connect)
After downloading, open terminal and run gunzip "filename" so
>gunzip Downloads/GCF_002288025.1_ASM228802v1_genomic.fna.gz
Then copy the file to your user (it will say permisison denied but that's false)
>cp Downloads/GCF_002288025.1_ASM228802v1_genomic.fna /Volumes/Personal/enfere24
Ssh into bi278 and copy to your sequence to lab_02:
>mkdir colbyhome/enfere24/Genomics/lab_02
>mv GCF_002288025.1_ASM228802v1_genomic.fna colbyhome/enfere24/Genomics/lab_02
>cd colbyhome/enfere24/Genomics/lab_02
I moved the GC analyzer by cd .. back as far as I could and then running:
>cp courses/bi278/Course_Materials/lab_01b/gc.sh home2/enfere24/colbyhome/Genomics/lab_02
It must be in the same folder or else it won't work
I also renamed the file
>mv lab_02/GCF_002288025.1_ASM228802v1_genomic.fna lab_02/GCF_01v.fna
>sh gc.sh GCF_01v.fna
>grep ">" GCF_01v.fna
##### This genome has a GC content of 37.52%, a total base pair count of 1666291 and 1 contig (it's complete)
## Exercise 1.2 Downloading publc genome data from NCBI**
I downloaded a SARS-CoV-2 from https://www.ncbi.nlm.nih.gov/bioproject/769411
##### The data set accession was SRX8879039, run is SRR12380022. 42.6M bases, 13.1M, GC conent of 38.1%. (I choose the same in the lab instructions as to make sure I was doing this correctly)
I redid the steps to move it into my personal lab_02 folder and cd to lab_02
>vdb-config --interactive
>#exited because it needed to run first
>fastq-dump -X 3 -Z SRR1238022
```
Read 3 spots for SRR1238022
Written 3 spots for SRR1238022
@SRR1238022.1 1 length=51
CTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGC
+SRR1238022.1 1 length=51
BC<DFFDDHHHFHJJJJJJIJJJJJJJJJJJJJJJJJJJIJHIJJJJJJIJ
@SRR1238022.2 2 length=51
CTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACC
+SRR1238022.2 2 length=51
???BDD:DHDHFHGGEHGII=E@FFD;FHEDHBDBFGFFIIIIEHGGEDHE
@SRR1238022.3 3 length=51
CCTCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCA
+SRR1238022.3 3 length=51
@@+=DDFDHHHHHIJIGGGEGGGFIIGIJIIIJJJJIJIJIIIJIIJJIIE
```
In the output, I can see 3 reads of legnth 51
>fastq-dump -X 3 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR1238022
This is confirmed with the output:
```
Read 3 spots for SRR1238022
Written 3 spots for SRR1238022
@SRR1238022.1.1 1 length=51
CTGGAGTGGAGTTTTCCTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGC
+SRR1238022.1.1 1 length=51
BC<DFFDDHHHFHJJJJJJIJJJJJJJJJJJJJJJJJJJIJHIJJJJJJIJ
@SRR1238022.2.1 2 length=51
CTGCCGTCTGCTGCCATCGGAGCCCAAAGCCGGGCTGTGACTGCTCAGACC
+SRR1238022.2.1 2 length=51
???BDD:DHDHFHGGEHGII=E@FFD;FHEDHBDBFGFFIIIIEHGGEDHE
@SRR1238022.3.1 3 length=51
CCTCACTCACTGTCTACTCTCCTCTCACCTCTGCAACACTGGGGACACTCA
+SRR1238022.3.1 3 length=51
@@+=DDFDHHHHHIJIGGGEGGGFIIGIJIIIJJJJIJIJIIIJIIJJIIE
```
Next, I'm running a random bacteria genome but T. equigenitalis is missing even though it says there is 1 result and so I am using Legionella pneumophila (im not redoing work I know well)
## Exercise 2
I choose 22
# first run jellyfish
>jellyfish count -t 2 -C -s 1G -m 22 -o t.equigenitalis.m22.count SRR12380022.fasta
>jellyfish histo -o t.equigenitalis.m22.histo t.equigenitalis.m22.count
>rm t.equigenitalis.m22.count
>cat t.equigenitalis.m22.histo #outputs an interesting thing

#I did this for legionares and will show them in r
First I went to R and then in terminal
>getwd()
Im posting this just in case
1. Open an internet browser and go to this server and log in with your Colby credentials:
bi278.colby.edu
2. Your next task is to set the working directory your histo files are in (this is similar to ls and
cd in Unix). In the Console:
getwd()
setwd("/home/colbyid/directory_name/
3. Then you can read in one by one the result files you created in the previous step. Unlike Unix, R
is unable to access these files without this step.
name <- read.table("name.m29.histo", h=F, sep=" ")
Remember the table that you just read has occurrence (or frequency, how often a k-mer appears in
that file) and density (how many k-mers show at that level of occurrence) for k-mers in your
genome.
4. We can check out the shape of this histogram.
plot(name, type="l")
There is an initial peak that represents kmers that are the result of sequencing errors. Since we
want to find the midpoint of the main peak that represents single copy regions of the genome, we
can narrow down the range of the histogram to try to figure out what the x-axis value is at the main
peak.
plot(name[5:250,], type="l")
5. Next we can check out the data directly to determine the midpoint of the main peak based on
where it was located in your plot (your range to zoom into may be different). This midpoint
number represents a rough estimate of genome coverage.
name[150:180,]
My midpoint with the “highest” density was about 154. This histogram curve is not entirely smooth,
so you might have to consider the overall trend rather than picking the numerical maximum after
the initial peak.
6. Now we want to area under the curve and divide it by this coverage estimate to get an estimate
of genome size. This is an estimation that is possible because the distribution of k-mers in a
genome theoretically follows a Poisson distribution. We are ignoring the part of the curve (the
initial peak) that contains the results of sequencing error.
This next series of commands will hopefully show you what the last line is doing.
nrow(name)
head(name)
head(name[,1]) #square brackets are used to limit [row, column]
head(name[,2])
head(name[5:nrow(name),])
head(name[5:nrow(name),1]*name[5:nrow(name),2])
sum(name[5:nrow(name),1]*name[5:nrow(name),2])
sum(name[5:nrow(name),1]*name[5:nrow(name),2])/154
I get an estimate of 9,185,185 with this calculation for P. fungorum from the SRA data. This estimate
is not necessarily accurate (you will have calculated the genome sizes for each of these datasets in
the previous lab), but is a useful ballpark number to be able to get when you don't know the
genome size for an organism.
Go ahead and finish estimation for your two SRA datasets.
###### im going to update this later but I legitamentlly did not have time to finish this today. im already struggling with mac, no laptop, and completing the rest of my work on top of this plus a serious cold/being sick where I'm short of breath all the time. I've also been up for 41 hrs STRAIGHT doing other work for other classes and this is my breaking point im sorry to if anyone actually read this, I know it's a lot. and i have more on plate I just can't finish this today Im so sorry I just need a day or two