# Notebook 2 Genomics Lab
## Get genome data of Brucella suis.
Get the genome file from NCBI and use the code modified to analyze it.
GC percent is 0.572513
Size is 3315175
Config Count is 2
Strain is 1330
## Get raw data of Brucella suis.
Click sequence data number after finding it.
Click the run code.
Using `fastq-dump` to download it.
Using `fastq-dump -X 3 –Z SRRnumber` to check it.
Using
```
fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter
pass --dumpbase --clip -Z SRRnumber
```
to check it in two parts.
Use
`fastq-dump -–split-3 --skip-technical --readids --read-filter pass --
dumpbase --clip -v --fasta default
--outdir ~/lab_02 SRR#######` to download it.
Use this to check 20 lines of the data to check it.
```
head -20 PATH/*.fastq
```
| Organism | SRA instrument record | SRA run number | Genome size (bp) | Estimated size |
|:-------------:|:--------------------- |:-------------- |:---------------- |:-------------- |
| P. fungorum | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 | 9,185,185 |
| P. sprentiae | Illumina GA IIx | SRR3927471 | 7,829,542 | |
| P. terrae | PacBio RS II | DRR322713 | 10,062,489 | |
| P. xenovorans | Illumina HiSeq 2000 | SRR2889773 | 9,702,951 | |
| Brucella suis | NextSeq 500 | SRR8550473 | 3,315,175 | Text |
## K-Mers
The K I choose is 21.
Use code like below to get the count file
```
jellyfish count -t 2 -C -s 1G -m 21 -o name.m21.count input_file
```
Use code like below to get the histo file
```
jellyfish histo -o name.m29.histo name.m29.count
```
Remove the count file.
```
rm name.m29.count
```
(Because of no memory, only use the histo file provided)
## Using R
In R, `getwd()` gets the working directory.
`setwd("/home/colbyid/directory_name/")` is an example of how to set directory.
`name <- read.table("name.m29.histo", h=F, sep=" ")` it reads the specified file.
`plot(name, type="l")` how you can plot the data.
`plot(name[5:250,], type="l")` zoom in the plot to range between 5 to 250 x value.
`name[150:180,]` how to get the data between 150 to 180 x value.
`sum(name[5:nrow(name),1]*name[5:nrow(name),2])/154` occurence * concentration/highest x to get the approximate genome size.