Notebook 2 Genomics Lab

# Notebook 2 Genomics Lab ## Get genome data of Brucella suis. Get the genome file from NCBI and use the code modified to analyze it. GC percent is 0.572513 Size is 3315175 Config Count is 2 Strain is 1330 ## Get raw data of Brucella suis. Click sequence data number after finding it. Click the run code. Using `fastq-dump` to download it. Using `fastq-dump -X 3 –Z SRRnumber` to check it. Using ``` fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRRnumber ``` to check it in two parts. Use `fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR#######` to download it. Use this to check 20 lines of the data to check it. ``` head -20 PATH/*.fastq ``` | Organism | SRA instrument record | SRA run number | Genome size (bp) | Estimated size | |:-------------:|:--------------------- |:-------------- |:---------------- |:-------------- | | P. fungorum | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 | 9,185,185 | | P. sprentiae | Illumina GA IIx | SRR3927471 | 7,829,542 | | | P. terrae | PacBio RS II | DRR322713 | 10,062,489 | | | P. xenovorans | Illumina HiSeq 2000 | SRR2889773 | 9,702,951 | | | Brucella suis | NextSeq 500 | SRR8550473 | 3,315,175 | Text | ## K-Mers The K I choose is 21. Use code like below to get the count file ``` jellyfish count -t 2 -C -s 1G -m 21 -o name.m21.count input_file ``` Use code like below to get the histo file ``` jellyfish histo -o name.m29.histo name.m29.count ``` Remove the count file. ``` rm name.m29.count ``` (Because of no memory, only use the histo file provided) ## Using R In R, `getwd()` gets the working directory. `setwd("/home/colbyid/directory_name/")` is an example of how to set directory. `name <- read.table("name.m29.histo", h=F, sep=" ")` it reads the specified file. `plot(name, type="l")` how you can plot the data. `plot(name[5:250,], type="l")` zoom in the plot to range between 5 to 250 x value. `name[150:180,]` how to get the data between 150 to 180 x value. `sum(name[5:nrow(name),1]*name[5:nrow(name),2])/154` occurence * concentration/highest x to get the approximate genome size.