Lab 2 Notes - HackMD

# Lab 2 Notes ### Exercise 1 #### You can use NCBI to access the public database in order to download **genomic data**- >https://www.ncbi.nlm.nih.gov/bioproject/browse 1) Find a Bacteria that has both a genome and raw sequence reads available through NCBI. 2) Click on the hyperlinked name to go to the hyperlinked genome. 3) Click on the genome link in order to download it. 4) You might need to unzip the downloaded file by using the function *gunzip filename.gz* while using it. 5) Move the downloaded genome data onto your personal colbyhome by using the following steps- >- Mount the filer (**command+K** and enter **[smb://filter.colby.edu](https://)**) > >- *cp Downloads/filename /Volumes/Personal/colbyid* - this makes a copy > >- *mv /personal/rakapa25/filename ~/lab02* -this moves the file > >But remember **there needs to be a space between the comand *mv* and the file location** #### In order to download **raw sequencing data**, do the following- > -Scroll to the bottom of the webpage for the data you have picked > -Go to raw sequencing reads > -Go to SRA Experiments under Sequence Data in order to access the SRR run code. **![](https://i.imgur.com/IdDdh4G.png)** In order to read in these sequences on the terminal, use the following- >*fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z **SRR run code*** To look at the first 20 lines of data > *head -20 PATH/*.fasta* |Organism|SRA instrument(type) record|SRA run number| Genome size(bp)|Estimated Size| | -------- | -------- | ------------ | --- | --- | |P. fungorum|Illumina NovaSeq 6000|SRR11022347 |9,058,983| 9,185,185 P. sprentiae |Illumina GA IIx|SRR3927471 | 7,829,542 | | P. terrae |PacBio RS II | DRR322713 | 10,062,489| | P. xenovorans|Illumina HiSeq 2000 |SRR2889773 |9,702,951 | | Y. enterocolitica| Illumina MiSeq|SRR19908121 | 978,929 | 1,228,652| ### Exercise 2 #### Count k-mers and estimate genome size based on k-mer frequencies - Picked k-mer length 20 >https://genome.umd.edu/docs/JellyfishUserGuide.pdf for more information about jellyfish, the program we are using. The following process should be used in order to count the k-mers in a genome. >- *jellyfish count -t 2 -C -s 1G -m 29 -o filename.count **Run Code*** >- jellyfish histo -o filename.histo filename.count >- *rm filename.count* (its a big file and needs to be deleted) >- *cat Taylorellaequigenitalis.m20.histo* (this prints the file) >- the function *sh ~/lab02/K-mer.sh SRR830651* was used, as nano was used to streamline the process #### Using R to visually process the data and calculate genome size >http://www.cookbook-r.com/ -To help learn R >bi278.colby.edu - this is our personal BI278 R Studio > In terminal to set the directory use the command >*setwd("/home2/rakapa25/lab02")* In order to read information into R, use the following- > *name <- read.table("name.m29.histo", h=F, sep=" ")* To make a calculation for the genome, use the following- >sum(name[5:nrow(name),1]*name[5:nrow(name),2])/154** ###### * -Can change ![](https://i.imgur.com/WgfdaRf.png) . . . . . . ![](https://i.imgur.com/j1623Uh.png)