BI278 lab2 - HackMD

# BI278 lab2 Exercise 1. Download public genomic data from NCBI I selected *Taylorella equigenitalis* as my genome and I downloaded from the NCBI website. Then I use copy the downloaded file from my local directory to my personal directory in BI278. ``` cp Downloads/GCF_002288025.1_ASM228802v1_genomic.fna.gz /Volumes/Personal/xsi25 ``` Then I create a new folder in my home directory and move the genome file into the new folder: ``` cd ~ mkdir lab_02 mv /personal/xsi25/GCF_002288025.1_ASM228802v1_genomic.fna.gz ~/lab_02 ``` Then I run this file with my shell code from last week. My shell code: ``` #!/bin/bash grep -v ">" $1 grep -v ">" $1 | tr -d -c ATatGCgc | wc grep -v ">" $1 | tr -d -c GCgc | wc ``` My result: Binary file GCF_002288025.1_ASM228802v1_genomic.fna.gz matches 0 1 11 0 1 6 1.2 Download raw reads via SRA TOOLKIT First I try to access the NCBI file directly through my terminal by the following codes: ``` vdb-config --interactive fastq-dump ``` But the external service failed, so I downloaded the FASTA file and move it to my personal directory. Then I use *cat* command to check the content in the file. Afterward, when I on the task to download another file, I am able to download the DRR322713 file: My code: ``` fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/lab_02 DRR322713 ``` When I check the size of the file: > -rw-r--r--. 1 xsi25 49681 1.1G Sep 20 15:01 DRR322713_pass.fasta -rwxrwx---. 1 xsi25 49681 482K Sep 20 13:29 GCF_002288025.1_ASM228802v1_genomic.fna.gz -rw-r--r--. 1 xsi25 49681 104 Sep 20 13:54 lab02.sh -rwxrwx---. 1 xsi25 49681 472M Sep 20 14:31 SRR17695649.fasta Exercise 2. Count k-mers and estimate genome size based on k-mer frequencies 1. I chose 24 mers 2. And I chose the P. terrae as my organism for the first time: ``` jellyfish count -t 2 -C -s 1G -m 24 -o P.terrae.m24.count DRR322713_pass.fasta ``` Since teh memory storage has been run out, I cannot complete the last step. For the following steps, I use the file from professor (pfung.m29.histo). ``` less pfung.m29.histo #I take a look at what is inside of this file ``` And also because we run out of memory, I cannot replicate these steps with another file. 2.2. VISUALIZE YOUR KMER COUNTS AND ESTIMATE GENOME SIZE IN R First, I change the directory to the lab_02 and I assign the .histo file to the name pfung. > setwd("/home2/xsi25/lab_02/") > getwd() "/home2/xsi25/lab_02" > pfung <- read.table("pfung.m29.histo", h=F, sep=" ") Then I try the following command: ``` > plot(pfung[5:250,], type="l") [1] 154 ``` Then I try to estimate the size of the genome: ``` > sum(pfung[5:nrow(pfung),1]*pfung[5:nrow(pfung),2])/154 [1] 9185185 ``` And that matches the one provided with the handout. Exercise 3 has other issue and it is not meant to be done this time.