Lab 2 Notes - HackMD

# Lab 2 Notes First, find a genome (preferably from the list given in the handout) and download it to the computer. This genome should have both a genome and raw sequence reads from NCBI. Using a new, mounted terminal window, move the genome sequence from the downloads folder to my colby home by using >mv Downloads/GCF_002861965.1_ASM286196v1_genomic.fna.gz /Volumes/Personal/kkhtut24 make a new directory for new documents >mkdir Lab_2 access new file location and move genome to new lab folder >cd /home2/kkhtut24/colbyhome >mv GCF_002861965.1_ASM286196v1_genomic.fna.gz /home2/kkhtut24/Lab_2 >cd /home2/kkhtut24/Lab_2 copy shell script from previous week into the new lab folder for usage in analyzing genome >cp /home2/kkhtut24/colbyhome/gc.sh /home2/kkhtut24/Lab_2 use nano to open shell script and modify for usage (if needed) variable setting for shell script is at GCF*.fna, using the * as a wildcard, so it will automatically locate our genome sequence use gunzip to unzip genome sequence if needed (will be needed if file ends with .gz) use sh to run shell >sh gc.sh gc countL 690256, atgc count 1675648 gc percentage: 411.934% For the next part, locate a raw sequence read of the same bacteria on NCBI, locate the number of SRA experiments, pick an experiment, and remmember the SRR number go back to terminal, and use >fastq-dump -X 3 -Z [srr number] Likely, the read of the raw sequence is presented incorrectly, so we need to fix the presentation of data by modifying the download options of fastq-dump -X modifies the maximum number of spots that will be shown, the spit-3 will split the sequence into its proper number of reads, and if there is an extra, it will go into its own folder, >fastq-dump -X 3 --split-3 --skip-technical --readids --read-filterpass --dumpbase --clip -Z SRR9331807 now download the raw sequence read without the quality scores >fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/Lab_2 SRR9331807 locate raw sequence files in Lab_2 folder and rename into names including organizm and source for later use >cd /home2/kkhtut24/Lab_2 >ls >mv SRR9331807_pass_1.fasta InfluenzaeSRP201848 >mv SRR9331807_pass_2.fasta Influenzae2SRP201848 >ls in order to evaluate the properties of k-mers, we will be using the jellyfish command, to find additional options, use >jellyfish [command] --help >i.e) jellyfish count --help >or >i.e) jellyfish histo --help use jellyfish count on the genome we are comparing > jellyfish count -t 2 -C -s 1G -m 20 -o Influenzae.m20.count /home2/kkhtut24/Lab_2/InfluenzaeSRP201848 this will create a count file in your current working directory next, use jellyish histo to get the frequency distribution across all k-mers > jellyfish histo -o Influenzae.m20.histo Influenzae.m20.count remove the count file rm Influenzae.m20.count look at histo using cat >cat /home2/kkhtut24/Lab_2/Influenzae.m20.histo The first column is occurence, or how much a k-mer appears in a file, and density, which is how many k-mers exist at that occurence level using nano, write a shell script in order to to create the histo for the other genome we downloaded >nano >#!/bin/bash >for FILE in Ecoli1*; do echo "$FILE" jellyfish count -t 2 -C -s 1G -m 20 -o "$FILE".m20.count /home2/kkhtut24$ jellyfish histo -o "$FILE".m20.histo "$FILE".m20.count rm "$FILE".m20.count done Using R studio, set the working directory to the folder with the histo files > getwd() [1] "/home2/kkhtut24" > setwd ("/home2/kkhtut24/Lab_2") > getwd () [1] "/home2/kkhtut24/Lab_2" Create a plot using the histo files > Ecoli1 <- read.table("Ecoli1SRP265716.m20.histo", h=F, sep="") > plot (Ecoli1, type="l") > plot (Ecoli1 [5:250,], type="l") > Ecoli1[150:180,]v around 154 is the peak middle pointusing using the sum command will estimate the size of the genome >sum(pfung[5:nrow(pfung),1]*pfung[5:nrow(pfung),2])/154 > the estimated size is 9,185,185 unfortunately, the graph for my bacteria creates a graph similat to a exponential curve, with no midpoint or peak.