# Lab 2 Notes
First, find a genome (preferably from the list given in the handout) and download it to the computer. This genome should have both a genome and raw sequence reads from NCBI.
Using a new, mounted terminal window, move the genome sequence from the downloads folder to my colby home by using
>mv Downloads/GCF_002861965.1_ASM286196v1_genomic.fna.gz /Volumes/Personal/kkhtut24
make a new directory for new documents
>mkdir Lab_2
access new file location and move genome to new lab folder
>cd /home2/kkhtut24/colbyhome
>mv GCF_002861965.1_ASM286196v1_genomic.fna.gz /home2/kkhtut24/Lab_2
>cd /home2/kkhtut24/Lab_2
copy shell script from previous week into the new lab folder for usage in analyzing genome
>cp /home2/kkhtut24/colbyhome/gc.sh /home2/kkhtut24/Lab_2
use nano to open shell script and modify for usage (if needed)
variable setting for shell script is at GCF*.fna, using the * as a wildcard, so it will automatically locate our genome sequence
use gunzip to unzip genome sequence if needed (will be needed if file ends with .gz)
use sh to run shell
>sh gc.sh
gc countL 690256, atgc count 1675648
gc percentage: 411.934%
For the next part, locate a raw sequence read of the same bacteria on NCBI, locate the number of SRA experiments, pick an experiment, and remmember the SRR number
go back to terminal, and use
>fastq-dump -X 3 -Z [srr number]
Likely, the read of the raw sequence is presented incorrectly, so we need to fix the presentation of data by modifying the download options of fastq-dump
-X modifies the maximum number of spots that will be shown, the spit-3 will split the sequence into its proper number of reads, and if there is an extra, it will go into its own folder,
>fastq-dump -X 3 --split-3 --skip-technical --readids --read-filterpass --dumpbase --clip -Z SRR9331807
now download the raw sequence read without the quality scores
>fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/Lab_2 SRR9331807
locate raw sequence files in Lab_2 folder and rename into names including organizm and source for later use
>cd /home2/kkhtut24/Lab_2
>ls
>mv SRR9331807_pass_1.fasta InfluenzaeSRP201848
>mv SRR9331807_pass_2.fasta Influenzae2SRP201848
>ls
in order to evaluate the properties of k-mers, we will be using the jellyfish command, to find additional options, use
>jellyfish [command] --help
>i.e) jellyfish count --help
>or
>i.e) jellyfish histo --help
use jellyfish count on the genome we are comparing
> jellyfish count -t 2 -C -s 1G -m 20 -o Influenzae.m20.count /home2/kkhtut24/Lab_2/InfluenzaeSRP201848
this will create a count file in your current working directory
next, use jellyish histo to get the frequency distribution across all k-mers
> jellyfish histo -o Influenzae.m20.histo Influenzae.m20.count
remove the count file
rm Influenzae.m20.count
look at histo using cat
>cat /home2/kkhtut24/Lab_2/Influenzae.m20.histo
The first column is occurence, or how much a k-mer appears in a file, and density, which is how many k-mers exist at that occurence level
using nano, write a shell script in order to to create the histo for the other genome we downloaded
>nano
>#!/bin/bash
>for FILE in Ecoli1*; do
echo "$FILE"
jellyfish count -t 2 -C -s 1G -m 20 -o "$FILE".m20.count /home2/kkhtut24$
jellyfish histo -o "$FILE".m20.histo "$FILE".m20.count
rm "$FILE".m20.count
done
Using R studio, set the working directory to the folder with the histo files
> getwd()
[1] "/home2/kkhtut24"
> setwd ("/home2/kkhtut24/Lab_2")
> getwd ()
[1] "/home2/kkhtut24/Lab_2"
Create a plot using the histo files
> Ecoli1 <- read.table("Ecoli1SRP265716.m20.histo", h=F, sep="")
> plot (Ecoli1, type="l")
> plot (Ecoli1 [5:250,], type="l")
> Ecoli1[150:180,]v
around 154 is the peak middle pointusing
using the sum command will estimate the size of the genome
>sum(pfung[5:nrow(pfung),1]*pfung[5:nrow(pfung),2])/154
>
the estimated size is 9,185,185
unfortunately, the graph for my bacteria creates a graph similat to a exponential curve, with no midpoint or peak.