Lab 2 Notes - HackMD

# Lab 2 Notes ## Intro I'm going to have to contiue editing the table from last week's notes. This will require the recollection of most opf the commands used this past week. ## 1 The aim of this section is to learn how to use an editor to make the process of coding more efficient. Editors allow you to create shell scripts which combine multiple commands into one, in order to make it easier to code. Before starting a nano its important to note that Where your script is is where you are when you opened nano. This location and where your files are need to work together. So today, our script should not be in the same directory as the genome files above. Code for an Editor: `nano` Code to Leave an Editor Window: Typed [Control + X] Using an Editor Window: First tell it this is a shell script Command to declare bash shell script: `#!/bin/bash` You need to treat the nano command like all other commands. However, when you are inputing a pathway you put the pathway and in place of the filename but $1 Example `/courses/bi278/Course_Materials/lab_01b/$1` Actually Building our Nano: Order needed: 1. get lines not the headers 2. translate ATGC 3. then word count Get the size of the Genome: `grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 | tr -d -c ATGC| wc -c` Get the Size of just C and G characters: `grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 | tr -d -c CG| wc -c` These lines works because you are isolating the lines with the actual sequnece by using `grep -v ">"` This command tells it to isolate the lines that dont include >. Next the command `tr -d -c ATGC` tells the computer to delete `-d` anything that isn't ACTG (the -c specifies the complement i.e. what you want to keep) Then By Hand Calculate the percent composition or I can use the command `awk 'BEGIN {print (CG#/Total#)}'` I can also use this Command to get the Strain and Chromosome Information `grep ">" /courses/bi278/Course_Materials/lab_01b/$1` so my completed nano code looks like this: `#!/bin/bash grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 | tr -d -c ATGC| wc -c grep -v ">" /courses/bi278/Course_Materials/lab_01b/$1 | tr -d -c CG| wc -c grep ">" /courses/bi278/Course_Materials/lab_01b/$1` Using My Nano File `sh nanoname` executes the shell. However, since we used `1$` in our pathway, we need to direct it to a new fasta file. In order to do this you use the command `sh FILENAME FASTAFILE` Using this I was able to fill the table below | Organism | Strain | Contig count | Genome size (bp) | GC% | | -------- | -------- | -------- | -------- | -------- | | *P. agricolaris* | Baqs159 | | 8721420 | 61.63% | | *P. bonniea* | Bbqs859 | | 4098182 | 58.72% | | *P. bonniea* | Bbqs395 | | 4009285|58.80% | | *P. bonniea* | Bbqs433 | | 4013203 | 58.78% | | *P. fungorum* | ATCC BAA-463 | | 9058983 | 61.75% | | *P. hayleyella* | Bhqs11 | | 4125700 | 59.24% | ## 2 #### 2.1 Downloading Data Downloading geenetic data from NCBI. Go to the following link https://www.ncbi.nlm.nih.gov/bioproject/browse When you search in here it is helpful to narrow down your search as much as possible. I'm choosing to look into "Aeromonas salmonicida" I first searched with the double quotes Then select either from [Genome sequencing] or [Raw sequence reads] using the Data Type filter, and [Monoisolate] from Scope, and [has publication] in Property. following this then click on the prj link. Then click on the genome link which is in the olive box. Next click download, and download it as a FASTA file, and selecte refseq as the source. Then you can manually upload this document into your colby home by dragging and droping. Once the document is manually uploaded create a new file for this week in your colby home, I droped it into my folder so i should see it when typing in `ls /courses/bi278/tsbroa25` Next I used this command `mkdir` to make a new folder for this weeks lab and move the document into it using the `mv` command After this I used the following commands to find that the genome of Aeromonas salmonicida has a length of 4954811 and a GC% composition of 58.65%. #### 2.2 Download raw sequencing reads with sra toolkit I had to use a different species as Aeromonas salmonicida doesn't have published raw data available. Instead I'm using Burkholderia pseudomallei. However, I'm using the other read available not the same one as Professor Noh. The number I'm using is SRR3183199 Before Downloading I am Going to configure my terminal i start by typing in `vdb-config -i` Navigating this area is bizzare, you click on the letter that correponds to the red letter on the button you want to click This software is critical to reading the NCBI raw data so it must be downloaded first. Now that I have ensured my software is set to go, I can retireve the SRR file number from NCBI and use it in my terminal to get the data I need. I Found the SRR number: SRR3183199 Next I typed in `vdb-dump --info SRR8460388` to get info on the read. After reciving info I typed in `ffasterq-dump -p --fasta --outdir ~/lab_02 SRR3183199` to download the data into my folder. I then used this command, `lab_02/SRR3183199*.fasta` which let mme look at the first few lines of the file. Notice the system automatically added ".fasta" to the end of the file. Next I used this command to check the file size: `ls ~/colbyhome/lab_02/SRR3183199.fasta -lh` if the file was more than 1gb I would've split it using the following command `split -b 1G` ## 3 ### 3.1 Jellyfish Jellyfish is the name of the program used to count Kmers in a DNA sequence. In particular we will be using the command `jellyfish count` and `jellyfish count` If you want more info on these commands go to this site https://genome.umd.edu/docs/JellyfishUserGuide.pdf I'm choosing a K-Mer size of 21 for the rest of this excercise. Use the following command to run Jellyfish to count the Kmers, to Change the size of the Kmer you want change the number after m `jellyfish count -t 2 -C -s 1G -m NUMBER -o NAME.m29.count INPUTFILE` So for my particular example i'm using `jellyfish count -t 2 -C -s 1G -m 21 -o kmertest.m21.count ~/colbyhome/lab_02/SRR3183199.fasta` You wont actually get a result from this but after you run the histo command as follows: `jellyfish histo -o kmertest.m21.count kmertest.m21.count` I messed up a bit here and actually wrote my history file over my kmer count file but typically you would want this to look like this `jellyfish histo -o kmertest.m21.histo kmertest.m21.count` After this use `cat` or `less` followed my the file name to visualize the results. From professor Noh on visualizing the results: "Your file will show in two columns: occurrence (or frequency, how often a k-mer appears in that file) and density (how many k-mers show at that level of occurrence). You’ll notice there are a lot of k-mers that occur exactly once (first line of the file). These are most likely to be the result of errors. This concept is used to remove reads from sequencing read datasets that are likely errors. Quake is one software that does this." Now we can Take this data and throw it into R ## 3.2 R Code We will be using r for visualization and some data analysis, so I need to be familiar with it. Here is a recommended webpage to get familiar with it http://www.cookbook-r.com To access r we will leave the terminal and go to bi278.colby.edu When you first enter R you need to set the working directory. You do this using the following command `getwd() setwd("/home2/COLBYID/lab_02/")` Make sure that the file you are attempting to read is in the working directory. You need to then upload the results files manually using the Name and read.table command see the example below: `NAME <- read.table("NAME.m29.histo", h=F, sep=" ")` Next rename the collumns using this colnames(KmerTest) <- c("occurrence","density") Next use this to plot the information: `plot(KmerTest[5:250,], type="l")` Note that the 5:250 is so the data at kmer=1 is cutoff since there is a very high frequency there. "Now that You have this zoom in on the peak area by adjusting the x axis You should have peak within this region. If not, play around with the occurrence interval. 5:250 is looking at occurrence (x-axis) values from 5 to 250." Next have R check the highest Occurence Density using the following code: `KmerTest[KmerTest[,2]==max(KmerTest[5:250,2]),]` "This occurrence point with the maximum density is your estimate of genome coverage. The assumption is that most k-mers should be present only once in a genome. What we have done just now is to find how often most k-mers occur in this raw sequencing dataset." Next you can take the area under the curve and devide it by the coverage estimate to get an estimated genome size. X axis position of your peak is coverage positio To do this I used this command `nrow(KmerTest) head(KmerTest) head(KmerTest[,1]) head(KmerTest[,2]) head(KmerTest[5:nrow(KmerTest),]) head(KmerTest[5:nrow(KmerTest),1] * KmerTest[5:nrow(KmerTest),2]) sum(KmerTest[5:nrow(KmerTest),1] * KmerTest[5:nrow(NAKmerTest),2]) sum(KmerTest[5:nrow(KmerTest),1] * KmerTest[5:nrow(KmerTest),2]) / COVERAGE` ### General New Things I've learned this week `-v` takes the opposite, So for example `grep ">"` shows all lines with >, and `grep -v ">"` Shows all lines without > `grep` Counts lines `wc -c` Counts the Characters Control C always gets you out of issues ``