# BI278 lab notebook 2 #
### Genomic data ###
Go to the NCBI website to browse the genomic data.
https://www.ncbi.nlm.nih.gov/bioproject/browse
Look up *Haemophilus influenzae* in the search with the Data type filters set to 'Genome sequencing ' and 'Raw sequence reads'.
* Found PRJDB7052 with Project title "Invasive non-typeable *Haemophilus influenze* in Japanese infants" which has the *Haemophilus influenzae* genome sequence.
* Click on it and click the 'see genome'

* Click on the 'genome' link to download sequences in FASTA format
* Copy the file from Downloads to colbyhome:
`cp Downloads/GCF_000931575.1_ASM93157v1_genomic.fna.gz /Volumes/Personal/tywang23`
* Then ssh into bi278 home by `ssh tywang23@bi278`
* In home(`/home2/tywang23`), make a new folder: `mkdir lab_02`
* Move the genome file into the lab_02 folder
`mv /personal/tywang23/GCF_000931575.1_ASM93157v1_genomic.fna.gz ~/lab_02`
> Since the files from lab1 are in `/home/students/t/tywang23`, the scripts need to be moved to the current `home2/tywang23` directory
> `mv /home/students/t/tywang23/lab* ~`
* Go to lab_02 directory and unzip the file by `gunzip GCF_000931575.1_ASM93157v1_genomic.fna.gz`
* Edit the GC_calc.sh in lab_01a so that the filepath taken in is:
```
grep -v ">" ./$1 | wc -c
grep -v ">" ./$1 | tr -d -c GCgc | wc -c
```
* Then run the GC_calc.sh on the genome file: `sh lab_01a/GC_calc.sh lab_02/GCF_000931575.1_ASM93157v1_genomic.fna`
> Total number of bases = 1869338
> GC bases = 704485
> GC % = 0.3769
### Download raw reads vis SRA toolkit ###
Using the NCBI search again, find a raw sequence reads file for *Haemophilus influenzae*:
* 'PRJDB9304' with project title "Whole genome sequence of *Haemophilus influenzae* serotype A in Japan"
* Click on it and find the SRA experiments number of links: it has 2 so click on it and then choose "Illumina MiSeq paired end sequencing of SAMD00204789"
* The run id is shown in the table under run : DRX233551
* Size of the file: 139Mb
* length of sequencing reads: paired-end 154 base pair DNA sequence
* Run SRA toolkit by using command `vdb-config --interactive`
* Use command `fastq-dump` for information on how to use it
* Check how the downloaded file may look like and have the first three spots printed to the screen
`fastq-dump -X 3 –Z SRRrun_number` so use `fastq-dump -X 3 –Z DRR243751`
* These show reads of length 312. Since both reads are combined and we want the pairs separated, use command
`fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter
pass --dumpbase --clip -Z DRR243751`
* The read lenghts are now 156 which is close ot the average 154 length
* Units of 4 lines at a time indicate for a single read the DNA sequence (marked by @) and individual base quality scores (marked by +)
Now download raw sequencing reads for the *Haemophilus influenzae*, without the quality scores:
```
fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/lab_02 DRR243751
```
* Look at the firsts 20 lines of the FASTA file `head -20 lab_02/*.fasta`
Make sure the files look like how you expect it to look
Use the same steps to download raw sequencing reads for *P. sprentiae* recorded with Illumina GA IIx and run id `SRR3927471` and genome size of 7,829,542bp.
```
fastq-dump -X 3 –Z SRR3927471
fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR3927471
```
For both organisms, there are more than one files generated. So check the file sizes using `ls -lh lab_02/` and keep the largest file which are the pass1 files.
Rename the files using the `mv` command so that the file names have the organism names, and source.
```
mv lab_02/DRR243751_pass_1.fasta lab_02/H.influenzae_sra.fasta
mv lab_02/SRR3927471_pass_1.fasta lab_02/P.sprentiae_sra.fasta
```
### Counting K-mers ###
Using a program called `jellyfish` to count the kmers in our genomes.
Choose a kmer size: 20
Change working directory: `cd lab_02`
* `jellyfish count -t 2 -C -s 1G -m 20 -o H.influenzae_sra.m20.count H.influenzae_sra.fasta`
* `jellyfish histo -o H.influenzae_sra.m20.histo H.influenzae_sra.m20.count`
* Look at the histo file using `less H.influenzae_sra.m20.histo` then type `q` to quit out of it
* There should be two columns of data with many numbers
* Then `rm H.influenzae_sra.m20.count`
* Create shell script `sra_histo.sh` to automate and do the same for the other file with the commands:
```
jellyfish count -t 2 -C -s 1G -m 20 -o $1.m20.count $1.fasta
jellyfish histo -o $1.m20.histo $1.m20.count
rm $1.m20.histo
```
* Run `sh sra_histo.sh P.sprentiae_sra`
### Visualizing kmers in R studio ###
Open the R studio server for bi278 by going to bi278.colby.edu
In the console:
`getwd()` to get the working directory
`setwd("/home2/tywang23/lab_02")` to set the working directory to the directory with the histo files
Read in the the result files created:
`H.influenzae.histo <- read.table("H.influenzae_sra.m20.histo, h=F, sep=" ")`
This creates a table named `H.influenzae.histo` in the environment
`'P.sprentiae.histo' <-read.table("P.sprentiae_sra.m20.histo", h=F, sep=" ")`
#### Estimate genome size for H.influenzae_sra.m20.histo ###
Check the shape of the histogram:
* `plot(H.influenzae.histo, type="l")`
* There is a initial peak that represents kmers that are the result of sequencing errors. We want to find the midpoint of the main peak so: `plot(H.influenzae.histo[5:250], type="l")`
* This zooms into the [5:250,] x-axis

* The peak seems to be between [40:100] on the x-axis so: `H.influenzae.histo[40:100,]` gives us the values of the density and the highest density is 59519 at x=62
Commands:
* `nrow(name)` gives the number of rows in the dataset
* `head(name)` gives the first few rows of the dataset
* `name[r,c]` limits the rows and columns to r and c
* `name[,1]` will give all the values in the rows of column 1
* `name[5:10,]` gives the values in all columns, between rows 5 and 10 (inclusive)
* `sum()` gives the sum of the values inserted to the function
Using these commands find an estimate for the genome size:
`sum(name[5:nrow(H.influenzae.histo),1]*H.influenzae.histo[5:nrow(H.influenzae.histo),2])/62` = 1861993
#### Estimate genome size for P.sprentiae_sra.m20.histo ###
Check the shape of the histogram:
* `plot(P.sprentiae.histo, type="l")`
* Try finding the peak and then `plot(P.sprentiae.histo[100:800,], type="l")`

* The peak seems to be between [200:350] on the x-axis so: `P.sprentiae.histo[200:350,]` gives us the values of the density and the highest density(55016) is at 246
* Find an estimate for the genome size:
`sum(P.sprentiae.histo[5:nrow(P.sprentiae.histo),1]*P.sprentiae.histo[5:nrow(P.sprentiae.histo),2])/246` = 1861993
|Organism|SRA instrument [type] record|SRA run number|Genome size(bp)|Estimated size|
| -------- | -------- | --- | --- | -------- |
|*P. sprentiae*|Illumina GA IIx|SRR3927471|7,829,542|8,352,846|
|*H. influenzae*|Illumina MiSeq|DRX23355|(1,830,000) |1,861,993|