BI278 lab notebook 2

# BI278 lab notebook 2 # ### Genomic data ### Go to the NCBI website to browse the genomic data. https://www.ncbi.nlm.nih.gov/bioproject/browse Look up *Haemophilus influenzae* in the search with the Data type filters set to 'Genome sequencing ' and 'Raw sequence reads'. * Found PRJDB7052 with Project title "Invasive non-typeable *Haemophilus influenze* in Japanese infants" which has the *Haemophilus influenzae* genome sequence. * Click on it and click the 'see genome' ![](https://i.imgur.com/cp5YqrH.png) * Click on the 'genome' link to download sequences in FASTA format * Copy the file from Downloads to colbyhome: `cp Downloads/GCF_000931575.1_ASM93157v1_genomic.fna.gz /Volumes/Personal/tywang23` * Then ssh into bi278 home by `ssh tywang23@bi278` * In home(`/home2/tywang23`), make a new folder: `mkdir lab_02` * Move the genome file into the lab_02 folder `mv /personal/tywang23/GCF_000931575.1_ASM93157v1_genomic.fna.gz ~/lab_02` > Since the files from lab1 are in `/home/students/t/tywang23`, the scripts need to be moved to the current `home2/tywang23` directory > `mv /home/students/t/tywang23/lab* ~` * Go to lab_02 directory and unzip the file by `gunzip GCF_000931575.1_ASM93157v1_genomic.fna.gz` * Edit the GC_calc.sh in lab_01a so that the filepath taken in is: ``` grep -v ">" ./$1 | wc -c grep -v ">" ./$1 | tr -d -c GCgc | wc -c ``` * Then run the GC_calc.sh on the genome file: `sh lab_01a/GC_calc.sh lab_02/GCF_000931575.1_ASM93157v1_genomic.fna` > Total number of bases = 1869338 > GC bases = 704485 > GC % = 0.3769 ### Download raw reads vis SRA toolkit ### Using the NCBI search again, find a raw sequence reads file for *Haemophilus influenzae*: * 'PRJDB9304' with project title "Whole genome sequence of *Haemophilus influenzae* serotype A in Japan" * Click on it and find the SRA experiments number of links: it has 2 so click on it and then choose "Illumina MiSeq paired end sequencing of SAMD00204789" * The run id is shown in the table under run : DRX233551 * Size of the file: 139Mb * length of sequencing reads: paired-end 154 base pair DNA sequence * Run SRA toolkit by using command `vdb-config --interactive` * Use command `fastq-dump` for information on how to use it * Check how the downloaded file may look like and have the first three spots printed to the screen `fastq-dump -X 3 –Z SRRrun_number` so use `fastq-dump -X 3 –Z DRR243751` * These show reads of length 312. Since both reads are combined and we want the pairs separated, use command `fastq-dump -X 3 -–split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z DRR243751` * The read lenghts are now 156 which is close ot the average 154 length * Units of 4 lines at a time indicate for a single read the DNA sequence (marked by @) and individual base quality scores (marked by +) Now download raw sequencing reads for the *Haemophilus influenzae*, without the quality scores: ``` fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/lab_02 DRR243751 ``` * Look at the firsts 20 lines of the FASTA file `head -20 lab_02/*.fasta` Make sure the files look like how you expect it to look Use the same steps to download raw sequencing reads for *P. sprentiae* recorded with Illumina GA IIx and run id `SRR3927471` and genome size of 7,829,542bp. ``` fastq-dump -X 3 –Z SRR3927471 fastq-dump -–split-3 --skip-technical --readids --read-filter pass -- dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR3927471 ``` For both organisms, there are more than one files generated. So check the file sizes using `ls -lh lab_02/` and keep the largest file which are the pass1 files. Rename the files using the `mv` command so that the file names have the organism names, and source. ``` mv lab_02/DRR243751_pass_1.fasta lab_02/H.influenzae_sra.fasta mv lab_02/SRR3927471_pass_1.fasta lab_02/P.sprentiae_sra.fasta ``` ### Counting K-mers ### Using a program called `jellyfish` to count the kmers in our genomes. Choose a kmer size: 20 Change working directory: `cd lab_02` * `jellyfish count -t 2 -C -s 1G -m 20 -o H.influenzae_sra.m20.count H.influenzae_sra.fasta` * `jellyfish histo -o H.influenzae_sra.m20.histo H.influenzae_sra.m20.count` * Look at the histo file using `less H.influenzae_sra.m20.histo` then type `q` to quit out of it * There should be two columns of data with many numbers * Then `rm H.influenzae_sra.m20.count` * Create shell script `sra_histo.sh` to automate and do the same for the other file with the commands: ``` jellyfish count -t 2 -C -s 1G -m 20 -o $1.m20.count $1.fasta jellyfish histo -o $1.m20.histo $1.m20.count rm $1.m20.histo ``` * Run `sh sra_histo.sh P.sprentiae_sra` ### Visualizing kmers in R studio ### Open the R studio server for bi278 by going to bi278.colby.edu In the console: `getwd()` to get the working directory `setwd("/home2/tywang23/lab_02")` to set the working directory to the directory with the histo files Read in the the result files created: `H.influenzae.histo <- read.table("H.influenzae_sra.m20.histo, h=F, sep=" ")` This creates a table named `H.influenzae.histo` in the environment `'P.sprentiae.histo' <-read.table("P.sprentiae_sra.m20.histo", h=F, sep=" ")` #### Estimate genome size for H.influenzae_sra.m20.histo ### Check the shape of the histogram: * `plot(H.influenzae.histo, type="l")` * There is a initial peak that represents kmers that are the result of sequencing errors. We want to find the midpoint of the main peak so: `plot(H.influenzae.histo[5:250], type="l")` * This zooms into the [5:250,] x-axis ![](https://i.imgur.com/NiNBeQV.png) * The peak seems to be between [40:100] on the x-axis so: `H.influenzae.histo[40:100,]` gives us the values of the density and the highest density is 59519 at x=62 Commands: * `nrow(name)` gives the number of rows in the dataset * `head(name)` gives the first few rows of the dataset * `name[r,c]` limits the rows and columns to r and c * `name[,1]` will give all the values in the rows of column 1 * `name[5:10,]` gives the values in all columns, between rows 5 and 10 (inclusive) * `sum()` gives the sum of the values inserted to the function Using these commands find an estimate for the genome size: `sum(name[5:nrow(H.influenzae.histo),1]*H.influenzae.histo[5:nrow(H.influenzae.histo),2])/62` = 1861993 #### Estimate genome size for P.sprentiae_sra.m20.histo ### Check the shape of the histogram: * `plot(P.sprentiae.histo, type="l")` * Try finding the peak and then `plot(P.sprentiae.histo[100:800,], type="l")` ![](https://i.imgur.com/vFRinFp.png) * The peak seems to be between [200:350] on the x-axis so: `P.sprentiae.histo[200:350,]` gives us the values of the density and the highest density(55016) is at 246 * Find an estimate for the genome size: `sum(P.sprentiae.histo[5:nrow(P.sprentiae.histo),1]*P.sprentiae.histo[5:nrow(P.sprentiae.histo),2])/246` = 1861993 |Organism|SRA instrument [type] record|SRA run number|Genome size(bp)|Estimated size| | -------- | -------- | --- | --- | -------- | |*P. sprentiae*|Illumina GA IIx|SRR3927471|7,829,542|8,352,846| |*H. influenzae*|Illumina MiSeq|DRX23355|(1,830,000) |1,861,993|