BI278 Lab 2: NCBI data, kmers, and prokka

# BI278 Lab 2: NCBI data, kmers, and prokka 1.11 find bacteria that has both a genome and raw sequence reads through NCBI (https://www.ncbi.nlm.nih.gov/bioproject/browse) --> Actinobacillus pleuropneumoniae strain:1022 Can search through based on type of organism, type of sequence, etc 1.12 download onto desktop 1.13 copy over to colbyhome and use in bi278 go to finder, type command+K, enter smb://filer.colby.edu and copy the file into colby open a new terminal: shell--> new window cp Downloads/filename /Volumes/Personal/cgsnow25 1.14 make a new folder in home ssh bi278 mkdir lab_02 mv /personal/cgsnow25/filename ~/lab_02 1.15 use shell script from last week to collect genome stats on new genome nano GC%_genomesize.sh change file location /home2/cgsnow25/lab02/ move to lab_02 (cd lab_02). then gunzip filename Then I moved GC%_genomesize.sh to lab_02 by using the copy command: cp GC%_genomesize.sh lab_02 Then, I ran sh GC%_genomesize.sh filename I got 2325526 (genome size) and 957813 (GC content) 1.2 Download raw reads via SRA toolkit go to https://www.ncbi.nlm.nih.gov/bioproject/?term(txid2697049%5BOrganism%3Anoexp%5D+AN Found NYU langone SARS CoV 2 data. The info that I need to download is the run ID We will use a command line program called SRAtoolkit, which is intalled in bi278 Can use SRAtoolkit to command fastq-dump (software written to interact with NCBI SRA database) fastq-dump --help #for more info fastq-dump -X 3 -Z SRR12379546 Supposed to be two 101 base pair-length sequences per "spot". in this output, they're being combined (how NCBI stores data), and we need to download the data with the pairs seperated, so we add commmands to fastq- dump fastq-dump -X 3 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR12379546 The length before was 302, and now it's 151 (better). We had to check in order to avoid bad data (by cecking whether the traces and the metadata of a sample of the files match up) 6.Download raw sequencing reads without quality scores for the bacteria I just got a genome for, by typing command fastq-dump -X 3 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR12379546 7.take a look at the first 20 lines of the FASTA files using head head -20 PATH/*.fasta head -20 *.fasta #don't need to include path because we're already in lab_02 #check ending of .fasta to make sure it matches what you're looking for 8.also download raw sequencing reads from one of the other organisms | Organism | SRA instrument type | SRA run number | Genome size (bp) | Estimated size | | ------------- | --------------------- | -------------- | --- | --- | | P. fungorum | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 | 9,185,185 | | P. sprentiae | Illumina GA iix | SRR3927471 | 7,829,542 | | | P. terrae | PacBio RSII | DRR322713 | 10,062,489 | | | P. xenovorans | Illumina HiSeq 2000 | SRR2889773 | 9,702,951 | | | A. Pleuropeumoniae | Illumina HiSeq 2500 | SRR5931780 | 2,325,526 | | 9.If you have a second file beause of the paired end reads, get rid of it using rm rm SRR12379546_pass_1.fasta rm SRR5931780_pass_1.fasta 10.rename each file so the filename reflects the organism and the source (SRA) for your own convenience using mv mv GCF_900638445.1_57675_E01_genomic.fna a.pleuropeumoniaeSRA Excersize 2: Count k-mers and estimate genome size based on k-mer frequencies 2.1 Count kmers with jellyfish k-mers are DNA words of length k and a useful way of characterizing long sequences. the k that you select will allow different degrees of detail in k-mer profile comparisons The program that we use to count k-mers is called jellyfish. the commands are jellyfish count jellyfish histo # can type in jellyfish count --help and jellyfish histo --help to see how they work User manual under http://www.genome.umd.edu/docs/JellyfishUserGuide.pdf 1.choose k-mer size between 19 and 31 (25). Number of k-mers you can make gets quickly unwieldy, but some k-mers are too small so there won't be much difference between one genome and another. Larger k-mers are more specific but will take longer to count. 2.Run jellyfish count for a genome to compare. jellyfish count -t 2 -C -s 1G -m 20 -o name.m20.count input_file jellyfish count -t 2 -C -s 1G -m 20 -o a.pleuropeumoniaeSRA.m20.count a.pleuropeumoniaeSRA 3.now run jellyfish histo on the count file to get the frequency distribution across all k-mers (and remove a.pleuropeumoniaeSRA.m20.count because we don't need it anymore) jellyfish histo -o a.pleuropeumoniaeSRA.m20.histo a.pleuropeumoniaeSRA.m20.count rm a.pleuropeumoniaeSRA.m20.count 4.Take a look at this histo file. you have two choices: cat (will print the whole file to your screen) or less (will show you one screen and line at a time; type q to exit out) cat B.PseudomalleiIllumina.histo 5.Assuming that the code worked, write a shell script to do the same for other SRA downloaded file. Include the rm command in the script nano jellyfish count -t 2 -C -s 1G -m 20 -o $1.count $1 jellyfish histo -o $1.histo $1.count rm $1.count named: columnoccurrence.sh ** Didn't work because of memory issue 2.2 Visualize your kmer counts and estimate genome size in R R is a statistical package that people can use to create "packages" and "libraries". We will interface R using a GUI called RStudio Resource for using R: http://www.cookbook-r.com Our bi278 home is visible from the RStudio server 1.Open browser and go to this server: bi278.colby.edu 2.set the working directory your histo files are in (similar to ls and cd in Unix) in console getwd() setwd("home2/cgsnow25/lab_02") 3.You can read in one by one the result filrs from setting the working directory B.PseudomalleiIllumina.histo <- read.table("B.PseudomalleiIllumina.histo", h=F, sep=" ") Table = occurrence for k-mers in your genome 4.check out shape of histogram plot(B.PseudomalleiIllumina.histo, type="l") The initial peak represents kmers that are the result of sequencing errors. We want to find the midpoint of the main peak (represents single copy regions of the genome), so we narrow down the range of the histogram to try to figure out what the x-axis value is at the main peak plot(B.PseudomalleiIllumina.histo[5:250,], type="l") 6.The area under the curve and divide it by the coverage estimate to get an estimate of genome size. This works because the distribution of k-mers in a genome theoretically follows a Poisson distribution (we are ignoring the part of the curve that contains the results of sequencing error) This series of command will show what the last line is doing nrow(name) head(name) head(name[,1]) #square brackets are used to limit [row, column] head(name[,2]) head(name[5:nrow(name),]) head(name[5:nrow(name),1]*name[5:nrow(name),2]) sum(name[5:nrow(name),1]*name[5:nrow(name),2]) sum(name[5:nrow(name),1]*name[5:nrow(name),2])/154 I typed: sum(B.PseudomalleiIllumina.histo[5:nrow(B.PseudomalleiIllumina.histo),1]*B.PseudomalleiIllumina.histo[5:nrow(B.PseudomalleiIllumina.histo),2])/23 Chose midpoint as 23. Estimation for B. Pseudomallei is 7,800,419