BI278 Lab #2 - NCBI data, kmers, and prokka

# BI278 Lab #2 - NCBI data, kmers, and prokka Excerise 1 - Download public genomic data from NCBI 1.1 - How to navigate NCBI for genomic data - You can search thru projects to find data sets based on various attributes (thru "filters" button): - types of organisms - types of sequences (imited to various types of next gen sequencing reads) - usually "genome sequencing" or "raw sequence reads" in Data type filter - can then sort by Data Type (click on arrow in column headings) - once you find a dataset of interest (data + publication is good), you click on link - clicking "Genome" will take you to the link --> it's the fastest way to get to a genome that's already put together - second genome link ("genome") will download the chromosome level genome sequence for you onto desktop - if you want individual protein sequences, you can click on the number (usually provided in table) 1.1b - Finding a bacteria that has both a genome and raw sequence reads available thru NCBI Bacteria I picked: Burkholderia pseudomallei Registration date: 2015-10-16 Link: https://www.ncbi.nlm.nih.gov/genome/476 GC % - 4836611 Genome Size - 7085397 Remember to unzip files first! And check for the correct path! Path: /home2/sdivit25/lab_02 Unzip file: gunzip "filename" 1.2 - Download Raw Read via SRA Toolkit NCBI SARS CoV2 genome sequencing example BioProject: https://www.ncbi.nlm.nih.gov/bioproject/?term=NYU+Langone+SARS-CoV-2+Sequencing+team Then click # of links for BioSample. Link: https://www.ncbi.nlm.nih.gov/biosample?Db=biosample&DbFrom=bioproject&Cmd=Link&LinkName=bioproject_biosample&LinkReadableName=BioSample&ordinalpos=1&IdsFromResult=650245 Then search for ID # & click on it. Link: https://www.ncbi.nlm.nih.gov/sra/SRX8878313/ From there, click on "Run". Link: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR12379546&display=metadata Make sure to checksize of file & length of sequencing read. The ID you need for next part is the SRR run # (usually starts w/ "SSR"). - SSR # = under "Runs" column - for this COVID file: SRR12379546 - for the genomic (Burkholderia pseudomallei) file: SRR2134242 LINK: raw seq - https://www.ncbi.nlm.nih.gov/bioproject/PRJNA290528 SRA exp tells you that raq seq data is available: https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=290528 click the first one: https://www.ncbi.nlm.nih.gov/sra/SRX1124256[accn] GIVES YOU RUN NUMBER: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR2134242&display=metadata SRAtoolkit = commmand line program we're using to download some genomic data from NCBI - sysadmin installed this on bi278 **Command to download NCBI SRA (Sequence read archive) database: fastq-dump In order to run this^: vdb-config --interactive BUT it's imp to know exactly what that's doing so everytime you run a new software, find out more info: fastq-dump --help To check what the downloaded data may lok like before downloading the whole file: fastq-dump -X 3 -Z SRR12379546 #sequence data of the first 3 spots will #be written to our screen instead of a #file b/c of the other optional flags -X #and -Z used here ^Doing this step, you will notice: - these seqs are supposed to be 2 101bp length seqs per "spot" - BUT output --> these 2 101bp length seqs are being combined - so we want to download this data w/ the pairs separated: fastq-dump -X 3 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR12379546 You know it worked if the length is roughly half. So before for the first fastq-dump, length was 302 and after the second fastq-dump, length was 151! Briefly: Format of FASTQ files = units of 4 lines at a time indicate for a single read the DNA sequence (marked by @) and individual base quality scores (marked by +) **Important to do these types of checks b/c there is lot of data in SRA that are not deposited/described properly. ^Easy way of avoiding these bad data is by checking whether the traces and the metadata of a sample of the files match up. Now, you have a genome for in the prev steps, so download raw sequencing data reads but w/out quality scores: fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v --fasta default --outdir ~/lab_02 SRR12379546 To look at first 20 lines of FASTA files: head -20 PATH/*.fasta #but no path needed if you're already in #that folder/directory so then: head - 20 *.fasta **CHECK ENDING OF ".fasta" to make sure it matches with what you're looking for **Practice** --> download one of other organisms other than P. fungorum (overlap w/ last week) |Organism|SRA Instrument [type] record|SRA Run Number|Genome Size (bp)|Estimated size| | -------- | -------- | --- | --- | -------- | |P. Fungorum|Illumina NovaSeq 6000|SRR11022347|9058983|9185185| |P. sprentiae|Illumina GA IIx|SRR3927471|7829542| | |**P. terrae**|PacBio RS II|DRR322713|10062489| **COULDN'T CALCULATE BC OF MEMORY ISSUE| |P. xenovorans|Illumina HiSeq 2000|SRR2889773|9702951| | |Your organism: Burkholderia pseudomallei| Illumina|SRR2134242|7085397|7800419| Link for P. terrae: https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=776439 Exercise 2 - Count k-mers & estimate genome size based on k-mer frequencies 2.1 Count Kmers w/ Jellyfish - program used to count k-mers in our genome is called jellyfish - commands you want to use: jellyfish count jellyfish histo typing " --help" for additional usage information Steps: 1) Choose a k-mer size between 19-31 My chosen size: 20 2) run "jellyfish count" for genome you will compare jellyfish count -t 2 -C -s 1G -m kmerSize -o name.m29.count input_file jellyfish count -t 2 -C -s 1G -m 20 -o B.PseudomalleiIllumina.count B.PseudomalleiIllumina.fasta 3) run "jellyfish histo" on the count file to get the frequency distribution across all kmers a) get rid of count file after checking that it worked b/c it takes up a lot of memory and you don't need it anymore jellyfish histo -o name.m29.histo name.m29.count jellyfish histo -o B.PseudomalleiIllumina.histo B.PseudomalleiIllumina.count 4) Look at histo file. 2 choices: cat ==> will print whole file to screen less ==> will show you one screen & a line at a time, type 'q' to exit **YOUR FILE WILL SHOW TWO COLUMNS OCCURENCE & DENSITY. - occurence = frequency, how often a k-mer appears in that file - density = how many kmers show at that level of occurence **UNDERSTANDING IT: There are 4^20 kmers available of size 20. So 23 254926 --> there are 254926 of the 4^20 kmers that appear 23 times - NOTICE: lot of kmers that occur only once - these are most likely to be result of errors ==> thus, used to remove reads from datasets that are likely errors - Quake is a software that does that 5) Now that your code is working: write a shell script to do the same for other SRA downloaded file. jellyfish count -t 2 -C -s 1G -m 20 -o $1.count $1 jellyfish histo -o $1.histo $1.count rm $1.count named: ColumnOcurrence.sh 2.2 - Visualize your Kmer counts & estimate genome size in R 1) open colby server: bi278.colby.edu 2) set working directory your histo files are in getwd() setwd("/home2/sdivit25/lab_02") 3) read in one by one the result files you created in the prev step name <- read.table("name.m29.histo", h=F, sep=" ") B.PseudomalleiIllumina <- read.table ("B.PseudomalleiIllumina.histo", h=F, sep=" ") 4) check out shape of histogram plot(name, type ="l") plot(B.PseudomalleiIllumina, type="l") Since, we want to find the midpoint of the main peak that represents the single copy regions of he genome, we can narrow down the range of the histogram to try to figure out what the x-axis value is at the main peak: plot(name[5:250,], type="l") plot(B.PseudomalleiIllumina[5:250,], type="l") 5) we can check out the data directly to determine the midpoint of the main peak based on where it was located in your plot. ***THIS MIDPOINT NUMBER REPRESENTS A ROUGH ESTIMATE OF GENOME COVERAGE. name[150:180,] B.PseudomalleiIllumina[0:50,] My midpoint with the highest density was about 23 (254926) 6) Now, we want to area under the curve & divide it by this coverage estimate to get an estimate of genome size. This estimation is made possible because the distribution of k-mers in a genome follows a Poisson distribution. We're ignoring the part of the curve (inital peak) that contains the results of sequencing error: sum(name[5:nrow(name), 1]*name[5:nrow(name),2])/"midpoint with highest density" sum(B.PseudomalleiIllumina[5:nrow(B.PseudomalleiIllumina), 1]*B.PseudomalleiIllumina[5:nrow(B.PseudomalleiIllumina),2])/23 Explanation: nrow(name) head(name) head(name[,1]) #square brackets are used to limit [row, column] head(name[,2]) head(name[5:nrow(name),]) head(name[5:nrow(name),1]*name[5:nrow(name),2]) sum(name[5:nrow(name),1]*name[5:nrow(name),2]) ***NOTE: Professor Noh told us that we didn't have to do Exercise 3.