BI278 Lab02 Kirsten Pastore

# BI278 Lab02 Kirsten Pastore Exercise 1. Download Public Genomic Data from NCBI 1.1 Navigate to NCBI for genomic data go to https://www.ncbi.nlm.nih.gov/bioproject/browse 1. Find a bacteria that has both a genome and raw sequence reads available through NCBI I searched "Burkholderia pseudomallei" and sorted by Genome sequencing projects I selected PRJNA300580 2. Download a genome onto your desktop clicking the genome link on the selected project, and then clicking the genome link again downloads the chromosome level genome sequence onto my desktop 3. Copy over to your colbyhome to use it in bi278 - opened a terminal window in desktop - mount filer - copy from desktop's Downloads folder to colbyhome and move to bi278 folder ``` cp Downloads/GCF_000756125.1_ASM75612v1_genomic.fna /Volumes/Personal/klpast23 ``` 4. ssh into bi278 and make a new folder for this week in your home ``` #ssh onto bi278 ssh klpast23@bi278 #make a new folder for this week in your home mkdir lab_02 # move the file into lab_02 folder mv Personal/klpast23/GCF_000756125.1_ASM75612v1_genomic.fna ~/lab_02 ``` 5. Use your shell script from last week to collect genome stats on this new genome ``` sh BPCounter.sh GCF_00756125.1ASM75612v1_genomic.fna ``` returns: NZ_CP008781.1 Burkholderia pseudomallei strain Mahidol-1106a chromosome 1, complete sequence NZ_CP008782.1 Burkholderia pseudomallei strain Mahidol-1106a chromosome 2, complete sequence 7,085,397 #total number of BP 4,836,611 #GC Count 1.2 Download Raw Reads via SRA Toolkit 1. Click on BioSample under Project Data 2. Find RunID under SRA information Run: SRR2886986 Spots: 5.4M Bases: 1.1G Size: 307.7M GC Content 66.9% Published: 2015-12-06 Experiment: SRX1406816 Library Name: MSHR8613 Platform: Illumina 3. run fastq-dunmp from command line program called SRAtoolkit ``` #find commands / help fastq-dump --help ``` note: -X means maximum spot id, filter by SPOT_GROUP -Z output to stdout 4. check downloaded data ``` fastq-dump -X 3 -Z SRR2886986 ``` 5. check the data with the pairs separated ``` fastq-dump -X 3 --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -Z SRR2886986 ``` 6. Download raw sequencing reads but without quality scores ``` fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v -- fasta default --outdir ~/lab_02 SRR2886986 ``` 7. Take a look at the first 20 lines of FASTA files using head ``` head -20 ~/lab_02/*.fastq ``` 8. Download raw sequencing reads from other organisms ``` fastq-dump --split-3 --skip-technical --readids --read-filter pass --dumpbase --clip -v -- fasta default --outdir ~/lab_02 SRR3927471 ``` | Organism | SRA Instrument | SRA Run Number | Genome Size (bp) | Estimated Size | | ----------- | --------------------- | -------------- | ---------------- | -------------- | | P. fungorum | Illumina NovaSeq 6000 | SRR11022347 | 9,058,983 |9,185.185 | | P.sprentiae | Illumina GA 11x | SRR3927471 | 7,829,542 | | | P. terrae| PacBioRS II | DRR322713 | 10,062,489 | | | P.xenovorans | Illumina HiSeq 2000 | 9,702,951 | | | | B. pseudomallei | Illumina HiSeq 2000 | SRR2886986 | 7,085,397 | 7,034,152 | 9. remove second file that appears using rm ``` rm SRR2886986_pass_2.* ``` 10. rename each file so the filename reflects the organism and the source (SRA) ``` mv SRR2886986_pass_1.fasta SRR2886986_B.pseudomallei.fasta mv SRR2886986_pass_1.fastq SRR2886986_B.pseudomallei.fastq mv SRR3927471_pass_1.fasta SRR3927471_P.sprentiae.fasta ``` Exercise 2. Count k-mers and estimate genome size based on k-mer frequenies 2.1 Count Kmers with Jellyfish 1. Chose a k-mer size that is between 19-31 k=21 2. run jellyfish count for a genome you will compare ``` jellyfish count -t 2 -C -s 1G -m 21 -o B.pseudomallei.m21.count SRR2886986_B.pseudomallei.fasta ``` 3. run jellyfish histo to get frequency distribution across all k-mers ``` jellyfish histo -o B.pseudomallei.m21.histo B.pseudomallei.m21.count #remove count file rm B.pseudomallei.m21.count ``` 4. look at histo file ``` cat B.pseudomallei.m21.histo ``` 5. write a shell script to do the same for the other SRA downloaded file ``` jellyfish count -t 2 -C -s 1G -m $2 -o $1.count $1.fasta jellyfish histo -o $1.histo $1.count rm $1.count cat $1.histo ``` 2.2 Visualize Your Kmer Counts and Estimate Genome Size in R 1. go to server: bi278.colby.edu 2. set working directory your histo files are in ``` #in the console: getwd() setwd("/home2/klpast23/lab_02") ``` 3. read in result files ``` B.pseudomallei <- read.table("SRR2886986_B.pseudomallei.histo", h=F, sep=" ") ``` 4. check out the shape of this histogram ``` plot(B.pseudomallei, type = 'l') plot(B.pseudomallei[5:250,], type = 'l') ``` 5. check out the data to determine the midpoint of the main peak ``` B.pseudomallei[25:100,] ``` midpoint with the highest density around 60 6. area under the curve & divide by the coverage estimate to get an estimate of the genome size ``` nrow(B.pseudomallei) head(B.pseudomallei) head(B.pseudomallei[,1]) head(B.pseudomallei[,2]) head(B.pseudomallei[5:nrow(B.pseudomallei),] ) head(B.pseudomallei[5:nrow(B.pseudomallei),1] * B.pseudomallei[5:nrow(B.pseudomallei),2]) sum(B.pseudomallei[5:nrow(B.pseudomallei),1] * B.pseudomallei[5:nrow(B.pseudomallei),2]) sum(B.pseudomallei[5:nrow(B.pseudomallei),1] * B.pseudomallei[5:nrow(B.pseudomallei),2]) /60 ``` returns 7,034,152 Exercise 3. Annotate a Genome with Prokka 3.1 Annotate a Genome 1. Find path for genome from B. pseudomallei 2. Run Prokka annotation ``` prokka --force --outdir bbqs395 --proteins /courses/bi278/Course_Materials/lab_02/Burkholderia_pseudomallei_K96243_132.gbk --locustag BB395 --genus Paraburkholderia --species bonniea --strain BbQS395 /courses/bi278/Course_Materials/lab_01b/P.bonniea_bbqs395.nanopore.fasta ``` Q1. What are the different output files that Prokka generates? .faa: Protein FASRA file of the translated CDS sequences .ffn: Nucleotide FASTA file of all the prediction transcripts .fna: Nucleotide FASTA file of the input contig sequences .gff: This is the master annotation in GFF3 formation, containing both sequences and annotations .txt: Statistics relating to the anotated features .tsv: Tab-separated file of all features: locus_tag, ftype, len_bp, gene, EC_number, COG, product Q2. Use the best file to answer each question - How many CDS do you have? - How many of these are "hypothetical proteins" - How many "ribosomal proteins" do you have - How many tRNA do you have Q3. Which file is the most useful for finding the genomic loations (chromosome and base address) of any type of annotations Q4. Now compare your metrics to that of your reference genome for these two species?