# Bi278 Lab 2 v2 NCBI data, kmers, and prokka ### By Lee Ferenc 9/19/2023 ## Exercise 1. Shell Script ### Writing the Shell Script Nano is where we write scriptis. You can access this in the terminal with: >`nano` When starting the nano script include `#!/bin/bash/` at the top and then write the commands. Instead of inlcuding a specific filename, use `$` To exit press CTRL + X. You will be prompted to save your file. Click y. Type in the filename you desire plus `.sh` at the end. To edit the script type in: >`nano FILENAME.sh` To exceute the script type in: >`sh FILENAME.sh` To execute the script with the file type: >`sh FILENAME.sh FASTAFILE` ### Script Table | Organsism | Strain | Contig Count | Genome Size (bp) | GC % | | -| -| - | - | - | | B. agricolaris | BaQS159 | 2 | 8721420 | 61.2% | | P. bonniea | Bbqs859 | 2 | 4098182 | 58.7% | | P. bonniea | bbqs395 | 2 | 4009285 | 58.8% | | P. bonniea | bbqs433 | 2 | 4013203 | 58.8% | | P. Fungorium | ATCC BAA-463| 4 | 9058983 | 61% | | P. Hawleyella | BhQS11 | 2 | 4125700 | 59.2% | Nano Script: I basically included the two commands from lab 1 though I was still unable to find out how to divide within nano ## Exercise 2.1 Downloading Public Genome Data from NCBI Go to the website below: [https://www.ncbi.nlm.nih.gov/bioproject/browse](https://) Search for a desirable genome, and filter by data type for Genome sequencing projects for fewer results. Click the accession and then you will see a page. Click on genome. The olive box on the right with "Genome" in blue or purple is the fastest way to get to the genome. The blue button on the bottom labelled "Download" is where you download the genome. Select "RefSeq Only" as the file source. (Filter is Genome Sequencing or Raw Sequencing, Monoisolate, has pubication) (Also make sure Filer is mounted with "smb://filer.colby.edu") ### 2.1 Downloading *Neisseria gonorrhoeae* After downloading the genome onto my desktop (PRJDB8572), I copied it over to `colbyhome`. The tutorial didn't work so I moved the donwload under personal. Then in the terminal moved until I was able to copy `/personal/enfere24/GCF_013030075.1` to `/home2/enfere24` then moved to lab_2. Then I ran my nano script Answer: Total size is 2171755 and GC content is 52.60% Fits with NCBI. ### Exercise 2.2 Downloading Raw Sequencing Reads with `sra toolit` To download straight from NCBI you can use `SRAtoolskit` and `faserq-dump`. First configure the enviroment >`vdb-config -i` Make sure remote access is enabled, type c, make sure local file-cashing is enabled. Then type t and go to tools. Check 'current directory'. Click s to save and then exit. >`fasterq-dump` >Check File Size `vdb-dump --info SRR#` >See first 8 lines `fasterq-dump -Z --concatenate-reads SRR# | head -n 8` >Download `fasterq-dump -p SRR#` >Once downloaded check: `head -n 4 SRR#_#.fastq` >Donwload directly to lab `fasterq-dump -p --fasta --outdir ~/lab_02 SRR#` ### Example of *Neisseria gonorrhoeae* SRR16847903 being downloaded >Downloaded the SRR `fasterq-dump -p --fasta --outdir ~/lab_02 SRR16847903` > Looked at its head `head lab_02/SRR16847903*.fasta` (looking good) > Renamed to a better file `mv lab_02/SRR16847903.fasta lab_02/N_gonorrhoeae_SRR16847903.fasta` >Split because the size (ls -lh) was 1.9G `split -b 1G N_gonorrhoeae_SRR16847903.fasta N_gonorrhoeae_SRR16847903` >Moved to the original fasta file and removed the other splits `mv N_gonorrhoeae_SRR16847903aa N_gonorrhoeae_SRR16847903.fasta` ## Exercise 3 Count k-mers and estimate genome size ### Exercise 3.1 `jellyfish` List of useful commands >`jellyfish count` > `jellyfish count -t 2 -C -s 1G -m NUMBER -o NAME.m29.count INPUTFILE` (number is k-mer) >`jellyfish histo` > `jellyfish histo -o NAME.m29.histo NAME.m29.count` I chose k-mer size 22. >`jellyfish count -t 2 -C -s 1G -m 22 -o N_gonorrhoeae_SRR16847903.m29.count N_gonorrhoeae_SRR16847903.fasta` >`jellyfish histo -o N_gonorrhoeae_SRR16847903.m29.histo N_gonorrhoeae_SRR16847903.m29.count` >To see I used `less N_gonorrhoeae_SRR16847903.m29.histo` ### Exercise 3.2 `R` R syntax resource: http://www.cookbook-r.com To go to R: Open an internet browser and type in https://bi278.colby.edu/ (In Rstudio console): >Get working directory `getwd()` >Set working directory `setwd("/home2/enfere24/lab_x/")` >Make a table from data`NAME <- read.table("NAME.m29.histo", h=F, sep=" ")` >Change variables `colnames(NAME) <- c("Vector 1","Vector 2")` >plot it `plot(NAME, type="l")` >Fix the the x-axis `plot(Name[x1:x2,], type="l")` >Have R report the max density: `NAME[NAME[,2]==max(NAME[x1:x2,2]),]` >Get the area under the curve and divide by coverage `sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) / COVERAGE` The last command can be broken down into these commands: >`nrow(NAME)` >`head(NAME)` >`head(NAME[,1])` #square brackets are used to limit [row, column] >`head(NAME[,2])` >`head(NAME[5:nrow(NAME),])` >`head(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2])` >`sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2])` Occurance point with maximum density is 281 with 15278. The estimate I got was 2,502,860 or 2.5 MBP. The estimated sizze was 1.5MBP from the website. In Rstudio I did: >`getwd()` >`setwd("/home2/enfere24/lab_02/")` >`N_gono <- read.table("N_gonorrhoeae_SRR16847903.m29.histo", h=F, sep=" ")` >`plot(N_gono, type="l")` >`plot(N_gono[5:500,], type="l")` >`plot(N_gono[200:300,], type="l")` >`plot(N_gono[270:300,], type="l")` >`N_gono[N_gono[,2]==max(N_gono[150:450,2]),]` >`sum(N_gono[5:nrow(N_gono),1] * N_gono[5:nrow(N_gono,2)] / 281`