WEEK 2 NOTES - HackMD

****WEEK 2 NOTES #!/bin/bash- telling unix we will be using bash $1 is a variable that will be accepted when I run this script, in this case it was where a file was uspposed to go so this will run on any file that I put in that position..more sh is running a shell script sh makes it easier to run genome size, count contigs and GC amount without having to run it everytime nano-editor within Unix shell TO type a script in nano: type nano in terminal, enter #!/bin/bash (return) type script or command Control and X at same time will return back to original terminal to go back to script :nano FILENAME.sh I could use $1 for a filler filename, (test.fa)__>$1 alloqs script to accept filenmae as variable totalGC.sh file name To run this scripton a new fasta file using the new variable: sh FILENAME.sh FASTAFILE TO COPY NCBI DATA 1. pick "genome" 2. either from [Genome sequencing] or [Raw sequence reads] using the Data Type filter, and [Monoisolate] from Scope, and [has publication] in Property.First sort these by data in Data Type and get genome sequencing projects first 3. click on accession link (PRJNA178 for eg.) 4. Click genome link in olive box. The tables below with Project Data here are telling you that this genome has two chromosome sequences and 5731 individual protein sequences associated with it. If you click on either of these numbers in the table, it will take you to individual sequence entries 5. Click donwnload , select [RefSeq only] as file source, File Types: genome sequences, Name of file eg.: ncbi_dataset.zip 6. press download , this will download a genome file to desktop 7. Unzip and copy the genome file over to your colbyhome to use it in bi278 (the course Unix environment). To do this, open a finder window for your desktop. Assuming filer is mounted (use command+K and enter smb://filer.colby.edu) you can copy a file from your desktop’s Downloads folder to your colbyhome then move it to your bi278 home (~): 8.Then, ssh onto bi278. Make a new folder for this week in your home (~) and then move your file from within colbyhome(= /Personal/COLBYID in finder) to your working directory: mv PATH/FILENAME ~/lab_02 10.Use your shell script from above to collect genome stats on this new genome. to copy file into colby home go to my personal finder and drag it over to terminal ls colbyhome should be the place where my file is: ``` ls colbyhome/GCF_000007505.1/ ``` then ``` mv colbyhome/GCF_000007505.1/GCF_000007505.1_ASM750v1_genomic.fna ~/lab_02 ``` folder, file is correctly moved into lob_02 folder **2.2 DOWNLOAD RAW SEQUENCING READS W SRA TOOLKIT** In NCBI going back to my BioProject search results, when I click on Data Type in the header again to flip the sorting order, I can see some projects with raw sequencing reads In the Project Data table, I can see how many SRA experiments are included here. This (SRA) is how NCBI stores its raw sequencing data. If I follow the link, it will show me details about this dataset, including type of sequencing (illumina), the data source (genomic), and sequencing read layout (paired [end]): The information you need to download this data is the run ID (usually starts with SRR) clicking on the run ID Useful things to check here are the number of spots (6.9M), and the length of the sequencing reads (301 base pair). The green bars confirm that it is paired end, with 2 reads per spot. I like to think of spot being the spot on an illumina chip, Once you find a BioProject with downloadable raw sequence reads, you can use the SRAtoolkit software and its specific command fasterq-dump: In your terminal and in bi278, type in: vdb-config -i (Will get blue screen) 1.Confirm that remote access is enabled. Next, type [c] to go to the [Cache] tab: Confirm that local file-caching is enabled. Last, type [t] to go to [Tools]: Check current directory here. Then type [s] to save and [x] to exit. Now type in: fasterq-dump *****SRATOOLKIT COMMANDS ``` --seq-defline custom defline for sequence: $ac=accession, $sn=spot-name, $sg=spot-group, $si=spot-id, $ri=read-id, $rl=read-length --qual-defline custom defline for qualities: same as seq-defline -U|--only-unaligned process only unaligned reads -a|--only-aligned process only aligned reads -l|--row-limit limit rowcount per thread --disk-limit explicitly set disk-limit --disk-limit-tmp explicitly set disk-limit for temp. files --size-check switch to control: on=perform size-check (default), off=do not perform size-check, only=perform size-check only --ngc <PATH> PATH to ngc file -h|--help Output brief explanation for the program. -V|--version Display the version of the program then quit. -L|--log-level <level> Logging level as number or enum string. One ``` of (fatal|sys|int|err|warn|info|debug) or (0-6) Current/default is warn ``` -v|--verbose Increase the verbosity of the program status messages. Use multiple times for more verbosity. Negates quiet. -q|--quiet Turn off all status messages for the program. Negated by verbose. --option-file <file> Read more options and parameters from the file. ***** vdb-dump --info SRR8460388 ^ give all the file info from the ncbi database ``` fasterq-dump -Z --concatenate-reads SRR8460388 |` head -n 8` ^takes a second to do shows fist 8 lines the output I am seeing the DNA sequence after the headers that start with @ and the associated base quality scores after the headers that start with +: in length it shows the base pair lengths seequences combined eg.length 602-->301 bp length sequences per spot NCBI SRA stores paired end reads in this concatenated format, so fasterq-dump will split them up for you by default. in downloading NCBI SRA, -p option lets me see how far along download is I want to check the first few lines of the two files I got: head -n 4 SRR8460388_1.fastq head -n 4 SRR8460388_2.fastq download raw sequencing reads but without quality scores. The download will take a few minutes so if you don’t use -p your command prompt will look like it is frozen. Just leave it hanging and it will complete on its own: fasterq-dump -p --fasta --outdir ~/lab_02 SRRNUMBER Then take a look at the first few lines of the FASTA files you’ve got using head. It’s always a good idea to do this to make sure your files look like what you expect. head lab_02/SRRNUMBER*.fasta (CONDITIONAL): If you have a second file because of paired end reads for any of your SRA files, get rid of it using rm. Rename each file so the filename reflects the organism and the source (SRA) for your own convenience using mv. mv can be used not only to move a file from directory to directory, but you can also use it to rename a file by “moving” it from one filename to another. Check the size of your files using: ls -lh If your raw read file is larger than 1G, then split it using this command: split -b 1G SRRNUMBER.fasta SRRNUMBER. Write over the original file with the split files ending with *aa like this: `mv SRRNUMBER.aa SRRNUMBER.fasta` Now remove the split files that end with anything other than *aa. You should now have a fasta file containing raw reads and a fasta file containing a genome sequence. **3.1 JELLYFISH** The program we will use to count k-mers in our genomes is called jellyfish. The commands you want to use are jellyfish count, and jellyfish histo. For each of these commands you can type in jellyfish count --help and jellyfish histo --help for additional usage information. in a line of code after o- is where you are naming a file 1.Choose a k-mer size that is between 19-27. I’ve done k=29 here. The number of k-mers you could make with the 4 DNA bases can get unwieldy pretty fast, but some k-mers may be too small so there won’t be much difference between one genome and another. Larger k-mers are more specific, but it will take a little longer to count. 2.Run jellyfish count for the **genome **you downloaded. Change k-mer size by changing the number after -m Like fasterq-dump above, this will take a few minutes and will appear to hang. Just wait for it to finish. `jellyfish count -t 2 -C -s 1G -m NUMBER -o NAME.NUMBER.count INPUTFILE` -os the name part would be what you remnamed your genome file eg. bmulti_SRR#.fasta 3.Now run jellyfish histo on the count file to get the frequency distribution across all k-mers. `jellyfish histo -o NAME.NUMBER.histo NAME.NUMBER.count` 4. in histo file: cat (will print the whole file to your screen) or less (will show you one screen-length and a line at a time; type q to exit out). Your file will show in two columns: occurrence (or frequency, how often a k-mer appears in that file) and density (how many k-mers show at that level of occurrence).You’ll notice there are a lot of k-mers that occur exactly once (first line of the file). These are most likely to be the result of errors Now we’ll take the histo file into R to plot and do some genome size estimation. 3.2 USING R-STUDIO TO ESTAIMATE GENOME SIZE FROM K-MER FREQUENCIES Open in ternet brwowsre log in w colby credentials :bi278.colby.edu to set the working directory your histo files are in (this is similar to ls and cd in Unix). In the Console: getwd() setwd("/home2/COLBYID/lab_02/") *auctocomplete can fill out the remainder of your file name* Then you can read in one by one the result files you created in the previous step. Unlike Unix, R is unable to access these files without this step:``` NAME <- read.table("NAME.m29.histo", h=F, sep=" ") Remember the table that you just read has occurrence (or frequency, how often a k-mer appears in that file) and density (how many k-mers show at that level of occurrence) for k-mers in your genome. Let’s change the variable labels on the data table: colnames(NAME) <- c("occurrence","density") plot(NAME, type="l") is code to see shape of histogram occurrence point with maximum density: plot(NAME[5:250,], type="l") could play around with intervals, since we wnat to find midpoint of main peak that represents single copy regions of the genome, we can narrow down the range of the histogram to try to figure out what the x-axis value is at the main peak.is looking at occurrence (x-axis) values from 5 to 250. NAME[90:120,] to determine midpoint of peak based on where it was located in your plot.This midpoint number represents a rough estimate of genome coverage. to get R to report to you the occurrence with the maximum density within the initial region you identified as having a peak: NAME[NAME[,2]==max(NAME[5:250,2]),]This occurrence point with the maximum density is your estimate of genome coverage nrow(NAME) head(NAME) head(NAME[,1]) #square brackets are used to limit [row, column] head(NAME[,2]) head(NAME[5:nrow(NAME),]) head(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) / COVERAGE # **9/27/23 Review** -type nano to open where script and file will work together -after typing #!/bin/bash we will type our commands(genome size,contig count, GC%) using the genome files PATHs to access them, and with each demand the PATH ahs to go in front of the name of the genome file being proccessed -we will us $1 as a variable for the files, so this command can run on any file (as long as its in the path) -for these commands in nano we will create a file name , so the commands will run when its file name is called in this order sh then filename.sh then copy the file you want to be processed: **sh FILENAME.sh FASTAFILE** eg. sh GenomeCode.sh GCF_009455685.1_ASM945568v1_genomic.fna # 2. DOWNLOADING PUBLIC GENOMIC DATA FROM NCBI ## 2.1 NCBI FOR GENOMIC DATA -Link to bioproject( can fin data sets by attributes such as organism, types of sequences):https://www.ncbi.nlm.nih.gov/bioproject/browse -once you look up a bacteria use data type filter to only get [Genome Sequencing] or [Raw Reads] in Data Type Filter, and [Monoisolate] from Scope, and [has publication] from Property, then you get bioprojects to choose from -we will start with genome sequencing prjoect first [Data type], one chosen click on teh accession number, this tells me how many protien sequences I have and how many chromosomes (I am doing Legionella pneumophila (Legionella antartica- has 6773 protein sequences) and it has 3 chromosomes, and 3720 protein sequences) clicking on these numbers in table will take you to individual sequence entries -Click on the genome link in the olive box -When you click the blue download button the File Source will be [RefSeq only] an for File Types [Genome Sequences( FASTA)] hit download then the genome file will dowload onto your desktop -Unzip/ uncompress. To copy genome file over **colbyhome** to use it in bi278(the Unix environment) open a finder window for my desktop. Asumming filer is mounted (use command+K and enter smb://filer.colby.edu) you can copy a file from your desktops dowloads folder to your colbyhome and then move it to you bi278 home(~) *when i press tab, it autocompletes what i want to type- this could be for paths, files etc. mv ./colbyhome/GCF_011764505.1 ./ ^ is how i moved my file from colbyhome to bi278 home (~) to move file from within colby home TO MY WOKRING DIRECTORY/FOLDER WHICH IS LAB 2, mv PATH/FILENAME ~/lab_202 OR mv FILENMAME/ Lab_202 **PWD will give you the path name** -MY genome file was within a folder, had to ls until i found the correct file that i was to run with nano script. Once I found file i moved it out into my lab_202: mv GCF_002285905.1/ncbi_dataset/data/GCF_002285905.1/GCF_002285905.1_ASM228590v1_genomic.fna ./ where i can finlly run the code however my nano script had the wrong path for each command, it was /courses/bi278/Course_Materials/Lab_01b, when my path to my file was: /home2/cibene26/lab_202, so i changed that in my nano script and it finally worked when i ran: sh GenomeCode.sh GCF_002285905.1_ASM228590v1_genomic.fna BTW GREEN IS A FILE, BLUE IS A FOLDER # 2.2 Downloading Raw Sequencing Reads -go to ncbi database, raw reads in data type, click accession number, the SRA link , need to download the run id -when moving a file, a / signifies a new location because files are usually nested, everytime there is a space before a /, that starts a new location as in where you want a file to go -I moved the SRR file from my colbyhome to my lab_202 folder, i did this by already being in colby home then : mv ./SRR21848900.fasta.gz ~/lab_202 -once we find downloadable raw sequence r we can use the SRAtoolkit software and its specific command fasterq-dump. This software was written specifically to interact with the NCBI SRA (Sequence Read Archive) database where NCBI stores all raw sequence data to make sure computer is configured properly, we need to make sure remote computer can access the web so once in my terminal and bi278 ([cibene26@vcacbi278 ~]$), i would enter: vdb-config -i where a blue screen would pop up -Confirm that remote access is enabled. Next, type [c] to go to the [Cache] tab: -Confirm that local file-caching is enabled. Last, type [t] to go to [Tools]: -Check current directory here. Then type [s] to save and [x] to exit. -Now i can type in fasterq-dump which will give me a long list of how to use this command -For the raw reads i want to retrieve, i want to see how big the file is going to so I would type: vdb-dump --info SRR8460388 --when i run this command it will take a second allow me to see this data, and to print my output to screen, but then only show the first 8 lines: fasterq-dump -Z --concatenate-reads SRR8460388 | head -n 8 (for me it said (disk-limit exeeded! to see limits: re-run with '-x' option.)so i used -x instead of -Z and it didnt give me that same thing)however what i should be seeing is the DNA sequence after the headers that start with @ (@SRR8460388.1 1 length=602) and (@SRR8460388.2 2 length=602) and the associated base quality scores after the headers that start with +(+SRR8460388.1 1 length=602)and (+SRR8460388.2 2 length=602). Based on the trace, these sequences are supposed to be two 301 base pair-length sequences per “spot”. In this output, you see that these two 301 base pair-length sequences are being combined. NCBI SRA stores paired end reads in this concatenated format, so fasterq-dump will split them up for you by default. -now we will do a full dowloadhe -p option let’s me see a progress bar of how far along the download is: fasterq-dump -p SRR8460388 -to check to first few lines of teh wto files we got( this will give us 4 lines): head -n 4 SRR8460388_1.fastq head -n 4 SRR8460388_2.fastq -The reason for doing these types of checks is because unfortunately there are a lot of data in SRA that are not deposited or described properly. so ot avoid this (bad) data is by checking whether the traces and the metadata of a sample of the files match up with what you actually get when you download the raw reads. -I will download the raw sequencing reads using the command( this also gives me teh progress bc of the -p):fasterq-dump -p --fasta --outdir ~/lab_202 SRRNUMBER -Then I'm going to look at the first few lines of the FASTA files you’ve got using head. It’s always a good idea to do this to make sure your files look like what you expect. head lab_02/SRRNUMBER*.fasta -I am then goifng to chnage the SRR file to the name of the organism and the source SRR#, eg:Rickprow_SRR#.fasta ^how i did this i used the mv command to rename: mv /home2/cibene26/lab_202/SRR21848900.fasta.gz Rickprow_SRR21848900.fasta however i was in the home directory when i did this so this command renamed the file but also moved it to my hone directory so i moved the file back into lab_202 by: mv Rickprow_SRR21848900.fasta /home2/cibene26/lab_202 -Check the size of your files using: ls -lh -If your raw read file is larger than 1G, then split it using this command: split -b 1G SRRNUMBER.fasta SRRNUMBER. -Write over the original file with the split files ending with *aa like this: mv SRRNUMBER.aa SRRNUMBER.fasta -Now remove the split files that end with anything other than *aa. You should now have a fasta file containing raw reads and a fasta file containing a genome sequence. ## 3. Count kmers and Estimate Genome Size The program we will use to count k-mers in our genomes is called jellyfish. The commands you want to use are jellyfish count, and jellyfish histo.type jellyfish count --help and jellyfish histo --help for additional usage information. 1.we choose a kmer size between 19 and 27( ill do k=21) The k you select for DNA sequences will allow different degrees of detail in k-mer profile comparisons.Larger k-mers are more specific, but it will take a little longer to count. *BTW I SWITCHED INTO MY LAB_202 DIRECTORY* 2.I will run jellyfish count for the genome i dowloaded (input file), i canm change kmer size by changing the number after -m: jellyfish count -t 2 -C -s 1G -m NUMBER -o NAME.NUMBER.count INPUTFILE in my case this looked like: jellyfish count -t 2 -C -s 1G -m 21 -o Rickprow.21.count GCF_002285905.1_ASM228590v1_genomic.fna 3.Now i will run jellyfish histo on the count file to get the kmer distributions: jellyfish histo -o NAME.NUMBER.histo NAME.NUMBER.count in my case this looked like:jellyfish histo -o Rickprow.21.histo Rickprow.21.count 4.Then i took a look at my histo file, 2 ways to do this; cat - which will print the whole file to my screen or less - which will show you one screen-length at at a time, type q to exit out -my file will show two different columns, *occurence* or frequency- how often a kmer appears in that file and *density* - how many kmers how at that level of occurrence. There are alot of kmers that occur excatly once( first line of the file) that is just a reslut of errors. now we will take the histo file into the R plot and do some genome size estimation ## 3.2 Estimate genome size from kmer frequencies link to R to achieve basic tasks 1.Open R Studio 2.set the working directory to where your histo file is, to do this, in console: getwd() setwd("/home2/COLBYID/lab_02/") for me i was already in lab2 directory so i just used the seocnd line and put lab_202 BTW AUTOCORRECT WORK HERE 3.I could read in these result files one by one, R cannot access these files without this step: NAME <- read.table("NAME.m29.histo", h=F, sep=" ") in my case, this looked like: Rickprow <- read.table("Rickprow.21.histo", h=F, sep=" ") since the table just read occurence an density we are going to chage the variable son. the data table by doing: colnames(NAME) <- c("occurrence","density") in my case this looked like: colnames(Rickprow) <- c("occurrence","density") 4.To check out the shape of the histogram: plot(NAME, type="l") in my case this looked like: plot(Rickprow, type="l") {Initial peak is occurence =1 represneting kmers that are the result of sequencing errors. We want to find the midpoint of the main peak that represnts single copy regions of the genome, in this case we can narrow down the range of the histogram to see what the x-axis value is at the main peak} 5.To narrow down the range of the graph: plot(NAME[5:250,], type="l") in my case thsi looked like: plot(Rickprow[5:250,], type="l") {play around with the occurrence intervals, 5:20 is looking at occurrence values (x-axis) from 5 to 250} 6.We can check the data directly to see the midpoint of the main peak is based on where is was located on your plot: NAME[90:120,] in my case this looked like: Rickprow[10:50,], this midpoint number represents a rough estimate of genome coverage, with this i can determine my occurrence point with the maximum density was 15 7.To get R to report to you the occurrence with the maximum density within the initial region I identfied as having a peak: NAME[NAME[,2]==max(NAME[5:250,2]),] in my case this looked like: Rickprow[Rickprow[,2]==max(Rickprow[10:50,2]),] this also gave me occurence w max density at 15 {I tried plugging in the 50:250 number for my graph and i got the list of all the observations and next to them said NA, then when i put in the values 10:50 it worked, so as it is said, personal range values will vary} -My occurrence point with the maximum density is your estimate of genome coverage. We aaume that kmers should be presnet only once in a genome. What we just did is found how often most kmers occur in this raw sequencing set *Extra commands* nrow(NAME)- total numbers of observations head(NAME)- lists first 6 occurrences(x-values) and tgheir density head(NAME[,1]) #square brackets are used to limit [row, column] head(NAME[,2]) head(NAME[5:nrow(NAME),]) head(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) sum(NAME[5:nrow(NAME),1] * NAME[5:nrow(NAME),2]) / COVERAGE: sum(Rickprow[5:nrow(Rickprow),1] * Rickprow[5:nrow(Rickprow),2]) / 15 gave me, which is the calculated genome size [1] 14416.4