###### tags: `BI278` # Lab 2 Notebook (9/14) ## Exercise 1: Count k-mers and estimate genome size based on k-mer frequencies ### 1.1 . DOWNLOAD SPECIFIED GENOMES FROM NCBI SRA (**practical**) I first created a new directory with the following command while in the home directory of bi278 ```mkdir lab_02``` Next I moved into this directory so that the imported files would go into this new directory ```cd lab_02``` I used the following command to import the SRA files, changing the SRA number for the different files. The 3000000 represents the 3 million spots read. ```fastq-dump -X 3000000 SRR11020398 --fasta``` I did not need to delete any files created paired end reads. I then renamed the files to reflect the organism with the following command, modified for each SRA file ``` mv SRR11020398.fasta p_fungorum_SRA.fasta``` The lab_02 therefore now has three files with SRA files for each of the three reads (Figure 1) ![](https://i.imgur.com/Ey44EGP.png) Figure 1: The files in the lab_02 directory after important and renaming the fastq files. ### 1.2. COUNT KMERS WITH JELLYFISH Next, I used jellyfish to make a count file for each of the SRA files using the following command, modifying for each file. This counts the frequency of kmers length 20 in the fastq file. ```jellyfish count -t 2 -C -s 2G -m 20 -o p_fungorum.20.count p_fungorum_SRA.fasta``` Next, I modified the following command for each of the three count files I just made to create a histogram of the frequency distribution across all k-mers. ```jellyfish histo -o p_fungorum.20.histo p_fungorum.20.count``` I was then able to see the histo file with the command ```less p_fungorum.20.histo```. It contained two columns, one showing the frequency a k-mer comes up in the file, the other showing how many other kmers come up with that frequency (Figure 2). Remember that all of these kmers are of length 20 because of the specification in early command-line entries. ![](https://i.imgur.com/ijuSyd9.png) Figure 2: The p_fungorum.20.histo file, shown with the command ```less```, that reveals the frequency distribution across all k-mers with length 20. For example, the first line demonstrates that there are 42252036 kmers of length 20 that only appear once in p_fungorum_SRA.fasta file. ### 1.3. VISUALIZE YOUR KMER COUNTS AND ESTIMATE GENOME SIZE IN R Type bi278.colby.edu to get into the login page for r for the bi278 unix environment. The login is just lmdrep23 and my password. Typing the following lines into rstudio revealed that I was in the home directory and moved me into the lab_02 directory I've been working in through terminal ``` getwd() setwd("/home/lmdrep23/lab_02/") ``` I then made each of the histo files I created in this directory into variables in RStudio to gather data from and manipulate them with the following command, modifying for the organism in each SRA file. ```p_fungorum <- read.table("p_fungorum.20.histo", h=F, sep=" ")``` I then plotted each of the files to visualize the density of each kmer frequency with the following command in RStudio, modifying for each organism (Figure 3). ``` plot(p_fungorum, xlim= c(2,200), ylim= c(0,500000), type="l", xlab= "frequency", ylab= "density", main= "p fungorum") ``` ![](https://i.imgur.com/TLS9nmk.png) Figure 3: A plot in Rstudio created with the p_fungorum.20.histo file that represents the density of kmers with each frequency, excluding the peak of kmers at frequency=1 (most likely all errors). Using the values from the table (clicking on the variable in the environment), I found that, for *P. fungorum*, the density values (represented in the table by the category V2) below 480000 represented those after the main peak. I therefore used the following commands to find the midpoint of the peak. The first makes a new, restricted data set, and the second finds the maximum density. ``` p_fungorum_mod <- p_fungorum[which(p_fungorum$V2<480000),] max(p_fungorum_mod$V2) ``` The output of this was a density of 224065. Looking back at the table revealed that this density belongs to the frequency 78. Repeating this process with the rest of the SRA files reveals variable midpoints for the different read files (Table 1) ```csvpreview {header="true"} organism, midpoint of the main peak (frequency), density at midpoint P. fungorum, 78, 224065 P. Megapolitana, 73, 289689 P. Xenovorans, 44, 325890 ``` Table 1: The 20-mer frequency with the highest density, excluding the inital peak for low frequencies, in sequence reads in SRA files for three different organisms. P. Xenovorans reveals a higher density of kmers with lower frequencies, possibly revealing lower coverage in general in this SRA file. Next, I used/modified the following command (using the restricted data set for each of the organisms) that estimates genome size represented by each SRA file by finding the area under the curve of the histogram for each organism and dividing by the coverage (the midpoint of the peak for each). (Table 2) ``` sum(p_fungorum_mod[,1]*p_fungorum_mod[,2])/78 ``` ```csvpreview {header="true"} organism, estimated genome size P. fungorum, 9945469 P. Megapolitana, 10301064 P. Xenovorans, 10303354 ``` Table 2: The estimated genome size for each organism, using the area under the main curve of the histogram for each organism (i.e. figure 3) divided by the midpoint of the peak (Table 1). ## Exercise 2. Build an alignment-free tree **practical**: In my home directory, I created an input directory and moved the files from the course material folder into the my lab_02 input directory with the following commands ``` mkdir lab_02_input cd lab_02_input cp /courses/bi278/Course_Materials/lab_02/* . ``` **practical**: The following commands then allowed me to make a separate directory for the output files in this lab and enter this directory. ``` [lmdrep23@vcacbi278 lab_02]$ cd .. [lmdrep23@vcacbi278 ~]$ mkdir lab_02_output [lmdrep23@vcacbi278 ~]$ cd lab_02_output ``` In the output directory, I executed the following command to count the kmers in the genomes in the files I just placed in the input directory. The -k 20 specifies to count kmers of size 20, and -o k20 specifies to make the output files all start with this prefix. ``` python /usr/local/src/AAF-20160831/BetaVersion/aaf_phylokmer.py –k 20 -n 2 -t 4 -o k20 -d /home/lmdrep23/lab_02_input & ``` The following command calculates a distance matrix from the kmer counts. The distance matrix file will be called k20.dist and the distance tree file will be called k20.tre ``` python /usr/local/src/AAF-20160831/BetaVersion/aaf_distance.py -i k20.dat.gz -t 4 -o k20 -f k20.wc ``` I then went back to Rstudio to be able to see this tree file. The following R commands moved me to the output directory, important the ape library, entered the tree file as a variable in the environment, and plotted the tree (Figure 4). ``` setwd("/home/lmdrep23/lab_02_output/") library(ape) tree <- read.tree("k20.tre") plot(tree) ``` ![](https://i.imgur.com/EPwnRiG.png) Figure 4: The Rstudio visualization of the tree I created through AAF Unix commands. **practical**: The following commands and output reveal the final organization in my directory of the bi278 unix environment ![](https://i.imgur.com/4z967Pz.png)