R Tutorial - HackMD

# R Tutorial [BI376](https://hackmd.io/@aphanotus/BI376Lab) For the week of [14 September 2020](https://hackmd.io/@aphanotus/BI376Lab#September-14) [Rmd](https://www.bugsinourbackyard.org/wp-content/uploads/2020/10/bi376.lab_.200914.Rmd_.txt) version of this script ## Tutorial instructions The final project for BI376 this semester will employ morphometric analysis in [R](https://en.wikipedia.org/wiki/R_(programming_language)), a free platform for statistical computing. While you've probably encountered R in other courses, this tutorial is meant as a refresher and as a reference. Later lab protocols will provide information on how you can explore more advanced features. While I hope you will look through all of this document, some of the material may be familiar and some may be new. At the end, there is a brief exercise asking you to send in a demonstration of your R skills. > Everyone should use the Rstudio server dedicated to our course. https://bi376.colby.edu/ Access this link on campus or from within Colby's VPN. You must log in with your regular Colby username and password. ![](https://i.imgur.com/kjLeI7E.jpg) #### For novice R users If you're new to R, I recommend that you step through each of the example code lines below. As you read this document, copy the R commands into the R script panel in Rstudio and execute them as you go. If you run into trouble, please contact Dave. The goal here is for you to increase your comfort level working with R. Don't give up in frustration! Ask for help. Several other online resources provide additional guides to beginning R. Manny Gimond has a useful two-minute video introduction to Rstudio <http://mgimond.github.io/ES218/Videos/RStudio_environ1.webm>. *[Swirl](http://swirlstats.com/students.html>)* is an interactive Rstudio tutorial, available at <http://swirlstats.com/students.html>. #### For experienced R users Everyone should at least scan through this document. Web links will lead to more detailed background information for some terms, which you are welcome to explore (although it is not required). However, if you're quite comfortable with the concepts described here, feel free to skip to the [exercise](https://hackmd.io/@aphanotus/R_Exercise). ## Rstudio ![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/RStudio_logo_flat.svg/500px-RStudio_logo_flat.svg.png =300x) Rstudio is a shell for R. You can download and install [Rstudio](https://rstudio.com/products/rstudio/download/#download) on your local machine. However, for work in BI376 it will be convenient to use the server set-up for our course. https://bi376.colby.edu/ This link runs Rstudio from a dedicated virtual machine (VM) that includes the packages you will need to run this semester. Simply log in with your Colby username and password. The Rstudio window is divided into 4 panels that each provide different tools. * The top left panel displays files that you can edit and save. These may be R scripts or R markdown files. * The R command line is in the bottom left. When a command is run here, nothing is "saved" except if a value is put into an object. - I'll explain that more below. * The top right panel shows either a list of objects currently in memory, or the command history. I find this to be the least useful panel, and you can minimize it by double-clicking on the tabs in the bottom right panel. * The bottom right panel displays lots of useful stuff, including plots generated in the command line. It also has tabs for navigating storage space (files and folders), looking at installed packages, or consulting the help pages. Rstudio allows you to plan an elaborate command in the script panel (upper left), and when you're ready to run it, keep the cursor on your command and mouse-click the `Run` button at the top of the panel or hit `command`-`enter` on the keyboard. The command will be run in the console panel. If it doesn't do exactly what you want, you can tweak the text in the script panel. When it does what you want, you can save the script file. This provides a saved record of your analysis. If your editing an R markdown file, rather than a plain R script, you can also include descriptive text that explains the background and interpretation of your analysis. (More on R markdown later.) ## The command line R has a command line interface, indicated by the `>` character in the console panel. In Rstudio this is the bottom right panel. Any expression you enter here will be evaluated. Try playing around with a few simple commands. R will evaluate simple math statements, just like a calculator. ```{r} 2+3 ``` ``` [1] 5 ``` ```{r} 9^2 ``` ``` [1] 81 ``` ```{r} log(1000) ``` ``` [1] 6.907755 ``` Note that there's an index in front of each answer, that `[1]`. Also notice that the `log` function uses the [natural log](https://en.wikipedia.org/wiki/Natural_logarithm). Use `log10` for [base-10 logs](https://en.wikipedia.org/wiki/Common_logarithm). ```{r} log10(1000) ``` ``` [1] 3 ``` It will quickly become useful to define *objects* to save values in R. Do this using `<-`. ```{r} a <- log10(1000) ``` Notice that you don't see the value produced by `log10(1000)` in this case. Instead of going to the output, the value has been stored in the object `a`. If you enter an object's name, its value will be displayed. ```{r} a ``` ``` [1] 3 ``` It's also possible to assign values to an object using `=`. However, I recommend using `<-` instead. Why? Two equal signs (`==`) is a test of [Boolean logic](https://en.wikipedia.org/wiki/Boolean_expression), not the definition of an object (`=`). Using the arrow for definitions (`<-`) avoids this potential confusion. ```{r} a == log10(1000) ``` ``` [1] TRUE ``` ```{r} a == log(1000) ``` ``` [1] FALSE ``` The values produced by these Boolean functions are logical values, not numeric values. R has different [types](https://www.tutorialspoint.com/r/r_data_types.htm) of data. `numeric` and `logical` are two different data types. This example also illustrates what are called [reserved words](https://www.datamentor.io/r-programming/reserved-words/) in R. These are words that have special meaning. You cannot use `TRUE` as an object name, because that word has a special meaning. Objects in R can be much more complex than a single value. The `c` function combines values into a [vector](http://www.r-tutor.com/r-introduction/vector). Similarly, you may want your data in a [matrix](http://www.r-tutor.com/r-introduction/matrix/matrix-construction). ```{r} b <- c(1,2,3) b ``` ``` [1] 1 2 3 ``` ```{r} m <- matrix(c(1,2,3,4,5,6), ncol=2) m ``` ``` [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 ``` ## Getting data into R New users often find getting data into R to be one of the hardest parts. Several methods exist. First it's possible to simply define a vector that includes data. ```{r} sample1 <- c(2,4,6,8) variable1 <- c(84.5,47.3,18.9,57.2,44.6,50.1) ``` Rather than entering many values into the R command line one by one, you can also paste them from the clipboard. Select the values from a row or column in Excel, and copy them. Use the `scan` function. Hit enter, and R will prompt you to enter values with `1:`. Paste the values from the clipboard, and hit enter twice. ```{r} sample1 <- scan() ``` Sharing large tables of data is often done using an open-source file format such as CSV or TSV. These are simply text files in which columns have "comma-separated values" or "table-separated values". R can read these files directly. ```{r} my.data <- read.csv("my.giant.datafile.csv") ``` You need to be mindful of the [path](https://en.wikipedia.org/wiki/Path_(computing)) you use to call the file and your [current working directory](https://bookdown.org/ndphillips/YaRrr/the-working-directory.html). Thankfully, Rstudio can help in managing this. First, if you've never changed the working directory, and you've always done everything in the same folder from the Files panel, you can probably just ignore paths entirely. Alternatively, you can use the Files panel to navigate through folders to find the data file you need. Once you see it, click the gear icon that says "More" and select `Set As Working Directory`. You may notice what this actually does is to run a command `setwd` in the console. You can copy this code to your script, use it, and modify it later. Alternatively, R has a very useful function that helps you find files using your operating system's file browser. ```{r} my.data <- read.csv(file.choose()) ``` When you run this command, a file browser window will open, allowing you to search for your data file. The examples above used `read.csv` for CSV format files. For TSV files, you'll use `read.delim`. ```{r} my.data <- read.delim("my.giant.datafile.tsv") ``` Excel file data can be imported into R too. But there are some important considerations. Talk to Dr. A. if you're considering the need to do this. Or simply save your Excel file in a CSV or TSV format. ## Exploring tabular data in R If you've used Excel a lot, learning to use R can be disorienting because you don't constantly get to see the table of your data. This is actually helpful, since R can easily handle datasets large enough to crash Excel. Thankfully there are several functions that let you see and manipulate a data table in R. R comes with several [datasets](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html) built in, and we'll use one of those for this example. It contains data on the weights of several chicks raised on different diets over developmental time. You can load a dataset with the `data` function. ```{r} data("ChickWeight") ``` Once you've loaded a data table into R, you can use several functions to find out its dimensions (`dim`), that is how many rows and columns it has, look at the column names (`names`), and take a look at the top (`head`) or bottom (`tail`) of the table. ```{r} dim(ChickWeight) ``` ``` [1] 578 4 ``` This tells us the table has 578 rows and 4 columns. ```{r} names(ChickWeight) ``` ``` [1] "weight" "Time" "Chick" "Diet" ``` ```{r} head(ChickWeight) ``` ``` weight Time Chick Diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 76 8 1 1 6 93 10 1 1 ``` An individual column in a table can be called by name, as so. ```{r} head(ChickWeight$weight, n = 10) ``` ``` [1] 42 51 59 64 76 93 106 125 149 171 ``` The function `colnames` is similar to `names`, in that it will report out the names of columns in a table. However, you can also use `colnames` to rename columns, in this way. ```{r} colnames(ChickWeight) <- c("weight", "days", "individual", "diet") colnames(ChickWeight) ``` ``` [1] "weight" "days" "individual" "diet" ``` Data in columns can be used in mathematical manipulations, and this often gives you a reason to create new columns in the table. ```{r} ChickWeight$hours <- ChickWeight$days * 24 head(ChickWeight) ``` ``` weight days individual diet hours 1 42 0 1 1 0 2 51 2 1 1 48 3 59 4 1 1 96 4 64 6 1 1 144 5 76 8 1 1 192 6 93 10 1 1 240 ``` ## Subsetting R allows you to "subset" the data in a table. After the name of an object, you can use hard brackets and refer to the row and column numbers of a particular value. Below, we're referencing row 5, column 2. ```{r} ChickWeight[5,2] ``` ``` [1] 8 ``` There are often multiple ways to do something in R. For example, you can combine the column name reference and the bracket subsetting. ```{r} ChickWeight$days[2] ``` ``` [1] 8 ``` If you leave a row or column value empty within the brackets, R will give the entire row or column that you do reference. Here, we'll reference all of row 5. ```{r} ChickWeight[5,] ``` ``` weight days individual diet hours 5 76 8 1 1 192 ``` You can also specify a vector of row or column numbers. For example, rows 3 and 5. ```{r} ChickWeight[c(3,5),] ``` ``` weight days individual diet hours 3 59 4 1 1 96 5 76 8 1 1 192 ``` You can also *exclude* a particular row or column by using a minus sign in the subset reference. Perhaps we want the table without the `hours` column: ```{r} head(ChickWeight[,-5]) ``` ``` weight days individual diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 70 8 1 1 6 93 10 1 1 ``` Subsetting can be useful for data curation. Let's say we need to edit the weight in row 5, we can treat the subsetted table location as an object we're defining. ```{r} ChickWeight[5,1] <- 70 head(ChickWeight) ``` ``` weight days individual diet hours 1 42 0 1 1 0 2 51 2 1 1 48 3 59 4 1 1 96 4 64 6 1 1 144 5 70 8 1 1 192 6 93 10 1 1 240 ``` The value has been changed from 76 to 70, and no other changes have been made to the table. ### which Sometimes you may not know the row and column positions of particular values in the table. This is often the case with large tables. The function `which` can be very helpful in subsetting. It takes a Boolean function as input and returns a vector of logical values as output. As an example, let's say we want the chick weights from only the last day in the experiment. We can use the function `max` to find the maximum value of `days` and `which` to see which rows match that value, then use the output to subset the table. ```{r} head(ChickWeight[which(ChickWeight$days == max(ChickWeight$days)),]) ``` ``` weight days individual diet hours 12 205 21 1 1 504 24 215 21 2 1 504 36 202 21 3 1 504 48 157 21 4 1 504 60 223 21 5 1 504 72 157 21 6 1 504 ``` ## R functions R comes equipped with a number of functions to manipulate data. In general there's anything you'd like to do with numbers (or text!) just Google search it and add "in R". You will probably find a way! ### `mean` Here's a simple example: a function to find the mean. ```{r} mean(b) ``` ``` [1] 2 ``` ### generating random numbers That's not super interesting, so let's introduce a function to randomly generate numbers. R has several functions that do this, depending on whether you'd like numbers drawn from a normal distribution (`rnorm`), a uniform distribution (`runif`), [etc](https://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html). ```{r} b <- runif(n = 100, min = 0, max = 10) mean(b) ``` ``` [1] 5.764308 ``` So, how did I know that the `runif` function takes those arguments? Every function in R has a help page that you can access by entering the name of the command preceded by `?`. Like this `?runif` The help page will list the arguments that the function takes as input, and lots of other useful information. As a short-cut, R functions will assume that arguments are entered in an expected order. So if I enter `b <- runif(100, 0, 10)` the function will assume I'm passing arguments to `n` (the sample size), `min` (the lower limit of the distribution) and `max`, in that order. This is not always something you want to do, since an important goal of coding is to **make your code understandable by other people!** ### Randomly `sample` values The random number functions are useful, but sometimes you have an existing group of values that you want to randomly draw from. The R function `sample` exists to do this. The help page tell us this function take arguments `x` (the vector to draw values from), `size` (the sample size -- the number of values to draw), and `replace` (a logical value specifying whether the sampling should be done with replacement or not). So, below let's randomly sample 3 values from the `days` column of the `ChickWeight` dataset. ```{r} sample(x = ChickWeight$days, size = 3, replace = TRUE) ``` ``` [1] 20 2 4 ``` The `replace = TRUE` argument allows for the possibility of sampling the same value repeatedly. Set `replace = FALSE` to prevent that. Also, note that the values we get are not in any kind of order. They're random! ### Sorting values The function `sort` will rearrange values in a vector in order. The default is to put them in increasing order. If you'd like them in descending order, add the argument `decreasing = TRUE`. ```{r} some.days <- sample(x = ChickWeight$days, size = 5, replace = TRUE) some.days ``` ``` [1] 16 6 20 14 16 ``` ```{r} sort(some.days) ``` ``` [1] 6 14 16 16 20 ``` There is also a function, `order`, that tells you the order of values in a vector, but doesn't actually rearrangement them. ```{r} order(some.days) ``` ``` [1] 2 4 1 5 3 ``` ## Missing data R has a reserved word explicitly for missing data: `NA`. You can use this value as a stand-in whenever necessary. Be aware that many functions aren't immediately equipped to deal with `NA` values. ```{r} x <- c(1,2,3,NA) mean(x) ``` ``` [1] NA ``` However, another function exists to detect `NA` values. `is.na` returns a vector of logical values specifying whether or not values are `NA`. You can use subsetting to then exclude those values. In R the `!` is used as a Boolean NOT function. In other words `!FALSE` is `TRUE`. We can put these ideas together this way. ```{r} mean(x[!is.na(x)]) ``` ``` [1] 2 ``` Many functions also have an optional argument `na.rm` that can tell them to remove `NA` values from their calculation. ```{r} mean(x, na.rm = TRUE) ``` ``` [1] 2 ``` ## Packages R comes ready-made with lots of useful functions. This is what's known as "[base R](https://rstudio.com/wp-content/uploads/2016/05/base-r.pdf)". However, packages (also called "libraries") add more functions and datasets that extend the utility of R. First, it's necessary to install a package from its source. A central repository called [CRAN](https://cran.r-project.org/) exists to make loading packages easy. If you're using the course Rstudio server, first be sure that your [working directory](http://rfunction.com/archives/1001) is set to your home folder (`~`) by executing the following command. ```{r} setwd("~") ``` Then you can install a package as shown below. In this example `RCarb` is the name of the package being installed. ([This package](https://cran.r-project.org/web/packages/RCarb/index.html) is used for dose rate modeling for carbonate-rich samples. Not something we actually need, but it illustrates the process!) ```{r} install.packages("RCarb") ``` In general, you should be able to install any CRAN package in the BI376 Rstudio server, as long as `NeedsCompilation` is listed as `no`. Other packages are available from [GitHub](https://github.com/), a website used by many software developers. These packages can be installed as shown below. ```{r} devtools::install_github("aphanotus/borealis") ``` Most packages on GitHub will have [instructions](https://github.com/aphanotus/borealis#installation) on how to install the package. They should typically follow this example, where `aphanotus` is the name of the GitHub [user](https://github.com/aphanotus) and `borealis` is the name of the package. Once a package is installed, it needs to be loaded for you to have access to the functions and data it contains. This can be done using the Packages tab in the lower right panel in Rstudio, or directly in the command console as shown below. ```{r} library(borealis) ``` You can also invoke a package's function without loading the entire package. This was just done above when we called `devtools::install_github`. The function `install_github` is part of the package `devtools`. ## Defining functions R also gives you the ability to [define your own functions](https://swcarpentry.github.io/r-novice-inflammation/02-func-R/). Here's a simple function to find the standard error of the mean for a vector of values. ```{r} se <- function(x) { sd(x)/sqrt(length(x)) } sample1 <- rnorm(n = 10, mean = 0, sd = 1) se(sample1) ``` ``` [1] 0.3914143 ``` ## Plots with base R One of the powers of R is to easily make good graphical plots. There are also endless ways to customize graphics and make them very fancy. The base R `plot` function is one of the simplest plotting functions in R. It requires you to pass `x` and `y` values. These should be vectors of the same length. ```{r} a <- rnorm(n = 10, mean = 5, sd = 2) b <- runif(n = 10, min = 0, max = 10) plot(a,b) ``` ![](https://i.imgur.com/Qp6Zhy5.png) That's pretty simple. To [add some details](http://publish.illinois.edu/johnrgallagher/files/2015/10/BaseGraphicsCheatsheet.pdf) to the plot, the function will take additional arguments. `xlim` and `ylim` can set the bounds of the x and y axes. `xlab`, `ylab` and `main` will add titles to the axes and top of the plot. ```{r} plot(a,b, xlim=c(0,10), ylim=c(0,10), main="example plot", xlab="independent variable", ylab="dependent variable") ``` > Notice that I've actually let the code break across multiple lines here. That often improves the readability of the code. Just be sure that you separate each of the arguments with a comma and end the function call with the closing parenthesis. ![](https://i.imgur.com/JNMXEVH.png) You can add linear trend lines and vertical or horizontal lines using the `abline` function. This is a separate function that you call after running `plot`, and it's then layered on top of the existing plot. You can change the color of the lines using the argument `col`. R recognizes many regular [color names](https://www.r-graph-gallery.com/42-colors-names.html). ```{r} abline(lm(b~a)) abline(v=1, col = "darkblue") abline(h=5, col = "darkred") ``` ![](https://i.imgur.com/tbkgWFI.png) ### Box plots So far we've looked at plots of two continuous variables. Let's look at a plot involving categorical data. We'll return to the `ChickWeight` dataset. In this case we'll want to use the `boxplot` function. This function takes the x and y variables in a [formula](https://www.datacamp.com/community/tutorials/r-formula-tutorial) notation, as in `y ~ x`. Think of the tilde as saying "as a function of". So below, we'll plot chick weight "as a function of" days of development. ```{r} boxplot(ChickWeight$weight ~ ChickWeight$days) ``` ![](https://i.imgur.com/trhKNAw.png) This style of plot is known as Tukey's [box-whisker plot](https://www.nature.com/articles/nmeth.2813.pdf?origin=ppub). The "box" encloses 50% of the data (the inner [quartiles](https://en.wikipedia.org/wiki/Interquartile_range)). The heavy black bar is the median. The whiskers extend to the full extent of the data range -- unless there are [outliers](https://www.dsquintana.blog/labeling-boxplot-outliers/#:~:text=Identifying%20and%20labeling%20boxplot%20outliers%20in%20R&text=Typically%2C%20boxplots%20show%20the%20median,or%20datapoint%20is%20your%20outlier.), which are plotted as open circles. ### Pairs plots In biology we are often interested in correlations. R has a great function `pairs` to quickly examine the correlations among all the variables in your dataset. ```{r} pairs(ChickWeight[,-5]) ``` ![](https://i.imgur.com/xfGZFse.png) This lets us see that there's probably a correlation between `weight` and `days`, but probably not the other factors. An informative modification of `pairs` is available in the `borealis` package. As shown below, it gives us trend lines and correlation coefficients. More details on this function can be found in its help page, `?borealis::pairs`. ```{r} borealis::pairs(ChickWeight) ``` ![](https://i.imgur.com/WYhX47n.png) You can do a lot with base R graphics. And it is often faster and easier to make plots that way in the early stages of an analysis, when you're just trying to explore the data. ## ggplot `ggplot2` is a package that provides much [more control over the formatting of plots](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf). It creates graphical objects that remain stored in R's environment. This has advantages for complex plots, and it lets you add "layers" to existing plots later. Let's work with the `ChickWeight` dataset again, and look at the weights of chicks on the last day of the experiment, comparing them by the different diets. Here's a base R boxplot showing these results. Since the subsetting gets cumbersome, we'll just define a new object `chick.final` to hold the modified version of the dataset. ```{r} colnames(ChickWeight) <- c("weight", "days", "individual", "diet") # In case it wasn't done before chick.final <- ChickWeight[which(ChickWeight$days == max(ChickWeight$days)),] boxplot(chick.final$weight ~ chick.final$diet) ``` ![](https://i.imgur.com/U0trFR5.png) Here's the equivalent version using `ggplot`. Start by loading the package. Notice that we can define an object `p` that holds the plot information. Then call `p` to display the plot. ```{r} library(ggplot2) p <- ggplot(chick.final, aes(x=diet, y=weight)) + geom_boxplot() p ``` ![](https://i.imgur.com/c1w7hxr.png) That's already a bit more aesthetically pleasing! Notice that there are two parts to the statement above. The first part, `ggplot`, takes as input the data frame we want to consider, in this case it's `chick.final`. Next we define the x and y axes as `diet` and `weight`. If you try that part alone, you'll see that it displays an empty field. That's because `ggplot` needs to be told what sort of graphical elements to generate using those x and y data. That's given by the second element, `geom_boxplot`. The parentheses are necessary, even though in this example we don't pass anything into `geom_boxplot`. However you can give it additional information to change things like color and transparency too. ### Adding layers: `geom_jitter` One good practice in generating graphs is to show the actual data. Box plots give a good sense of the median and dispersion of the data, but it's not immediately clear what the sample size is. We can use `ggplot` to add another layer that shows individual points for the weight produced by each diet. Since it may be hard to interpret if they all line up above one another, we can plot them with a random "jitter" in the x-axis. ```{r} p + geom_jitter() ``` ![](https://i.imgur.com/5wumFp0.png) An important note about `geom_jitter` is that it randomly moves points in categorical dimensions. You can control the exent of that displacement with the `position` parameter. ```{r} p + geom_jitter(position = position_jitter(width = 0.2)) ``` ![](https://i.imgur.com/MUrXPB9.png) Of course, you can also just edit the original object definition. If the line defining your `ggplot` object is getting crowded, you can break it into multiple lines. Just be sure to copy it all into the console window to run it. Or from the script window in Rstudio, hit Run `[command]-[enter]` from one of the lines of the `ggplot` definition. But notice that each layer needs to be separated by a `+`. It's easy to forget about the `+` when your layers span multiple lines like this. ```{r} p <- ggplot(chick.final, aes(x=diet, y=weight)) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) ``` ### Using functions in graphs You can build in functions to the data being graphed too. So if we want to look at the log of weight... ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) p ``` ![](https://i.imgur.com/BRtiDz9.png) ### Formatting axes The real appeal of `ggplot` is that it allows you to customize almost everything about a plot. The title and axes can be defined using `ggtitle`, `xlab` and `ylab`. The y-axis label uses a trick to get the "10" in log~10~ sub-scripted. ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p ``` ![](https://i.imgur.com/FehVAqf.png) ### Themes Use of themes allows customization of font, size, orientation and other elements of the plot. Below I'll center the title, increase the sizes of all the fonts, and add a black border around the plot ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + theme( plot.title = element_text(size=16,face="bold",hjust=0.5), axis.text.x = element_text(size=14,face="italic"), axis.text.y = element_text(size=12,face="plain"), axis.title.x = element_text(size=14,face="plain"), axis.title.y = element_text(size=14,face="plain"), panel.border = element_rect(fill = NA, color = "black") ) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p ``` ![](https://i.imgur.com/22MJH8m.png) Theme definitions can get big. One helpful thing to do is to define the theme as a separate object that you then call within the `ggplot` definition. If you have multiple plots in your analysis, you can define a theme once and have all the individual `ggplot` objects call the same theme. ```{r} large.text.theme <- theme( plot.title = element_text(size=16,face="bold",hjust=0.5), axis.text.x = element_text(size=14,face="italic"), axis.text.y = element_text(size=12,face="plain"), axis.title.x = element_text(size=14,face="plain"), axis.title.y = element_text(size=14,face="plain"), panel.border = element_rect(fill = NA, color = "black") ) p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + large.text.theme + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) ``` ### Preset themes Alternatively, `ggplot` comes with several predefined themes. Some people don't like `ggplot`'s default gray background. If so, there is a theme called `theme_bw` which sticks to black and white elements. ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + theme_bw() + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.2, height = 0)) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p ``` ![](https://i.imgur.com/egZNG4E.png) ### Color and Opacity Another thing you can customize is the size, color and opacity of the layers. This is especially useful if you want to layer actual data points on top of something, like a boxplot or a violin plot. The opacity is controlled through a parameter called `alpha`, where 1 is completely opaque and 0 is completely transparent. If you're displaying lots of dots, giving them some transparency helps show where they're overlapping. ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + theme_bw() + geom_boxplot(colour="grey", size=1, alpha = 0.6) + geom_jitter(position = position_jitter(width = 0.2, height = 0), colour = "darkblue", size=3, alpha = 0.8) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p ``` ![](https://i.imgur.com/rHnpXmO.png) Note that I'm using the commonwealth spellings of *colour* and *grey* in this code. `ggplot` (and a lot of R tools) were originally written in New Zealand, which uses British spellings. However, the authors conveniently included duplicate versions of all their function using British and American spellings. So `color = "gray"` will also work just as well. ### Violin plots If you have a lot of data points, and you want a reader to easily have a sense of their distribution, you can use what's called a violin plot. The median is indicated by adding the `draw_quantiles = 0.5` argument. ```{r} p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + theme_bw() + geom_violin(colour="grey", size=1, trim = FALSE, alpha = 0.75, draw_quantiles = 0.5) + geom_jitter(position = position_jitter(width = 0.2, height = 0), colour = "darkblue", size=3, alpha = 0.8) + ggtitle("Weights of chicks based on different diets") + xlab("Artifical diet number") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p ``` ![](https://i.imgur.com/CnSQDDU.png) ### Scatter plots Let's return to the original chick dataset to illustrate how to plot two continuous variables against one another. That is, let's make a scatter plot using `ggplot`. ```{r} p2 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + theme_bw() + geom_point(size = 2, alpha = 0.5) + xlab("day") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p2 ``` ![](https://i.imgur.com/HhY0naa.png) Trend lines can also be added using the `stat_smooth` feature. ```{r} p2 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + theme_bw() + geom_point(size = 2, alpha = 0.5) + stat_smooth(method=lm) + xlab("day") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p2 ``` ![](https://i.imgur.com/9h7Q1eo.png) As with most features of ggplot, you can set the size, color and transparency of the line here by adding `size`, `color` and `alpha` arguments inside the `stat_smooth` function call. Notice that the line here also has a gray outline. This is a confidence interval the function calculates for the line. The confidence boundary can be useful, but if you want to get rid of it, just include `se = FALSE` as an argument in the `stat_smooth` function call. ### Facets `ggplot` is particularly useful when you want to start making more complex plots. For example, the dataset includes weight measurements from chicks raised on different diets. Interpreting the results may be easier if we can see growth over time for each of the diets separately. The `facet_wrap` feature allows us to make replicate plots with data from each level of a factor, like `diet` in this example. ```{r} p3 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + theme_bw() + facet_wrap(.~diet) + geom_point(size = 2, alpha = 0.5) + stat_smooth(method=lm, se = FALSE) + xlab("day") + ylab(expression(paste(log[10]," chick weight (g)", sep=""))) p3 ``` ![](https://i.imgur.com/bQbG3Nk.png) ### Other ggplot `geoms` Lots of other [styles of plots are possible with ggplot](https://ggplot2.tidyverse.org/reference/). Their use all follows the same basic rules we've seen above. ### Other tutorials on `ggplot` The list below provides links to several other excellent online resources for using `ggplot`. Happy plotting! * A simple [**cheatsheet**](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) for common `ggplot` elements * Herb Wilson teaches extensively with R in BI223 Science and Baseball. He has recorded a number of instructional videos on YouTube, including this one on [**making a histogram using `ggplot`**](https://www.youtube.com/watch?v=BvykvAgSnUg&feature=youtu.be) * A [**Harvard University workshop**](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html) on data science walks through the use of `ggplot`, with the example of deconstructing a clear and informative plot from *The Economist* * [**ggplot2: Elegant Graphics for Data Analysis**](http://cbbcat.net/record=b4724928) by Hadley Wickham is a book that provides a comprehensive presentation of `ggplot`'s capabilities. The complete text is available online through the CBB library consortium. ![](https://i.imgur.com/xTF35Mn.png) > An excellent example of data visualization from [The Economist](https://www.economist.com/graphic-detail/) ## The R Exercise After you've reviewed the background information above, follow this link to instruction on the [R Exercise](https://hackmd.io/@aphanotus/R_Exercise) to complete for class. --- ## Quick Links - [BI376 Lab Course Page](https://hackmd.io/@aphanotus/BI376Lab) - [BI376 Moodle](https://moodle.colby.edu/course/view.php?id=19447) - [*Facit Saltum*](http://web.colby.edu/evodevo/) - [HackMD](https://hackmd.io/)

Read more

Colby Biology Information

The Biology Major Checklist

The Comp Bio Major

Bread Notes