R Tutorial

BI376
For the week of 14 September 2020

Rmd version of this script

Tutorial instructions

The final project for BI376 this semester will employ morphometric analysis in R, a free platform for statistical computing. While you've probably encountered R in other courses, this tutorial is meant as a refresher and as a reference. Later lab protocols will provide information on how you can explore more advanced features.

While I hope you will look through all of this document, some of the material may be familiar and some may be new. At the end, there is a brief exercise asking you to send in a demonstration of your R skills.

Everyone should use the Rstudio server dedicated to our course. https://bi376.colby.edu/ Access this link on campus or from within Colby's VPN. You must log in with your regular Colby username and password.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

For novice R users

If you're new to R, I recommend that you step through each of the example code lines below. As you read this document, copy the R commands into the R script panel in Rstudio and execute them as you go.

If you run into trouble, please contact Dave. The goal here is for you to increase your comfort level working with R. Don't give up in frustration! Ask for help.

Several other online resources provide additional guides to beginning R. Manny Gimond has a useful two-minute video introduction to Rstudio http://mgimond.github.io/ES218/Videos/RStudio_environ1.webm. Swirl is an interactive Rstudio tutorial, available at http://swirlstats.com/students.html.

For experienced R users

Everyone should at least scan through this document. Web links will lead to more detailed background information for some terms, which you are welcome to explore (although it is not required). However, if you're quite comfortable with the concepts described here, feel free to skip to the exercise.

Rstudio

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Rstudio is a shell for R. You can download and install Rstudio on your local machine. However, for work in BI376 it will be convenient to use the server set-up for our course. https://bi376.colby.edu/ This link runs Rstudio from a dedicated virtual machine (VM) that includes the packages you will need to run this semester. Simply log in with your Colby username and password.

The Rstudio window is divided into 4 panels that each provide different tools.

The top left panel displays files that you can edit and save. These may be R scripts or R markdown files.
The R command line is in the bottom left. When a command is run here, nothing is "saved" except if a value is put into an object. - I'll explain that more below.
The top right panel shows either a list of objects currently in memory, or the command history. I find this to be the least useful panel, and you can minimize it by double-clicking on the tabs in the bottom right panel.
The bottom right panel displays lots of useful stuff, including plots generated in the command line. It also has tabs for navigating storage space (files and folders), looking at installed packages, or consulting the help pages.

Rstudio allows you to plan an elaborate command in the script panel (upper left), and when you're ready to run it, keep the cursor on your command and mouse-click the Run button at the top of the panel or hit command-enter on the keyboard. The command will be run in the console panel. If it doesn't do exactly what you want, you can tweak the text in the script panel. When it does what you want, you can save the script file. This provides a saved record of your analysis. If your editing an R markdown file, rather than a plain R script, you can also include descriptive text that explains the background and interpretation of your analysis. (More on R markdown later.)

The command line

R has a command line interface, indicated by the > character in the console panel. In Rstudio this is the bottom right panel. Any expression you enter here will be evaluated. Try playing around with a few simple commands.

R will evaluate simple math statements, just like a calculator.

2+3

[1] 5

9^2

[1] 81

log(1000)

[1] 6.907755

Note that there's an index in front of each answer, that [1]. Also notice that the log function uses the natural log. Use log10 for base-10 logs.

log10(1000)

[1] 3

It will quickly become useful to define objects to save values in R. Do this using <-.

a <- log10(1000)

Notice that you don't see the value produced by log10(1000) in this case. Instead of going to the output, the value has been stored in the object a. If you enter an object's name, its value will be displayed.

[1] 3

It's also possible to assign values to an object using =. However, I recommend using <- instead. Why? Two equal signs (==) is a test of Boolean logic, not the definition of an object (=). Using the arrow for definitions (<-) avoids this potential confusion.

a == log10(1000)

[1] TRUE

a == log(1000)

[1] FALSE

The values produced by these Boolean functions are logical values, not numeric values. R has different types of data. numeric and logical are two different data types. This example also illustrates what are called reserved words in R. These are words that have special meaning. You cannot use TRUE as an object name, because that word has a special meaning.

Objects in R can be much more complex than a single value. The c function combines values into a vector. Similarly, you may want your data in a matrix.

b <- c(1,2,3)
b

[1] 1 2 3

m <- matrix(c(1,2,3,4,5,6), ncol=2)	
m

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Getting data into R

New users often find getting data into R to be one of the hardest parts. Several methods exist. First it's possible to simply define a vector that includes data.

sample1 <- c(2,4,6,8)  
variable1 <- c(84.5,47.3,18.9,57.2,44.6,50.1)

Rather than entering many values into the R command line one by one, you can also paste them from the clipboard. Select the values from a row or column in Excel, and copy them. Use the scan function. Hit enter, and R will prompt you to enter values with 1:. Paste the values from the clipboard, and hit enter twice.

sample1 <- scan()

Sharing large tables of data is often done using an open-source file format such as CSV or TSV. These are simply text files in which columns have "comma-separated values" or "table-separated values". R can read these files directly.

my.data <- read.csv("my.giant.datafile.csv")

You need to be mindful of the path you use to call the file and your current working directory. Thankfully, Rstudio can help in managing this. First, if you've never changed the working directory, and you've always done everything in the same folder from the Files panel, you can probably just ignore paths entirely. Alternatively, you can use the Files panel to navigate through folders to find the data file you need. Once you see it, click the gear icon that says "More" and select Set As Working Directory. You may notice what this actually does is to run a command setwd in the console. You can copy this code to your script, use it, and modify it later.

Alternatively, R has a very useful function that helps you find files using your operating system's file browser.

my.data <- read.csv(file.choose())

When you run this command, a file browser window will open, allowing you to search for your data file.

The examples above used read.csv for CSV format files. For TSV files, you'll use read.delim.

my.data <- read.delim("my.giant.datafile.tsv")

Excel file data can be imported into R too. But there are some important considerations. Talk to Dr. A. if you're considering the need to do this. Or simply save your Excel file in a CSV or TSV format.

Exploring tabular data in R

If you've used Excel a lot, learning to use R can be disorienting because you don't constantly get to see the table of your data. This is actually helpful, since R can easily handle datasets large enough to crash Excel. Thankfully there are several functions that let you see and manipulate a data table in R.

R comes with several datasets built in, and we'll use one of those for this example. It contains data on the weights of several chicks raised on different diets over developmental time. You can load a dataset with the data function.

data("ChickWeight")

Once you've loaded a data table into R, you can use several functions to find out its dimensions (dim), that is how many rows and columns it has, look at the column names (names), and take a look at the top (head) or bottom (tail) of the table.

dim(ChickWeight)

[1] 578   4

This tells us the table has 578 rows and 4 columns.

names(ChickWeight)

[1] "weight" "Time"   "Chick"  "Diet"

head(ChickWeight)

  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

An individual column in a table can be called by name, as so.

head(ChickWeight$weight, n = 10)

[1]  42  51  59  64  76  93 106 125 149 171

The function colnames is similar to names, in that it will report out the names of columns in a table. However, you can also use colnames to rename columns, in this way.

colnames(ChickWeight) <- c("weight", "days", "individual", "diet")
colnames(ChickWeight)

[1] "weight"     "days"       "individual" "diet"

Data in columns can be used in mathematical manipulations, and this often gives you a reason to create new columns in the table.

ChickWeight$hours <-  ChickWeight$days * 24
head(ChickWeight)

  weight days individual diet hours
1     42    0          1    1     0
2     51    2          1    1    48
3     59    4          1    1    96
4     64    6          1    1   144
5     76    8          1    1   192
6     93   10          1    1   240

Subsetting

R allows you to "subset" the data in a table. After the name of an object, you can use hard brackets and refer to the row and column numbers of a particular value. Below, we're referencing row 5, column 2.

ChickWeight[5,2]

[1] 8

There are often multiple ways to do something in R. For example, you can combine the column name reference and the bracket subsetting.

ChickWeight$days[2]

[1] 8

If you leave a row or column value empty within the brackets, R will give the entire row or column that you do reference. Here, we'll reference all of row 5.

ChickWeight[5,]

  weight days individual diet hours
5     76    8          1    1   192

You can also specify a vector of row or column numbers. For example, rows 3 and 5.

ChickWeight[c(3,5),]

  weight days individual diet hours
3     59    4          1    1    96
5     76    8          1    1   192

You can also exclude a particular row or column by using a minus sign in the subset reference. Perhaps we want the table without the hours column:

head(ChickWeight[,-5])

  weight days individual diet
1     42    0          1    1
2     51    2          1    1
3     59    4          1    1
4     64    6          1    1
5     70    8          1    1
6     93   10          1    1

Subsetting can be useful for data curation. Let's say we need to edit the weight in row 5, we can treat the subsetted table location as an object we're defining.

ChickWeight[5,1] <- 70
head(ChickWeight)

  weight days individual diet hours
1     42    0          1    1     0
2     51    2          1    1    48
3     59    4          1    1    96
4     64    6          1    1   144
5     70    8          1    1   192
6     93   10          1    1   240

The value has been changed from 76 to 70, and no other changes have been made to the table.

which

Sometimes you may not know the row and column positions of particular values in the table. This is often the case with large tables. The function which can be very helpful in subsetting. It takes a Boolean function as input and returns a vector of logical values as output.

As an example, let's say we want the chick weights from only the last day in the experiment. We can use the function max to find the maximum value of days and which to see which rows match that value, then use the output to subset the table.

head(ChickWeight[which(ChickWeight$days == max(ChickWeight$days)),])

   weight days individual diet hours
12    205   21          1    1   504
24    215   21          2    1   504
36    202   21          3    1   504
48    157   21          4    1   504
60    223   21          5    1   504
72    157   21          6    1   504

R functions

R comes equipped with a number of functions to manipulate data. In general there's anything you'd like to do with numbers (or text!) just Google search it and add "in R". You will probably find a way!

`mean`

Here's a simple example: a function to find the mean.

mean(b)

[1] 2

generating random numbers

That's not super interesting, so let's introduce a function to randomly generate numbers. R has several functions that do this, depending on whether you'd like numbers drawn from a normal distribution (rnorm), a uniform distribution (runif), etc.

b <- runif(n = 100, min = 0, max = 10)
mean(b)

[1] 5.764308

So, how did I know that the runif function takes those arguments? Every function in R has a help page that you can access by entering the name of the command preceded by ?. Like this ?runif

The help page will list the arguments that the function takes as input, and lots of other useful information.

As a short-cut, R functions will assume that arguments are entered in an expected order. So if I enter b <- runif(100, 0, 10) the function will assume I'm passing arguments to n (the sample size), min (the lower limit of the distribution) and max, in that order. This is not always something you want to do, since an important goal of coding is to make your code understandable by other people!

Randomly `sample` values

The random number functions are useful, but sometimes you have an existing group of values that you want to randomly draw from. The R function sample exists to do this. The help page tell us this function take arguments x (the vector to draw values from), size (the sample size – the number of values to draw), and replace (a logical value specifying whether the sampling should be done with replacement or not). So, below let's randomly sample 3 values from the days column of the ChickWeight dataset.

sample(x = ChickWeight$days, size = 3, replace = TRUE)

[1] 20  2  4

The replace = TRUE argument allows for the possibility of sampling the same value repeatedly. Set replace = FALSE to prevent that. Also, note that the values we get are not in any kind of order. They're random!

Sorting values

The function sort will rearrange values in a vector in order. The default is to put them in increasing order. If you'd like them in descending order, add the argument decreasing = TRUE.

some.days <- sample(x = ChickWeight$days, size = 5, replace = TRUE)
some.days

[1] 16  6 20 14 16

sort(some.days)

[1]  6 14 16 16 20

There is also a function, order, that tells you the order of values in a vector, but doesn't actually rearrangement them.

order(some.days)

[1] 2 4 1 5 3

Missing data

R has a reserved word explicitly for missing data: NA. You can use this value as a stand-in whenever necessary. Be aware that many functions aren't immediately equipped to deal with NA values.

x <- c(1,2,3,NA)
mean(x)

[1] NA

However, another function exists to detect NA values. is.na returns a vector of logical values specifying whether or not values are NA. You can use subsetting to then exclude those values. In R the ! is used as a Boolean NOT function. In other words !FALSE is TRUE. We can put these ideas together this way.

mean(x[!is.na(x)])

[1] 2

Many functions also have an optional argument na.rm that can tell them to remove NA values from their calculation.

mean(x, na.rm = TRUE)

[1] 2

Packages

R comes ready-made with lots of useful functions. This is what's known as "base R". However, packages (also called "libraries") add more functions and datasets that extend the utility of R. First, it's necessary to install a package from its source. A central repository called CRAN exists to make loading packages easy. If you're using the course Rstudio server, first be sure that your working directory is set to your home folder (~) by executing the following command.

setwd("~")

Then you can install a package as shown below. In this example RCarb is the name of the package being installed. (This package is used for dose rate modeling for carbonate-rich samples. Not something we actually need, but it illustrates the process!)

install.packages("RCarb")

In general, you should be able to install any CRAN package in the BI376 Rstudio server, as long as NeedsCompilation is listed as no.

Other packages are available from GitHub, a website used by many software developers. These packages can be installed as shown below.

devtools::install_github("aphanotus/borealis")

Most packages on GitHub will have instructions on how to install the package. They should typically follow this example, where aphanotus is the name of the GitHub user and borealis is the name of the package.

Once a package is installed, it needs to be loaded for you to have access to the functions and data it contains. This can be done using the Packages tab in the lower right panel in Rstudio, or directly in the command console as shown below.

library(borealis)

You can also invoke a package's function without loading the entire package. This was just done above when we called devtools::install_github. The function install_github is part of the package devtools.

Defining functions

R also gives you the ability to define your own functions. Here's a simple function to find the standard error of the mean for a vector of values.

se <- function(x) { sd(x)/sqrt(length(x)) }
sample1 <- rnorm(n = 10, mean = 0, sd = 1)
se(sample1)

[1] 0.3914143

Plots with base R

One of the powers of R is to easily make good graphical plots. There are also endless ways to customize graphics and make them very fancy.

The base R plot function is one of the simplest plotting functions in R. It requires you to pass x and y values. These should be vectors of the same length.

a <- rnorm(n = 10, mean = 5, sd = 2)
b <- runif(n = 10, min = 0, max = 10)
plot(a,b)

That's pretty simple. To add some details to the plot, the function will take additional arguments. xlim and ylim can set the bounds of the x and y axes. xlab, ylab and main will add titles to the axes and top of the plot.

plot(a,b, xlim=c(0,10), ylim=c(0,10), 
  main="example plot",
  xlab="independent variable", 
  ylab="dependent variable")

Notice that I've actually let the code break across multiple lines here. That often improves the readability of the code. Just be sure that you separate each of the arguments with a comma and end the function call with the closing parenthesis.

You can add linear trend lines and vertical or horizontal lines using the abline function. This is a separate function that you call after running plot, and it's then layered on top of the existing plot. You can change the color of the lines using the argument col. R recognizes many regular color names.

abline(lm(b~a))
abline(v=1, col = "darkblue")
abline(h=5, col = "darkred")

Box plots

So far we've looked at plots of two continuous variables. Let's look at a plot involving categorical data. We'll return to the ChickWeight dataset. In this case we'll want to use the boxplot function. This function takes the x and y variables in a formula notation, as in y ~ x. Think of the tilde as saying "as a function of". So below, we'll plot chick weight "as a function of" days of development.

boxplot(ChickWeight$weight ~ ChickWeight$days)

This style of plot is known as Tukey's box-whisker plot. The "box" encloses 50% of the data (the inner quartiles). The heavy black bar is the median. The whiskers extend to the full extent of the data range – unless there are outliers, which are plotted as open circles.

Pairs plots

In biology we are often interested in correlations. R has a great function pairs to quickly examine the correlations among all the variables in your dataset.

pairs(ChickWeight[,-5])

This lets us see that there's probably a correlation between weight and days, but probably not the other factors.

An informative modification of pairs is available in the borealis package. As shown below, it gives us trend lines and correlation coefficients. More details on this function can be found in its help page, ?borealis::pairs.

borealis::pairs(ChickWeight)

You can do a lot with base R graphics. And it is often faster and easier to make plots that way in the early stages of an analysis, when you're just trying to explore the data.

ggplot

ggplot2 is a package that provides much more control over the formatting of plots. It creates graphical objects that remain stored in R's environment. This has advantages for complex plots, and it lets you add "layers" to existing plots later.

Let's work with the ChickWeight dataset again, and look at the weights of chicks on the last day of the experiment, comparing them by the different diets. Here's a base R boxplot showing these results. Since the subsetting gets cumbersome, we'll just define a new object chick.final to hold the modified version of the dataset.

colnames(ChickWeight) <- c("weight", "days", "individual", "diet") # In case it wasn't done before
chick.final <- ChickWeight[which(ChickWeight$days == max(ChickWeight$days)),]
boxplot(chick.final$weight ~ chick.final$diet)

Here's the equivalent version using ggplot. Start by loading the package. Notice that we can define an object p that holds the plot information. Then call p to display the plot.

library(ggplot2)
p <- ggplot(chick.final, aes(x=diet, y=weight)) + geom_boxplot()
p

That's already a bit more aesthetically pleasing!

Notice that there are two parts to the statement above. The first part, ggplot, takes as input the data frame we want to consider, in this case it's chick.final. Next we define the x and y axes as diet and weight. If you try that part alone, you'll see that it displays an empty field. That's because ggplot needs to be told what sort of graphical elements to generate using those x and y data. That's given by the second element, geom_boxplot. The parentheses are necessary, even though in this example we don't pass anything into geom_boxplot. However you can give it additional information to change things like color and transparency too.

Adding layers: `geom_jitter`

One good practice in generating graphs is to show the actual data. Box plots give a good sense of the median and dispersion of the data, but it's not immediately clear what the sample size is. We can use ggplot to add another layer that shows individual points for the weight produced by each diet. Since it may be hard to interpret if they all line up above one another, we can plot them with a random "jitter" in the x-axis.

p + geom_jitter()

An important note about geom_jitter is that it randomly moves points in categorical dimensions. You can control the exent of that displacement with the position parameter.

p + geom_jitter(position = position_jitter(width = 0.2))

Of course, you can also just edit the original object definition. If the line defining your ggplot object is getting crowded, you can break it into multiple lines. Just be sure to copy it all into the console window to run it. Or from the script window in Rstudio, hit Run [command]-[enter] from one of the lines of the ggplot definition. But notice that each layer needs to be separated by a +. It's easy to forget about the + when your layers span multiple lines like this.

p <- ggplot(chick.final, aes(x=diet, y=weight)) + 
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0))

Using functions in graphs

You can build in functions to the data being graphed too. So if we want to look at the log of weight…

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0))
p

Formatting axes

The real appeal of ggplot is that it allows you to customize almost everything about a plot. The title and axes can be defined using ggtitle, xlab and ylab. The y-axis label uses a trick to get the "10" in log₁₀ sub-scripted.

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0)) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p

Themes

Use of themes allows customization of font, size, orientation and other elements of the plot. Below I'll center the title, increase the sizes of all the fonts, and add a black border around the plot

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  theme(
    plot.title = element_text(size=16,face="bold",hjust=0.5), 
    axis.text.x = element_text(size=14,face="italic"), 
    axis.text.y = element_text(size=12,face="plain"), 
    axis.title.x = element_text(size=14,face="plain"), 
    axis.title.y = element_text(size=14,face="plain"),  
    panel.border = element_rect(fill = NA, color = "black")
  ) +  
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0)) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p

Theme definitions can get big. One helpful thing to do is to define the theme as a separate object that you then call within the ggplot definition. If you have multiple plots in your analysis, you can define a theme once and have all the individual ggplot objects call the same theme.

large.text.theme <- theme(
  plot.title = element_text(size=16,face="bold",hjust=0.5), 
  axis.text.x = element_text(size=14,face="italic"), 
  axis.text.y = element_text(size=12,face="plain"), 
  axis.title.x = element_text(size=14,face="plain"), 
  axis.title.y = element_text(size=14,face="plain"),  
  panel.border = element_rect(fill = NA, color = "black")
) 

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  large.text.theme +  
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0)) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep="")))

Preset themes

Alternatively, ggplot comes with several predefined themes. Some people don't like ggplot's default gray background. If so, there is a theme called theme_bw which sticks to black and white elements.

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  theme_bw() +  
  geom_boxplot() +
  geom_jitter(position = position_jitter(width = 0.2, height = 0)) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p

Color and Opacity

Another thing you can customize is the size, color and opacity of the layers. This is especially useful if you want to layer actual data points on top of something, like a boxplot or a violin plot. The opacity is controlled through a parameter called alpha, where 1 is completely opaque and 0 is completely transparent. If you're displaying lots of dots, giving them some transparency helps show where they're overlapping.

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  theme_bw() +  
  geom_boxplot(colour="grey", size=1, alpha = 0.6) + 
  geom_jitter(position = position_jitter(width = 0.2, height = 0), 
              colour = "darkblue", size=3, alpha = 0.8) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p

Note that I'm using the commonwealth spellings of colour and grey in this code. ggplot (and a lot of R tools) were originally written in New Zealand, which uses British spellings. However, the authors conveniently included duplicate versions of all their function using British and American spellings. So color = "gray" will also work just as well.

Violin plots

If you have a lot of data points, and you want a reader to easily have a sense of their distribution, you can use what's called a violin plot. The median is indicated by adding the draw_quantiles = 0.5 argument.

p <- ggplot(chick.final, aes(x=diet, y=log10(weight))) + 
  theme_bw() +  
  geom_violin(colour="grey", size=1, trim = FALSE, alpha = 0.75,
              draw_quantiles = 0.5) +  
  geom_jitter(position = position_jitter(width = 0.2, height = 0), 
              colour = "darkblue", size=3, alpha = 0.8) +
  ggtitle("Weights of chicks based on different diets") +
  xlab("Artifical diet number") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p

Scatter plots

Let's return to the original chick dataset to illustrate how to plot two continuous variables against one another. That is, let's make a scatter plot using ggplot.

p2 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + 
  theme_bw() +  
  geom_point(size = 2, alpha = 0.5) +  
  xlab("day") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p2

Trend lines can also be added using the stat_smooth feature.

p2 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + 
  theme_bw() +  
  geom_point(size = 2, alpha = 0.5) +  
  stat_smooth(method=lm) +
  xlab("day") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p2

As with most features of ggplot, you can set the size, color and transparency of the line here by adding size, color and alpha arguments inside the stat_smooth function call.

Notice that the line here also has a gray outline. This is a confidence interval the function calculates for the line. The confidence boundary can be useful, but if you want to get rid of it, just include se = FALSE as an argument in the stat_smooth function call.

ggplot is particularly useful when you want to start making more complex plots. For example, the dataset includes weight measurements from chicks raised on different diets. Interpreting the results may be easier if we can see growth over time for each of the diets separately. The facet_wrap feature allows us to make replicate plots with data from each level of a factor, like diet in this example.

p3 <- ggplot(ChickWeight, aes(x=days, y=log10(weight))) + 
  theme_bw() +  
  facet_wrap(.~diet) +
  geom_point(size = 2, alpha = 0.5) +  
  stat_smooth(method=lm, se = FALSE) +
  xlab("day") +
  ylab(expression(paste(log[10]," chick weight (g)", sep=""))) 
p3

Other ggplot `geoms`

Lots of other styles of plots are possible with ggplot. Their use all follows the same basic rules we've seen above.

The R Exercise

After you've reviewed the background information above, follow this link to instruction on the R Exercise to complete for class.

R Tutorial

Tutorial instructions

For novice R users

For experienced R users

Rstudio

The command line

Getting data into R

Exploring tabular data in R

Subsetting

which

R functions

`mean`

generating random numbers

Randomly `sample` values

Sorting values

Missing data

Packages

Defining functions

Plots with base R

Box plots

Pairs plots

ggplot

Adding layers: `geom_jitter`

Using functions in graphs

Formatting axes

Themes

Preset themes

Color and Opacity

Violin plots

Scatter plots

Facets

Other ggplot `geoms`

Other tutorials on `ggplot`

The R Exercise

Quick Links

R Tutorial

Tutorial instructions

For novice R users

For experienced R users

Rstudio

The command line

Getting data into R

Exploring tabular data in R

Subsetting

which

R functions

mean

generating random numbers

Randomly sample values

Sorting values

Missing data

Packages

Defining functions

Plots with base R

Box plots

Pairs plots

ggplot

Adding layers: geom_jitter

Using functions in graphs

Formatting axes

Themes

Preset themes

Color and Opacity

Violin plots

Scatter plots

Facets

Other ggplot geoms

Other tutorials on ggplot

The R Exercise

Quick Links

Read more

The Comp Bio Major

The Biology Major Checklist

IMG_9348

Why bother with Covid-19 vaccination?

`mean`

Randomly `sample` values

Adding layers: `geom_jitter`

Other ggplot `geoms`

Other tutorials on `ggplot`