BI376
For the week of 14 September 2020
Rmd version of this script
The final project for BI376 this semester will employ morphometric analysis in R, a free platform for statistical computing. While you've probably encountered R in other courses, this tutorial is meant as a refresher and as a reference. Later lab protocols will provide information on how you can explore more advanced features.
While I hope you will look through all of this document, some of the material may be familiar and some may be new. At the end, there is a brief exercise asking you to send in a demonstration of your R skills.
Everyone should use the Rstudio server dedicated to our course. https://bi376.colby.edu/ Access this link on campus or from within Colby's VPN. You must log in with your regular Colby username and password.
If you're new to R, I recommend that you step through each of the example code lines below. As you read this document, copy the R commands into the R script panel in Rstudio and execute them as you go.
If you run into trouble, please contact Dave. The goal here is for you to increase your comfort level working with R. Don't give up in frustration! Ask for help.
Several other online resources provide additional guides to beginning R. Manny Gimond has a useful two-minute video introduction to Rstudio http://mgimond.github.io/ES218/Videos/RStudio_environ1.webm. Swirl is an interactive Rstudio tutorial, available at http://swirlstats.com/students.html.
Everyone should at least scan through this document. Web links will lead to more detailed background information for some terms, which you are welcome to explore (although it is not required). However, if you're quite comfortable with the concepts described here, feel free to skip to the exercise.
Rstudio is a shell for R. You can download and install Rstudio on your local machine. However, for work in BI376 it will be convenient to use the server set-up for our course. https://bi376.colby.edu/ This link runs Rstudio from a dedicated virtual machine (VM) that includes the packages you will need to run this semester. Simply log in with your Colby username and password.
The Rstudio window is divided into 4 panels that each provide different tools.
Rstudio allows you to plan an elaborate command in the script panel (upper left), and when you're ready to run it, keep the cursor on your command and mouse-click the Run
button at the top of the panel or hit command
-enter
on the keyboard. The command will be run in the console panel. If it doesn't do exactly what you want, you can tweak the text in the script panel. When it does what you want, you can save the script file. This provides a saved record of your analysis. If your editing an R markdown file, rather than a plain R script, you can also include descriptive text that explains the background and interpretation of your analysis. (More on R markdown later.)
R has a command line interface, indicated by the >
character in the console panel. In Rstudio this is the bottom right panel. Any expression you enter here will be evaluated. Try playing around with a few simple commands.
R will evaluate simple math statements, just like a calculator.
Note that there's an index in front of each answer, that [1]
. Also notice that the log
function uses the natural log. Use log10
for base-10 logs.
It will quickly become useful to define objects to save values in R. Do this using <-
.
Notice that you don't see the value produced by log10(1000)
in this case. Instead of going to the output, the value has been stored in the object a
. If you enter an object's name, its value will be displayed.
It's also possible to assign values to an object using =
. However, I recommend using <-
instead. Why? Two equal signs (==
) is a test of Boolean logic, not the definition of an object (=
). Using the arrow for definitions (<-
) avoids this potential confusion.
The values produced by these Boolean functions are logical values, not numeric values. R has different types of data. numeric
and logical
are two different data types. This example also illustrates what are called reserved words in R. These are words that have special meaning. You cannot use TRUE
as an object name, because that word has a special meaning.
Objects in R can be much more complex than a single value. The c
function combines values into a vector. Similarly, you may want your data in a matrix.
New users often find getting data into R to be one of the hardest parts. Several methods exist. First it's possible to simply define a vector that includes data.
Rather than entering many values into the R command line one by one, you can also paste them from the clipboard. Select the values from a row or column in Excel, and copy them. Use the scan
function. Hit enter, and R will prompt you to enter values with 1:
. Paste the values from the clipboard, and hit enter twice.
Sharing large tables of data is often done using an open-source file format such as CSV or TSV. These are simply text files in which columns have "comma-separated values" or "table-separated values". R can read these files directly.
You need to be mindful of the path you use to call the file and your current working directory. Thankfully, Rstudio can help in managing this. First, if you've never changed the working directory, and you've always done everything in the same folder from the Files panel, you can probably just ignore paths entirely. Alternatively, you can use the Files panel to navigate through folders to find the data file you need. Once you see it, click the gear icon that says "More" and select Set As Working Directory
. You may notice what this actually does is to run a command setwd
in the console. You can copy this code to your script, use it, and modify it later.
Alternatively, R has a very useful function that helps you find files using your operating system's file browser.
When you run this command, a file browser window will open, allowing you to search for your data file.
The examples above used read.csv
for CSV format files. For TSV files, you'll use read.delim
.
Excel file data can be imported into R too. But there are some important considerations. Talk to Dr. A. if you're considering the need to do this. Or simply save your Excel file in a CSV or TSV format.
If you've used Excel a lot, learning to use R can be disorienting because you don't constantly get to see the table of your data. This is actually helpful, since R can easily handle datasets large enough to crash Excel. Thankfully there are several functions that let you see and manipulate a data table in R.
R comes with several datasets built in, and we'll use one of those for this example. It contains data on the weights of several chicks raised on different diets over developmental time. You can load a dataset with the data
function.
Once you've loaded a data table into R, you can use several functions to find out its dimensions (dim
), that is how many rows and columns it has, look at the column names (names
), and take a look at the top (head
) or bottom (tail
) of the table.
This tells us the table has 578 rows and 4 columns.
An individual column in a table can be called by name, as so.
The function colnames
is similar to names
, in that it will report out the names of columns in a table. However, you can also use colnames
to rename columns, in this way.
Data in columns can be used in mathematical manipulations, and this often gives you a reason to create new columns in the table.
R allows you to "subset" the data in a table. After the name of an object, you can use hard brackets and refer to the row and column numbers of a particular value. Below, we're referencing row 5, column 2.
There are often multiple ways to do something in R. For example, you can combine the column name reference and the bracket subsetting.
If you leave a row or column value empty within the brackets, R will give the entire row or column that you do reference. Here, we'll reference all of row 5.
You can also specify a vector of row or column numbers. For example, rows 3 and 5.
You can also exclude a particular row or column by using a minus sign in the subset reference. Perhaps we want the table without the hours
column:
Subsetting can be useful for data curation. Let's say we need to edit the weight in row 5, we can treat the subsetted table location as an object we're defining.
The value has been changed from 76 to 70, and no other changes have been made to the table.
Sometimes you may not know the row and column positions of particular values in the table. This is often the case with large tables. The function which
can be very helpful in subsetting. It takes a Boolean function as input and returns a vector of logical values as output.
As an example, let's say we want the chick weights from only the last day in the experiment. We can use the function max
to find the maximum value of days
and which
to see which rows match that value, then use the output to subset the table.
R comes equipped with a number of functions to manipulate data. In general there's anything you'd like to do with numbers (or text!) just Google search it and add "in R". You will probably find a way!
mean
Here's a simple example: a function to find the mean.
That's not super interesting, so let's introduce a function to randomly generate numbers. R has several functions that do this, depending on whether you'd like numbers drawn from a normal distribution (rnorm
), a uniform distribution (runif
), etc.
So, how did I know that the runif
function takes those arguments? Every function in R has a help page that you can access by entering the name of the command preceded by ?
. Like this ?runif
The help page will list the arguments that the function takes as input, and lots of other useful information.
As a short-cut, R functions will assume that arguments are entered in an expected order. So if I enter b <- runif(100, 0, 10)
the function will assume I'm passing arguments to n
(the sample size), min
(the lower limit of the distribution) and max
, in that order. This is not always something you want to do, since an important goal of coding is to make your code understandable by other people!
sample
valuesThe random number functions are useful, but sometimes you have an existing group of values that you want to randomly draw from. The R function sample
exists to do this. The help page tell us this function take arguments x
(the vector to draw values from), size
(the sample size – the number of values to draw), and replace
(a logical value specifying whether the sampling should be done with replacement or not). So, below let's randomly sample 3 values from the days
column of the ChickWeight
dataset.
The replace = TRUE
argument allows for the possibility of sampling the same value repeatedly. Set replace = FALSE
to prevent that. Also, note that the values we get are not in any kind of order. They're random!
The function sort
will rearrange values in a vector in order. The default is to put them in increasing order. If you'd like them in descending order, add the argument decreasing = TRUE
.
There is also a function, order
, that tells you the order of values in a vector, but doesn't actually rearrangement them.
R has a reserved word explicitly for missing data: NA
. You can use this value as a stand-in whenever necessary. Be aware that many functions aren't immediately equipped to deal with NA
values.
However, another function exists to detect NA
values. is.na
returns a vector of logical values specifying whether or not values are NA
. You can use subsetting to then exclude those values. In R the !
is used as a Boolean NOT function. In other words !FALSE
is TRUE
. We can put these ideas together this way.
Many functions also have an optional argument na.rm
that can tell them to remove NA
values from their calculation.
R comes ready-made with lots of useful functions. This is what's known as "base R". However, packages (also called "libraries") add more functions and datasets that extend the utility of R. First, it's necessary to install a package from its source. A central repository called CRAN exists to make loading packages easy. If you're using the course Rstudio server, first be sure that your working directory is set to your home folder (~
) by executing the following command.
Then you can install a package as shown below. In this example RCarb
is the name of the package being installed. (This package is used for dose rate modeling for carbonate-rich samples. Not something we actually need, but it illustrates the process!)
In general, you should be able to install any CRAN package in the BI376 Rstudio server, as long as NeedsCompilation
is listed as no
.
Other packages are available from GitHub, a website used by many software developers. These packages can be installed as shown below.
Most packages on GitHub will have instructions on how to install the package. They should typically follow this example, where aphanotus
is the name of the GitHub user and borealis
is the name of the package.
Once a package is installed, it needs to be loaded for you to have access to the functions and data it contains. This can be done using the Packages tab in the lower right panel in Rstudio, or directly in the command console as shown below.
You can also invoke a package's function without loading the entire package. This was just done above when we called devtools::install_github
. The function install_github
is part of the package devtools
.
R also gives you the ability to define your own functions. Here's a simple function to find the standard error of the mean for a vector of values.
One of the powers of R is to easily make good graphical plots. There are also endless ways to customize graphics and make them very fancy.
The base R plot
function is one of the simplest plotting functions in R. It requires you to pass x
and y
values. These should be vectors of the same length.
That's pretty simple. To add some details to the plot, the function will take additional arguments. xlim
and ylim
can set the bounds of the x and y axes. xlab
, ylab
and main
will add titles to the axes and top of the plot.
Notice that I've actually let the code break across multiple lines here. That often improves the readability of the code. Just be sure that you separate each of the arguments with a comma and end the function call with the closing parenthesis.
You can add linear trend lines and vertical or horizontal lines using the abline
function. This is a separate function that you call after running plot
, and it's then layered on top of the existing plot. You can change the color of the lines using the argument col
. R recognizes many regular color names.
So far we've looked at plots of two continuous variables. Let's look at a plot involving categorical data. We'll return to the ChickWeight
dataset. In this case we'll want to use the boxplot
function. This function takes the x and y variables in a formula notation, as in y ~ x
. Think of the tilde as saying "as a function of". So below, we'll plot chick weight "as a function of" days of development.
This style of plot is known as Tukey's box-whisker plot. The "box" encloses 50% of the data (the inner quartiles). The heavy black bar is the median. The whiskers extend to the full extent of the data range – unless there are outliers, which are plotted as open circles.
In biology we are often interested in correlations. R has a great function pairs
to quickly examine the correlations among all the variables in your dataset.
This lets us see that there's probably a correlation between weight
and days
, but probably not the other factors.
An informative modification of pairs
is available in the borealis
package. As shown below, it gives us trend lines and correlation coefficients. More details on this function can be found in its help page, ?borealis::pairs
.
You can do a lot with base R graphics. And it is often faster and easier to make plots that way in the early stages of an analysis, when you're just trying to explore the data.
ggplot2
is a package that provides much more control over the formatting of plots. It creates graphical objects that remain stored in R's environment. This has advantages for complex plots, and it lets you add "layers" to existing plots later.
Let's work with the ChickWeight
dataset again, and look at the weights of chicks on the last day of the experiment, comparing them by the different diets. Here's a base R boxplot showing these results. Since the subsetting gets cumbersome, we'll just define a new object chick.final
to hold the modified version of the dataset.
Here's the equivalent version using ggplot
. Start by loading the package. Notice that we can define an object p
that holds the plot information. Then call p
to display the plot.
That's already a bit more aesthetically pleasing!
Notice that there are two parts to the statement above. The first part, ggplot
, takes as input the data frame we want to consider, in this case it's chick.final
. Next we define the x and y axes as diet
and weight
. If you try that part alone, you'll see that it displays an empty field. That's because ggplot
needs to be told what sort of graphical elements to generate using those x and y data. That's given by the second element, geom_boxplot
. The parentheses are necessary, even though in this example we don't pass anything into geom_boxplot
. However you can give it additional information to change things like color and transparency too.
geom_jitter
One good practice in generating graphs is to show the actual data. Box plots give a good sense of the median and dispersion of the data, but it's not immediately clear what the sample size is. We can use ggplot
to add another layer that shows individual points for the weight produced by each diet. Since it may be hard to interpret if they all line up above one another, we can plot them with a random "jitter" in the x-axis.
An important note about geom_jitter
is that it randomly moves points in categorical dimensions. You can control the exent of that displacement with the position
parameter.
Of course, you can also just edit the original object definition. If the line defining your ggplot
object is getting crowded, you can break it into multiple lines. Just be sure to copy it all into the console window to run it. Or from the script window in Rstudio, hit Run [command]-[enter]
from one of the lines of the ggplot
definition. But notice that each layer needs to be separated by a +
. It's easy to forget about the +
when your layers span multiple lines like this.
You can build in functions to the data being graphed too. So if we want to look at the log of weight…
The real appeal of ggplot
is that it allows you to customize almost everything about a plot. The title and axes can be defined using ggtitle
, xlab
and ylab
. The y-axis label uses a trick to get the "10" in log10 sub-scripted.
Use of themes allows customization of font, size, orientation and other elements of the plot. Below I'll center the title, increase the sizes of all the fonts, and add a black border around the plot
Theme definitions can get big. One helpful thing to do is to define the theme as a separate object that you then call within the ggplot
definition. If you have multiple plots in your analysis, you can define a theme once and have all the individual ggplot
objects call the same theme.
Alternatively, ggplot
comes with several predefined themes. Some people don't like ggplot
's default gray background. If so, there is a theme called theme_bw
which sticks to black and white elements.
Another thing you can customize is the size, color and opacity of the layers. This is especially useful if you want to layer actual data points on top of something, like a boxplot or a violin plot. The opacity is controlled through a parameter called alpha
, where 1 is completely opaque and 0 is completely transparent. If you're displaying lots of dots, giving them some transparency helps show where they're overlapping.
Note that I'm using the commonwealth spellings of colour and grey in this code. ggplot
(and a lot of R tools) were originally written in New Zealand, which uses British spellings. However, the authors conveniently included duplicate versions of all their function using British and American spellings. So color = "gray"
will also work just as well.
If you have a lot of data points, and you want a reader to easily have a sense of their distribution, you can use what's called a violin plot. The median is indicated by adding the draw_quantiles = 0.5
argument.
Let's return to the original chick dataset to illustrate how to plot two continuous variables against one another. That is, let's make a scatter plot using ggplot
.
Trend lines can also be added using the stat_smooth
feature.
As with most features of ggplot, you can set the size, color and transparency of the line here by adding size
, color
and alpha
arguments inside the stat_smooth
function call.
Notice that the line here also has a gray outline. This is a confidence interval the function calculates for the line. The confidence boundary can be useful, but if you want to get rid of it, just include se = FALSE
as an argument in the stat_smooth
function call.
ggplot
is particularly useful when you want to start making more complex plots. For example, the dataset includes weight measurements from chicks raised on different diets. Interpreting the results may be easier if we can see growth over time for each of the diets separately. The facet_wrap
feature allows us to make replicate plots with data from each level of a factor, like diet
in this example.
geoms
Lots of other styles of plots are possible with ggplot. Their use all follows the same basic rules we've seen above.
ggplot
The list below provides links to several other excellent online resources for using ggplot
. Happy plotting!
ggplot
elementsggplot
ggplot
, with the example of deconstructing a clear and informative plot from The Economistggplot
's capabilities. The complete text is available online through the CBB library consortium.An excellent example of data visualization from The Economist
After you've reviewed the background information above, follow this link to instruction on the R Exercise to complete for class.