Learn-r

Discussion resources

This note includes information discussed during Tuesday morning R sessions at the University of Arizona. More information about the R sessions can be found at https://jcoliver.github.io/learn-r/schedule. Posts are shown in reverse chronological order. If you have questions or comments, e-mail Jeff Oliver at jcoliver@arizona.edu.

2021-08-18

I have to impute data, but need to analyze mean-centered variables and I want to see model performance on each imputed data set. Can I do this with mice?

Indeed. The mice package is the go-to resource when imputing data is necessary. We can break the prompt into three questions, where we address each in term:

  1. How can I see model fit for each imputed data set?
  2. How can I test for multicollinearity on each imputed data set?
  3. How can I do analyses on data that require

The code block below provides a worked example addressing these three questions:

library(mice) # for multiple imputation
library(car)  # for variance inflation factor (measure of multicollinearity)

# Make up some imputed analyses for example (using default of 5 
# imputed data sets)
# Start by imputing missing values for nhanes data set (part of mice package)
imp <- mice(data = nhanes, print = FALSE, seed = 20210817)
# Run linear regression on each of the imputed data sets
fit <- with(imp, lm(bmi ~ chl + hyp))
# Use mice::getfit to get model for each imputed data set
f1 <- getfit(fit)

# 1. model fit for each imputed data set; here we use R-squared, but statistic 
# will vary based on model. i.e. you might want to extract AIC from models that 
# include such measures of fit

# Variable to hold R squared value for each model
r_squared <- numeric(length(f1))
# Cycle through each model and extract r-squared
for (i in 1:length(f1)) {
  # Pull out model from the ith data set
  one_model <- f1[[i]]
  # Run summary for the ith model
  model_summary <- summary(one_model)
  # Extract R squared
  r_sq <- model_summary$r.squared
  # Store in that r_squared vector
  r_squared[i] <- r_sq
}
# For fun, run summary on the R squared vector to get a sense of variation in 
# model fit
summary(r_squared)

# 2. vif

# Similar approach as above where you extract each model (one for each imputed 
# data set) and run car::vif on the model

# Will hold variance inflation factor for each model, because we have a value 
# for each predictor, we'll use a list to store values
vif_mod <- list(length(f1))
# Cycle through each model and calculate variance inflation factor
for (i in 1:length(f1)) {
  # Pull out model from the ith data set
  one_model <- f1[[i]]
  # Run car::vif on model
  model_vif <- car::vif(one_model)
  # Store in that vif_mod list
  vif_mod[[i]] <- model_vif
}
# Convert list to a data frame
vif_df <- data.frame(do.call(rbind, vif_mod))
# Print distribution of VIF for each predictor
summary(vif_df)

# 3. mean center

# Use passive imputation to create mean-centered values following imputation

# Add empty columns that will hold mean-centered values in imputed data sets
nhanes_mc <- nhanes # copying original data
nhanes_mc$bmi_mc <- NA
nhanes_mc$chl_mc <- NA
nhanes_mc$hyp_mc <- NA

# Run mice initially to create prediction and method matrices
ini <- mice(data = nhanes_mc, print = FALSE, seed = 20210817)
# We can ignore the warnings, as we aren't going to use the ini imputed 
# datasets for anything
pred <- ini$pred

# Set the mean-centered values as ones we do NOT want to actively impute
pred[c("bmi", "chl", "hyp"), c("bmi_mc", "chl_mc", "hyp_mc")] <- 0

# Use the method matrix to indicate how to mean center variables post-imputation
meth<- ini$meth
meth["bmi_mc"] <- "~I(bmi - mean(bmi))"
meth["chl_mc"] <- "~I(chl - mean(chl))"
meth["hyp_mc"] <- "~I(hyp - mean(hyp))"

# Run passive imputation with those pred and meth matrices
passive_imp <- mice(data = nhanes_mc, 
                    meth = meth, 
                    pred = pred, 
                    seed = 20210817,
                    print = FALSE)

# For the paranoid, check that first data set to make sure the values were 
# actually mean-centered
# Extracts the first imputed data set
passive_one <- complete(data = passive_imp, 
                        action = 1)
# Should show Mean = 0.00 for bmi_mc, chl_mc, and hyp_mc
summary(passive_one)

# Run the analysis, using mean-centered variables in function call
pass_fit <- with(passive_imp, lm(bmi_mc ~ chl_mc + hyp_mc))
# Proceed as you see fit

2021-04-13

Code reviews: a very, very brief overview

What is a code review?

Checking code for readability and functionality. Just like when we have someone
read something we wrote.

Why do we do code reviews?

Improves sharability of code and identifies errors or inefficiencies. Like when
someone else reads your thesis and shows you where you meant "there" instead of
"their".

What should I look for when performing a code review?
  • Is the rationale of why clear from the code and comments?
  • Are variable names informative?
  • Is the same code duplicated more than twice?
  • Could I reproduce these analyses/plots with this code and data alone?
Some guidelines for code (and other) reviews
  • Don't be (too) nitpicky.
  • Take breaks. Don't do this for more than an hour.
  • Use "I" statements
    • Wrong: You are writing cryptic code.
    • Right: It is hard for me to grasp what is going on in this code.
  • Recognize the differences between your preferences and best practices
How do I actually do a code review?
  • The best way is to use a code-sharing platform, such as GitHub, GitLab, or Bitbucket.
  • If you are not yet set up with one of those platforms, you can share files through e-mail or file sharing systems (e.g. Google Drive, Dropbox).
  • If you provide comments directly in code, try to use an identifier that is easy to search for. For example, I would mark all comments that I make with the string "TODO: JCO", e.g. # TODO: JCO rename varaible.

2021-02-16

How do I merge two data frames?

So you have two data frames with differing numbers of rows, and you want to get
all the information into a single data frame. For a quick approach, the merge
function will do the trick:

# Make up data to add to iris
colors <- data.frame(Species = c("setosa", "virginica", "versicolor"),
                     Color = c("purple", "lavender", "blue"))

# Merge color data frame with iris data frame
iris_colors <- merge(x = iris, y = colors)

# Confirm it matched up colors with names appropriately with table
table(iris_colors$Species, iris_colors$Color)
#            blue lavender purple
# setosa        0        0     50
# versicolor   50        0      0
# virginica     0       50      0

# What if we want to merge data with different ID column name?
iris_names <- data.frame(Scientific = c("setosa", "virginica", "versicolor"),
                         Common = c("bristle-pointed iris", "Virginia iris", "blue flag"))

# Since there is no Species column in iris_names, we need to tell R which
# columns to use for matching rows (it guessed before)
iris_full <- merge(x = iris_colors, by.x = "Species",
                   y = iris_names, by.y = "Scientific")

# Reality check with table to see that it worked as expected
table(iris_full$Species, iris_full$Common)
#            blue flag bristle-pointed iris Virginia iris
# setosa             0                   50             0
# versicolor        50                    0             0
# virginica          0                    0            50

If you have more complex combinations, or are a fan of the tidyverse, see our
answer from 2020-08-11.

How do I transform data to "long" format? I used to use gather

The gather and spread functions have been replaced with pivot_longer and
pivot_wider, respectively. These functions work mostly the same way and the
syntax has been revised to be more explicit about what is happening. An example:

library(dplyr)    # For the pipe operator %>%
library(tidyr)    # For the pivot functions
library(ggplot2)  # For plotting data

# Transform iris data to long
iris_long <- iris %>%
  pivot_longer(cols = c(Petal.Length, Petal.Width, Sepal.Length, Sepal.Width), 
               names_to = "Flower_part", # Name of column that holds column names
               values_to = "Measurement") # Name of column that holds values

# Alternatively, can just name columns we *don't* want to pivot using the 
# negative sign
iris_long <- iris %>%
  pivot_longer(cols = -Species, 
               names_to = "Flower_part", # Name of column that holds column names
               values_to = "Measurement") # Name of column that holds values

# Create boxplots for each flower measurement
ggplot(data = iris_long, mapping = aes(x = Species, y = Measurement)) +
  geom_boxplot() +
  facet_wrap(~ Flower_part) # Make a separate plot for each flower part

# Update the facet_wrap command so each sub-plot has an independent y-axis scale
ggplot(data = iris_long, mapping = aes(x = Species, y = Measurement)) +
  geom_boxplot() +
  facet_wrap(~ Flower_part, scales = "free_y")

2020-11-24

How can I use a dynamic column name when a function uses dplyr?

So you have written a nice custom function that does some cool things with
parts of the dplyr package. Combining a call to group_by and summarize
worked great with hard-coded values:

library(dplyr)
group_stats <- function(data) {
    stats_data <- data %>%
        group_by(family) %>%
        summarize(means = mean(scores),
                  stdev = sd(scores))
    return(stats_data)
}

The function operates on a data frame, data, and returns some summary
statistics (mean and standard deviation) of the scores column, for each value
in the family column.

But what if you want to add functionality for other columns? One might intuit
(at least, I did):

library(dplyr)
group_stats <- function(data, group_col, value_col) {
    stats_data <- data %>%
        group_by(group_col) %>%
        summarize(means = mean(value_col),
                  stdev = sd(value_col))
    return(stats_data)
}

Unfortunately, this won't work. For in-depth discussion of why, you can read
about the dark art of Quasiquotation.
If you just want to make it work, this Stack Overflow
post provided an explanation that we use below. In short, you can make it work
by wrapping the column name variables in !!as.name() whenever you reference
them in tidyverse functions.

library(dplyr)
group_stats <- function(data, group_col, value_col) {
    stats_data <- data %>%
        group_by(!!as.name(group_col)) %>%
        summarize(means = mean(!!as.name(value_col)),
                  stdev = sd(!!as.name(value_col)))
    return(stats_data)
}

It might not be pretty, but it works.

How do I return multiple objects from a function?

Make a list! I mean, a list. The bento box of R, list objects can contain
just about anything. Using the function we created in the question above, we
can also add code to create a plot of values.

library(dplyr)
library(ggplot2)
group_stats <- function(data, group_col, value_col) {
    stats_data <- data %>%
        group_by(!!as.name(group_col)) %>%
        summarize(means = mean(!!as.name(value_col)),
                  stdev = sd(!!as.name(value_col)))
    stats_plot <- ggplot(data = stats_data, 
                         mapping = aes(x = !!as.name(group_col),
                                       y = means))
    return(stats_data)
}

But the function just returns the stats_data data frame (well, it's really a
tibble). How do we get it to return the plot object, too? (Aside: note how we
did not need to "quasiquote" the means column in the call to ggplot with
!!as.name(). This is because "means" is the actual name of the column, not
the name of a variable referring to the column). To return the plot, we create
a list object and return that instead:

library(dplyr)
library(ggplot2)
group_stats <- function(data, group_col, value_col) {
    stats_data <- data %>%
        group_by(!!as.name(group_col)) %>%
        summarize(means = mean(!!as.name(value_col)),
                  stdev = sd(!!as.name(value_col)))
    stats_plot <- ggplot(data = stats_data, 
                         mapping = aes(x = !!as.name(group_col),
                                       y = means))
    return_list <- list(stats_table = stats_data,
                        means_plot = stats_plot)
    return(return_list)
}

# Call the function, assigning output to variable group_summary
group_summary <- group_stats(data = df, 
                             group_col = "family", 
                             value_col = "scores")
# Access each of the elements using $ notation
# The stats table
group_summary$stats_table
# The plot
group_summary$means_plot

How do I avoid long if - else chains?

Let's say you have a continuous variable that you want to categorize into
"high", "medium", and "low" categories. We can do this with some base R
looping:

for (i in 1:nrow(df)) {
    if (df$value[i] < 10) {
        df$category[i] <- "low"
    } else if (df$value[i] < 30) {
        df$category[i] <- "medium"
    } else {
        df$category[i] <- "high"
    }
}

Ooof. That seems oenerous. Another base R approach is

df$category[df$value < 10] <- "low"
df$category[df$value >= 10 & df$value < 30] <- "medium"
df$category[df$value >= 30] <- "high"

Better, but there is a tidy way, too.

library(dplyr)
df <- df %>% 
    mutate(category = case_when(value < 10 ~ "low",
                                value < 30 ~ "medium",
                                value >= 30 ~ "high"))

2020-11-03

Can I calculate a slope for a sliding window of data?

Indeed. This "rolling" approach is implemented in a number of the functions of
the zoo package and
there is a worked example on Stack Overflow at https://stackoverflow.com/questions/41061140/how-to-calculate-the-average-slope-within-a-moving-window-in-r
(Credit: A. Rutherford). If you don't want to deal with the overhead of the zoo
package, the runner
package provides an alternative approach.

2020-10-06

How do I get started with mixed-effects models?

Whoo-boy. Right now it seems you can learn mixed effects models by either (1)
fumbling around on the internet until you find lme code that works or (2)
take a statistics course requiring linear algebra. For a middle ground (sort
of), take a look at the lesson at https://m-clark.github.io/mixed-models-with-R/introduction.html.
It gets a little hairy at times (just hum through parts like

bint_studentN(bintercept,τ)), but provides a very nice one-
to-one comparison of standard linear regression to mixed-effects regression.
There is also a very nice introduction at https://www.jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r
(Credit: A. Rutherford).

How do I do logistic regression in R?

For data that aren't too crazy, you can use the glm() function, setting the
family argument to "binomial":

logit_model <- glm(response ~ predictor, family = "binomial", data = mydata)

See the examples at https://www.statmethods.net/advstats/glm.html. If you want more of the math underlying the approach, check out the Wikipedia page.

How do I control the use of horizontal space in knitr output?

This seems a perennial challenge. One thing to check would be the kableExtra package, which also includes a workaround for tables that are too wide to fit on a screen (use a scroll box). (Credit: A. Rutherford)

Another option is described at Stack Overflow, which essentially suggests adding some CSS styling in the R Markdown file, just below the YAML header:

<style type="text/css">
.main-container {
  max-width: 1800px;
  margin-left: auto;
  margin-right: auto;
}
</style>

2020-09-22

I want to select a subset of columns from my dataset, butit's complicated.

The data set dat includes over 3,000 columns, but the goal is to create a subset of those columns. Specifically, columns that start with "QU", followed by a number between 1 and 40 and end with either "A1" or "A2". The challenges are: there are columns that start with "QU", but are followed with numbers greater than 40, but we do not want any of them. Also, some column names end with "A3", "A4", etc., and we do not want any of them either. The trick here is that there are multiple criteria to match, based on the beginning and the ending of the column names. Thankfully, there are helper functions in the tidyverse, specifically starts_with() and ends_with() that are designed for just this purpose (Credit: A. Rutherford):

dat_subset <- dat %>% 
    select(starts_with(paste0("QU", 1:40)) &
           ends_with(c("A1", "A2")))

Side note: We tried a variant on this using num_range(), but just could not get it to work.

Are there any accessible style guides for writing R?

2020-09-15

Today's session was a introduction to some common statistical tests in R (t-tests, analysis of variance, and linear regression). The lesson can be found at https://jcoliver.github.io/learn-r/002-intro-stats.html.

In the post-hoc test results of an ANOVA, many p-values are zero. What can I do about this?

R is rounding the p-values, and in some cases this rounding results in a value of zero. When p-values are so small, it probably is not appropriate to compare them to one another. There is a rich discussion on this topic on Stack Overflow (credit: K. Busby). See the post from 2020-09-01 for more information on how to run post-hoc tests.

How can we find the effect size from an ANOVA?

2020-09-08

I need to sum variables from 15 different columns - is there a (tidyverse) way to do this without naming all the columns?

Ideally, the columns of interest would have a unique starting or ending to their name. For example, all the columns of interest might start with "temp_" (and no other columns would start with that string of characters). One could then use the dplyr functions select() and starts_with() to pull out only those columns of interest. So it might look something like this:

summed_data <- my_data %>%
  mutate(sum_temp = rowSums(select(., starts_with("temp_")))

Another option is to use c_across, which assumes the columns appear consecutively in your data (credit: K. Gonzalez):

summed_data <- my_data %>%
  mutate(sum_temp = rowSums(c_across(temp_1:temp_15)))

How do I visualize distributions for two categories?

The violin plot option of ggplot2 is a great place to start. It is kind of like a boxplot and a histogram combined. Some examples are available at http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization. (credit: K. Busby)

Can I combine two (or more) plots in ggplot?

Indeed.

Can I transfer maintainership for an R package on CRAN?

Yes. The first thing to do is for the current maintainer to e-mail CRAN (from the e-mail address listed in the DESCRIPTION file) with the new maintainer's name and e-mail address. The folks at CRAN will pull some levers on their end to get the process started. A nice Stack Overflow thread goes over the process at https://stackoverflow.com/questions/39223320/transferring-maintainership-of-an-r-package-on-cran. And if you want to find out more about submitting packages to CRAN you can read the official documentation or the human-readable overview of the process at https://kbroman.org/pkg_primer/pages/cran.html. (credit: A. Rutherford)

2020-09-01

I have significant results in a one-way ANOVA, but how can I tell which groups are different from one another?

Time for some post hoc analyses! Check out the example implementation of the Tukey test at https://rpubs.com/aaronsc32/post-hoc-analysis-tukey.

I have five predictor variables and want to find the best model; is there any way to do this automatically, rather than manually creating all 25 models?

The package MuMIn has a function dredge() which is designed for automated model selection https://rdrr.io/cran/MuMIn/man/dredge.html.

2020-08-25

How do I use the imputed data from mice()?

The mice package is useful for imputing data when you have missing values, but using it isn't aways straightforward. There is a nice series of vignettes on using mice, starting at https://www.gerkovink.com/miceVignettes/Ad_hoc_and_mice/Ad_hoc_methods.html. The imputation process is generally performed multiple times and for information on how to use the results of multiple imputation, the documentation for the pool() function in mice provides a particularly succinct workflow:

  1. Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids);
  2. Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira;
  3. Pool the estimates from each model into a single set of estimates and standard errors, resulting in an object of class mipo;
  4. Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions.

How do I compile packages from source if I am running Mac OSX?

You will most likely need the Xcode program for Mac. You can download it through the Apple Store on your machine or from https://developer.apple.com/xcode/resources/

Miscellaneous debris

This is not an R-related point, but if you are thinking about choosing a graduate advisor, consider what aspects the the mentor you find most important. Recent work suggests that the supportiveness of the mentor has a large impact on student satisfaction (https://www.sciencemag.org/careers/2019/04/what-matters-phd-adviser-here-s-what-research-says).

2020-08-18

Today's lesson was about using the knitr package and RMarkdown to create a single document that had styled text, code, and plots. The full lesson is available at https://jcoliver.github.io/learn-r/005-intro-knitr.html. Some great questions to come out of today's session:

How do you add comments to markdown?

Since the pound sign (#) indicates a header in markdown, you can use HTML-style commenting in markdown blocks:

<!-- this is a comment -->

What's best practice for using packages in markdown?

To make it easy for other people to re-use your code, it is probably best to load all the libraries your RMarkdown document depends on at the very beginning of the document (after the header, of course) in either the pre-existing "setup" block or another block named something like "load-dependencies". e.g.

---
title: My Fancy RMarkdown document
author: Jeff Oliver
date: 2029-08-18
output: html_document
---

```{r load-dependencies}
library(ggplot2)
library(dplyr)
```

How can I hide the code from the output document?

You can keep code from appearing in the output document by passing echo = FALSE in the start of the code block. For example, using the block where we loaded dependencies, above:

library(ggplot2)
library(dplyr)

Are there easy ways to format model output tables?

  • The knitr package includes the function kable(), which is designed to format tabular data for the appropriate output.
  • The kableExtra package takes table styling even further.
  • The broom package is designed for tidying up output of statistical analyses
  • And if you're getting real fancy with mixed models, check out broom.mixed.

Miscellaneous Debris

2020-08-11

I have large data sets that need some serious data wrangling - how can I do this in R?

More and more often, we get to deal with data that were originally collected for another use (or collected with little consideration of future use). Rarely are those data in a shape appropriate for the analyses we would like to perform. Before starting work on the unwrangled data, I often start by creating a mock-up of what I want the data to look like. Just a little sketch on paper or a sample Excel file. By starting at the end, I have a clear vision of what I want the data to look like. Two tidyverse packages are particularly useful for data wrangling:

How do I write for loops?

At the risk of sounding like a broken record, look at the Software Carpentry lesson on loops at http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html. When writing a for loop, make heavy use of reporting functions so you can be sure the loop is doing what you think it is supposed to be doing. For example, if I wanted to write a loop that summed the numbers 1 through 10, I could write

my_sum <- 0
for (i in 10) {
  my_sum <- my_sum + i
}
my_sum
# 11

That's not right at all, so I use the cat function in the loop to print the current value assigned to i:

my_sum <- 0
for (i in 10) {
  cat("i = ", i, "\n")
  my_sum <- my_sum + i
}

When the loop executes, it will print "i = 10" and that is it. It should print "i = 1", "i = 2", "i = 3" and so on. This reality check tells me that something is not going right with the loop because i is not taking the right values. Looking closer at my syntax, I see that I did not declare the for statement correctly. Instead of telling i to take values 1 through 10, I told it to take the value of 10. That's it. So I need to change the first line of my loop from

for (i in 10) {

to

for (i in 1:10) {

And this will now sum the integers from 1 through 10 (and print out the "i = " statements as expected). Once the loop works as expected, I recommend commenting out (or deleting) the cat function to cut down on the information printed to the screen.

Just remember, when writing loops, reality checks are your friends.

2020-08-04

R can be slow for large data or compute? How can I speed it up?

How can I combine more than two data frames?

The join functions in dplyr would allow one to do this:

final_data <- data1 %>% 
        full_join(data2, by = "id") %>% 
        full_join(data3, by = "id")

To decide which of the four joins to use (left, right, inner, or full), there is a great explanation with visual examples at https://r4ds.had.co.nz/relational-data.html#understanding-joins. The RStudio Cheat sheet for data wrangling also provides a succinct overview of how to use the join functions:
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf.

This data science careerany advice?

Miscellaneous Debris