This note includes information discussed during Tuesday morning R sessions at the University of Arizona. More information about the R sessions can be found at https://jcoliver.github.io/learn-r/schedule. Posts are shown in reverse chronological order. If you have questions or comments, e-mail Jeff Oliver at jcoliver@arizona.edu.
Indeed. The mice package is the go-to resource when imputing data is necessary. We can break the prompt into three questions, where we address each in term:
The code block below provides a worked example addressing these three questions:
Checking code for readability and functionality. Just like when we have someone
read something we wrote.
Improves sharability of code and identifies errors or inefficiencies. Like when
someone else reads your thesis and shows you where you meant "there" instead of
"their".
# TODO: JCO rename varaible
.So you have two data frames with differing numbers of rows, and you want to get
all the information into a single data frame. For a quick approach, the merge
function will do the trick:
If you have more complex combinations, or are a fan of the tidyverse, see our
answer from 2020-08-11.
gather
…The gather
and spread
functions have been replaced with pivot_longer
and
pivot_wider
, respectively. These functions work mostly the same way and the
syntax has been revised to be more explicit about what is happening. An example:
So you have written a nice custom function that does some cool things with
parts of the dplyr package. Combining a call to group_by
and summarize
worked great with hard-coded values:
The function operates on a data frame, data
, and returns some summary
statistics (mean and standard deviation) of the scores
column, for each value
in the family
column.
But what if you want to add functionality for other columns? One might intuit
(at least, I did):
Unfortunately, this won't work. For in-depth discussion of why, you can read
about the dark art of Quasiquotation.
If you just want to make it work, this Stack Overflow
post provided an explanation that we use below. In short, you can make it work
by wrapping the column name variables in !!as.name()
whenever you reference
them in tidyverse functions.
It might not be pretty, but it works.
Make a list! I mean, a list
. The bento box of R, list objects can contain
just about anything. Using the function we created in the question above, we
can also add code to create a plot of values.
But the function just returns the stats_data
data frame (well, it's really a
tibble). How do we get it to return the plot object, too? (Aside: note how we
did not need to "quasiquote" the means
column in the call to ggplot with
!!as.name()
. This is because "means" is the actual name of the column, not
the name of a variable referring to the column). To return the plot, we create
a list
object and return that instead:
if
- else
chains?Let's say you have a continuous variable that you want to categorize into
"high", "medium", and "low" categories. We can do this with some base R
looping:
Ooof. That seems oenerous. Another base R approach is
Better, but there is a tidy way, too.
Indeed. This "rolling" approach is implemented in a number of the functions of
the zoo package and
there is a worked example on Stack Overflow at https://stackoverflow.com/questions/41061140/how-to-calculate-the-average-slope-within-a-moving-window-in-r
(Credit: A. Rutherford). If you don't want to deal with the overhead of the zoo
package, the runner
package provides an alternative approach.
Whoo-boy. Right now it seems you can learn mixed effects models by either (1)
fumbling around on the internet until you find lme
code that works or (2)
take a statistics course requiring linear algebra. For a middle ground (sort
of), take a look at the lesson at https://m-clark.github.io/mixed-models-with-R/introduction.html.
It gets a little hairy at times (just hum through parts like
), but provides a very nice one-
to-one comparison of standard linear regression to mixed-effects regression.
There is also a very nice introduction at https://www.jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r
(Credit: A. Rutherford).
For data that aren't too crazy, you can use the glm()
function, setting the
family
argument to "binomial":
See the examples at https://www.statmethods.net/advstats/glm.html. If you want more of the math underlying the approach, check out the Wikipedia page.
This seems a perennial challenge. One thing to check would be the kableExtra package, which also includes a workaround for tables that are too wide to fit on a screen (use a scroll box). (Credit: A. Rutherford)
Another option is described at Stack Overflow, which essentially suggests adding some CSS styling in the R Markdown file, just below the YAML header:
The data set dat
includes over 3,000 columns, but the goal is to create a subset of those columns. Specifically, columns that start with "QU", followed by a number between 1 and 40 and end with either "A1" or "A2". The challenges are: there are columns that start with "QU", but are followed with numbers greater than 40, but we do not want any of them. Also, some column names end with "A3", "A4", etc., and we do not want any of them either. The trick here is that there are multiple criteria to match, based on the beginning and the ending of the column names. Thankfully, there are helper functions in the tidyverse, specifically starts_with()
and ends_with()
that are designed for just this purpose (Credit: A. Rutherford):
Side note: We tried a variant on this using num_range()
, but just could not get it to work.
Today's session was a introduction to some common statistical tests in R (t-tests, analysis of variance, and linear regression). The lesson can be found at https://jcoliver.github.io/learn-r/002-intro-stats.html.
R is rounding the p-values, and in some cases this rounding results in a value of zero. When p-values are so small, it probably is not appropriate to compare them to one another. There is a rich discussion on this topic on Stack Overflow (credit: K. Busby). See the post from 2020-09-01 for more information on how to run post-hoc tests.
eta_sq()
for one approach. More information is available at https://rdrr.io/cran/effectsize/man/eta_squared.html. In short, you should be able to use this directly on the output of a call to aov()
. e.g. eta_squared(aov(Petal.Length ~ Species, data = iris), partial = TRUE, ci = 0.95)
(credit: K. Busby).Ideally, the columns of interest would have a unique starting or ending to their name. For example, all the columns of interest might start with "temp_" (and no other columns would start with that string of characters). One could then use the dplyr functions select()
and starts_with()
to pull out only those columns of interest. So it might look something like this:
Another option is to use c_across
, which assumes the columns appear consecutively in your data (credit: K. Gonzalez):
The violin plot option of ggplot2 is a great place to start. It is kind of like a boxplot and a histogram combined. Some examples are available at http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization. (credit: K. Busby)
Indeed.
ggplot
, you can use the grid.arrange()
function from the gridExtra package. Some additional worked examples are available at https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html. (credit: K. Busby)ggplot
, you can use the faceting system of ggplot2. Examples are available at http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/. (credit: A. Rutherford)patchwork
. It literally lets you add plots together like plot1 + plot2
. Learn more here. (Credit: K. Gonzalez)Yes. The first thing to do is for the current maintainer to e-mail CRAN (from the e-mail address listed in the DESCRIPTION file) with the new maintainer's name and e-mail address. The folks at CRAN will pull some levers on their end to get the process started. A nice Stack Overflow thread goes over the process at https://stackoverflow.com/questions/39223320/transferring-maintainership-of-an-r-package-on-cran. And if you want to find out more about submitting packages to CRAN you can read the official documentation or the human-readable overview of the process at https://kbroman.org/pkg_primer/pages/cran.html. (credit: A. Rutherford)
Time for some post hoc analyses! Check out the example implementation of the Tukey test at https://rpubs.com/aaronsc32/post-hoc-analysis-tukey.
The package MuMIn has a function dredge()
which is designed for automated model selection https://rdrr.io/cran/MuMIn/man/dredge.html.
mice()
?The mice package is useful for imputing data when you have missing values, but using it isn't aways straightforward. There is a nice series of vignettes on using mice, starting at https://www.gerkovink.com/miceVignettes/Ad_hoc_and_mice/Ad_hoc_methods.html. The imputation process is generally performed multiple times and for information on how to use the results of multiple imputation, the documentation for the pool()
function in mice provides a particularly succinct workflow:
mice
function, resulting in a multiple imputed data set (class mids);with()
function, resulting an object of class mira;D1()
or D3()
functions.You will most likely need the Xcode program for Mac. You can download it through the Apple Store on your machine or from https://developer.apple.com/xcode/resources/
This is not an R-related point, but if you are thinking about choosing a graduate advisor, consider what aspects the the mentor you find most important. Recent work suggests that the supportiveness of the mentor has a large impact on student satisfaction (https://www.sciencemag.org/careers/2019/04/what-matters-phd-adviser-here-s-what-research-says).
Today's lesson was about using the knitr package and RMarkdown to create a single document that had styled text, code, and plots. The full lesson is available at https://jcoliver.github.io/learn-r/005-intro-knitr.html. Some great questions to come out of today's session:
Since the pound sign (#) indicates a header in markdown, you can use HTML-style commenting in markdown blocks:
<!-- this is a comment -->
To make it easy for other people to re-use your code, it is probably best to load all the libraries your RMarkdown document depends on at the very beginning of the document (after the header, of course) in either the pre-existing "setup" block or another block named something like "load-dependencies". e.g.
You can keep code from appearing in the output document by passing echo = FALSE
in the start of the code block. For example, using the block where we loaded dependencies, above:
kable()
, which is designed to format tabular data for the appropriate output.More and more often, we get to deal with data that were originally collected for another use (or collected with little consideration of future use). Rarely are those data in a shape appropriate for the analyses we would like to perform. Before starting work on the unwrangled data, I often start by creating a mock-up of what I want the data to look like. Just a little sketch on paper or a sample Excel file. By starting at the end, I have a clear vision of what I want the data to look like. Two tidyverse packages are particularly useful for data wrangling:
for
loops?At the risk of sounding like a broken record, look at the Software Carpentry lesson on loops at http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html. When writing a for
loop, make heavy use of reporting functions so you can be sure the loop is doing what you think it is supposed to be doing. For example, if I wanted to write a loop that summed the numbers 1 through 10, I could write
That's not right at all, so I use the cat
function in the loop to print the current value assigned to i
:
When the loop executes, it will print "i = 10" and that is it. It should print "i = 1", "i = 2", "i = 3" and so on. This reality check tells me that something is not going right with the loop because i
is not taking the right values. Looking closer at my syntax, I see that I did not declare the for
statement correctly. Instead of telling i
to take values 1 through 10, I told it to take the value of 10. That's it. So I need to change the first line of my loop from
to
And this will now sum the integers from 1 through 10 (and print out the "i = " statements as expected). Once the loop works as expected, I recommend commenting out (or deleting) the cat
function to cut down on the information printed to the screen.
Just remember, when writing loops, reality checks are your friends.
The join
functions in dplyr would allow one to do this:
To decide which of the four joins to use (left, right, inner, or full), there is a great explanation with visual examples at https://r4ds.had.co.nz/relational-data.html#understanding-joins. The RStudio Cheat sheet for data wrangling also provides a succinct overview of how to use the join functions:
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf.