# Learn-r ## Discussion resources This note includes information discussed during Tuesday morning R sessions at the University of Arizona. More information about the R sessions can be found at https://jcoliver.github.io/learn-r/schedule. Posts are shown in reverse chronological order. If you have questions or comments, e-mail Jeff Oliver at [jcoliver@arizona.edu](mailto:jcoliver@arizona.edu). ### 2021-08-18 #### I have to impute data, but need to analyze mean-centered variables _and_ I want to see model performance on each imputed data set. Can I do this with mice? Indeed. The [mice package](https://amices.org/mice/) is the go-to resource when imputing data is necessary. We can break the prompt into three questions, where we address each in term: 1. How can I see model fit for each imputed data set? 2. How can I test for multicollinearity on each imputed data set? 3. How can I do analyses on data that require The code block below provides a worked example addressing these three questions: {r} library(mice) # for multiple imputation library(car) # for variance inflation factor (measure of multicollinearity) # Make up some imputed analyses for example (using default of 5 # imputed data sets) # Start by imputing missing values for nhanes data set (part of mice package) imp <- mice(data = nhanes, print = FALSE, seed = 20210817) # Run linear regression on each of the imputed data sets fit <- with(imp, lm(bmi ~ chl + hyp)) # Use mice::getfit to get model for each imputed data set f1 <- getfit(fit) # 1. model fit for each imputed data set; here we use R-squared, but statistic # will vary based on model. i.e. you might want to extract AIC from models that # include such measures of fit # Variable to hold R squared value for each model r_squared <- numeric(length(f1)) # Cycle through each model and extract r-squared for (i in 1:length(f1)) { # Pull out model from the ith data set one_model <- f1[[i]] # Run summary for the ith model model_summary <- summary(one_model) # Extract R squared r_sq <- model_summary$r.squared # Store in that r_squared vector r_squared[i] <- r_sq } # For fun, run summary on the R squared vector to get a sense of variation in # model fit summary(r_squared) # 2. vif # Similar approach as above where you extract each model (one for each imputed # data set) and run car::vif on the model # Will hold variance inflation factor for each model, because we have a value # for each predictor, we'll use a list to store values vif_mod <- list(length(f1)) # Cycle through each model and calculate variance inflation factor for (i in 1:length(f1)) { # Pull out model from the ith data set one_model <- f1[[i]] # Run car::vif on model model_vif <- car::vif(one_model) # Store in that vif_mod list vif_mod[[i]] <- model_vif } # Convert list to a data frame vif_df <- data.frame(do.call(rbind, vif_mod)) # Print distribution of VIF for each predictor summary(vif_df) # 3. mean center # Use passive imputation to create mean-centered values following imputation # Add empty columns that will hold mean-centered values in imputed data sets nhanes_mc <- nhanes # copying original data nhanes_mc$bmi_mc <- NA nhanes_mc$chl_mc <- NA nhanes_mc$hyp_mc <- NA # Run mice initially to create prediction and method matrices ini <- mice(data = nhanes_mc, print = FALSE, seed = 20210817) # We can ignore the warnings, as we aren't going to use the ini imputed # datasets for anything pred <- ini$pred # Set the mean-centered values as ones we do NOT want to actively impute pred[c("bmi", "chl", "hyp"), c("bmi_mc", "chl_mc", "hyp_mc")] <- 0 # Use the method matrix to indicate how to mean center variables post-imputation meth<- ini$meth meth["bmi_mc"] <- "~I(bmi - mean(bmi))" meth["chl_mc"] <- "~I(chl - mean(chl))" meth["hyp_mc"] <- "~I(hyp - mean(hyp))" # Run passive imputation with those pred and meth matrices passive_imp <- mice(data = nhanes_mc, meth = meth, pred = pred, seed = 20210817, print = FALSE) # For the paranoid, check that first data set to make sure the values were # actually mean-centered # Extracts the first imputed data set passive_one <- complete(data = passive_imp, action = 1) # Should show Mean = 0.00 for bmi_mc, chl_mc, and hyp_mc summary(passive_one) # Run the analysis, using mean-centered variables in function call pass_fit <- with(passive_imp, lm(bmi_mc ~ chl_mc + hyp_mc)) # Proceed as you see fit  ### 2021-04-13 #### Code reviews: a very, very brief overview ##### What is a code review? Checking code for readability and functionality. Just like when we have someone read something we wrote. ##### Why do we do code reviews? Improves sharability of code and identifies errors or inefficiencies. Like when someone else reads your thesis and shows you where you meant "there" instead of "their". ##### What should I look for when performing a code review? + Is the rationale of _why_ clear from the code and comments? + Are variable names informative? + Is the same code duplicated more than twice? + Could I reproduce these analyses/plots with this code and data alone? ##### Some guidelines for code (and other) reviews + Don't be (too) nitpicky. + Take breaks. Don't do this for more than an hour. + Use "I" statements + Wrong: You are writing cryptic code. + Right: It is hard for me to grasp what is going on in this code. + Recognize the differences between your preferences and best practices ##### How do I actually _do_ a code review? + The best way is to use a code-sharing platform, such as [GitHub](https://github.com), [GitLab](https://gitlab.com), or [Bitbucket](https://bitbucket.org). + If you are not yet set up with one of those platforms, you can share files through e-mail or file sharing systems (e.g. Google Drive, Dropbox). + If you provide comments directly in code, try to use an identifier that is easy to search for. For example, I would mark all comments that I make with the string "TODO: JCO", e.g. # TODO: JCO rename varaible. ##### A couple of nice links with more code review guidelines + Nice code review guidelines are available at [https://phauer.com/2018/code-review-guidelines/#the-code-review-cheat-sheet](https://phauer.com/2018/code-review-guidelines/#the-code-review-cheat-sheet) + Preparing for and doing a code review, from reviewee and reviewer standpoint [https://engineeringblog.yelp.com/2017/11/code-review-guidelines.html](https://engineeringblog.yelp.com/2017/11/code-review-guidelines.html) ### 2021-02-16 #### _How do I merge two data frames?_ So you have two data frames with differing numbers of rows, and you want to get all the information into a single data frame. For a quick approach, the merge function will do the trick: {r} # Make up data to add to iris colors <- data.frame(Species = c("setosa", "virginica", "versicolor"), Color = c("purple", "lavender", "blue")) # Merge color data frame with iris data frame iris_colors <- merge(x = iris, y = colors) # Confirm it matched up colors with names appropriately with table table(iris_colors$Species, iris_colors$Color) # blue lavender purple # setosa 0 0 50 # versicolor 50 0 0 # virginica 0 50 0 # What if we want to merge data with different ID column name? iris_names <- data.frame(Scientific = c("setosa", "virginica", "versicolor"), Common = c("bristle-pointed iris", "Virginia iris", "blue flag")) # Since there is no Species column in iris_names, we need to tell R which # columns to use for matching rows (it guessed before) iris_full <- merge(x = iris_colors, by.x = "Species", y = iris_names, by.y = "Scientific") # Reality check with table to see that it worked as expected table(iris_full$Species, iris_full$Common) # blue flag bristle-pointed iris Virginia iris # setosa 0 50 0 # versicolor 50 0 0 # virginica 0 0 50  If you have more complex combinations, or are a fan of the tidyverse, see our answer from [2020-08-11](https://hackmd.io/puJYWNnWQM6MOZW46IF3Hg?both#2020-08-11). #### _How do I transform data to "long" format? I used to use gather..._ The gather and spread functions have been replaced with pivot_longer and pivot_wider, respectively. These functions work mostly the same way and the syntax has been revised to be more explicit about what is happening. An example: {r} library(dplyr) # For the pipe operator %>% library(tidyr) # For the pivot functions library(ggplot2) # For plotting data # Transform iris data to long iris_long <- iris %>% pivot_longer(cols = c(Petal.Length, Petal.Width, Sepal.Length, Sepal.Width), names_to = "Flower_part", # Name of column that holds column names values_to = "Measurement") # Name of column that holds values # Alternatively, can just name columns we *don't* want to pivot using the # negative sign iris_long <- iris %>% pivot_longer(cols = -Species, names_to = "Flower_part", # Name of column that holds column names values_to = "Measurement") # Name of column that holds values # Create boxplots for each flower measurement ggplot(data = iris_long, mapping = aes(x = Species, y = Measurement)) + geom_boxplot() + facet_wrap(~ Flower_part) # Make a separate plot for each flower part # Update the facet_wrap command so each sub-plot has an independent y-axis scale ggplot(data = iris_long, mapping = aes(x = Species, y = Measurement)) + geom_boxplot() + facet_wrap(~ Flower_part, scales = "free_y")  ### 2020-11-24 #### _How can I use a dynamic column name when a function uses dplyr?_ So you have written a nice custom function that does some cool things with parts of the dplyr package. Combining a call to group_by and summarize worked great with hard-coded values: {r} library(dplyr) group_stats <- function(data) { stats_data <- data %>% group_by(family) %>% summarize(means = mean(scores), stdev = sd(scores)) return(stats_data) }  The function operates on a data frame, data, and returns some summary statistics (mean and standard deviation) of the scores column, for each value in the family column. But what if you want to add functionality for other columns? One might intuit (at least, I did): {r} library(dplyr) group_stats <- function(data, group_col, value_col) { stats_data <- data %>% group_by(group_col) %>% summarize(means = mean(value_col), stdev = sd(value_col)) return(stats_data) }  Unfortunately, this won't work. For in-depth discussion of why, you can read about the dark art of [Quasiquotation](https://adv-r.hadley.nz/quasiquotation.html). If you just want to make it work, this [Stack Overflow](https://stackoverflow.com/questions/29678435/how-to-pass-dynamic-column-names-in-dplyr-into-custom-function) post provided an explanation that we use below. In short, you can make it work by wrapping the column name variables in !!as.name() whenever you reference them in tidyverse functions. {r} library(dplyr) group_stats <- function(data, group_col, value_col) { stats_data <- data %>% group_by(!!as.name(group_col)) %>% summarize(means = mean(!!as.name(value_col)), stdev = sd(!!as.name(value_col))) return(stats_data) }  It might not be pretty, but it works. #### _How do I return multiple objects from a function?_ Make a list! I mean, a list. The bento box of R, list objects can contain just about anything. Using the function we created in the question above, we can also add code to create a plot of values. {r} library(dplyr) library(ggplot2) group_stats <- function(data, group_col, value_col) { stats_data <- data %>% group_by(!!as.name(group_col)) %>% summarize(means = mean(!!as.name(value_col)), stdev = sd(!!as.name(value_col))) stats_plot <- ggplot(data = stats_data, mapping = aes(x = !!as.name(group_col), y = means)) return(stats_data) }  But the function just returns the stats_data data frame (well, it's really a tibble). How do we get it to return the plot object, too? (Aside: note how we did not need to "quasiquote" the means column in the call to ggplot with !!as.name(). This is because "means" is the actual name of the column, not the name of a variable referring to the column). To return the plot, we create a list object and return that instead: {r} library(dplyr) library(ggplot2) group_stats <- function(data, group_col, value_col) { stats_data <- data %>% group_by(!!as.name(group_col)) %>% summarize(means = mean(!!as.name(value_col)), stdev = sd(!!as.name(value_col))) stats_plot <- ggplot(data = stats_data, mapping = aes(x = !!as.name(group_col), y = means)) return_list <- list(stats_table = stats_data, means_plot = stats_plot) return(return_list) } # Call the function, assigning output to variable group_summary group_summary <- group_stats(data = df, group_col = "family", value_col = "scores") # Access each of the elements using $notation # The stats table group_summary$stats_table # The plot group_summary$means_plot  #### _How do I avoid long if - else chains?_ Let's say you have a continuous variable that you want to categorize into "high", "medium", and "low" categories. We can do this with some base R looping: {r} for (i in 1:nrow(df)) { if (df$value[i] < 10) { df$category[i] <- "low" } else if (df$value[i] < 30) { df$category[i] <- "medium" } else { df$category[i] <- "high" } }  Ooof. That seems oenerous. Another base R approach is {r} df$category[df$value < 10] <- "low" df$category[df$value >= 10 & df$value < 30] <- "medium" df$category[df$value >= 30] <- "high"  Better, but there is a tidy way, too. {r} library(dplyr) df <- df %>% mutate(category = case_when(value < 10 ~ "low", value < 30 ~ "medium", value >= 30 ~ "high"))  ### 2020-11-03 #### _Can I calculate a slope for a sliding window of data?_ Indeed. This "rolling" approach is implemented in a number of the functions of the [zoo package](https://cran.r-project.org/web/packages/zoo/index.html) and there is a worked example on Stack Overflow at [https://stackoverflow.com/questions/41061140/how-to-calculate-the-average-slope-within-a-moving-window-in-r](https://stackoverflow.com/questions/41061140/how-to-calculate-the-average-slope-within-a-moving-window-in-r) (Credit: A. Rutherford). If you don't want to deal with the overhead of the zoo package, the [runner](https://www.rdocumentation.org/packages/runner/versions/0.3.7) package provides an alternative approach. ### 2020-10-06 #### _How do I get started with mixed-effects models?_ Whoo-boy. Right now it seems you can learn mixed effects models by either (1) fumbling around on the internet until you find lme code that works or (2) take a statistics course requiring linear algebra. For a middle ground (sort of), take a look at the lesson at [https://m-clark.github.io/mixed-models-with-R/introduction.html](https://m-clark.github.io/mixed-models-with-R/introduction.html). It gets a little hairy at times (just hum through parts like$b_{int\_student} \sim N(b_{intercept}, \tau)\$), but provides a very nice one- to-one comparison of standard linear regression to mixed-effects regression. There is also a very nice introduction at [https://www.jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r](https://www.jaredknowles.com/journal/2013/11/25/getting-started-with-mixed-effect-models-in-r) (Credit: A. Rutherford). #### _How do I do logistic regression in R?_ For data that aren't too crazy, you can use the glm() function, setting the family argument to "binomial": {r} logit_model <- glm(response ~ predictor, family = "binomial", data = mydata)  See the examples at [https://www.statmethods.net/advstats/glm.html](https://www.statmethods.net/advstats/glm.html). If you want more of the math underlying the approach, check out the [Wikipedia page](https://en.wikipedia.org/wiki/Logistic_regression). #### _How do I control the use of horizontal space in knitr output?_ This seems a perennial challenge. One thing to check would be the [kableExtra package](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html), which also includes a workaround for tables that are too wide to fit on a screen (use a [scroll box](https://rdrr.io/cran/kableExtra/man/scroll_box.html)). (Credit: A. Rutherford) Another option is described at [Stack Overflow](https://stackoverflow.com/questions/34906002/increase-width-of-entire-html-rmarkdown-output), which essentially suggests adding some CSS styling in the R Markdown file, just below the YAML header:  <style type="text/css"> .main-container { max-width: 1800px; margin-left: auto; margin-right: auto; } </style>  ### 2020-09-22 #### _I want to select a subset of columns from my dataset, but...it's complicated._ The data set dat includes over 3,000 columns, but the goal is to create a subset of those columns. Specifically, columns that start with "QU", followed by a number between 1 and 40 **and** end with either "A1" or "A2". The challenges are: there are columns that start with "QU", but are followed with numbers greater than 40, but we do not want any of them. Also, some column names end with "A3", "A4", etc., and we do not want any of them either. The trick here is that there are multiple criteria to match, based on the beginning and the ending of the column names. Thankfully, there are helper functions in the tidyverse, specifically starts_with() and ends_with() that are designed for just this purpose (Credit: A. Rutherford):  dat_subset <- dat %>% select(starts_with(paste0("QU", 1:40)) & ends_with(c("A1", "A2")))  _Side note_: We tried a variant on this using num_range(), but just could not get it to work. #### _Are there any accessible style guides for writing R?_ + Hadley Wickham's _Advanced R_ has a nice chapter on styles at [http://adv-r.had.co.nz/Style.html](http://adv-r.had.co.nz/Style.html) (Credit: K. Busby) + There is also the tidyverse style guide at [https://style.tidyverse.org/](https://style.tidyverse.org/) + And even R packages to help ensure your code conforms to style, including [styler](https://styler.r-lib.org/) and [lintr](https://github.com/jimhester/lintr) ### 2020-09-15 Today's session was a introduction to some common statistical tests in R (t-tests, analysis of variance, and linear regression). The lesson can be found at [https://jcoliver.github.io/learn-r/002-intro-stats.html](https://jcoliver.github.io/learn-r/002-intro-stats.html). #### _In the post-hoc test results of an ANOVA, many p-values are zero. What can I do about this?_ R is rounding the p-values, and in some cases this rounding results in a value of zero. When p-values are so small, it probably is not appropriate to compare them to one another. There is a rich discussion on this topic on [Stack Overflow](https://stackoverflow.com/questions/16470404/tukeyhsd-adjusted-p-value-is-0-0000000) (credit: K. Busby). See the post from [2020-09-01](#2020-09-01) for more information on how to run post-hoc tests. #### _How can we find the effect size from an ANOVA?_ + See the function eta_sq() for one approach. More information is available at [https://rdrr.io/cran/effectsize/man/eta_squared.html](https://rdrr.io/cran/effectsize/man/eta_squared.html). In short, you should be able to use this directly on the output of a call to aov(). e.g. eta_squared(aov(Petal.Length ~ Species, data = iris), partial = TRUE, ci = 0.95) (credit: K. Busby). + There is also a package, [sjstats](https://cran.r-project.org/web/packages/sjstats/index.html), that supports a variety of effect size tests. There is a nice R-bloggers introduction to using this package at [https://www.r-bloggers.com/effect-size-statistics-for-anova-tables-rstats/](https://www.r-bloggers.com/effect-size-statistics-for-anova-tables-rstats/) (credit: A. Rutherford). ### 2020-09-08 #### _I need to sum variables from 15 different columns - is there a (tidyverse) way to do this without naming all the columns?_ Ideally, the columns of interest would have a unique starting or ending to their name. For example, all the columns of interest might start with "temp_" (and no other columns would start with that string of characters). One could then use the dplyr functions select() and starts_with() to pull out only those columns of interest. So it might look something like this:  summed_data <- my_data %>% mutate(sum_temp = rowSums(select(., starts_with("temp_")))  Another option is to use c_across, which assumes the columns appear consecutively in your data (credit: K. Gonzalez):  summed_data <- my_data %>% mutate(sum_temp = rowSums(c_across(temp_1:temp_15)))  #### _How do I visualize distributions for two categories?_ The violin plot option of ggplot2 is a great place to start. It is kind of like a boxplot and a histogram combined. Some examples are available at [http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization](http://www.sthda.com/english/wiki/ggplot2-violin-plot-quick-start-guide-r-software-and-data-visualization). (credit: K. Busby) #### _Can I combine two (or more) plots in ggplot?_ Indeed. + If you want to combine two separate plot objects (created by different calls to ggplot, you can use the grid.arrange() function from the [gridExtra package](https://cran.r-project.org/web/packages/gridExtra/vignettes/arrangeGrob.html). Some additional worked examples are available at [https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html](https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html). (credit: K. Busby) + If you want to combine plots with a single call of ggplot, you can use the faceting system of ggplot2. Examples are available at [http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/](http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/). (credit: A. Rutherford) + Another great package is called patchwork. It literally lets you add plots together like plot1 + plot2. Learn more [here](https://patchwork.data-imaginist.com/). (Credit: K. Gonzalez) #### _Can I transfer maintainership for an R package on CRAN?_ Yes. The first thing to do is for the current maintainer to e-mail CRAN (from the e-mail address listed in the DESCRIPTION file) with the _new_ maintainer's name and e-mail address. The folks at CRAN will pull some levers on their end to get the process started. A nice Stack Overflow thread goes over the process at [https://stackoverflow.com/questions/39223320/transferring-maintainership-of-an-r-package-on-cran](https://stackoverflow.com/questions/39223320/transferring-maintainership-of-an-r-package-on-cran). And if you want to find out more about submitting packages to CRAN you can read the [official documentation](https://cran.r-project.org/submit.html) or the human-readable overview of the process at [https://kbroman.org/pkg_primer/pages/cran.html](https://kbroman.org/pkg_primer/pages/cran.html). (credit: A. Rutherford) ### 2020-09-01 #### _I have significant results in a one-way ANOVA, but how can I tell which groups are different from one another?_ Time for some _post hoc_ analyses! Check out the example implementation of the Tukey test at [https://rpubs.com/aaronsc32/post-hoc-analysis-tukey](https://rpubs.com/aaronsc32/post-hoc-analysis-tukey). #### _I have five predictor variables and want to find the best model; is there any way to do this automatically, rather than manually creating all 25 models?_ The package MuMIn has a function dredge() which is designed for automated model selection [https://rdrr.io/cran/MuMIn/man/dredge.html](https://rdrr.io/cran/MuMIn/man/dredge.html). ### 2020-08-25 #### _How do I use the imputed data from mice()?_ The mice package is useful for imputing data when you have missing values, but using it isn't aways straightforward. There is a nice series of vignettes on using mice, starting at [https://www.gerkovink.com/miceVignettes/Ad_hoc_and_mice/Ad_hoc_methods.html](https://www.gerkovink.com/miceVignettes/Ad_hoc_and_mice/Ad_hoc_methods.html). The imputation process is generally performed multiple times and for information on how to use the results of multiple imputation, the [documentation](https://cran.r-project.org/web/packages/mice/mice.pdf) for the pool() function in mice provides a particularly succinct workflow: 1. Impute the missing data by the mice function, resulting in a multiple imputed data set (class mids); 2. Fit the model of interest (scientific model) on each imputed data set by the with() function, resulting an object of class mira; 3. Pool the estimates from each model into a single set of estimates and standard errors, resulting in an object of class mipo; 4. Optionally, compare pooled estimates from different scientific models by the D1() or D3() functions. #### _How do I compile packages from source if I am running Mac OSX?_ You will most likely need the Xcode program for Mac. You can download it through the Apple Store on your machine or from [https://developer.apple.com/xcode/resources/](https://developer.apple.com/xcode/resources/) #### _Miscellaneous debris_ This is not an R-related point, but if you are thinking about choosing a graduate advisor, consider what aspects the the mentor you find most important. Recent work suggests that the supportiveness of the mentor has a large impact on student satisfaction ([https://www.sciencemag.org/careers/2019/04/what-matters-phd-adviser-here-s-what-research-says](https://www.sciencemag.org/careers/2019/04/what-matters-phd-adviser-here-s-what-research-says)). ### 2020-08-18 Today's lesson was about using the [knitr](https://yihui.org/knitr/) package and RMarkdown to create a single document that had styled text, code, and plots. The full lesson is available at [https://jcoliver.github.io/learn-r/005-intro-knitr.html](https://jcoliver.github.io/learn-r/005-intro-knitr.html). Some great questions to come out of today's session: #### _How do you add comments to markdown?_ Since the pound sign (#) indicates a header in markdown, you can use HTML-style commenting in markdown blocks: > <!-- this is a comment --> #### _What's best practice for using packages in markdown?_ To make it easy for other people to re-use your code, it is probably best to load all the libraries your RMarkdown document depends on at the very beginning of the document (after the header, of course) in either the pre-existing "setup" block or another block named something like "load-dependencies". e.g. ~~~ --- title: My Fancy RMarkdown document author: Jeff Oliver date: 2029-08-18 output: html_document --- {r load-dependencies} library(ggplot2) library(dplyr)  ~~~ #### _How can I hide the code from the output document?_ You can keep code from appearing in the output document by passing echo = FALSE in the start of the code block. For example, using the block where we loaded dependencies, above: {r load-dependencies, echo = FALSE} library(ggplot2) library(dplyr)  #### _Are there easy ways to format model output tables?_ + The knitr package includes the function [kable()](https://bookdown.org/yihui/rmarkdown-cookbook/kable.html), which is designed to format tabular data for the appropriate output. + The [kableExtra](https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html) package takes table styling even further. + The [broom](https://cran.r-project.org/web/packages/broom/vignettes/broom.html) package is designed for tidying up output of statistical analyses + And if you're getting real fancy with mixed models, check out [broom.mixed](https://cran.r-project.org/web/packages/broom.mixed/vignettes/broom_mixed_intro.html). #### _Miscellaneous Debris_ + RMarkdown cheat sheet [https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf](https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) + Word templates for RMarkdown output [https://rmarkdown.rstudio.com/articles_docx.html](https://rmarkdown.rstudio.com/articles_docx.html) ### 2020-08-11 #### _I have large data sets that need some serious data wrangling - how can I do this in R?_ More and more often, we get to deal with data that were originally collected for another use (or collected with little consideration of future use). Rarely are those data in a shape appropriate for the analyses we would like to perform. Before starting work on the unwrangled data, I often start by creating a mock-up of what I _want_ the data to look like. Just a little sketch on paper or a sample Excel file. By **starting at the end**, I have a clear vision of what I want the data to look like. Two [tidyverse]() packages are particularly useful for data wrangling: + [dplyr]() is especially useful for extracting and summarizing information from data frames. The Software Carpentry lesson on the dplyr package is a great place to start: http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/index.html + [tidyr]() provides ways of transforming data. Check out the Software Carpentry lesson for demonstrations of how to do this: http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html #### _How do I write for loops?_ At the risk of sounding like a broken record, look at the Software Carpentry lesson on loops at http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html. When writing a for loop, make heavy use of reporting functions so you can be sure the loop is doing what you think it is supposed to be doing. For example, if I wanted to write a loop that summed the numbers 1 through 10, I could write  my_sum <- 0 for (i in 10) { my_sum <- my_sum + i } my_sum # 11  That's not right at all, so I use the cat function in the loop to print the current value assigned to i:  my_sum <- 0 for (i in 10) { cat("i = ", i, "\n") my_sum <- my_sum + i }  When the loop executes, it will print "i = 10" and that is it. It _should_ print "i = 1", "i = 2", "i = 3" and so on. This reality check tells me that something is not going right with the loop because i is not taking the right values. Looking closer at my syntax, I see that I did not declare the for statement correctly. Instead of telling i to take values 1 through 10, I told it to take the value of 10. That's it. So I need to change the first line of my loop from  for (i in 10) {  to  for (i in 1:10) {  And this will now sum the integers from 1 through 10 (and print out the "i = " statements as expected). Once the loop works as expected, I recommend commenting out (or deleting) the cat function to cut down on the information printed to the screen. Just remember, when writing loops, reality checks are your friends. ### 2020-08-04 #### _R can be slow for large data or compute? How can I speed it up?_ + Parallel Computing in R: https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html + Using the power of vectorization in R (and how R is working in the background): https://adv-r.hadley.nz/perf-improve.html + Writing C++ code in R: https://adv-r.hadley.nz/rcpp.html + useR! 2020 Tutorial - Seamless R and C++ Integration https://vimeo.com/438283959 + Faster file loading: vroom package https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/ + For large data, consider the data.table package https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html + And if you are used to dplyr manipulations, here's a nice side-by-side syntax comparison between data.table and dplyr: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/ #### _How can I combine more than two data frames?_ The join functions in dplyr would allow one to do this:  final_data <- data1 %>% full_join(data2, by = "id") %>% full_join(data3, by = "id")  To decide which of the four joins to use (left, right, inner, or full), there is a great explanation with visual examples at https://r4ds.had.co.nz/relational-data.html#understanding-joins. The RStudio Cheat sheet for data wrangling also provides a succinct overview of how to use the join functions: https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. #### _This data science career...any advice?_ + https://www.sushmitagopalan.com/post/navigating-data-science-jobs/navigating-data-science-careers/ + http://veekaybee.github.io/2019/02/13/data-science-is-different/ #### _Miscellaneous Debris_ + Pipelines in tidymodels https://mdneuzerling.com/post/machine-learning-pipelines-with-tidymodels-and-targets/ + Column names that require those unsightly backticks or imported dates stored as funky serial numbers? Check out the janitor package + https://www.rdocumentation.org/packages/janitor/versions/2.0.1 + https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html