R user group meeting #12

###### tags: `R` `factor()` # R user group meeting #12 ### Tip 1 * **Name of the tip**: Sorting character columns in user-defined order * **Contributor**: Lun-Hsien Chang * **Problem to solve**: You want one or multiple character columns in an order defined by yourself * **Solution to the problem**: Change the columns to sort to factor types and specify the order of the levels in the factor() * **Limitations of the code**: You may need to change the sorted columns from factor to character ```r! # Create data assuming the first column has been sorted alphabetically drug.use.stages <- data.frame( stage=c("Addiction","Experimentation","Regular use","Relapse") ,count=c(20,40,30,10) ,stringsAsFactors = FALSE) # Custom sort data by drug stages stages <- c("Experimentation","Regular use","Addiction","Relapse") # Change the column to sort from character to factor drug.use.stages$stage <- factor(drug.use.stages$stage, levels = stages) # Sort the data by the stages vector drug.use.stages.sorted <- drug.use.stages[with(drug.use.stages, order(stage)),] # View sorted data View(drug.use.stages.sorted) ``` --- ### Tip 2 * **Name of the tip**: Using passwords in R * **Contributor**: Jeffrey Molendijk * **Problem to solve**: If you want to connect to a database management system (e.g. MySQL, PostgreSQL), R would require a password, which you don't want to type in your terminal directly. * **Solution to the problem**: Using a function that requests your password, such as getPass:getPass() or rstudioapi::askForPassword() * **Limitations of the code**: getPass works when knitting a document, whilst rstudioapi::askForPassword("") or .rs.askForPassword("") will not work. ```r! # When I connect to PostgreSQL, R would require a password, which you don't want to type in your terminal directly. Instead I use the package getPass to keep my password secure (e.g. it won't be written to the history file). If I can use my own laptop I will do a quick live demonstration using the attached .Rmd file. con <- dbConnect(RPostgres::Postgres(), user = "postgres", password = getPass::getPass("Enter database password"), dbname = "AUSNUT") ``` --- ### Tip 3 * **Name of the tip**: Defining custom reference levels in an explanatory variable in a regression model * **Contributor**: Dilki Jayasinghe * **Problem to solve**: * **Solution to the problem**: * **Limitations of the code**: ```r! library(MASS) combined_data$family_moles <- MASS::relevel(combined_data$family_moles,ref="No") combined_data$occupational_exposure <- relevel(combined_data$occupational_exposure,ref="Mainly Indoors") combined_data$leisure_exposure <- relevel(combined_data$leisure_exposure,ref="Mainly indoors") negbinmod <- glm.nb(combined_data$freq ~ as.factor(combined_data$age_50cat)+as.factor(combined_data$sex)+as.factor(combined_data$innate_skin_colour)+as.factor(combined_data$hair_colour_cat)+as.factor(combined_data$eye_colour_cat)+as.factor(combined_data$burns_score20_cat)+as.factor(combined_data$occupational_exposure)+as.factor(combined_data$leisure_exposure)+as.factor(combined_data$family_moles)+as.factor(combined_data$family_history_melanoma),data=combined_data) summary(negbinmod) ``` --- ### Tip 4 * **Name of the tip**: Plotting mean and standard error * **Contributor**: Dwan Vilcins * **Problem to solve**: Create quick plots to explore trends across time or categories * **Solution to the problem**: Using ggplot's built in stat functions allows for quick plotting without summarising data first * **Limitations of the code**: Requires a categorical and continuous variable; some limitations in adding other layers to the plot ```r! # Load packages and data install.packages("tidyverse") library(tidyverse) data(mtcars) # Create factors mtcars2 <- within(mtcars, { cyl <- factor(cyl, labels = c("4", "6", "8")) vs <- factor(vs, labels = c("V", "S")) am <- factor(am, labels = c("automatic", "manual"))}) glimpse(mtcars2) # Mean and standard error plots mean_se_plot <- mtcars2 %>% ggplot(aes(cyl, mpg)) + geom_point(stat = "summary", fun.y = "mean") + geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0.1) + ggtitle("Mean miles per gallon by Number of cylinders") + labs(y = "Miles per gallon", x = "Number of cylinders") + theme_bw() mean_se_plot ``` --- ### Tip 5 * **Name of the tip**: Creating effects plot of a model object * **Contributor**: Dwan Vilcins * **Problem to solve**: You may need to visualise the relationship between outcome and predictor from a model, especially for complex relationships like interactions and smoothed variables * **Solution to the problem**: The effects package offers an easy way to plot many model objects, with flexible options * **Limitations of the code**: The package effects require R version of 3.5 or higher ```r! # Updating R in RGui (if your R version < 3.5) install.packages("installr") library(installr) updateR() # load package install.packages("effects") library(effects) # Create a model object mod1 <- lm(mpg ~ cyl + disp, data = mtcars2) summary(mod1) # Plot the effects plot(allEffects(mod1), rug = FALSE, main = FALSE, ylab = "Miles per gallon", xlab = "Displacement") ``` --- ### Tip 6 * **Name of the tip**: Making bar plots with ggplot2? * **Contributor**: Stéphane Guillou * **Problem to solve**: * **Solution to the problem**: * **Limitations of the code**: ```r! install.packages("ggplot2") # how to deal with long categories in plots? library(ggplot2) # Check the structure of the data msleep str(msleep) # keep only long names from msleep df <- msleep[nchar(msleep$name) > 22,] # base plot p <- ggplot(df, aes(x = name, y = sleep_total)) + geom_col() p # names overlap on the plot! # a few options: # 1. flip coordinates p + coord_flip() # 2. abbreviate p + scale_x_discrete(label = abbreviate) # 3. subset p + scale_x_discrete(label = function(x) substr(x, 1, 10)) # 4. truncate (with ellipsis) p + scale_x_discrete(label = function(x) stringr::str_trunc(x, 12)) ``` --- ### Tip 7 * **Name of the tip**: Tidying messy outputs from statistical functions * **Contributor**: Rebecca Johnston * **Problem to solve**: When you create a model using in-built functions such as `lm` and `t.test` or using popular packages such as `survival` and `glmnet` the output is a total pain to summarise and manipulate, especially when you need to perform multiple tests and/or combine multiple models. * **Solution to the problem**: Use the R packages `broom` together with `tidyverse` to summarise your results in a tidy data frame! Note `broom` uses three key verbs `tidy`, `glance` and `augment`, but I have only used `tidy` below. The results can then be used downstream by other tidy tools like `dplyr` or visualized using `ggplot2` * **Limitations of the code**: The models that `broom` cannot yet clean! ```r! # Load required libraries install.packages("tidyverse") library("tidyverse") library("broom") # Load example data data(iris) # Create linear model for sepal length and sepal width for ALL species # Using base R approach, use summary to obtain all results: summary(lm(Sepal.Length ~ Sepal.Width, iris)) # Using tidyverse + broom approach: iris %>% do(tidy(lm(Sepal.Length ~ Sepal.Width, .))) # Create linear model for sepal length and sepal width PER species # Using base R approach: summary(lm(Sepal.Length ~ Sepal.Width, data = iris[which(iris$Species == "setosa"), ])) # Or instead of summary, call coefficients variable directly lm(Sepal.Length ~ Sepal.Width, data = iris[which(iris$Species == "versicolor"), ])$coefficients lm(Sepal.Length ~ Sepal.Width, data = iris[which(iris$Species == "virginica"), ])$coefficients # Using tidyverse + broom approach: iris %>% group_by(Species) %>% do(tidy(lm(Sepal.Length ~ Sepal.Width, .))) ``` --- ### Tip 8 * **Name of the tip**: Displaying the structure of ANY R object using str() * **Contributor**: Ahmed Mohamed * **Problem to solve**: * **Solution to the problem**: * **Limitations of the code**: ```r! str(mtcars) l <- list(a=1:10, b=5, c=mtcars) str(l) str(Titanic) ``` --- ### Tip 9 * **Name of the tip**: Modifying rownames using pipe * **Contributor**: Ahmed Mohamed * **Problem to solve**: * **Solution to the problem**: * **Limitations of the code**: ```r! # using dplyr removes the rownames df <- mtcars %>% mutate(highgear = gear > 4) df # We need to re-assign rownames rownames(df) <- rownames(mtcars) # Since everything in R is a function # this statement is equivalent to: `rownames<-`(df, rownames(mtcars)) # We can rewrite our pipe using # `rownames<-` function mtcars %>% mutate(highgear = gear > 4) %>% `rownames<-`(rownames(mtcars)) # This can be used with similar assignment functions c(1:10) %>% `names<-`(LETTERS[1:10]) ``` --- ### Tip 10 * **Name of the tip**: Efficient interval-based joins using data.table * **Contributor**: Ahmed Mohamed * **Problem to solve**: * **Solution to the problem**: * **Limitations of the code**: ```r! install.packages("data.table") library(data.table) ## simple example: x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10) y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3) setkey(y, start, end) foverlaps(x, y, type="any", which=TRUE) ## return overlap indices foverlaps(x, y, type="any") ## return overlap join foverlaps(x, y, type="any", mult="first") ## returns only first match foverlaps(x, y, type="within") ## matches iff 'x' is within 'y' ``` --- ### Tip 11 * **Name of the tip**: * **Contributor**: Muhammad Khan * **Problem to solve**: ??Here is a tip to visualize big dataset using "sparklyr" and "SparkR" and "ggplot2" packages in R, we can visualize the small datasets through scatter plot however when it comes to a bigger dataset, scatterplot does not work. * **Solution to the problem**: * **Limitations of the code**: ```r! install.packages("sparklyr") library(sparklyr) library(ggplot2) install.packages("SparkR") library(SparkR) ggplot(collect(mydata),aes(X,Y))+geom_jitter(size=0.3, alpha=0.5)+geom_smooth() ``` --- ### Tip 12 * **Name of the tip**: Merge more than 2 data sets that have same-named merging key columns * **Contributor**: Lun-Hsien Chang * **Problem to solve**: You have multiple data sets to merge but you don't want to merge just 2 data sets at a time * **Solution to the problem**: Use purrr::reduce() and dplyr::left_join jointly * **Limitations of the code**: Expect slowness in merging large data sets ```r! # Create 4 sample data sets data.1 <- data_frame(i = c("a","b","c","d","e"), col.1 = 1:5) data.2 <- data_frame(i = c("b","c","d","e","f"), col.2 = 3:7) data.3 <- data_frame(i = c("c","d","e","f","g"), col.3 = 5:9) data.4 <- data_frame(i = c("d","e","f","g","h"), col.4 = LETTERS[7:11]) # Left join the 4 data sets library(tidyverse) left.join <- list(data.1, data.2, data.3, data.4) %>% purrr::reduce(dplyr::left_join, by = "i") # dim(left.join) 5 5 # Inner join the 4 data sets inner.join <- list(data.1, data.2, data.3, data.4) %>% purrr::reduce(dplyr::inner_join, by = "i") # dim(inner.join) 2 4 ``` ---