###### tags: `R` `factor()`
# R user group meeting #12
### Tip 1
* **Name of the tip**: Sorting character columns in user-defined order
* **Contributor**: Lun-Hsien Chang
* **Problem to solve**: You want one or multiple character columns in an order defined by yourself
* **Solution to the problem**: Change the columns to sort to factor types and specify the order of the levels in the factor()
* **Limitations of the code**: You may need to change the sorted columns from factor to character
```r!
# Create data assuming the first column has been sorted alphabetically
drug.use.stages <- data.frame( stage=c("Addiction","Experimentation","Regular use","Relapse")
,count=c(20,40,30,10)
,stringsAsFactors = FALSE)
# Custom sort data by drug stages
stages <- c("Experimentation","Regular use","Addiction","Relapse")
# Change the column to sort from character to factor
drug.use.stages$stage <- factor(drug.use.stages$stage, levels = stages)
# Sort the data by the stages vector
drug.use.stages.sorted <- drug.use.stages[with(drug.use.stages, order(stage)),]
# View sorted data
View(drug.use.stages.sorted)
```
---
### Tip 2
* **Name of the tip**: Using passwords in R
* **Contributor**: Jeffrey Molendijk
* **Problem to solve**: If you want to connect to a database management system (e.g. MySQL, PostgreSQL), R would require a password, which you don't want to type in your terminal directly.
* **Solution to the problem**: Using a function that requests your password, such as getPass:getPass() or rstudioapi::askForPassword()
* **Limitations of the code**: getPass works when knitting a document, whilst rstudioapi::askForPassword("") or .rs.askForPassword("") will not work.
```r!
# When I connect to PostgreSQL, R would require a password, which you don't want to type in your terminal directly. Instead I use the package getPass to keep my password secure (e.g. it won't be written to the history file). If I can use my own laptop I will do a quick live demonstration using the attached .Rmd file.
con <- dbConnect(RPostgres::Postgres(), user = "postgres", password = getPass::getPass("Enter database password"), dbname = "AUSNUT")
```
---
### Tip 3
* **Name of the tip**: Defining custom reference levels in an explanatory variable in a regression model
* **Contributor**: Dilki Jayasinghe
* **Problem to solve**:
* **Solution to the problem**:
* **Limitations of the code**:
```r!
library(MASS)
combined_data$family_moles <- MASS::relevel(combined_data$family_moles,ref="No")
combined_data$occupational_exposure <- relevel(combined_data$occupational_exposure,ref="Mainly Indoors")
combined_data$leisure_exposure <- relevel(combined_data$leisure_exposure,ref="Mainly indoors")
negbinmod <- glm.nb(combined_data$freq ~ as.factor(combined_data$age_50cat)+as.factor(combined_data$sex)+as.factor(combined_data$innate_skin_colour)+as.factor(combined_data$hair_colour_cat)+as.factor(combined_data$eye_colour_cat)+as.factor(combined_data$burns_score20_cat)+as.factor(combined_data$occupational_exposure)+as.factor(combined_data$leisure_exposure)+as.factor(combined_data$family_moles)+as.factor(combined_data$family_history_melanoma),data=combined_data)
summary(negbinmod)
```
---
### Tip 4
* **Name of the tip**: Plotting mean and standard error
* **Contributor**: Dwan Vilcins
* **Problem to solve**: Create quick plots to explore trends across time or categories
* **Solution to the problem**: Using ggplot's built in stat functions allows for quick plotting without summarising data first
* **Limitations of the code**: Requires a categorical and continuous variable; some limitations in adding other layers to the plot
```r!
# Load packages and data
install.packages("tidyverse")
library(tidyverse)
data(mtcars)
# Create factors
mtcars2 <- within(mtcars, {
cyl <- factor(cyl, labels = c("4", "6", "8"))
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))})
glimpse(mtcars2)
# Mean and standard error plots
mean_se_plot <- mtcars2 %>%
ggplot(aes(cyl, mpg)) +
geom_point(stat = "summary", fun.y = "mean") +
geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0.1) +
ggtitle("Mean miles per gallon by Number of cylinders") +
labs(y = "Miles per gallon", x = "Number of cylinders") +
theme_bw()
mean_se_plot
```
---
### Tip 5
* **Name of the tip**: Creating effects plot of a model object
* **Contributor**: Dwan Vilcins
* **Problem to solve**: You may need to visualise the relationship between outcome and predictor from a model, especially for complex relationships like interactions and smoothed variables
* **Solution to the problem**: The effects package offers an easy way to plot many model objects, with flexible options
* **Limitations of the code**: The package effects require R version of 3.5 or higher
```r!
# Updating R in RGui (if your R version < 3.5)
install.packages("installr")
library(installr)
updateR()
# load package
install.packages("effects")
library(effects)
# Create a model object
mod1 <- lm(mpg ~ cyl + disp, data = mtcars2)
summary(mod1)
# Plot the effects
plot(allEffects(mod1),
rug = FALSE,
main = FALSE,
ylab = "Miles per gallon",
xlab = "Displacement")
```
---
### Tip 6
* **Name of the tip**: Making bar plots with ggplot2?
* **Contributor**: Stéphane Guillou
* **Problem to solve**:
* **Solution to the problem**:
* **Limitations of the code**:
```r!
install.packages("ggplot2")
# how to deal with long categories in plots?
library(ggplot2)
# Check the structure of the data msleep
str(msleep)
# keep only long names from msleep
df <- msleep[nchar(msleep$name) > 22,]
# base plot
p <- ggplot(df, aes(x = name, y = sleep_total)) +
geom_col()
p # names overlap on the plot!
# a few options:
# 1. flip coordinates
p + coord_flip()
# 2. abbreviate
p + scale_x_discrete(label = abbreviate)
# 3. subset
p + scale_x_discrete(label = function(x) substr(x, 1, 10))
# 4. truncate (with ellipsis)
p + scale_x_discrete(label = function(x) stringr::str_trunc(x, 12))
```
---
### Tip 7
* **Name of the tip**: Tidying messy outputs from statistical functions
* **Contributor**: Rebecca Johnston
* **Problem to solve**: When you create a model using in-built functions such as `lm` and `t.test` or using popular packages such as `survival` and `glmnet` the output is a total pain to summarise and manipulate, especially when you need to perform multiple tests and/or combine multiple models.
* **Solution to the problem**: Use the R packages `broom` together with `tidyverse` to summarise your results in a tidy data frame! Note `broom` uses three key verbs `tidy`, `glance` and `augment`, but I have only used `tidy` below. The results can then be used downstream by other tidy tools like `dplyr` or visualized using `ggplot2`
* **Limitations of the code**: The models that `broom` cannot yet clean!
```r!
# Load required libraries
install.packages("tidyverse")
library("tidyverse")
library("broom")
# Load example data
data(iris)
# Create linear model for sepal length and sepal width for ALL species
# Using base R approach, use summary to obtain all results:
summary(lm(Sepal.Length ~ Sepal.Width, iris))
# Using tidyverse + broom approach:
iris %>%
do(tidy(lm(Sepal.Length ~ Sepal.Width, .)))
# Create linear model for sepal length and sepal width PER species
# Using base R approach:
summary(lm(Sepal.Length ~ Sepal.Width,
data = iris[which(iris$Species == "setosa"), ]))
# Or instead of summary, call coefficients variable directly
lm(Sepal.Length ~ Sepal.Width,
data = iris[which(iris$Species == "versicolor"), ])$coefficients
lm(Sepal.Length ~ Sepal.Width,
data = iris[which(iris$Species == "virginica"), ])$coefficients
# Using tidyverse + broom approach:
iris %>%
group_by(Species) %>%
do(tidy(lm(Sepal.Length ~ Sepal.Width, .)))
```
---
### Tip 8
* **Name of the tip**: Displaying the structure of ANY R object using str()
* **Contributor**: Ahmed Mohamed
* **Problem to solve**:
* **Solution to the problem**:
* **Limitations of the code**:
```r!
str(mtcars)
l <- list(a=1:10, b=5, c=mtcars)
str(l)
str(Titanic)
```
---
### Tip 9
* **Name of the tip**: Modifying rownames using pipe
* **Contributor**: Ahmed Mohamed
* **Problem to solve**:
* **Solution to the problem**:
* **Limitations of the code**:
```r!
# using dplyr removes the rownames
df <- mtcars %>% mutate(highgear = gear > 4)
df
# We need to re-assign rownames
rownames(df) <- rownames(mtcars)
# Since everything in R is a function
# this statement is equivalent to:
`rownames<-`(df, rownames(mtcars))
# We can rewrite our pipe using
# `rownames<-` function
mtcars %>% mutate(highgear = gear > 4) %>%
`rownames<-`(rownames(mtcars))
# This can be used with similar assignment functions
c(1:10) %>% `names<-`(LETTERS[1:10])
```
---
### Tip 10
* **Name of the tip**: Efficient interval-based joins using data.table
* **Contributor**: Ahmed Mohamed
* **Problem to solve**:
* **Solution to the problem**:
* **Limitations of the code**:
```r!
install.packages("data.table")
library(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
```
---
### Tip 11
* **Name of the tip**:
* **Contributor**: Muhammad Khan
* **Problem to solve**: ??Here is a tip to visualize big dataset using "sparklyr" and "SparkR" and "ggplot2" packages in R, we can visualize the small datasets through scatter plot however when it comes to a bigger dataset, scatterplot does not work.
* **Solution to the problem**:
* **Limitations of the code**:
```r!
install.packages("sparklyr")
library(sparklyr)
library(ggplot2)
install.packages("SparkR")
library(SparkR)
ggplot(collect(mydata),aes(X,Y))+geom_jitter(size=0.3, alpha=0.5)+geom_smooth()
```
---
### Tip 12
* **Name of the tip**: Merge more than 2 data sets that have same-named merging key columns
* **Contributor**: Lun-Hsien Chang
* **Problem to solve**: You have multiple data sets to merge but you don't want to merge just 2 data sets at a time
* **Solution to the problem**: Use purrr::reduce() and dplyr::left_join jointly
* **Limitations of the code**: Expect slowness in merging large data sets
```r!
# Create 4 sample data sets
data.1 <- data_frame(i = c("a","b","c","d","e"), col.1 = 1:5)
data.2 <- data_frame(i = c("b","c","d","e","f"), col.2 = 3:7)
data.3 <- data_frame(i = c("c","d","e","f","g"), col.3 = 5:9)
data.4 <- data_frame(i = c("d","e","f","g","h"), col.4 = LETTERS[7:11])
# Left join the 4 data sets
library(tidyverse)
left.join <- list(data.1, data.2, data.3, data.4) %>%
purrr::reduce(dplyr::left_join, by = "i") # dim(left.join) 5 5
# Inner join the 4 data sets
inner.join <- list(data.1, data.2, data.3, data.4) %>%
purrr::reduce(dplyr::inner_join, by = "i") # dim(inner.join) 2 4
```
---