# 2020-Data-Carpentry-Social-Science
# workshop schedule:
https://ucsdlib.github.io/2020-02-04-UCSDsocsci/
## This workshop qualifies for the Co-Curricular Record!
## https://elt.ucsd.edu/ccr/index.html
If you'd like this workshop to show up on your co-curricular record, follow the instructions in the CCR portal, then email Stephanie (slabou@ucsd.edu) two items:
1) text file with the JSON extract from our lesson on OpenRefine
2) your R script from Wednesday's R lesson
---
## This HackMD: https://hackmd.io/@U2NG/ryazym6x8
## Instructors:
**Stephanie Labou (Library) - slabou@ucsd.edu
Reid Otsuji (Library) - rotsuji@ucsd.edu
Rick Mccosh (OBGYN/Reprod Sci)
Justin Shaffer (Pediatrics)**
## Collaborative Notes
**This is the collaborative notes for the workshop.**
Please type your notes here!
## stickies notes
Blue - "i need help"
Orange - "i'm good!"
### #######Sign-In Here######
### Please sign in:
**Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab**
Reid Otsuji, Librarian, Library
Rick McCosh, Postdoc, OB/GYN + Repro. Sci
Ryan Johnson, Librarian, Library
Stephanie Labou, Librarian, Library
Leonidas Mylonakis, Alumni, History
Veronica Hoyo, Staff, ACTRI-Health Sciences
Evie Xinqi Guo, Graduate Student, Experimental Psychology
Alexandre Gomide, Visiting Scholar, GPS-UCSD
Julian Borba, Visiting Scholar, CILAS-UCSD
Peipei Zhu, post-doc, Croker Lab, Pediatrics
Danielle Fritts, Graduate Student, Visiting
Rita Kuckertz, Visiting Graduate Student
Bryan Kehr, Library IT
# ########Day 1 - Notes###########
Lesson information:
https://datacarpentry.org/spreadsheets-socialsci/
### Workshop Data download:
https://ndownloader.figshare.com/articles/6262019/versions/4
https://datacarpentry.org/openrefine-socialsci/setup.html
### information about the SAFI Teaching Database date set: https://datacarpentry.org/socialsci-workshop/data/
---
### For Data Management Best Practices section
**Download and unzip on your desktop:**
https://librarycarpentry.org/lc-shell/data/shell-lesson.zip
**Git Bash for Windows setup:**
https://librarycarpentry.org/lc-shell/setup.html
## Data Organization in Spreadsheets
* Leave raw data raw
* keep a record of any cleaning steps you do
**Exercise: what's wrong with the spreadsheet?**
Problems identified:
* multiple tables on a single sheet
* typos
* missing values - different values used
* working on different sheets - need to combine data
* unknow numeric values
* colored cells
* KeyID issues
* Merged cells
* use of underscores
* no dates
* notes in cells
**How to clean up:**
Consistent values
be explicit - clean up will create more values in the spreadsheet - that's ok R can handle it.
R likes 'NA' for missing values
R can handle column names of various lengths but does not like spaces or numbers at the beginning (use underscore:`_`)
**Common formating problems:**
using multiple tables and/or tabs
not filling in zeros
confusing field names
inconsistent column names
**Excel stores dates as numbers**
`Tip for dates in Excel: separate dates in to columns (year, month, day)`
## Openrefine Lesson
To import `.zip` or `.tar` projects use the `import project` command
**why use open refine:**
Data is often very messy. OpenRefine provides a set of tools to allow you to identify and amend the messy data.
It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.
All actions are easily reversed in OpenRefine.
If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset.
Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets.
Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.
for more OpenRefine information:
google `openrefine libguide`
You can find out a lot more about OpenRefine at http://openrefine.org
More about GREL:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions
# Data Management Best practices
Data Sharing SNAFU youtube video:
https://www.youtube.com/watch?v=66oNv_DJuPc
ICPSR (social science data repository - great for finding data, can also deposit data): https://www.icpsr.umich.edu/icpsrweb/
UCSD's reserach data curation group in the library: https://library.ucsd.edu/research-and-collections/data-curation/
# R
R package for non-English language text mining: https://www.bnosac.be/index.php/blog/72-natural-language-processing-for-non-english-languages-with-udpipe
# Day 2 Sign-in
**Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab**
Reid Otsuji, Librarian, Library
Rick McCosh, Postdoc, OB/GYN + Repro. Sci.
Stephanie Labou, Librarian, Library
Veronica Hoyo, Staff, ACTRI-Health Sciences
Peipei Zhu, Postdoc, Pediatrics
Leonidas Mylonakis, Alumni, History
Alexandre Gomide, Visiting Scholar, GPS-UCSD
Danielle Fritts, Visiting Graduate Student
Justin Shaffer, Postdoc, Pediatrics
Bryan Kehr, Library IT
Rita Kuckertz, Visiting Graduate Student
Evie Xinqi Guo, Graduate Student, Experimental Psychology
# Useful links:
### R Workshop Walkthrough
https://datacarpentry.org/r-socialsci/
### Dataset download
https://ndownloader.figshare.com/articles/6262019/versions/4
### Dataset overview
https://datacarpentry.org/socialsci-workshop/data/
### logical operators in R
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html
### Hadley Wickham's Tidy Data paper
https://vita.had.co.nz/papers/tidy-data.pdf
### ggplot2 themes
http://docs.ggplot2.org/current/ggtheme.html
### ggplot2 cheat sheet
https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
# Day 2 notes:
**Assigning varialbes:**
assign variables using the `<-` assignment function
**be thoughtful about variable naming:**
* yes: letters, numbers, period, underscore
* no: spaces, dashes
* no: start with #, or underscore
**varialbe name styles:**
* CamelCase
* Underscore_between_words
* periods.between.words
### Functions
Examples:
```
sqrt() # square root fucntion
b <- sqrt(2) #assign fuctions to variables
round(3.1415)
args(round)
?round #for function help
```
Function errors will display in the console.
### Vectors
Examples:
```
hh_members <- c(3,7,10, 6) # c() means conacatenate and used to create vectors
hh_wall_type <- c("mauddaub","burntbricks", "sunbricks")
length(hh_members)
length(hh_wall_type)
class(hh_members)
class(hh_wall_type)
```
R data types:
* logical
* integer
* numeric
* complex
* character
Tip: R is indexed beginning at 1
Tip: remember R is case sensitive - be consistent
#### adding to vectors
```
possessions <- c("bike","radio", "tv")
possessions <- c(posessions, "cellPhone")
possessions <-
```
Data structures
* vector
* list
* matrix
* array
* data from (tibble)
```
num_char <- c(1,2,3,"a")
class(num_char)
tricky <- c(1,2,3, "4")
num <- c(1,2,3)
class(num)
int <- as.integer(num)
class(int)
# as.integer()
# as.character()
# as.numeric()
```
### subsetting
```
hh_wall_type
hh_wall_type[2]
hh_wall_type[c(1,3)]
hh_wall_type[1,3]
more_wall_types <- hh_wall_type[c(1,2,3,3,4,1)]
more_wall_types
```
### conditional subsetting
```
hh_members
hh_members > 5
hh_members[hh_members >5]
possessions
possessions[ possessions == "tv" | possessions == "bike"] # search for one item at a time
possessions %in% c("car", "bicycle", "motorcycles", "boat") # search for multiple items at once
```
### missing data - NA
```
rooms <- c(2,1,1,NA,4)
mean(rooms)
max(rooms)
mean(rooms, na.rm = TRUE)
args(mean)
is.na(rooms)
rooms[!is.na(rooms)]
```
# working with a dataset
### installing packages
```
install.packages("tidyverse")
library(tidyverse)
```
### loading data
```
interviews <- read_csv("SAFI_clean.csv")
View(interviews)
```
```
head(interviews)
tail(interviews)
class(interview)
str(interviews)
nrow(interviews)
ncol(interviews)
names(interviews)
summary(interviews)
```
# subsetting
```
interviews[1,6] # output as a tibble (data frame)
interviews[1]
interviews[[1]] #output as a vector
interviews[1:6, ] #all of the columns and 1 - 6 rows
interviews[4:7, ] # subset
interviews[ , -1]
interviews["village"]
interviews[["village"]]
interview$village # use $ to access a column e.g. dataFrameName$columnName
village
```
```
# factors
respondent_floor_type <- factor(c("earch","cement","cement","earth"))
class(respondent_floor_type)
levels(respondent_floor_type)
levels(respondent_floor_type)
unique(respondent_floor_type)
respondent_floor_type
as.character(respondent_floor_type)
memb_assoc <- interviews$memb_assoc
memb_assoc <- as.factor(memb_assoc)
memb_assoc
plot(memb_assoc)
levels(memb_assoc)
```
# Dates
R and rpackages likes date formats in yyy-mm-dd
```
install.packages("lubridate") #if you need to install the package
library(lubridate)
dates <- interviews$interview_date
interviews$day <- day(dates)
interviews$month <- month(dates)
interviews$year <- year(dates)
View(interviews)
```
# R - PART 2
#### Dplyr and Tidyr
```
select (interviews, village, no_members, years_liv)
filter(interviews, village == "god")
interviews_god <- select(interviews2, no_members, years_liv)
''
interviews2 <- filter(interviews, village == "god")
interviews2
interviews_god <- select(interviews2, no_members, years_live)
interviews_god
dates <- interviews$interview_date
interviews$day <- day(dates)
interviews$month <- month(dates)
interviews$year <- year(dates)
interviews$key_ID <- day(dates)
View(interviews)
```
interviews2 <- filter(interviews, village == "god")
interviews2
```
```
interviews_god <- select(interviews2, no_membrs, years_liv)
interviews_god
```
```
interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv)
intervews_god <- interviews %>%
filter(village =="God") %>%
select(no_membrs, years_liv)
interviews_god
Exercise solution:
```
interviews %>%
filter(memb_assoc == "yes") %>%
select(affect_conflicts, liv_count, no_meals)
```
# mutate
- creating a new column
```
new_variables <- interviews %>%
mutate(people_per_room = no_membrs / rooms)
view(new_variable)
```
```
interviews %>%
filter(!is.na(member_assoc)) %>%
mutate(people_per_room = no_membrs / rooms)
```
Exercise solution:
```
interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
select(village, total_meals)
interviews_total_meals <- interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
filter(total_meals > 20) %>%
select(village, total_meals)
View(interviews_total_meals)
```
# groupby function
Average household size by village
```
interviews %>%
group_by(village) %>%
summarize(mean_no_membrs = mean(no_membrs))
```
```
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))
```
```
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs))
```
```
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
arrange(desc(min_membrs))
```
```
interviews %>%
count(village)
interviews %>%
count(village, sort = TRUE)
```
Exercise solution
```
interviews %>%
group_by(village) %>%
summarize(mean_no_membrs = mean = no_membrs,
min_no_membrs = min(no_membrs),
max_no_membrs = max(no_membrs),
n = n()
)
```
### spreading
```
interviews_spread <- interviews %>%
mutate(wall_type_logical = TRUE) %>%
spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE)
```
### gathering
```
interviews_gather <- interviews_spread %>%
gather(key = respondent_wall_type, value = "wall_type_logical",
burntbricks:sunbricks)%>%
filter(wall_type_logical) %>%
select(-wall_type_logical)
# each cell should only have a single data
```
```
interviews_items_owned <- interviews %>%
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE)
View(interviews_items_owned)
nrow(interviews_items_owned)
```
```
interviews_items_owned %>%
filter(bicycle) %>%
group_by(village) %>%
count(bicycle)
```
```
interviews_items_owned %>%
mutate(number_items = rowSums(select(., bicycle:television))) %>%
group_by(village) %>%
summarize(mean_items = mean(number_items))
```
```
interviews_items_owned %>%
mutate(number_items = rowSums(select(., bicycle:television))) %>%
group_by(village) %>%
summarize(mean_items = mean(number_items))
```
Exercise Solution part 1:
```
interviews_months_lack_food <- interviews %>%
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE)
```
Exercise Solution part 2:
How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t?
```
interviews_months_lack_food %>%
mutate(number_months = rowSums(select(., Apr:Sept))) %>%
group_by(memb_assoc) %>%
summarize(mean_months = mean(number_months))
```
```
# This is what to export for plotting:
interviews_plotting <- interviews %>%
## spread data by items_owned
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>%
rename(no_listed_items = `<NA>`) %>%
## spread data by months_lack_food
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>%
## add some summary columns
mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>%
mutate(number_items = rowSums(select(., bicycle:television)))
```
```
# saving to .csv
write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv")
```
# Plotting with GGPLOT2
```
library(tidyverse)
```
```
ggplot(data = interviews_plotting)
```
```
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))
```
#### scatter plot
```
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point()
```
assign ggplot to a variable
```
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))
```
now you can use the variable and change geom_
```
interviews_plot +
geom_point()
```
https://datacarpentry.org/r-socialsci/04-ggplot2/index.html
# Code from R for dpyr and ggplot sections
#######################################################################################################################
#######################################################################################################################
# 2020.02.05
# Data Carpentries - R for Social Scientists
# Justin Shaffer
# justinparkshaffer@gmail.com
#######################################################################################################################
#######################################################################################################################
# Set working directory
#######################################################################################################################
getwd()
setwd("~/Google-Drive-UCSD/R/2020_carpentries_social_sci/")
# Install and load libraries needed for analysis
#######################################################################################################################
install.packages("tidyverse")
install.packages("tidyr")
install.packages("ggplot2")
install.packages("plyr")
install.packages("dplyr")
library(tidyverse)
library(ggplot2)
# Read in dataset
#######################################################################################################################
interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")
# Inspecting dataframes
#######################################################################################################################
# Viewing
interviews
View(interviews)
# Data types
class(interviews)
# Size
dim(interviews)
nrow(interviews)
ncol(interviews)
# Content
head(interviews)
tail(interviews)
# Names
names(interviews)
# Summary
str(interviews)
summary(interviews)
# Indexing and subsetting
#######################################################################################################################
## first element in the first column of the data frame (as a vector)
interviews[1, 1]
## first element in the 6th column (as a vector)
interviews[1, 6]
## first column of the data frame (as a vector)
interviews[[1]]
## first column of the data frame (as a data.frame)
interviews[1]
## first three elements in the 7th column (as a vector)
interviews[1:3, 7]
## the 3rd row of the data frame (as a data.frame)
interviews[3, ]
## The whole data frame, except the first column
interviews[, -1]
## Equivalent to head(interviews)
interviews[-c(7:131), ]
interviews["village"] # Result is a data frame
interviews[, "village"] # Result is a data frame
interviews[["village"]] # Result is a vector
interviews$village # Result is a vector
# Exercise
#######################################################################################################################
## 1.
interviews_100 <- interviews[100, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(interviews)
interviews_last <- interviews[n_rows, ]
## 3.
interviews_middle <- interviews[(n_rows / 2), ]
## 4.
interviews_head <- interviews[-(7:n_rows), ]
# Factors
#######################################################################################################################
# Create variable
respondent_floor_type <- factor(c("earth", "cement", "cement", "earth"))
# Check levels
levels(respondent_floor_type)
# Display number of levels
nlevels(respondent_floor_type)
# Original order
respondent_floor_type
# After re-ordering
respondent_floor_type <- factor(respondent_floor_type, levels = c("earth", "cement"))
respondent_floor_type
levels(respondent_floor_type)
levels(respondent_floor_type)[2] <- "brick"
levels(respondent_floor_type)
respondent_floor_type
# Converting factors
#######################################################################################################################
as.character(respondent_floor_type)
# One wrong way - without a warning!
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct)
# One solution
as.numeric(as.character(year_fct))
# The recommended solution
as.numeric(levels(year_fct))[year_fct]
# Renaming factors
#######################################################################################################################
## create a vector from the data frame column "memb_assoc"
memb_assoc <- interviews$memb_assoc
## convert it into a factor
memb_assoc <- as.factor(memb_assoc)
## let's see what it looks like
memb_assoc
plot(memb_assoc)
## Let's recreate the vector from the data frame column "memb_assoc"
memb_assoc <- interviews$memb_assoc
## replace the missing data with "undetermined"
memb_assoc[is.na(memb_assoc)] <- "undetermined"
## convert it into a factor
memb_assoc <- as.factor(memb_assoc)
## let's see what it looks like
memb_assoc
## bar plot of the number of interview respondents who were
## members of irrigation association:
plot(memb_assoc)
# Exercise
#######################################################################################################################
levels(memb_assoc)
levels(memb_assoc) <- c("No", "Undetermined", "Yes")
memb_assoc <- factor(memb_assoc, levels = c("No", "Yes", "Undetermined"))
plot(memb_assoc)
##############################################################################################################################################################################################################################################
# Learning dplyr and tidyr
#######################################################################################################################
#######################################################################################################################
## load the tidyverse
library(tidyverse)
interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")
## inspect the data
interviews
## preview the data
View(interviews)
# Selecting columns and filtering rows (subsetting)
#######################################################################################################################
# select function - first argument is dataset, others are columns to keep
select(interviews, village, no_membrs, years_liv)
# filter function - choose rows based on specific criteria
filter(interviews, village == "God")
# Pipes - performing multiple functions simultaneously
#######################################################################################################################
## option1: intermediate steps
interviews2 <- filter(interviews, village == "God")
interviews2
interviews_god <- select(interviews2, no_membrs, years_liv)
interviews_god
## option2: nested functions
interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv)
## option3: pipes
interviews_god <- interviews %>%
filter(village == "God") %>%
select(no_membrs, years_liv)
interviews_god
# tangent on logical operators
#######################################################################################################################
# https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html
# !x: not x
# AND: &, &&
# OR: |, ||
# Exercise
#######################################################################################################################
# Using pipes, subset the interviews data to include interviews where respondents were members of an irrigation association (memb_assoc) and retain only the columns affect_conflicts, liv_count, and no_meals.
interviews %>%
filter(memb_assoc == "yes") %>%
select(affect_conflicts, liv_count, no_meals)
# Mutate - create new columns based on states in other columns
#######################################################################################################################
## What if we want to know the ratio of the number of people in a household, to the number of rooms they use to sleep
new_variable <- interviews %>%
mutate(people_per_room = no_membrs / rooms)
View(new_variable)
## Does being a member of an irrigation assoc. affect the ratio above? First, we'll remove non-responders...
interviews %>%
filter(!is.na(memb_assoc)) %>%
mutate(people_per_room = no_membrs / rooms)
# Exercise
#######################################################################################################################
## Create a new data frame from the interviews data that meets the following criteria: contains only the village column and a new column called total_meals containing a value that is equal to the total number of meals served in the household per day on average (no_membrs times no_meals). Only the rows where total_meals is greater than 20 should be shown in the final data frame. Hint: think about how the commands should be ordered to produce this data frame!
interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
select(village, total_meals)
interviews_total_meals <- interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
filter(total_meals > 20) %>%
select(village, total_meals)
View(interviews_total_meals)
# Split-apply-combine data analysis and the summarize() function
#######################################################################################################################
# Split data into groups, apply some analysis to each group, then combine the results
# Average household size, by village
interviews %>%
group_by(village) %>%
summarize(mean_no_membrs = mean(no_membrs))
# Group multiple columns
interviews %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))
# Exclude NAs from previous output using a filter step
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))
# Can summarize multiple variables at once - add minimum household size for each village group
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs))
# Sort on min_membrs to put smallest value first
interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
arrange(min_membrs)
# Descending order
interviews_new <- interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
arrange(desc(min_membrs))
View(interviews_new)
# Counting - determine the number of observations for each factor or combination of factors
#######################################################################################################################
# Count rows of data per village
interviews %>%
count(village)
# Count and sort in decreasing order
interviews %>%
count(village, sort = TRUE) %>%
# Exercise
#######################################################################################################################
# How many households in the survey have an average of two meals per day? Three meals per day? Are there any other numbers of meals represented?
interviews %>%
count(no_meals)
# Use group_by() and summarize() to find the mean, min, and max number of household members for each village. Also add the number of observations (hint: see ?n).
interviews %>%
group_by(village) %>%
summarize(
mean_no_membrs = mean(no_membrs),
min_no_membrs = min(no_membrs),
max_no_membrs = max(no_membrs),
n = n()
)
interviews %>%
group_by(village) %>%
summarize(
mean_no_membrs = mean(no_membrs),
min_no_membrs = min(no_membrs),
max_no_membrs = max(no_membrs),
n = n()
)
# Skip last exercise
# Reshaping with gather and spread
#######################################################################################################################
## What if instead of comparing records, we wanted to look at differences in households grouped by different types of housing construction materials?
# Spreading (long to wide)
## Because both the key and value parameters must come from column values, we will create a dummy column (we’ll name it wall_type_logical) to hold the value TRUE
interviews_spread <- interviews %>%
mutate(wall_type_logical = TRUE) %>%
spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE)
View(interviews_spread)
# Gathering (wide to long)
interviews_gather <- interviews_spread %>%
gather(key = respondent_wall_type, value = "wall_type_logical", burntbricks:sunbricks)
View(interviews_gather)
## Now have four rows per interview respondent - filter all but those that are TRUE
interviews_gather <- interviews_spread %>%
gather(key = "respondent_wall_type", value = "wall_type_logical",
burntbricks:sunbricks) %>%
filter(wall_type_logical)
select(-wall_type_logical)
View(interviews_gather)
# Applying spread() to clean our data
#######################################################################################################################
# Tidy up raw data - some columns have multiple values in a single cell (e.g. items owned)
View(interviews)
str(interviews$items_owned)
interviews_items_owned <- interviews %>%
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE)
View(interviews_items_owned)
nrow(interviews_items_owned)
# Rename NA column
interviews_items_owned <- interviews_items_owned %>%
rename(no_listed_items = `<NA>`)
# Now can summarize differently - show number of people in each village who owned a particluar item
interviews_items_owned %>%
filter(computer) %>%
group_by(village) %>%
count(computer)
# Calculate average number of items owned by respondents in each village
interviews_items_owned %>%
mutate(number_items = rowSums(select(., bicycle:television))) %>%
group_by(village) %>%
summarize(mean_items = mean(number_items))
# Exercise
#######################################################################################################################
# Create a new data frame (named interviews_months_lack_food) that has one column for each month and records TRUE or FALSE for whether each interview respondent was lacking food in that month
interviews_months_lack_food <- interviews %>%
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE)
View(interviews_months_lack_food)
# How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t?
interviews_months_lack_food %>%
mutate(number_months = rowSums(select(., Apr:Sept))) %>%
group_by(memb_assoc) %>%
summarize(mean_months = mean(number_months))
# Exporting data
#######################################################################################################################
# Create new folder 'data_output'
# Prepare for plotting - create version where each column includes only one data value
## Use spread to expand months_lack_food and items_owned
interviews_plotting <- interviews %>%
## spread data by items_owned
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>%
rename(no_listed_items = `<NA>`) %>%
## spread data by months_lack_food
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>%
## add some summary columns
mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>%
mutate(number_items = rowSums(select(., bicycle:television)))
# Save new file
write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv")
##############################################################################################################################################################################################################################################
# Plotting with ggplot2
#######################################################################################################################
#######################################################################################################################
library(tidyverse)
ggplot(data = interviews_plotting)
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))
# scatter plot - two continuous variables
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point()
# Assign plot to a variable
interviews_plot <- ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))
# Draw the plot
interviews_plot +
geom_point()
## This is the correct syntax for adding layers
interviews_plot +
geom_point()
## This will not add the new layer and will return an error message
interviews_plot
+ geom_point()
# Building your plots iteratively
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point()
# Make points transparent
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point(alpha = 0.5)
# Jitter points
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.5)
# Color points
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.7, color = "blue")
# Color by village - new aes function to specify across subsets of points
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(aes(color = village), alpha = 0.5)
ggplot(data = interviews_plotting, aes(x = village, y = rooms)) +
geom_jitter(aes(color = respondent_wall_type), alpha = 0.5)
# Use what you just learned to create a scatter plot of rooms by village with the respondent_wall_type showing in different colors. Is this a good way to show this type of data?
ggplot(data = interviews_plotting, aes(x = village, y = rooms)) +
geom_jitter(aes(color = respondent_wall_type))
# Difficult to distinguish between villages in above plot
# Boxplots
#######################################################################################################################
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_boxplot()
# Add points
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.5, color = "tomato")
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_jitter(alpha = 0.5, color = "tomato") +
geom_boxplot()
# Exercise
#######################################################################################################################
# Make violin plot
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_violin(alpha = 0) +
geom_jitter(alpha = 0.5, color = "tomato")
# Create a boxplot for liv_count for each wall type - overlay on a jitter layer to show actual measurements
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.5)
# Add color based on irrigation association membership
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) +
geom_boxplot(alpha = 0) +
geom_jitter(aes(alpha = 0.5, shape = memb_assoc))
# Barplots
#######################################################################################################################
ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar()
# Fill bars by village counts
ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar(aes(fill = village))
# Size-by-side bars
ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar(aes(fill = village), position = "dodge")
# Create new datasets to plot prop. of each housing type in each village
## Also remove houses with cement walls as there is only one in the dataset
percent_wall_type <- interviews_plotting %>%
filter(respondent_wall_type != "cement") %>%
count(village, respondent_wall_type) %>%
group_by(village) %>%
mutate(percent = n / sum(n)) %>%
ungroup()
View(percent_wall_type)
ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge")
ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge")
# Exercise
#######################################################################################################################
## Create a bar plot showing the proportion of respondents in each village who are or are not part of an irrigation association (memb_assoc). Include only respondents who answered that question in the calculations and plot. Which village had the lowest proportion of respondents in an irrigation association?
percent_memb_assoc <- interviews_plotting %>%
filter(!is.na(memb_assoc)) %>%
count(village, memb_assoc) %>%
group_by(village) %>%
mutate(percent = n / sum(n)) %>%
ungroup()
View(percent_memb_assoc)
ggplot(percent_memb_assoc, aes(x = village, y = percent, fill = memb_assoc)) +
geom_bar(stat = "identity", position = "dodge")
# Adding Labels and Titles
#######################################################################################################################
ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village", x="Wall Type", y="Percent")
# Faceting
######################################################################################################################
ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village",
x="Wall Type",
y="Percent") +
facet_wrap(~ village, ncol=1)
# Add white background and remove grid
ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village",
x="Wall Type",
y="Percent") +
facet_wrap(~ village) +
theme_bw() +
theme(panel.grid = element_blank())
# Plot prop. of respondents in each village who owned a particular item
## First calc. the percentage of people in each village who owned each item
table(interviews$village)
percent_items <- interviews_plotting %>%
gather(items, items_owned_logical, bicycle:no_listed_items) %>%
filter(items_owned_logical) %>%
count(items, village) %>%
## add a column with the number of people in each village
mutate(people_in_village = case_when(village == "Chirodzo" ~ 39,
village == "God" ~ 43,
village == "Ruaca" ~ 49)) %>%
mutate(percent = n / people_in_village)
ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
theme_bw() +
theme(panel.grid = element_blank())
# ggplot2 themes
######################################################################################################################
# link to ggplot2 themes:
# http://docs.ggplot2.org/current/ggtheme.html
# Exercise
# Experiment with at least two different themes. Build the previous plot using each of those themes. Which do you like best?
# Customization
######################################################################################################################
# ggplot2 cheat sheat
# https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf
# Change labels to be more informative
ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw()
# Increase font size
ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(text=element_text(size = 16))
# Rotate x-axis labels, and introduce newlines
ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16))
# Save theme and center plot title
grey_theme <- theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16),
plot.title = element_text(hjust = 0.5))
ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
grey_theme
# Explort plot using commands
my_plot <- ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16),
plot.title = element_text(hjust = 0.5))
# Create folder 'fig_output'
ggsave("items_by_village_barplot.png", my_plot, width = 15, height = 10)