Try   HackMD

2020-Data-Carpentry-Social-Science

workshop schedule:

https://ucsdlib.github.io/2020-02-04-UCSDsocsci/

This workshop qualifies for the Co-Curricular Record!

https://elt.ucsd.edu/ccr/index.html

If you'd like this workshop to show up on your co-curricular record, follow the instructions in the CCR portal, then email Stephanie (slabou@ucsd.edu) two items:

  1. text file with the JSON extract from our lesson on OpenRefine
  2. your R script from Wednesday's R lesson

This HackMD: https://hackmd.io/@U2NG/ryazym6x8

Instructors:

Stephanie Labou (Library) - slabou@ucsd.edu
Reid Otsuji (Library) - rotsuji@ucsd.edu
Rick Mccosh (OBGYN/Reprod Sci)
Justin Shaffer (Pediatrics)

Collaborative Notes

This is the collaborative notes for the workshop.
Please type your notes here!

stickies notes

Blue - "i need help"
Orange - "i'm good!"

#######Sign-In Here######

Please sign in:

Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab

Reid Otsuji, Librarian, Library
Rick McCosh, Postdoc, OB/GYN + Repro. Sci
Ryan Johnson, Librarian, Library
Stephanie Labou, Librarian, Library
Leonidas Mylonakis, Alumni, History
Veronica Hoyo, Staff, ACTRI-Health Sciences
Evie Xinqi Guo, Graduate Student, Experimental Psychology
Alexandre Gomide, Visiting Scholar, GPS-UCSD
Julian Borba, Visiting Scholar, CILAS-UCSD
Peipei Zhu, post-doc, Croker Lab, Pediatrics
Danielle Fritts, Graduate Student, Visiting
Rita Kuckertz, Visiting Graduate Student
Bryan Kehr, Library IT

########Day 1 - Notes###########

Lesson information:
https://datacarpentry.org/spreadsheets-socialsci/

Workshop Data download:

https://ndownloader.figshare.com/articles/6262019/versions/4

https://datacarpentry.org/openrefine-socialsci/setup.html

information about the SAFI Teaching Database date set: https://datacarpentry.org/socialsci-workshop/data/


For Data Management Best Practices section

Download and unzip on your desktop:
https://librarycarpentry.org/lc-shell/data/shell-lesson.zip

Git Bash for Windows setup:
https://librarycarpentry.org/lc-shell/setup.html

Data Organization in Spreadsheets

  • Leave raw data raw
  • keep a record of any cleaning steps you do

Exercise: what's wrong with the spreadsheet?

Problems identified:

  • multiple tables on a single sheet
  • typos
  • missing values - different values used
  • working on different sheets - need to combine data
  • unknow numeric values
  • colored cells
  • KeyID issues
  • Merged cells
  • use of underscores
  • no dates
  • notes in cells

How to clean up:
Consistent values
be explicit - clean up will create more values in the spreadsheet - that's ok R can handle it.
R likes 'NA' for missing values
R can handle column names of various lengths but does not like spaces or numbers at the beginning (use underscore:_)

Common formating problems:
using multiple tables and/or tabs
not filling in zeros
confusing field names
inconsistent column names
Excel stores dates as numbers

Tip for dates in Excel: separate dates in to columns (year, month, day)

Openrefine Lesson

To import .zip or .tar projects use the import project command

why use open refine:
Data is often very messy. OpenRefine provides a set of tools to allow you to identify and amend the messy data.
It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material.

All actions are easily reversed in OpenRefine.

If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset.

Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets.

Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power.

for more OpenRefine information:
google openrefine libguide

You can find out a lot more about OpenRefine at http://openrefine.org

More about GREL:
https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

Data Management Best practices

Data Sharing SNAFU youtube video:
https://www.youtube.com/watch?v=66oNv_DJuPc

ICPSR (social science data repository - great for finding data, can also deposit data): https://www.icpsr.umich.edu/icpsrweb/

UCSD's reserach data curation group in the library: https://library.ucsd.edu/research-and-collections/data-curation/

R

R package for non-English language text mining: https://www.bnosac.be/index.php/blog/72-natural-language-processing-for-non-english-languages-with-udpipe

Day 2 Sign-in

Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab

Reid Otsuji, Librarian, Library
Rick McCosh, Postdoc, OB/GYN + Repro. Sci.
Stephanie Labou, Librarian, Library
Veronica Hoyo, Staff, ACTRI-Health Sciences
Peipei Zhu, Postdoc, Pediatrics
Leonidas Mylonakis, Alumni, History
Alexandre Gomide, Visiting Scholar, GPS-UCSD
Danielle Fritts, Visiting Graduate Student
Justin Shaffer, Postdoc, Pediatrics
Bryan Kehr, Library IT
Rita Kuckertz, Visiting Graduate Student
Evie Xinqi Guo, Graduate Student, Experimental Psychology

Useful links:

R Workshop Walkthrough

https://datacarpentry.org/r-socialsci/

Dataset download

https://ndownloader.figshare.com/articles/6262019/versions/4

Dataset overview

https://datacarpentry.org/socialsci-workshop/data/

logical operators in R

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html

Hadley Wickham's Tidy Data paper

https://vita.had.co.nz/papers/tidy-data.pdf

ggplot2 themes

http://docs.ggplot2.org/current/ggtheme.html

ggplot2 cheat sheet

https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

Day 2 notes:

Assigning varialbes:
assign variables using the <- assignment function

be thoughtful about variable naming:

  • yes: letters, numbers, period, underscore
  • no: spaces, dashes
  • no: start with #, or underscore

varialbe name styles:

  • CamelCase
  • Underscore_between_words
  • periods.between.words

Functions

Examples:

sqrt()  # square root fucntion
b <- sqrt(2) #assign fuctions to variables 

round(3.1415)
args(round) 
?round #for function help


Function errors will display in the console.

Vectors

Examples:

hh_members <- c(3,7,10, 6) # c() means conacatenate and used to create vectors

hh_wall_type <- c("mauddaub","burntbricks", "sunbricks")

length(hh_members)
length(hh_wall_type)
class(hh_members)
class(hh_wall_type)


R data types:

  • logical
  • integer
  • numeric
  • complex
  • character

Tip: R is indexed beginning at 1
Tip: remember R is case sensitive - be consistent

adding to vectors

possessions <- c("bike","radio", "tv")
possessions <- c(posessions, "cellPhone")
possessions <- 

Data structures

  • vector
  • list
  • matrix
  • array
  • data from (tibble)
num_char <- c(1,2,3,"a")
class(num_char)

tricky <- c(1,2,3, "4")
num <- c(1,2,3)

class(num)

int <- as.integer(num)
class(int)

# as.integer()
# as.character()
# as.numeric()

subsetting

hh_wall_type
hh_wall_type[2]
hh_wall_type[c(1,3)]
hh_wall_type[1,3]

more_wall_types <- hh_wall_type[c(1,2,3,3,4,1)]
more_wall_types

conditional subsetting

hh_members
hh_members > 5

hh_members[hh_members >5]

possessions
possessions[ possessions == "tv" | possessions == "bike"] # search for one item at a time
possessions %in% c("car", "bicycle", "motorcycles", "boat") # search for multiple items at once


missing data - NA

rooms <- c(2,1,1,NA,4)
mean(rooms)
max(rooms)
mean(rooms, na.rm = TRUE)
args(mean)

is.na(rooms)
rooms[!is.na(rooms)]

working with a dataset

installing packages

install.packages("tidyverse")

library(tidyverse) 

loading data

interviews <- read_csv("SAFI_clean.csv")
View(interviews)
head(interviews)
tail(interviews)
class(interview)

str(interviews) 
nrow(interviews)

ncol(interviews)
names(interviews)
summary(interviews)

subsetting

interviews[1,6] # output as a tibble (data frame)
interviews[1]
interviews[[1]] #output as a vector
interviews[1:6, ] #all of the columns and 1 - 6 rows

interviews[4:7, ] # subset
interviews[ , -1]
interviews["village"] 
interviews[["village"]]

interview$village  # use $ to access a column e.g.   dataFrameName$columnName
village



# factors
respondent_floor_type <- factor(c("earch","cement","cement","earth"))
class(respondent_floor_type)
levels(respondent_floor_type)


levels(respondent_floor_type)
unique(respondent_floor_type)
respondent_floor_type

as.character(respondent_floor_type)


memb_assoc <- interviews$memb_assoc
memb_assoc <- as.factor(memb_assoc)
memb_assoc
plot(memb_assoc)
levels(memb_assoc)

Dates

R and rpackages likes date formats in yyy-mm-dd

install.packages("lubridate") #if you need to install the package

library(lubridate)

dates <- interviews$interview_date

interviews$day <- day(dates)

interviews$month <- month(dates)
interviews$year <- year(dates)
View(interviews)

R - PART 2

Dplyr and Tidyr

select (interviews, village, no_members, years_liv)

filter(interviews, village == "god")
interviews_god <- select(interviews2, no_members, years_liv)
''

interviews2 <- filter(interviews, village == "god")
interviews2

interviews_god <- select(interviews2, no_members, years_live)
interviews_god


dates <- interviews$interview_date

interviews$day <- day(dates)

interviews$month <- month(dates)
interviews$year <- year(dates)
interviews$key_ID <- day(dates)
View(interviews)

interviews2 <- filter(interviews, village == "god")
interviews2

interviews_god <- select(interviews2, no_membrs, years_liv)
interviews_god

interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv)
intervews_god <- interviews %>%
filter(village =="God") %>%
select(no_membrs, years_liv)
interviews_god

Exercise solution:

interviews %>%
    filter(memb_assoc == "yes") %>% 
    select(affect_conflicts, liv_count, no_meals)
    

mutate

  • creating a new column
new_variables <- interviews %>%
    mutate(people_per_room = no_membrs / rooms)
view(new_variable)
interviews %>%
    filter(!is.na(member_assoc)) %>%
    mutate(people_per_room = no_membrs / rooms)

Exercise solution:

interviews %>%
    mutate(total_meals = no_membrs * no_meals) %>%
    select(village, total_meals)
    
interviews_total_meals <- interviews %>%
    mutate(total_meals = no_membrs * no_meals) %>%
    filter(total_meals > 20) %>%
    select(village, total_meals)
View(interviews_total_meals)

groupby function

Average household size by village

interviews %>%
    group_by(village) %>%
    summarize(mean_no_membrs = mean(no_membrs))
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs))
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs))
interviews %>%
    filter(!is.na(memb_assoc)) %>%
    group_by(village, memb_assoc) %>%
    summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
    arrange(desc(min_membrs))


interviews %>%
    count(village)

interviews %>%
    count(village, sort = TRUE)
    

Exercise solution

interviews %>%
    group_by(village) %>%
    summarize(mean_no_membrs = mean = no_membrs, 
    min_no_membrs = min(no_membrs),
    max_no_membrs = max(no_membrs),
    n = n()
    )
    

spreading

interviews_spread <- interviews %>%
    mutate(wall_type_logical = TRUE) %>%
    spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE)



gathering

interviews_gather <- interviews_spread %>%
    gather(key = respondent_wall_type, value = "wall_type_logical",
           burntbricks:sunbricks)%>%
    filter(wall_type_logical) %>%
    select(-wall_type_logical)
           
           
# each cell should only have a single data


interviews_items_owned <- interviews %>%
    separate_rows(items_owned, sep=";") %>%
    mutate(items_owned_logical = TRUE) %>%
    spread(key = items_owned, value = items_owned_logical, fill = FALSE)
View(interviews_items_owned)
nrow(interviews_items_owned)


interviews_items_owned %>%
    filter(bicycle) %>%
    group_by(village) %>%
    count(bicycle)
interviews_items_owned %>%
    mutate(number_items = rowSums(select(., bicycle:television))) %>%
    group_by(village) %>%
    summarize(mean_items = mean(number_items))

interviews_items_owned %>%
    mutate(number_items = rowSums(select(., bicycle:television))) %>%
    group_by(village) %>%
    summarize(mean_items = mean(number_items))


Exercise Solution part 1:

interviews_months_lack_food <- interviews %>%
  separate_rows(months_lack_food, sep=";") %>%
  mutate(months_lack_food_logical  = TRUE) %>%
  spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE)

Exercise Solution part 2:
How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t?

  interviews_months_lack_food %>%
  mutate(number_months = rowSums(select(., Apr:Sept))) %>%
  group_by(memb_assoc) %>%
  summarize(mean_months = mean(number_months))

# This is what to export for plotting:
interviews_plotting <- interviews %>%
    ## spread data by items_owned
    separate_rows(items_owned, sep=";") %>%
    mutate(items_owned_logical = TRUE) %>%
    spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>%
    rename(no_listed_items = `<NA>`) %>%
    ## spread data by months_lack_food
    separate_rows(months_lack_food, sep=";") %>%
    mutate(months_lack_food_logical = TRUE) %>%
    spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>%
    ## add some summary columns
    mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>%
    mutate(number_items = rowSums(select(., bicycle:television)))


# saving to .csv
write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv")

Plotting with GGPLOT2

library(tidyverse)

ggplot(data = interviews_plotting)
ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))

scatter plot

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point()

assign ggplot to a variable

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))

now you can use the variable and change geom_

interviews_plot +
    geom_point()
    

https://datacarpentry.org/r-socialsci/04-ggplot2/index.html

Code from R for dpyr and ggplot sections

#######################################################################################################################
#######################################################################################################################

2020.02.05

Data Carpentries - R for Social Scientists

Justin Shaffer

justinparkshaffer@gmail.com

#######################################################################################################################
#######################################################################################################################

Set working directory

#######################################################################################################################
getwd()
setwd("~/Google-Drive-UCSD/R/2020_carpentries_social_sci/")

Install and load libraries needed for analysis

#######################################################################################################################
install.packages("tidyverse")
install.packages("tidyr")
install.packages("ggplot2")
install.packages("plyr")
install.packages("dplyr")

library(tidyverse)
library(ggplot2)

Read in dataset

#######################################################################################################################
interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")

Inspecting dataframes

#######################################################################################################################

Viewing

interviews
View(interviews)

Data types

class(interviews)

Size

dim(interviews)
nrow(interviews)
ncol(interviews)

Content

head(interviews)
tail(interviews)

Names

names(interviews)

Summary

str(interviews)
summary(interviews)

Indexing and subsetting

#######################################################################################################################

first element in the first column of the data frame (as a vector)

interviews[1, 1]

first element in the 6th column (as a vector)

interviews[1, 6]

first column of the data frame (as a vector)

interviews[[1]]

first column of the data frame (as a data.frame)

interviews[1]

first three elements in the 7th column (as a vector)

interviews[1:3, 7]

the 3rd row of the data frame (as a data.frame)

interviews[3, ]

The whole data frame, except the first column

interviews[, -1]

Equivalent to head(interviews)

interviews[-c(7:131), ]

interviews["village"] # Result is a data frame
interviews[, "village"] # Result is a data frame
interviews[["village"]] # Result is a vector
interviews$village # Result is a vector

Exercise

#######################################################################################################################

1.

interviews_100 <- interviews[100, ]

2.

Saving n_rows to improve readability and reduce duplication

n_rows <- nrow(interviews)
interviews_last <- interviews[n_rows, ]

3.

interviews_middle <- interviews[(n_rows / 2), ]

4.

interviews_head <- interviews[-(7:n_rows), ]

Factors

#######################################################################################################################

Create variable

respondent_floor_type <- factor(c("earth", "cement", "cement", "earth"))

Check levels

levels(respondent_floor_type)

Display number of levels

nlevels(respondent_floor_type)

Original order

respondent_floor_type

After re-ordering

respondent_floor_type <- factor(respondent_floor_type, levels = c("earth", "cement"))
respondent_floor_type

levels(respondent_floor_type)

levels(respondent_floor_type)[2] <- "brick"
levels(respondent_floor_type)

respondent_floor_type

Converting factors

#######################################################################################################################

as.character(respondent_floor_type)

One wrong way - without a warning!

year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct)

One solution

as.numeric(as.character(year_fct))

The recommended solution

as.numeric(levels(year_fct))[year_fct]

Renaming factors

#######################################################################################################################

create a vector from the data frame column "memb_assoc"

memb_assoc <- interviews$memb_assoc

convert it into a factor

memb_assoc <- as.factor(memb_assoc)

let's see what it looks like

memb_assoc

plot(memb_assoc)

Let's recreate the vector from the data frame column "memb_assoc"

memb_assoc <- interviews$memb_assoc

replace the missing data with "undetermined"

memb_assoc[is.na(memb_assoc)] <- "undetermined"

convert it into a factor

memb_assoc <- as.factor(memb_assoc)

let's see what it looks like

memb_assoc

bar plot of the number of interview respondents who were

members of irrigation association:

plot(memb_assoc)

Exercise

#######################################################################################################################
levels(memb_assoc)
levels(memb_assoc) <- c("No", "Undetermined", "Yes")
memb_assoc <- factor(memb_assoc, levels = c("No", "Yes", "Undetermined"))
plot(memb_assoc)

##############################################################################################################################################################################################################################################

Learning dplyr and tidyr

#######################################################################################################################
#######################################################################################################################

load the tidyverse

library(tidyverse)

interviews <- read_csv("data/SAFI_clean.csv", na = "NULL")

inspect the data

interviews

preview the data

View(interviews)

Selecting columns and filtering rows (subsetting)

#######################################################################################################################

select function - first argument is dataset, others are columns to keep

select(interviews, village, no_membrs, years_liv)

filter function - choose rows based on specific criteria

filter(interviews, village == "God")

Pipes - performing multiple functions simultaneously

#######################################################################################################################

option1: intermediate steps

interviews2 <- filter(interviews, village == "God")
interviews2
interviews_god <- select(interviews2, no_membrs, years_liv)
interviews_god

option2: nested functions

interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv)

option3: pipes

interviews_god <- interviews %>%
filter(village == "God") %>%
select(no_membrs, years_liv)

interviews_god

tangent on logical operators

#######################################################################################################################

https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html

!x: not x

AND: &, &&

OR: |, ||

Exercise

#######################################################################################################################

Using pipes, subset the interviews data to include interviews where respondents were members of an irrigation association (memb_assoc) and retain only the columns affect_conflicts, liv_count, and no_meals.

interviews %>%
filter(memb_assoc == "yes") %>%
select(affect_conflicts, liv_count, no_meals)

Mutate - create new columns based on states in other columns

#######################################################################################################################

What if we want to know the ratio of the number of people in a household, to the number of rooms they use to sleep

new_variable <- interviews %>%
mutate(people_per_room = no_membrs / rooms)
View(new_variable)

Does being a member of an irrigation assoc. affect the ratio above? First, we'll remove non-responders

interviews %>%
filter(!is.na(memb_assoc)) %>%
mutate(people_per_room = no_membrs / rooms)

Exercise

#######################################################################################################################

Create a new data frame from the interviews data that meets the following criteria: contains only the village column and a new column called total_meals containing a value that is equal to the total number of meals served in the household per day on average (no_membrs times no_meals). Only the rows where total_meals is greater than 20 should be shown in the final data frame. Hint: think about how the commands should be ordered to produce this data frame!

interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
select(village, total_meals)

interviews_total_meals <- interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
filter(total_meals > 20) %>%
select(village, total_meals)

View(interviews_total_meals)

Split-apply-combine data analysis and the summarize() function

#######################################################################################################################

Split data into groups, apply some analysis to each group, then combine the results

Average household size, by village

interviews %>%
group_by(village) %>%
summarize(mean_no_membrs = mean(no_membrs))

Group multiple columns

interviews %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))

Exclude NAs from previous output using a filter step

interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs))

Can summarize multiple variables at once - add minimum household size for each village group

interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs))

Sort on min_membrs to put smallest value first

interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
arrange(min_membrs)

Descending order

interviews_new <- interviews %>%
filter(!is.na(memb_assoc)) %>%
group_by(village, memb_assoc) %>%
summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>%
arrange(desc(min_membrs))
View(interviews_new)

Counting - determine the number of observations for each factor or combination of factors

#######################################################################################################################

Count rows of data per village

interviews %>%
count(village)

Count and sort in decreasing order

interviews %>%
count(village, sort = TRUE) %>%

Exercise

#######################################################################################################################

How many households in the survey have an average of two meals per day? Three meals per day? Are there any other numbers of meals represented?

interviews %>%
count(no_meals)

Use group_by() and summarize() to find the mean, min, and max number of household members for each village. Also add the number of observations (hint: see ?n).

interviews %>%
group_by(village) %>%
summarize(
mean_no_membrs = mean(no_membrs),
min_no_membrs = min(no_membrs),
max_no_membrs = max(no_membrs),
n = n()
)

interviews %>%
group_by(village) %>%
summarize(
mean_no_membrs = mean(no_membrs),
min_no_membrs = min(no_membrs),
max_no_membrs = max(no_membrs),
n = n()
)

Skip last exercise

Reshaping with gather and spread

#######################################################################################################################

What if instead of comparing records, we wanted to look at differences in households grouped by different types of housing construction materials?

Spreading (long to wide)

Because both the key and value parameters must come from column values, we will create a dummy column (we’ll name it wall_type_logical) to hold the value TRUE

interviews_spread <- interviews %>%
mutate(wall_type_logical = TRUE) %>%
spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE)
View(interviews_spread)

Gathering (wide to long)

interviews_gather <- interviews_spread %>%
gather(key = respondent_wall_type, value = "wall_type_logical", burntbricks:sunbricks)
View(interviews_gather)

Now have four rows per interview respondent - filter all but those that are TRUE

interviews_gather <- interviews_spread %>%
gather(key = "respondent_wall_type", value = "wall_type_logical",
burntbricks:sunbricks) %>%
filter(wall_type_logical)
select(-wall_type_logical)
View(interviews_gather)

Applying spread() to clean our data

#######################################################################################################################

Tidy up raw data - some columns have multiple values in a single cell (e.g. items owned)

View(interviews)
str(interviews$items_owned)

interviews_items_owned <- interviews %>%
separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE)
View(interviews_items_owned)
nrow(interviews_items_owned)

Rename NA column

interviews_items_owned <- interviews_items_owned %>%
rename(no_listed_items = <NA>)

Now can summarize differently - show number of people in each village who owned a particluar item

interviews_items_owned %>%
filter(computer) %>%
group_by(village) %>%
count(computer)

Calculate average number of items owned by respondents in each village

interviews_items_owned %>%
mutate(number_items = rowSums(select(., bicycle:television))) %>%
group_by(village) %>%
summarize(mean_items = mean(number_items))

Exercise

#######################################################################################################################

Create a new data frame (named interviews_months_lack_food) that has one column for each month and records TRUE or FALSE for whether each interview respondent was lacking food in that month

interviews_months_lack_food <- interviews %>%
separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE)

View(interviews_months_lack_food)

How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t?

interviews_months_lack_food %>%
mutate(number_months = rowSums(select(., Apr:Sept))) %>%
group_by(memb_assoc) %>%
summarize(mean_months = mean(number_months))

Exporting data

#######################################################################################################################

Create new folder 'data_output'

Prepare for plotting - create version where each column includes only one data value

Use spread to expand months_lack_food and items_owned

interviews_plotting <- interviews %>%

spread data by items_owned

separate_rows(items_owned, sep=";") %>%
mutate(items_owned_logical = TRUE) %>%
spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>%
rename(no_listed_items = <NA>) %>%

spread data by months_lack_food

separate_rows(months_lack_food, sep=";") %>%
mutate(months_lack_food_logical = TRUE) %>%
spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>%

add some summary columns

mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>%
mutate(number_items = rowSums(select(., bicycle:television)))

Save new file

write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv")

##############################################################################################################################################################################################################################################

Plotting with ggplot2

#######################################################################################################################
#######################################################################################################################
library(tidyverse)

ggplot(data = interviews_plotting)

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))

scatter plot - two continuous variables

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point()

Assign plot to a variable

interviews_plot <- ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items))

Draw the plot

interviews_plot +
geom_point()

This is the correct syntax for adding layers

interviews_plot +
geom_point()

This will not add the new layer and will return an error message

interviews_plot

  • geom_point()

Building your plots iteratively

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point()

Make points transparent

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_point(alpha = 0.5)

Jitter points

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.5)

Color points

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(alpha = 0.7, color = "blue")

Color by village - new aes function to specify across subsets of points

ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) +
geom_jitter(aes(color = village), alpha = 0.5)

ggplot(data = interviews_plotting, aes(x = village, y = rooms)) +
geom_jitter(aes(color = respondent_wall_type), alpha = 0.5)

Use what you just learned to create a scatter plot of rooms by village with the respondent_wall_type showing in different colors. Is this a good way to show this type of data?

ggplot(data = interviews_plotting, aes(x = village, y = rooms)) +
geom_jitter(aes(color = respondent_wall_type))

Difficult to distinguish between villages in above plot

Boxplots

#######################################################################################################################
ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_boxplot()

Add points

ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.5, color = "tomato")

ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_jitter(alpha = 0.5, color = "tomato") +
geom_boxplot()

Exercise

#######################################################################################################################

Make violin plot

ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) +
geom_violin(alpha = 0) +
geom_jitter(alpha = 0.5, color = "tomato")

Create a boxplot for liv_count for each wall type - overlay on a jitter layer to show actual measurements

ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.5)

Add color based on irrigation association membership

ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) +
geom_boxplot(alpha = 0) +
geom_jitter(aes(alpha = 0.5, shape = memb_assoc))

Barplots

#######################################################################################################################

ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar()

Fill bars by village counts

ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar(aes(fill = village))

Size-by-side bars

ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) +
geom_bar(aes(fill = village), position = "dodge")

Create new datasets to plot prop. of each housing type in each village

Also remove houses with cement walls as there is only one in the dataset

percent_wall_type <- interviews_plotting %>%
filter(respondent_wall_type != "cement") %>%
count(village, respondent_wall_type) %>%
group_by(village) %>%
mutate(percent = n / sum(n)) %>%
ungroup()
View(percent_wall_type)

ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge")

ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge")

Exercise

#######################################################################################################################

Create a bar plot showing the proportion of respondents in each village who are or are not part of an irrigation association (memb_assoc). Include only respondents who answered that question in the calculations and plot. Which village had the lowest proportion of respondents in an irrigation association?

percent_memb_assoc <- interviews_plotting %>%
filter(!is.na(memb_assoc)) %>%
count(village, memb_assoc) %>%
group_by(village) %>%
mutate(percent = n / sum(n)) %>%
ungroup()
View(percent_memb_assoc)

ggplot(percent_memb_assoc, aes(x = village, y = percent, fill = memb_assoc)) +
geom_bar(stat = "identity", position = "dodge")

Adding Labels and Titles

#######################################################################################################################

ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village", x="Wall Type", y="Percent")

Faceting

######################################################################################################################

ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village",
x="Wall Type",
y="Percent") +
facet_wrap(~ village, ncol=1)

Add white background and remove grid

ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title="Proportion of wall type by village",
x="Wall Type",
y="Percent") +
facet_wrap(~ village) +
theme_bw() +
theme(panel.grid = element_blank())

Plot prop. of respondents in each village who owned a particular item

First calc. the percentage of people in each village who owned each item

table(interviews$village)

percent_items <- interviews_plotting %>%
gather(items, items_owned_logical, bicycle:no_listed_items) %>%
filter(items_owned_logical) %>%
count(items, village) %>%

add a column with the number of people in each village

mutate(people_in_village = case_when(village == "Chirodzo" ~ 39,
village == "God" ~ 43,
village == "Ruaca" ~ 49)) %>%
mutate(percent = n / people_in_village)

ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
theme_bw() +
theme(panel.grid = element_blank())

ggplot2 themes

######################################################################################################################

link to ggplot2 themes:

http://docs.ggplot2.org/current/ggtheme.html

Exercise

Experiment with at least two different themes. Build the previous plot using each of those themes. Which do you like best?

Customization

######################################################################################################################

ggplot2 cheat sheat

https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

Change labels to be more informative

ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw()

Increase font size

ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(text=element_text(size = 16))

Rotate x-axis labels, and introduce newlines

ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16))

Save theme and center plot title

grey_theme <- theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16),
plot.title = element_text(hjust = 0.5))

ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
grey_theme

Explort plot using commands

my_plot <- ggplot(percent_items, aes(x = village, y = percent)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ items) +
labs(title = "Percent of respondents in each village \n who owned each item",
x = "Village",
y = "Percent of Respondents") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16),
plot.title = element_text(hjust = 0.5))

Create folder 'fig_output'

ggsave("items_by_village_barplot.png", my_plot, width = 15, height = 10)