# 2020-Data-Carpentry-Social-Science # workshop schedule: https://ucsdlib.github.io/2020-02-04-UCSDsocsci/ ## This workshop qualifies for the Co-Curricular Record! ## https://elt.ucsd.edu/ccr/index.html If you'd like this workshop to show up on your co-curricular record, follow the instructions in the CCR portal, then email Stephanie (slabou@ucsd.edu) two items: 1) text file with the JSON extract from our lesson on OpenRefine 2) your R script from Wednesday's R lesson --- ## This HackMD: https://hackmd.io/@U2NG/ryazym6x8 ## Instructors: **Stephanie Labou (Library) - slabou@ucsd.edu Reid Otsuji (Library) - rotsuji@ucsd.edu Rick Mccosh (OBGYN/Reprod Sci) Justin Shaffer (Pediatrics)** ## Collaborative Notes **This is the collaborative notes for the workshop.** Please type your notes here! ## stickies notes Blue - "i need help" Orange - "i'm good!" ### #######Sign-In Here###### ### Please sign in: **Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab** Reid Otsuji, Librarian, Library Rick McCosh, Postdoc, OB/GYN + Repro. Sci Ryan Johnson, Librarian, Library Stephanie Labou, Librarian, Library Leonidas Mylonakis, Alumni, History Veronica Hoyo, Staff, ACTRI-Health Sciences Evie Xinqi Guo, Graduate Student, Experimental Psychology Alexandre Gomide, Visiting Scholar, GPS-UCSD Julian Borba, Visiting Scholar, CILAS-UCSD Peipei Zhu, post-doc, Croker Lab, Pediatrics Danielle Fritts, Graduate Student, Visiting Rita Kuckertz, Visiting Graduate Student Bryan Kehr, Library IT # ########Day 1 - Notes########### Lesson information: https://datacarpentry.org/spreadsheets-socialsci/ ### Workshop Data download: https://ndownloader.figshare.com/articles/6262019/versions/4 https://datacarpentry.org/openrefine-socialsci/setup.html ### information about the SAFI Teaching Database date set: https://datacarpentry.org/socialsci-workshop/data/ --- ### For Data Management Best Practices section **Download and unzip on your desktop:** https://librarycarpentry.org/lc-shell/data/shell-lesson.zip **Git Bash for Windows setup:** https://librarycarpentry.org/lc-shell/setup.html ## Data Organization in Spreadsheets * Leave raw data raw * keep a record of any cleaning steps you do **Exercise: what's wrong with the spreadsheet?** Problems identified: * multiple tables on a single sheet * typos * missing values - different values used * working on different sheets - need to combine data * unknow numeric values * colored cells * KeyID issues * Merged cells * use of underscores * no dates * notes in cells **How to clean up:** Consistent values be explicit - clean up will create more values in the spreadsheet - that's ok R can handle it. R likes 'NA' for missing values R can handle column names of various lengths but does not like spaces or numbers at the beginning (use underscore:`_`) **Common formating problems:** using multiple tables and/or tabs not filling in zeros confusing field names inconsistent column names **Excel stores dates as numbers** `Tip for dates in Excel: separate dates in to columns (year, month, day)` ## Openrefine Lesson To import `.zip` or `.tar` projects use the `import project` command **why use open refine:** Data is often very messy. OpenRefine provides a set of tools to allow you to identify and amend the messy data. It is important to know what you did to your data. Additionally, journals, granting agencies, and other institutions are requiring documentation of the steps you took when working with your data. With OpenRefine, you can capture all actions applied to your raw data and share them with your publication as supplemental material. All actions are easily reversed in OpenRefine. If you save your work it will be to a new file. OpenRefine always uses a copy of your data and does not modify your original dataset. Data cleaning steps often need repeating with multiple files. OpenRefine keeps track of all of your actions and allows them to be applied to different datasets. Some concepts such as clustering algorithms are quite complex, but OpenRefine makes it easy to introduce them, use them, and show their power. for more OpenRefine information: google `openrefine libguide` You can find out a lot more about OpenRefine at http://openrefine.org More about GREL: https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions # Data Management Best practices Data Sharing SNAFU youtube video: https://www.youtube.com/watch?v=66oNv_DJuPc ICPSR (social science data repository - great for finding data, can also deposit data): https://www.icpsr.umich.edu/icpsrweb/ UCSD's reserach data curation group in the library: https://library.ucsd.edu/research-and-collections/data-curation/ # R R package for non-English language text mining: https://www.bnosac.be/index.php/blog/72-natural-language-processing-for-non-english-languages-with-udpipe # Day 2 Sign-in **Full Name, Affiliation (Faculty, student, post-doc, staff), department or Lab** Reid Otsuji, Librarian, Library Rick McCosh, Postdoc, OB/GYN + Repro. Sci. Stephanie Labou, Librarian, Library Veronica Hoyo, Staff, ACTRI-Health Sciences Peipei Zhu, Postdoc, Pediatrics Leonidas Mylonakis, Alumni, History Alexandre Gomide, Visiting Scholar, GPS-UCSD Danielle Fritts, Visiting Graduate Student Justin Shaffer, Postdoc, Pediatrics Bryan Kehr, Library IT Rita Kuckertz, Visiting Graduate Student Evie Xinqi Guo, Graduate Student, Experimental Psychology # Useful links: ### R Workshop Walkthrough https://datacarpentry.org/r-socialsci/ ### Dataset download https://ndownloader.figshare.com/articles/6262019/versions/4 ### Dataset overview https://datacarpentry.org/socialsci-workshop/data/ ### logical operators in R https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html ### Hadley Wickham's Tidy Data paper https://vita.had.co.nz/papers/tidy-data.pdf ### ggplot2 themes http://docs.ggplot2.org/current/ggtheme.html ### ggplot2 cheat sheet https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf # Day 2 notes: **Assigning varialbes:** assign variables using the `<-` assignment function **be thoughtful about variable naming:** * yes: letters, numbers, period, underscore * no: spaces, dashes * no: start with #, or underscore **varialbe name styles:** * CamelCase * Underscore_between_words * periods.between.words ### Functions Examples: ``` sqrt() # square root fucntion b <- sqrt(2) #assign fuctions to variables round(3.1415) args(round) ?round #for function help ``` Function errors will display in the console. ### Vectors Examples: ``` hh_members <- c(3,7,10, 6) # c() means conacatenate and used to create vectors hh_wall_type <- c("mauddaub","burntbricks", "sunbricks") length(hh_members) length(hh_wall_type) class(hh_members) class(hh_wall_type) ``` R data types: * logical * integer * numeric * complex * character Tip: R is indexed beginning at 1 Tip: remember R is case sensitive - be consistent #### adding to vectors ``` possessions <- c("bike","radio", "tv") possessions <- c(posessions, "cellPhone") possessions <- ``` Data structures * vector * list * matrix * array * data from (tibble) ``` num_char <- c(1,2,3,"a") class(num_char) tricky <- c(1,2,3, "4") num <- c(1,2,3) class(num) int <- as.integer(num) class(int) # as.integer() # as.character() # as.numeric() ``` ### subsetting ``` hh_wall_type hh_wall_type[2] hh_wall_type[c(1,3)] hh_wall_type[1,3] more_wall_types <- hh_wall_type[c(1,2,3,3,4,1)] more_wall_types ``` ### conditional subsetting ``` hh_members hh_members > 5 hh_members[hh_members >5] possessions possessions[ possessions == "tv" | possessions == "bike"] # search for one item at a time possessions %in% c("car", "bicycle", "motorcycles", "boat") # search for multiple items at once ``` ### missing data - NA ``` rooms <- c(2,1,1,NA,4) mean(rooms) max(rooms) mean(rooms, na.rm = TRUE) args(mean) is.na(rooms) rooms[!is.na(rooms)] ``` # working with a dataset ### installing packages ``` install.packages("tidyverse") library(tidyverse) ``` ### loading data ``` interviews <- read_csv("SAFI_clean.csv") View(interviews) ``` ``` head(interviews) tail(interviews) class(interview) str(interviews) nrow(interviews) ncol(interviews) names(interviews) summary(interviews) ``` # subsetting ``` interviews[1,6] # output as a tibble (data frame) interviews[1] interviews[[1]] #output as a vector interviews[1:6, ] #all of the columns and 1 - 6 rows interviews[4:7, ] # subset interviews[ , -1] interviews["village"] interviews[["village"]] interview$village # use $ to access a column e.g. dataFrameName$columnName village ``` ``` # factors respondent_floor_type <- factor(c("earch","cement","cement","earth")) class(respondent_floor_type) levels(respondent_floor_type) levels(respondent_floor_type) unique(respondent_floor_type) respondent_floor_type as.character(respondent_floor_type) memb_assoc <- interviews$memb_assoc memb_assoc <- as.factor(memb_assoc) memb_assoc plot(memb_assoc) levels(memb_assoc) ``` # Dates R and rpackages likes date formats in yyy-mm-dd ``` install.packages("lubridate") #if you need to install the package library(lubridate) dates <- interviews$interview_date interviews$day <- day(dates) interviews$month <- month(dates) interviews$year <- year(dates) View(interviews) ``` # R - PART 2 #### Dplyr and Tidyr ``` select (interviews, village, no_members, years_liv) filter(interviews, village == "god") interviews_god <- select(interviews2, no_members, years_liv) '' interviews2 <- filter(interviews, village == "god") interviews2 interviews_god <- select(interviews2, no_members, years_live) interviews_god dates <- interviews$interview_date interviews$day <- day(dates) interviews$month <- month(dates) interviews$year <- year(dates) interviews$key_ID <- day(dates) View(interviews) ``` interviews2 <- filter(interviews, village == "god") interviews2 ``` ``` interviews_god <- select(interviews2, no_membrs, years_liv) interviews_god ``` ``` interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv) intervews_god <- interviews %>% filter(village =="God") %>% select(no_membrs, years_liv) interviews_god Exercise solution: ``` interviews %>% filter(memb_assoc == "yes") %>% select(affect_conflicts, liv_count, no_meals) ``` # mutate - creating a new column ``` new_variables <- interviews %>% mutate(people_per_room = no_membrs / rooms) view(new_variable) ``` ``` interviews %>% filter(!is.na(member_assoc)) %>% mutate(people_per_room = no_membrs / rooms) ``` Exercise solution: ``` interviews %>% mutate(total_meals = no_membrs * no_meals) %>% select(village, total_meals) interviews_total_meals <- interviews %>% mutate(total_meals = no_membrs * no_meals) %>% filter(total_meals > 20) %>% select(village, total_meals) View(interviews_total_meals) ``` # groupby function Average household size by village ``` interviews %>% group_by(village) %>% summarize(mean_no_membrs = mean(no_membrs)) ``` ``` interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs)) ``` ``` interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) ``` ``` interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>% arrange(desc(min_membrs)) ``` ``` interviews %>% count(village) interviews %>% count(village, sort = TRUE) ``` Exercise solution ``` interviews %>% group_by(village) %>% summarize(mean_no_membrs = mean = no_membrs, min_no_membrs = min(no_membrs), max_no_membrs = max(no_membrs), n = n() ) ``` ### spreading ``` interviews_spread <- interviews %>% mutate(wall_type_logical = TRUE) %>% spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE) ``` ### gathering ``` interviews_gather <- interviews_spread %>% gather(key = respondent_wall_type, value = "wall_type_logical", burntbricks:sunbricks)%>% filter(wall_type_logical) %>% select(-wall_type_logical) # each cell should only have a single data ``` ``` interviews_items_owned <- interviews %>% separate_rows(items_owned, sep=";") %>% mutate(items_owned_logical = TRUE) %>% spread(key = items_owned, value = items_owned_logical, fill = FALSE) View(interviews_items_owned) nrow(interviews_items_owned) ``` ``` interviews_items_owned %>% filter(bicycle) %>% group_by(village) %>% count(bicycle) ``` ``` interviews_items_owned %>% mutate(number_items = rowSums(select(., bicycle:television))) %>% group_by(village) %>% summarize(mean_items = mean(number_items)) ``` ``` interviews_items_owned %>% mutate(number_items = rowSums(select(., bicycle:television))) %>% group_by(village) %>% summarize(mean_items = mean(number_items)) ``` Exercise Solution part 1: ``` interviews_months_lack_food <- interviews %>% separate_rows(months_lack_food, sep=";") %>% mutate(months_lack_food_logical = TRUE) %>% spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) ``` Exercise Solution part 2: How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t? ``` interviews_months_lack_food %>% mutate(number_months = rowSums(select(., Apr:Sept))) %>% group_by(memb_assoc) %>% summarize(mean_months = mean(number_months)) ``` ``` # This is what to export for plotting: interviews_plotting <- interviews %>% ## spread data by items_owned separate_rows(items_owned, sep=";") %>% mutate(items_owned_logical = TRUE) %>% spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>% rename(no_listed_items = `<NA>`) %>% ## spread data by months_lack_food separate_rows(months_lack_food, sep=";") %>% mutate(months_lack_food_logical = TRUE) %>% spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>% ## add some summary columns mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>% mutate(number_items = rowSums(select(., bicycle:television))) ``` ``` # saving to .csv write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv") ``` # Plotting with GGPLOT2 ``` library(tidyverse) ``` ``` ggplot(data = interviews_plotting) ``` ``` ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) ``` #### scatter plot ``` ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point() ``` assign ggplot to a variable ``` ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) ``` now you can use the variable and change geom_ ``` interviews_plot + geom_point() ``` https://datacarpentry.org/r-socialsci/04-ggplot2/index.html # Code from R for dpyr and ggplot sections ####################################################################################################################### ####################################################################################################################### # 2020.02.05 # Data Carpentries - R for Social Scientists # Justin Shaffer # justinparkshaffer@gmail.com ####################################################################################################################### ####################################################################################################################### # Set working directory ####################################################################################################################### getwd() setwd("~/Google-Drive-UCSD/R/2020_carpentries_social_sci/") # Install and load libraries needed for analysis ####################################################################################################################### install.packages("tidyverse") install.packages("tidyr") install.packages("ggplot2") install.packages("plyr") install.packages("dplyr") library(tidyverse) library(ggplot2) # Read in dataset ####################################################################################################################### interviews <- read_csv("data/SAFI_clean.csv", na = "NULL") # Inspecting dataframes ####################################################################################################################### # Viewing interviews View(interviews) # Data types class(interviews) # Size dim(interviews) nrow(interviews) ncol(interviews) # Content head(interviews) tail(interviews) # Names names(interviews) # Summary str(interviews) summary(interviews) # Indexing and subsetting ####################################################################################################################### ## first element in the first column of the data frame (as a vector) interviews[1, 1] ## first element in the 6th column (as a vector) interviews[1, 6] ## first column of the data frame (as a vector) interviews[[1]] ## first column of the data frame (as a data.frame) interviews[1] ## first three elements in the 7th column (as a vector) interviews[1:3, 7] ## the 3rd row of the data frame (as a data.frame) interviews[3, ] ## The whole data frame, except the first column interviews[, -1] ## Equivalent to head(interviews) interviews[-c(7:131), ] interviews["village"] # Result is a data frame interviews[, "village"] # Result is a data frame interviews[["village"]] # Result is a vector interviews$village # Result is a vector # Exercise ####################################################################################################################### ## 1. interviews_100 <- interviews[100, ] ## 2. # Saving `n_rows` to improve readability and reduce duplication n_rows <- nrow(interviews) interviews_last <- interviews[n_rows, ] ## 3. interviews_middle <- interviews[(n_rows / 2), ] ## 4. interviews_head <- interviews[-(7:n_rows), ] # Factors ####################################################################################################################### # Create variable respondent_floor_type <- factor(c("earth", "cement", "cement", "earth")) # Check levels levels(respondent_floor_type) # Display number of levels nlevels(respondent_floor_type) # Original order respondent_floor_type # After re-ordering respondent_floor_type <- factor(respondent_floor_type, levels = c("earth", "cement")) respondent_floor_type levels(respondent_floor_type) levels(respondent_floor_type)[2] <- "brick" levels(respondent_floor_type) respondent_floor_type # Converting factors ####################################################################################################################### as.character(respondent_floor_type) # One wrong way - without a warning! year_fct <- factor(c(1990, 1983, 1977, 1998, 1990)) as.numeric(year_fct) # One solution as.numeric(as.character(year_fct)) # The recommended solution as.numeric(levels(year_fct))[year_fct] # Renaming factors ####################################################################################################################### ## create a vector from the data frame column "memb_assoc" memb_assoc <- interviews$memb_assoc ## convert it into a factor memb_assoc <- as.factor(memb_assoc) ## let's see what it looks like memb_assoc plot(memb_assoc) ## Let's recreate the vector from the data frame column "memb_assoc" memb_assoc <- interviews$memb_assoc ## replace the missing data with "undetermined" memb_assoc[is.na(memb_assoc)] <- "undetermined" ## convert it into a factor memb_assoc <- as.factor(memb_assoc) ## let's see what it looks like memb_assoc ## bar plot of the number of interview respondents who were ## members of irrigation association: plot(memb_assoc) # Exercise ####################################################################################################################### levels(memb_assoc) levels(memb_assoc) <- c("No", "Undetermined", "Yes") memb_assoc <- factor(memb_assoc, levels = c("No", "Yes", "Undetermined")) plot(memb_assoc) ############################################################################################################################################################################################################################################## # Learning dplyr and tidyr ####################################################################################################################### ####################################################################################################################### ## load the tidyverse library(tidyverse) interviews <- read_csv("data/SAFI_clean.csv", na = "NULL") ## inspect the data interviews ## preview the data View(interviews) # Selecting columns and filtering rows (subsetting) ####################################################################################################################### # select function - first argument is dataset, others are columns to keep select(interviews, village, no_membrs, years_liv) # filter function - choose rows based on specific criteria filter(interviews, village == "God") # Pipes - performing multiple functions simultaneously ####################################################################################################################### ## option1: intermediate steps interviews2 <- filter(interviews, village == "God") interviews2 interviews_god <- select(interviews2, no_membrs, years_liv) interviews_god ## option2: nested functions interviews_god <- select(filter(interviews, village == "God"), no_membrs, years_liv) ## option3: pipes interviews_god <- interviews %>% filter(village == "God") %>% select(no_membrs, years_liv) interviews_god # tangent on logical operators ####################################################################################################################### # https://stat.ethz.ch/R-manual/R-devel/library/base/html/Logic.html # !x: not x # AND: &, && # OR: |, || # Exercise ####################################################################################################################### # Using pipes, subset the interviews data to include interviews where respondents were members of an irrigation association (memb_assoc) and retain only the columns affect_conflicts, liv_count, and no_meals. interviews %>% filter(memb_assoc == "yes") %>% select(affect_conflicts, liv_count, no_meals) # Mutate - create new columns based on states in other columns ####################################################################################################################### ## What if we want to know the ratio of the number of people in a household, to the number of rooms they use to sleep new_variable <- interviews %>% mutate(people_per_room = no_membrs / rooms) View(new_variable) ## Does being a member of an irrigation assoc. affect the ratio above? First, we'll remove non-responders... interviews %>% filter(!is.na(memb_assoc)) %>% mutate(people_per_room = no_membrs / rooms) # Exercise ####################################################################################################################### ## Create a new data frame from the interviews data that meets the following criteria: contains only the village column and a new column called total_meals containing a value that is equal to the total number of meals served in the household per day on average (no_membrs times no_meals). Only the rows where total_meals is greater than 20 should be shown in the final data frame. Hint: think about how the commands should be ordered to produce this data frame! interviews %>% mutate(total_meals = no_membrs * no_meals) %>% select(village, total_meals) interviews_total_meals <- interviews %>% mutate(total_meals = no_membrs * no_meals) %>% filter(total_meals > 20) %>% select(village, total_meals) View(interviews_total_meals) # Split-apply-combine data analysis and the summarize() function ####################################################################################################################### # Split data into groups, apply some analysis to each group, then combine the results # Average household size, by village interviews %>% group_by(village) %>% summarize(mean_no_membrs = mean(no_membrs)) # Group multiple columns interviews %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs)) # Exclude NAs from previous output using a filter step interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs)) # Can summarize multiple variables at once - add minimum household size for each village group interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) # Sort on min_membrs to put smallest value first interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>% arrange(min_membrs) # Descending order interviews_new <- interviews %>% filter(!is.na(memb_assoc)) %>% group_by(village, memb_assoc) %>% summarize(mean_no_membrs = mean(no_membrs), min_membrs = min(no_membrs)) %>% arrange(desc(min_membrs)) View(interviews_new) # Counting - determine the number of observations for each factor or combination of factors ####################################################################################################################### # Count rows of data per village interviews %>% count(village) # Count and sort in decreasing order interviews %>% count(village, sort = TRUE) %>% # Exercise ####################################################################################################################### # How many households in the survey have an average of two meals per day? Three meals per day? Are there any other numbers of meals represented? interviews %>% count(no_meals) # Use group_by() and summarize() to find the mean, min, and max number of household members for each village. Also add the number of observations (hint: see ?n). interviews %>% group_by(village) %>% summarize( mean_no_membrs = mean(no_membrs), min_no_membrs = min(no_membrs), max_no_membrs = max(no_membrs), n = n() ) interviews %>% group_by(village) %>% summarize( mean_no_membrs = mean(no_membrs), min_no_membrs = min(no_membrs), max_no_membrs = max(no_membrs), n = n() ) # Skip last exercise # Reshaping with gather and spread ####################################################################################################################### ## What if instead of comparing records, we wanted to look at differences in households grouped by different types of housing construction materials? # Spreading (long to wide) ## Because both the key and value parameters must come from column values, we will create a dummy column (we’ll name it wall_type_logical) to hold the value TRUE interviews_spread <- interviews %>% mutate(wall_type_logical = TRUE) %>% spread(key = respondent_wall_type, value = wall_type_logical, fill = FALSE) View(interviews_spread) # Gathering (wide to long) interviews_gather <- interviews_spread %>% gather(key = respondent_wall_type, value = "wall_type_logical", burntbricks:sunbricks) View(interviews_gather) ## Now have four rows per interview respondent - filter all but those that are TRUE interviews_gather <- interviews_spread %>% gather(key = "respondent_wall_type", value = "wall_type_logical", burntbricks:sunbricks) %>% filter(wall_type_logical) select(-wall_type_logical) View(interviews_gather) # Applying spread() to clean our data ####################################################################################################################### # Tidy up raw data - some columns have multiple values in a single cell (e.g. items owned) View(interviews) str(interviews$items_owned) interviews_items_owned <- interviews %>% separate_rows(items_owned, sep=";") %>% mutate(items_owned_logical = TRUE) %>% spread(key = items_owned, value = items_owned_logical, fill = FALSE) View(interviews_items_owned) nrow(interviews_items_owned) # Rename NA column interviews_items_owned <- interviews_items_owned %>% rename(no_listed_items = `<NA>`) # Now can summarize differently - show number of people in each village who owned a particluar item interviews_items_owned %>% filter(computer) %>% group_by(village) %>% count(computer) # Calculate average number of items owned by respondents in each village interviews_items_owned %>% mutate(number_items = rowSums(select(., bicycle:television))) %>% group_by(village) %>% summarize(mean_items = mean(number_items)) # Exercise ####################################################################################################################### # Create a new data frame (named interviews_months_lack_food) that has one column for each month and records TRUE or FALSE for whether each interview respondent was lacking food in that month interviews_months_lack_food <- interviews %>% separate_rows(months_lack_food, sep=";") %>% mutate(months_lack_food_logical = TRUE) %>% spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) View(interviews_months_lack_food) # How many months (on average) were respondents without food if they did belong to an irrigation association? What about if they didn’t? interviews_months_lack_food %>% mutate(number_months = rowSums(select(., Apr:Sept))) %>% group_by(memb_assoc) %>% summarize(mean_months = mean(number_months)) # Exporting data ####################################################################################################################### # Create new folder 'data_output' # Prepare for plotting - create version where each column includes only one data value ## Use spread to expand months_lack_food and items_owned interviews_plotting <- interviews %>% ## spread data by items_owned separate_rows(items_owned, sep=";") %>% mutate(items_owned_logical = TRUE) %>% spread(key = items_owned, value = items_owned_logical, fill = FALSE) %>% rename(no_listed_items = `<NA>`) %>% ## spread data by months_lack_food separate_rows(months_lack_food, sep=";") %>% mutate(months_lack_food_logical = TRUE) %>% spread(key = months_lack_food, value = months_lack_food_logical, fill = FALSE) %>% ## add some summary columns mutate(number_months_lack_food = rowSums(select(., Apr:Sept))) %>% mutate(number_items = rowSums(select(., bicycle:television))) # Save new file write_csv(interviews_plotting, path = "data_output/interviews_plotting.csv") ############################################################################################################################################################################################################################################## # Plotting with ggplot2 ####################################################################################################################### ####################################################################################################################### library(tidyverse) ggplot(data = interviews_plotting) ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) # scatter plot - two continuous variables ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point() # Assign plot to a variable interviews_plot <- ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) # Draw the plot interviews_plot + geom_point() ## This is the correct syntax for adding layers interviews_plot + geom_point() ## This will not add the new layer and will return an error message interviews_plot + geom_point() # Building your plots iteratively ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point() # Make points transparent ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_point(alpha = 0.5) # Jitter points ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_jitter(alpha = 0.5) # Color points ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_jitter(alpha = 0.7, color = "blue") # Color by village - new aes function to specify across subsets of points ggplot(data = interviews_plotting, aes(x = no_membrs, y = number_items)) + geom_jitter(aes(color = village), alpha = 0.5) ggplot(data = interviews_plotting, aes(x = village, y = rooms)) + geom_jitter(aes(color = respondent_wall_type), alpha = 0.5) # Use what you just learned to create a scatter plot of rooms by village with the respondent_wall_type showing in different colors. Is this a good way to show this type of data? ggplot(data = interviews_plotting, aes(x = village, y = rooms)) + geom_jitter(aes(color = respondent_wall_type)) # Difficult to distinguish between villages in above plot # Boxplots ####################################################################################################################### ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) + geom_boxplot() # Add points ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.5, color = "tomato") ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) + geom_jitter(alpha = 0.5, color = "tomato") + geom_boxplot() # Exercise ####################################################################################################################### # Make violin plot ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = rooms)) + geom_violin(alpha = 0) + geom_jitter(alpha = 0.5, color = "tomato") # Create a boxplot for liv_count for each wall type - overlay on a jitter layer to show actual measurements ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.5) # Add color based on irrigation association membership ggplot(data = interviews_plotting, aes(x = respondent_wall_type, y = liv_count)) + geom_boxplot(alpha = 0) + geom_jitter(aes(alpha = 0.5, shape = memb_assoc)) # Barplots ####################################################################################################################### ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) + geom_bar() # Fill bars by village counts ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) + geom_bar(aes(fill = village)) # Size-by-side bars ggplot(data = interviews_plotting, aes(x = respondent_wall_type)) + geom_bar(aes(fill = village), position = "dodge") # Create new datasets to plot prop. of each housing type in each village ## Also remove houses with cement walls as there is only one in the dataset percent_wall_type <- interviews_plotting %>% filter(respondent_wall_type != "cement") %>% count(village, respondent_wall_type) %>% group_by(village) %>% mutate(percent = n / sum(n)) %>% ungroup() View(percent_wall_type) ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) + geom_bar(stat = "identity", position = "dodge") ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) + geom_bar(stat = "identity", position = "dodge") # Exercise ####################################################################################################################### ## Create a bar plot showing the proportion of respondents in each village who are or are not part of an irrigation association (memb_assoc). Include only respondents who answered that question in the calculations and plot. Which village had the lowest proportion of respondents in an irrigation association? percent_memb_assoc <- interviews_plotting %>% filter(!is.na(memb_assoc)) %>% count(village, memb_assoc) %>% group_by(village) %>% mutate(percent = n / sum(n)) %>% ungroup() View(percent_memb_assoc) ggplot(percent_memb_assoc, aes(x = village, y = percent, fill = memb_assoc)) + geom_bar(stat = "identity", position = "dodge") # Adding Labels and Titles ####################################################################################################################### ggplot(percent_wall_type, aes(x = village, y = percent, fill = respondent_wall_type)) + geom_bar(stat = "identity", position = "dodge") + labs(title="Proportion of wall type by village", x="Wall Type", y="Percent") # Faceting ###################################################################################################################### ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) + geom_bar(stat = "identity", position = "dodge") + labs(title="Proportion of wall type by village", x="Wall Type", y="Percent") + facet_wrap(~ village, ncol=1) # Add white background and remove grid ggplot(percent_wall_type, aes(x = respondent_wall_type, y = percent)) + geom_bar(stat = "identity", position = "dodge") + labs(title="Proportion of wall type by village", x="Wall Type", y="Percent") + facet_wrap(~ village) + theme_bw() + theme(panel.grid = element_blank()) # Plot prop. of respondents in each village who owned a particular item ## First calc. the percentage of people in each village who owned each item table(interviews$village) percent_items <- interviews_plotting %>% gather(items, items_owned_logical, bicycle:no_listed_items) %>% filter(items_owned_logical) %>% count(items, village) %>% ## add a column with the number of people in each village mutate(people_in_village = case_when(village == "Chirodzo" ~ 39, village == "God" ~ 43, village == "Ruaca" ~ 49)) %>% mutate(percent = n / people_in_village) ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + theme_bw() + theme(panel.grid = element_blank()) # ggplot2 themes ###################################################################################################################### # link to ggplot2 themes: # http://docs.ggplot2.org/current/ggtheme.html # Exercise # Experiment with at least two different themes. Build the previous plot using each of those themes. Which do you like best? # Customization ###################################################################################################################### # ggplot2 cheat sheat # https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf # Change labels to be more informative ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + labs(title = "Percent of respondents in each village who owned each item", x = "Village", y = "Percent of Respondents") + theme_bw() # Increase font size ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + labs(title = "Percent of respondents in each village who owned each item", x = "Village", y = "Percent of Respondents") + theme_bw() + theme(text=element_text(size = 16)) # Rotate x-axis labels, and introduce newlines ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + labs(title = "Percent of respondents in each village \n who owned each item", x = "Village", y = "Percent of Respondents") + theme_bw() + theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5), axis.text.y = element_text(colour = "grey20", size = 12), text = element_text(size = 16)) # Save theme and center plot title grey_theme <- theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5), axis.text.y = element_text(colour = "grey20", size = 12), text = element_text(size = 16), plot.title = element_text(hjust = 0.5)) ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + labs(title = "Percent of respondents in each village \n who owned each item", x = "Village", y = "Percent of Respondents") + grey_theme # Explort plot using commands my_plot <- ggplot(percent_items, aes(x = village, y = percent)) + geom_bar(stat = "identity", position = "dodge") + facet_wrap(~ items) + labs(title = "Percent of respondents in each village \n who owned each item", x = "Village", y = "Percent of Respondents") + theme_bw() + theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 45, hjust = 0.5, vjust = 0.5), axis.text.y = element_text(colour = "grey20", size = 12), text = element_text(size = 16), plot.title = element_text(hjust = 0.5)) # Create folder 'fig_output' ggsave("items_by_village_barplot.png", my_plot, width = 15, height = 10)