# Collaborative Document. Day 3, Sept 28th
2022-09-28 R for Social Scientists
Welcome to The Workshop Collaborative Document
This Document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents.
All content is publicly available under the Creative Commons Attribution License
https://creativecommons.org/licenses/by/4.0/
----------------------------------------------------------------------------
This is the document for the 26th: [link](https://hackmd.io/@o3DWHyfCQNqBUaAA1JO-_A/HJkxiTA-o/edit)
This is the document for the 27th: [link](https://hackmd.io/@o3DWHyfCQNqBUaAA1JO-_A/H1iIiNkGj/edit)
This is the document for today: [link](https://hackmd.io/@o3DWHyfCQNqBUaAA1JO-_A/HJfCD0lGo/edit)
## 👮Code of Conduct
* Participants are expected to follow those guidelines:
* Use welcoming and inclusive language
* Be respectful of different viewpoints and experiences
* Gracefully accept constructive criticism
* Focus on what is best for the community
* Show courtesy and respect towards other community members
## ⚖️ License
All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
## 🙋Getting help
to ask a question, type in the chat window
to get help, type in the chat window
you can ask questions in the document or chat window and helpers will try to help you
## 🖥 Workshop website
The workshop website can be found [here](https://steltenpower.github.io/2022-09-26-dc-socsci-R-nlesc-dccpo-online/).
### 🛠 Setup
The general setup of the workshop can be found [here](https://datacarpentry.org/socialsci-workshop/setup-r-workshop.html).
#### For today:
- R
- RStudio
- `install.packages("tidyverse")` en `install.packages("here")`
- [SAFI_clean.csv](https://ndownloader.figshare.com/files/11492171)
## About the data
For more information about the dataset and to download it from [Figshare](http://www.datacarpentry.org/socialsci-workshop/data), check out the Social Sciences workshop data page.
## 👩🏫👩💻🎓 Instructors
Ruud Steltenpool, Rick de Klerk
## 🧑🙋 Helpers
Rins Rutgers, Margriet Miedema
## Check-in: naam (pronouns) | organisatie | wat is na Engels en Nederlands je beste taal
## Starting with data
### Importeren en data inladen
```r=
library(tidyverse)
library(here)
interviews <- read_csv(
here("data", "SAFI_clean.csv"),
na = "NULL")
```
RStudio kan warnings geven omdat de libraries nog niet zijn geïmporteerd.
Voor het reproduceerbaar 'managen' van packages: https://rstudio.github.io/packrat/walkthrough.html
### Inspecteren
```r=
View(interviews)
print(interviews)
dim(interviews)
nrow(interviews)
ncol(interviews)
head(interviews)
tail(interviews)
head(interviews, n = 9)
tail(interviews, n = 3)
names(interviews)
names(interviews)[1:3]
colnames(interviews)
str(interviews)
summary(interviews)
glimpse(interviews)
```
print(interviews, n = 10, width = Inf) om alle kolommen te printen en 10 rijen.
### Indexeren
```r=
interviews[1, 1] # rij, kolom: geeft dbl 1
interviews[1, 2] # geeft chr "God"
interviews[[1]]
sum(interviews[[1]])
interviews[1,] # de hele eerste rij
interviews[2:4] # tweede t/m de vierde kolom
interviews[2:4, ] # tweede t/m de vierde rij
interviews[-1] # alles behalve de eerste kolom
subset <- interviews[-c(2:130), 1:2] # let op dat je hier c() nodig hebt
interviews["village"]
interviews[1:3, "village"]
interviews[1, ]
interviews$village # vector
```
**Opdracht**
1. Create a tibble (interviews_100) containing only the data in row 100 of the interviews dataset.
1. Notice how nrow() gave you the number of rows in the tibble?
- Use that number to pull out just that last row in the tibble.
- Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
- Pull out that last row using nrow() instead of the row number.
- Create a new tibble (interviews_last) from that last row.
1. Using the number of rows in the interviews dataset that you found in question 2, extract the row that is in the middle of the dataset. Store the content of this middle row in an object named interviews_middle. (hint: This dataset has an odd number of rows, so finding the middle is a bit trickier than dividing n_rows by 2. Use the median( ) function and what you’ve learned about sequences in R to extract the middle row!
1. Combine nrow() with the - notation above to reproduce the behavior of head(interviews), keeping just the first through 6th rows of the interviews dataset.
```r=
## 1.
interviews_100 <- interviews[100, ]
## 2.
# Saving `n_observations` to improve readability and reduce duplication
n_observations <- nrow(interviews) # totaal aantal rijen
interviews_last <- interviews[n_observations, ]
## 3.
interviews_middle <- interviews[median(1:n_observations), ]
## 4.
interviews_head <- interviews[-(7:n_observations), ]
```
## Factors
```r=
## Creating a factor with 2 levels:
respondent_floor_type <- factor(c("earth", "cement", "cement", "earth"))
## Showing levels:
levels(respondent_floor_type)
## Count levels:
nlevels(respondent_floor_type)
respondent_floor_type <- fct_recode(respondent_floor_type, brick = "cement")
levels(respondent_floor_type)
respondent_floor_type <- factor(respondent_floor_type, ordered = TRUE)
respondent_floor_type
text <- as.character(respondent_floor_type)
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct) # Dit werkt niet, omdat hij de onderliggende nummers omzet
as.numeric(as.character(year_fct))
memb_assoc <- interviews$memb_assoc
memb_assoc
memb_assoc <- factor(memb_assoc)
memb_assoc
## NAs ook meenemen
memb_assoc <- interviews$memb_assoc
# als een waarde na is, vul dan "undetermined" in
memb_assoc[is.na(memb_assoc)] <- "undetermined"
# En weer een factor van maken voor geheugen besparing.
memb_assoc <- as.factor(memb_assoc)
#namen aanpassen:
memb_assoc <- fct_recode(memb_assoc, No = "no",
Undetermined = "undetermined", Yes = "yes")
memb_assoc <- factor(memb_assoc, levels = c("No", "Yes", "Undetermined"))
plot(memb_assoc)
```
### Dates
```r=
library(lubridate)
```
## Data wrangling
```r=
## load the tidyverse
library(tidyverse)
library(here)
interviews <- read_csv(here("data", "SAFI_clean.csv"), na = "NULL")
## inspect the data
interviews
## preview the data
# view(interviews)
```
### Select en filter
```r=
select(interviews, village, no_membrs, months_lack_food) # seleCCCt voor Columns
select(interviews, c("village", "no_membrs", "months_lack_food"))
select(interviews, village:respondent_wall_type)
filter(interviews, village == "Chirodzo") # filteRRRRRR voor Rijen
filter(interviews, village == "Chirodzo",
rooms > 1,
no_meals > 2)
filter(interviews, village == "Chirodzo" &
rooms > 1 &
no_meals > 2) # doet hetzelfde als de vorige statement
filter(interviews, village == "Chirodzo" | village == "Ruaca") # | betekent OR
```
### Pipes
```r=
filter(select(interviews, village:respondent_wall_type), village == "Chirodzo" | village == "Ruaca")
# %>% is het pipe symbool uit de tidyverse |> zit sinds 4.x in base R
interviews %>%
filter(village == "Chirodzo") %>%
select(village:respondent_wall_type)
subset <- interviews %>%
filter(village == "Chirodzo") %>%
select(village:respondent_wall_type) # nu is de subset opgeslagen in de variabele subset
```
**Opdracht**
Using pipes, subset the `interviews` data to include interviews where respondents were members of an irrigation association (`memb_assoc`) and retain only the columns `affect_conflicts`, `liv_count`, and `no_meals`.
```r=
filter(memb_assoc == "yes") %>%
select( affect_conflicts
, liv_count
, no_meals
)
```
### Mutate
```r=
interviews %>%
mutate(people_per_room = no_membrs / rooms) %>%
select(no_membrs, rooms, people_per_room)
interviews %>%
filter(!is.na(memb_assoc)) %>%
mutate(people_per_room = no_membrs / rooms) %>%
select(no_membrs, rooms, people_per_room)
```
**Opdracht**
Create a new dataframe from the `interviews` data that meets the following criteria: contains only the `village` column and a new column called `total_meals` containing a value that is equal to the total number of meals served in the household per day on average (`no_membrs` times `no_meals`). Only the rows where `total_meals` is greater than 20 should be shown in the final dataframe.
**Hint:** think about how the commands should be ordered to produce this data frame!
```r=
interviews_total_meals <- interviews %>%
mutate(total_meals = no_membrs * no_meals) %>%
filter(total_meals > 20) %>%
select(village, total_meals)
```