Section: Birthdays

--- tags: section --- # Section: Birthdays ### Objectives By the end of this section, you will be able to: * use the `stringr` and `lubridate` libraries * save a cleaned data set with a date tag and re-load it ### Setup In order to proceed, you will need to install a few R libraries for data cleaning. Open RStudio, and then run the following commands in the console: ~~~{r} ```{r} install.packages("stringr") install.packages("lubridate") ``` ~~~ As is usual by now, you should write your code and report your findings in an R markdown file. Insert the following setup code at the start of your file: ~~~{r} ```{r setup, include = FALSE} library(lubridate) library(stringr) library(dplyr) ``` ~~~ As you complete the section, remember to insert code chunks using ```{r}``` notation. ### Birthdays At the beginning of the year, the TAs filled out a questionnaire about themselves. Unfortunately, we failed to specify a standard date format, so they all entered their birthdays in different ways. In this section, you will be cleaning their birthdays so they can be easily processed in R. Once this is done, you can find out super interesting things about (not even) your (own) TA staff (but rather the TA staff from 2017), such as their average age! Use the following line of code (which reads strings as strings!, i.e., not as factors) to load the data. Insert it towards the end of your setup chunk. ```{r} tas <- read.csv("https://cs.brown.edu/courses/cs100/studios/data/5/birthdays.csv", stringsAsFactors = FALSE) ``` This data set is small (there are only 8 TAs, and we are only including their birthdays.) So you can type `tas` in the console and view the entire data frame. You’ll notice that there are only eight observations and three variables: "Name", "Birthdate", and "Birthyear". Your goal is to combine the information in "Birthdate" and "Birthyear" into a new variable, "Birthday", that stores each TA’s birthday in numeric form. To get started, observe that one of the records contains no birthday information. This record is not useful, so let’s remove it from the data frame. In a new R code chunk, type: ```{r} # Omit NAs tas <- na.omit(tas) ``` Now, to clean the remaining data, let’s combine the "Birthdate" and "Birthyear" variables into one variable, "Birthdays", and let’s standardize the format of birthday entries. For starters, we need to choose a datatype for birthdays. Observe that "Birthdate" and "Birthyear" are both strings. But storing dates as strings is not that useful, because you cannot easily do arithmetic with strings, and you often want to do arithmetic with dates. For example, if you had a database in which you recorded the times you went to sleep and the times you woke up every night for a month, you might then want to compute the average number of hours of sleep you got that month. You cannot easily do that if your dates/times are stored as strings. Luckily for us, R has a special datatype designed expressly for this purpose, called (you guessed it!) `Date`. So let’s work towards combining the information in "Birthdate" and "Birthyear" in a single `Date`. For starters, let's use the `unite` function in the tidyr package to paste the "Birthdate" and "Birthyear" columns together into a single column called "Birthday" (adding a space between the two): ```{r} # Combine variables united <- tas %>% unite(Birthday, Birthdate:Birthyear, sep = " ") ``` To observe the new data frame, type `united` into the console and hit enter. Because these data are so messy (the format of each entry is different than the last), you will have to clean/standardize them manually, one at a time. There are many [standard date formats](https://en.wikipedia.org/wiki/ISO_8601). For concreteness, let’s go with `yyyymmdd`, meaning a four-digit year, followed by two digits for the month, and then two more digits for the day of the month, with no spaces or other separators between any of the digits. One way to proceed is using the `str_replace` function (part of the [`stringr`](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) package). This function takes as input a vector, the pattern to match, and what to replace the matched patterns with. Start by making a copy of the "Birthday" variable: ```{r} birthday <- united$Birthday ``` You will be editing the entries in the `united$Birthday` vector. It is best to make a copy of a variable before editing it, so that you don't lose any of the original information. We can now proceed to edit the new variable "birthday". For example: ```{r} # Standardize birthday formats birthday <- str_replace(birthday, "3.10 2002", "20020310") ``` Try this out, and then type `birthday` into the console again and hit enter. You should see almost the same vector as before, but with the first birthday replaced by `"20020310"`. Replace each birthday in turn, using `str_replace`. Check your work as you proceed by observing `birthday` again and again. We were particular in our choice of date format, because we knew that it would be easy to convert strings in this format into `Date`s in R. Specifically, the `ymd` function in the [lubridate package](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html) takes in a vector of strings and converts each string to a `Date`. Try it: ```{r} birthday <- ymd(birthday) ``` Once again, type `birthday` into the console. Furthermore, type `class(birthday)` into the console. The `class` function returns the data type of its argument. As you can see, `birthday` is a vector of `Date`s. By way of comparison, also check the type (i.e., `class`) of "united$Birthday". Finally, let's update the `tas` data frame with the new "birthday" variable: ```{r} # Replace the old vector of strings with the new vector of Dates united$Birthday <- birthday # Replace the old data frame with the new tas <- united ``` Now that you have the TAs' birthdays stored as a vector of `Date`s (which are really numeric values under the hood), you can manipulate these values to compute things like each of the TA's ages. Add a new variable "Age" to the `tas` data frame that stores all the TAs' ages. *Hint:* To calculate a TA's age, you can subtract their birthday from `today()`, a function provided by lubridate. What do you notice when you subtract one date from another? In what units is the answer provided? Add another variable to your data frame, which records the TAs' ages in years. Call this new variable "Age". What is the average age of the TAs (in years)? Joon was also CS100 TA in 2017. Joon's birthday is April 10th, 1993. Let's insert Joon into the data frame. One strategy for adding a new observation to an existing data frame in R is to first create a new data frame consisting of just that one observation, and to then use the `rbind` function to bind the two data frames together. To create a new data frame, use the `data.frame` function. And then, before binding, use the `names` function to give the variables in your new data frame the same names as the variables in the original data frame, so that they bind successfully. Create new variables to match those in your data frame: e.g., `joon_birthday`, `joon_age`. Be sure to store Joon's birthday as a `Date`, so it will be a simple matter to calculate his age. Then create a new data frame consisting of a new `joon` observation: ```{r} joon <- data.frame("Joon", joon_birthday, joon_age) ``` Before binding the two data frames, you should give them matching names: ```{r} names(joon) <- c("Name", "Birthday", "Age") ``` Finally, use `rbind` to bind the two data frames: ```{r} tas <- rbind(tas, joon) ``` What is the average age of the TAs, including Joon?