---
tags: bym4102
---
This page is viewable and editable [in this link](https://hackmd.io/@_XewoO5-TXqsfl7SIIJhLQ/HJXk-XBL_/edit)
# Tidy data
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In **tidy data**:
1. Every column is a variable.
2. Every row is an observation.
3. Every cell is a single value.
[source](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data)
## Tidying messy datasets
Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analysing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies:
- Column headers are values, not variable names. *Example datasets*: `relig_income`, `billboard` or Johns Hopkins [coronavirus data](https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)
- Multiple variables are stored in one column. *Example datasets*: `who`
- Variables are stored in both rows and columns. *Example datasets:* `weather` [link](https://raw.githubusercontent.com/tidyverse/tidyr/master/vignettes/weather.csv)
- Multiple types of observational units are stored in the same table. *Example datasets*: `billboard` should be two tables *song* and *ranking*
- A single observational unit is stored in multiple tables. *Use `purrr::map` to read and combine multiple files*
```r
# try this code at rstudio.cloud project
fs::dir_ls(path="data/fuel-economy/",recurse = T, glob = "*.csv") %>%
purrr::map_dfr(read_csv)
```
# Separate columns
> Please checkout "Data import" [cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf) page 2 for `separate` and `pivot_longer/_wider` usage
Consider the table below
| Department | StudentId | Courses |
| ---------- | --------- | ----------------------- |
| Bioeng-en | 05A001 | BYM1001,BYM1005 |
| Bioeng-tr | 056004 | BYM1002 |
| Bioeng-en | 05A010 | BYM1003,BYM1005,BYM1001 |
Department column contains to separate information, department and language. Courses column is more problematic, arbitrary number of courses are listed in single cell. This column is practically useless since we can not use it for joining or other processes.
Let's tackle both problems
```r
tibble::tribble(
~Department, ~StudentId, ~Courses,
"Bioeng-en", "05A001", "BYM1001,BYM1005",
"Bioeng-tr", "056004", "BYM1002",
"Bioeng-en", "05A010", "BYM1003,BYM1005,BYM1001"
)
```
# Pivot (longer) aka `gather`
Here's visual explanation

[Image Source](https://fromthebottomoftheheap.net/2019/10/25/pivoting-tidily/)
Consider the following data (inspired by `who` data)
```r
tibble::tribble(
~continent, ~country, ~m_2020_old, ~m_2020_new, ~f_2020_old, ~f_2020_new, ~m_2019_old, ~m_2019_new, ~f_2019_old, ~f_2019_new,
"Asia", "Afghanistan", 10L, 15L, 5L, 9L, NA, NA, 2L, 3L,
"Europe", "Albania", NA, NA, NA, NA, 10L, 15L, 12L, 18L,
"Africa", "Algeria", NA, 8L, NA, 12L, NA, 9L, NA, 20L
)
```
which looks like following
| continent | country | m_2020_old | m_2020_new | f_2020_old | f_2020_new | m_2019_old | m_2019_new | f_2019_old | f_2019_new |
| --------- | ----------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Asia | Afghanistan | 10 | 15 | 5 | 9 | NA | NA | 2 | 3 |
| Europe | Albania | NA | NA | NA | NA | 10 | 15 | 12 | 18 |
| Africa | Algeria | NA | 8 | NA | 12 | NA | 9 | NA | 20 |
## Pivot column selection
We need to indicate which columns to pivot. All variations below select same columns to pivot
```r
data %>%
pivot_longer(m_2020_old:f_2019_new, names_to="treatment", values_to = "cases")
# OR
pivot_longer(-(1:2), names_to="treatment", values_to = "cases")
# OR
pivot_longer(-c(continent,country), names_to="treatment", values_to = "cases")
# OR
pivot_longer(contains("20"), names_to="treatment", values_to = "cases")
# OR
pivot_longer(where(is.numeric), names_to="treatment", values_to = "cases")
```
`pivot_longer` has many arguments which help reshaping the data. For instance, by using various `pivot_longer` arguments we can skip using additional `separate` or `as.numeric` conversions.
```r
my_who %>%
pivot_longer(-(continent:country), names_to="treatment", values_to = "cases") %>%
separate(treatment, into=c("gender","year","method")) %>%
mutate(year=as.numeric(year))
```
is equivalent to
```r
my_who %>%
pivot_longer(-(continent:country),
names_sep = "_",
names_to=c("gender","year","method"),
names_transform = list(year=as.numeric),
values_to = "cases")
```
# Coronavirus data
Here's coronavirus data which needs pivoting.
```r
initial <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv") %>%
janitor::clean_names()
```
Can you calculate daily **new** cases? (*hint*:`dply::lag()`)
# Exercise questions
*As bonus homework: +5 points to final score. Due date: May 7th,2021 9:30. Please submit as knitted html including your comments/instructions/descriptions*
The data files are under `datacamp-data/` folder. Most of the files are in `.rds` format, which is special format for R data. You can import the data with `readRDS` function.
* netflix data:
* longest movie (not TV show) and its director
* which actor/actress worked with which director the most
* in which movies the director is also in the cast?
* medals data:
* which country won the most medals?
* who is the athlete that won most medals (list name, sport and count)
* who is the athlete that won second most "gold" medals (list name and count)
* who are the athletes that got medals in two different sports after 1976 (included) ? (list name, sport, year, medal, team)
* planet_metrics_wide.rds:
* which planets have gravity greater than 9 (included)? (list planet and its gravity)
* planet_moons_list_df.rds: (BONUS - tidyr chapter4 - Rectangling data)
* which planet has the largest moon? (list the planet and its moon name and data)
* bird_df:
* the winner bird in average user points (imagine doing this calculation in Excel)