Data wrangling: Tidy data

Learning objectives

  • Define a tibble
  • Demonstrate how vectors can be read and parsed
  • Define various data file formats and functions for importation
  • Define tidy data and its characteristics
  • Practice tidying data

Notes

The purpose of this lecture is to learn how we import/arrange data that we want to work with in R. Typically, stats classes focus on doing actual stats so you don't get "real" data. Real data isn't necessarily the prettiest thing, so in this lecture, we'll teach you how to import and clean real data!

Importing data – CSV, Excel, SPSS, SAS, RDS etc.

Read.csv is a base R function while read_csv is a tidyverse function from the readr package. read_csv is considerably more efficient, especially as files move into the billions of rows.

Vroom package is another option. It imports packages with the single intention of maximizing efficency.

You can import literally any filetype into R. However, if you will be using only R, an RDS file is the fastest datatype you can store data in. If you will be working collaboratively or want other people to be able to work with your data who might not use R, it's helpful to maybe consider other formats like .csv or .feather files.

Read_excel allows you to import .xlsx files into R. It comes from the readxl package and shares the formatting of tidyverse read functions.

We recommend not trying to deal with SAS or SPSS dataformats if possible. If you can't find your data in any other format (typically a .csv will be available), you can use the haven package. SAS and SPSS data structures do not work well in R because they coerce certain datatypes to factors instead of labels.

Tidy Data Format

Tidy data considers the variables, observations and values of data. It isn't an end-all be-all method of arranging data, but you'll find that it helps a lot.

Rules for tidy data:

  • every col has a distinct variable
  • every row has a distinct observation
  • every cell has a distinct value

Tidy data is a relatively new idea, so lots of data on the internet isn't arranged in a tidy format. Here, we'll learn how to turn un-tidy data into tidy data!

Pivoting, Seperating, Uniting

Pivot Longer: Used to turn data that has a single variable across the vertical x-axis to a distinct col. This is frequently used to turn panel or time series data into tidy data.

# Example of Pivoting Longer

table4a %>%
    pivot_longer(cols = c(`1999`, `2000`), names_to = "Year",
                 values_to = "Cases")

Pivot Wider: Used to turn a dataset with too many distinct variables spread across multiple rows into a single row. Remember, each row should be a single observation.

# Example of Pivoting Wider

table2 %>%
    pivot_wider(names_from = type, values_from = count)

We don't need to use quotes around type and count here because type and count already exist in the dataframe. In the Pivot Longer example, Year and Cases do not exist yet so we had to name them with strings.

Separating: Used to split up cells where there are more than one value stored into multiple distinct cols.