Try   HackMD

Data wrangling: Relational data and factors

Learning objectives

  • Introduce relational data
  • Demonstrate how tables can be linked to one another
  • Demonstrate methods in dplyr for linking and merging related tables
  • Practice joining tables
  • Define factors
  • Demonstrate how to transform and reorder factors for visualizations
  • Practice transforming factors

Notes

Primary objective of this lecture is to figure out how we work with multiple dataframes and join them together.

Relational Data

Often, dataframes will share values for certain variables. Remember Soltoff's example of the superheros and publisher's dataframe. They both share information on the publisher. We want to be able to join that information together.

Join Operations

Mutating Joins

Inner Join: Returns all rows of original (left) dataframe that have matching values in new (right) dataframe.

Left Join: Returns all rows in the left dataframe and merge shared information with the right dataframe.

Right Join: Returns all the rows in the right dataframe and merge shared infomatoin with the left dataframe.

Full Join: Keeps all data from left and all data from right.

Filtering Joins

Semi Join: All rows in left that also exist in right. Does not incorporate the information on the right.

Anti Join: Keep all rows in left that DO NOT exist in right.

Note, you can use the by attribute to identify equal cols that do not share the same name. Syntax is roughly:

left_join(left, right, by = c("Admin2" = "County"))

Factors

Used to organize a string col into shared buckets.

The syntax for this is:

factor(object, levels = c(_), labels = c(_))