renatadiaz - HackMD

Big data workshop

Wrangling larger than memory data in R Assumes you have a big dataset and you want to be able to write familiar-ish code around it Not going to cover so much the different options you have for data storage and access (but will talk about this at the end, as this can help with speed + interoperability!). For today, you do not need to have your data in a specialized format (like a database or parquet files), although you certainly can use those formats with a workflow like this. We'll assume you have either one big or many small .csv files. Focus on two overlapping and complementary tools that together allow you to write tidyverse-style code for big datasets: duckdb and arrow These packages offload code evaluation into other packages (arrow's C++ library, duckdb's database engine) so that they're not being evaluated in R's memory directly. Without getting into the specifics, this allows you to write and run code on really big datasets that would cause R to crash if you tried to do the same thing in standard tidyverse.