wtf is a distributed column store (and why should I care)?

# wtf is a distributed column store (and why should I care)? ## prior art - sam stokes talk - charity blog posts? - charity tweets ## rows vs. columns you don't need to be a database engineer like Charity to be able to understand these differences. a simplified explanation of reading data in row-based storage ## indexing when people talk about indexing a column in a row-based database, that means (blah) a human needs to decide which columns to index, and it's a relatively expensive ## append-only events are append-only (like flat logs). we don't update event data after it's been stored. this means we don't have to worry about making update tasks rows corresponding to events might also lean towards being sparse; not every event has every possible field, so it's not that natural to write an entire row at a time (if a bunch of cells in that row are empty) ## no pre-aggregation ## query patterns row-based: transactional queries where you're handling data across all columns for a particular row (e.g., a user record for logging into your web app) column-based: analytical queries where you're handling a handful of columns, but all data across that column (e.g., `AVG(duration_ms)`)