Loading data partitioned on ingestion time vs a timestamp/date column

# Loading data partitioned on ingestion time vs a timestamp/date column Note that views in BigQuery has two roles here: - Data deduplication - Filter results by partition key **or** columns. ## Ingestion time partition also known as: load time or _PARTITIONTIME Pros - **Easy** to maintain - **Consistent** partitioning across all our tables Cons - We'd get **inaccurate** reports from missing historical data that's segregated in a single partition outside the sliding window of the view. - Overflows of events between dates. Eg if last sync happened at 7pm, a new event arrives at 8pm and the next sync happens at 7am - that event will be incorrectly stored in the next day's partition. Especially since our dataset is relatively small and this likely will affect statistical significance of our reports. ### Workaround 1.A: Convert date-sharded tables into ingestion-time partitioned tables. 1. Split up each table in RDS into different shards based on their date, (eg `sfjc_connection_binging_20190101`, ... , `sfjc_connection_binging_20191231`) 2. load them into BigQuery as separate sharded tables 3. Convert them into ingestion time partitioned table using `bq partition` command. We'll end up with a single ingestion time partitioned tables, that actually partitioned by the shard date (`sfjc_connection_binging$20190101`, ... , `sfjc_connection_binging$20191231`) Reference: [1](https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_date-sharded_tables_into_ingestion-time_partitioned_tables) Probably the best options if we still want **consistent** partition key and **save cost**, but will take a lot **more engineering** cycles and will still have **inaccurate** results. ### Workaround 1.B: Filter by column instead of partition As mentioned in the beginning of this article, we can filter results by partition or column. When we specify a column as a partition key and filter the results based on that key, BigQuery doesn't scan the entire table (and individual rows) instead just scans those particular partitions. [source](https://cloud.google.com/bigquery/docs/querying-partitioned-tables) For example consider table `sfjc_binding_created` which is partitioned by `timestamp` column. When we filter by partition key `timestamp`, this query will just process 120.7 KB when run: ```sql SELECT * FROM `apps-almond.sfjc_staging.sfjc_binding_created` WHERE timestamp BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ``` When we filter by column`received_at`, this query will process the entire table of 1.4 MB when run: ```sql SELECT * FROM `apps-almond.sfjc_staging.sfjc_binding_created` WHERE received_at BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ``` We'll get **accurate** results and technically **consistent keys (although it's sort of useless)** with this approach but this will incur **additional cost** for the full table scan. However, we will be **still within our initial cost estimation**. ## Timestamp / date partition Pros - **Accurate** report as data is partitioned to correctly based on actual time the event occured - **Faster** reports as we can _filter by partitions_ in view instead of column. **No full table scans**, thus **cheaper** bills. Cons - **Inconsistent** partition key across tables - **Overhead** of managing different partition keys when querying ### Workaround 2.A: Abstract the difference Abstract the difference in partition keys using multiple common views (eg `sfjc_connection_binding_last_30_days_view`, `sfjc_connection_binding_last_60_days_view`, `sfjc_connection_binding_last_90_days_view`). While this would handle most common use cases, this solution **unlikely meet all our unique requirements**. In those cases we can resort to explicitly filtering the partition keys or consider: ### Workaround 2.AA: Deduplicating service We can run a periodic job that runs only on a single partition per table, everyday. This will delete duplicates in a partition, thus eliminating the need for _views_. References: [1](https://medium.com/google-cloud/bigquery-deduplication-14a1206efdbb) [2](https://stackoverflow.com/questions/53650596/bigquery-deleting-duplicates-in-partitioned-table) Let's say we want to get all deduplicated records from past 60 days. This service will simplify our query from: ```sql SELECT * EXCEPT (__row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) AS __row_number FROM `apps-almond.sfjc_staging.sfjc_connection_binding` WHERE timestamp BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 * 24 HOUR),DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ) WHERE __row_number = 1 ``` to: ```sql SELECT * FROM `apps-almond.sfjc_staging.sfjc_connection_binding` WHERE timestamp BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 * 24 HOUR),DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ``` PS: This additional service actually will be useful and simplify our queries regardless of which option we choose: ingestion time or timestamp partitions. ## Conclusion Clearly, the best approach here is to _parition by timestamp_ which would give us a both accurate and cost efficient solution, in exchange for slight additional complexity. However, considering the time limit that we have for our migration and that the squad will be a one man army for while, I think going with _ingestion time partitioning_ would be the path of least resistance. We will have to trade off between accuracy and cost with this option. The end. Annyeonghaseyo.