# Loading data partitioned on ingestion time vs a timestamp/date column
Note that views in BigQuery has two roles here:
- Data deduplication
- Filter results by partition key **or** columns.
## Ingestion time partition
also known as: load time or _PARTITIONTIME
Pros
- **Easy** to maintain
- **Consistent** partitioning across all our tables
Cons
- We'd get **inaccurate** reports from missing historical data that's segregated in a single partition outside the sliding window of the view.
- Overflows of events between dates. Eg if last sync happened at 7pm, a new event arrives at 8pm and the next sync happens at 7am - that event will be incorrectly stored in the next day's partition. Especially since our dataset is relatively small and this likely will affect statistical significance of our reports.
### Workaround 1.A: Convert date-sharded tables into ingestion-time partitioned tables.
1. Split up each table in RDS into different shards based on their date, (eg `sfjc_connection_binging_20190101`, ... , `sfjc_connection_binging_20191231`)
2. load them into BigQuery as separate sharded tables
3. Convert them into ingestion time partitioned table using `bq partition` command. We'll end up with a single ingestion time partitioned tables, that actually partitioned by the shard date (`sfjc_connection_binging$20190101`, ... , `sfjc_connection_binging$20191231`)
Reference: [1](https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_date-sharded_tables_into_ingestion-time_partitioned_tables)
Probably the best options if we still want **consistent** partition key and **save cost**, but will take a lot **more engineering** cycles and will still have **inaccurate** results.
### Workaround 1.B: Filter by column instead of partition
As mentioned in the beginning of this article, we can filter results by partition or column. When we specify a column as a partition key and filter the results based on that key, BigQuery doesn't scan the entire table (and individual rows) instead just scans those particular partitions. [source](https://cloud.google.com/bigquery/docs/querying-partitioned-tables)
For example consider table `sfjc_binding_created` which is partitioned by `timestamp` column. When we filter by partition key `timestamp`, this query will just process 120.7 KB when run:
```sql
SELECT * FROM `apps-almond.sfjc_staging.sfjc_binding_created`
WHERE timestamp BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC')
AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC')
```
When we filter by column`received_at`, this query will process the entire table of 1.4 MB when run:
```sql
SELECT * FROM `apps-almond.sfjc_staging.sfjc_binding_created`
WHERE received_at BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC')
AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC')
```
We'll get **accurate** results and technically **consistent keys (although it's sort of useless)** with this approach but this will incur **additional cost** for the full table scan. However, we will be **still within our initial cost estimation**.
## Timestamp / date partition
Pros
- **Accurate** report as data is partitioned to correctly based on actual time the event occured
- **Faster** reports as we can _filter by partitions_ in view instead of column. **No full table scans**, thus **cheaper** bills.
Cons
- **Inconsistent** partition key across tables
- **Overhead** of managing different partition keys when querying
### Workaround 2.A: Abstract the difference
Abstract the difference in partition keys using multiple common views (eg `sfjc_connection_binding_last_30_days_view`, `sfjc_connection_binding_last_60_days_view`,
`sfjc_connection_binding_last_90_days_view`). While this would handle most common use cases, this solution **unlikely meet all our unique requirements**. In those cases we can resort to explicitly filtering the partition keys or consider:
### Workaround 2.AA: Deduplicating service
We can run a periodic job that runs only on a single partition per table, everyday. This will delete duplicates in a partition, thus eliminating the need for _views_. References: [1](https://medium.com/google-cloud/bigquery-deduplication-14a1206efdbb) [2](https://stackoverflow.com/questions/53650596/bigquery-deleting-duplicates-in-partitioned-table)
Let's say we want to get all deduplicated records from past 60 days. This service will simplify our query from:
```sql
SELECT * EXCEPT (__row_number) FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) AS __row_number FROM `apps-almond.sfjc_staging.sfjc_connection_binding`
WHERE timestamp BETWEEN
TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 * 24 HOUR),DAY, 'UTC')
AND
TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC')
)
WHERE __row_number = 1
```
to:
```sql
SELECT * FROM `apps-almond.sfjc_staging.sfjc_connection_binding`
WHERE timestamp BETWEEN
TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 60 * 24 HOUR),DAY, 'UTC')
AND
TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC')
```
PS: This additional service actually will be useful and simplify our queries regardless of which option we choose: ingestion time or timestamp partitions.
## Conclusion
Clearly, the best approach here is to _parition by timestamp_ which would give us a both accurate and cost efficient solution, in exchange for slight additional complexity.
However, considering the time limit that we have for our migration and that the squad will be a one man army for while, I think going with _ingestion time partitioning_ would be the path of least resistance. We will have to trade off between accuracy and cost with this option.
The end. Annyeonghaseyo.