HackMD - Collaborative Markdown Knowledge Base

## Our Current Approach: 1. Turn off Segment sync to RDS 2. Segment to BigQuery 3. Export RDS to BigQuery ## Problem 1: New columns Not all columns are added during initial Segment sync. And BQ will not load the data if all columns are not present. Solution: We will have to manually added them before importing the historic data from RDS. ## Problem 2: Partitions By default, Segment partitions tables based on timestamp for data that is loaded into the table, by a pseudo column called _PARTITIONTIME. Also known as ingestion-time partitioned tables. This means **if we load bulk data to BQ at same time, all those data will be stored in the same partition**. We cannot re-partition the tables without re-creating the tables. Solution: We could either totally avoid partitioning or use column based partition. Not sure if segment supports either of this option. ### a) Avoid partitioning We would be paying more, especially when someone runs a query that spans the entire dataset. > If you have a table that is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50 percent to $0.01 per GB, per month. This is the same cost as Cloud Storage Nearline. > > Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn't been modified in the last 90 days, the data in that partition is considered long term storage and is charged at the discounted price. [source](https://cloud.google.com/bigquery/docs/best-practices-storage?authuser=0#take_advantage_of_long-term_storage) ### b) Column based partition We should use Segment's automatic field `timestamp` instead of `originalTimestamp` column since the clock skew is corrected. [source](https://segment.com/docs/connections/spec/common/) #### limitation > Each job operation (query or load) can affect a maximum of 4,000 partitions. Any query or load job that affects more than 4,000 partitions is rejected by BigQuery. [source](https://cloud.google.com/bigquery/quotas?authuser=0#partitioned_tables) That means any query we can only ~11 years worth of data in a single operation. ### Suggested Approach: Assumption: Segment probably adds new columns to BigQuery automatically, this is most likely the case but if it doesn't we need to figure out which additional columns needed to be add it before step 3. 1. Turn off Segment sync to RDS 2. Export RDS data to BigQuery 3. Segment sync to BigQuery ## Problem 3: Data Duplication Unlike in RDS, Segment only de-duplicates data in the BigQuery views. > We only de-duplicate data in your views. [source](https://segment.com/docs/connections/storage/catalog/bigquery/#im-seeing-duplicates-in-my-tables) > A view is a virtual table defined by a SQL query. ... All our views are set up to show information from the last 60 days. [source](https://segment.com/docs/connections/storage/catalog/bigquery/#views) Solution: We should always only query from views and avoid directly querying from the partitioned table. By default, we can only access de-duplicated data from past 60 days, we can however change this to 365 days or create different views for different analytical needs.