Preface - HackMD

# Preface - Problem: Our current architecture of using RDS as a Online Analytical Processing (or OLAP) doesn't scale, slow but way too expensive and requires ahead-of-time resource provisioning (We pay for provisioned resource, not resources that we actually use). Common platform for data needs, non-duct tape solution. - Scope: The scope my spike is limited to explore vaiable alternatives in GCP. For AWS please checkout Amir's spike. --- # BigQuery - a serverless data warehouse, no resource (CPU, RAM, Disk and Bandwidth) provision required - extremely easy to use, supports **SQL 2011** - BigQuery ML: create and deploy ML models via SQL, completely serverless! - security (encrypted at rest, query audit trail, IAM etc) - build in caching mechanism for speed & cost savings - native integration with Data Studio & Dataflow serverless ETL (Based on OSS Apache Beam) - With BigQuery we don't have to build separate data lake and data warehouse. --- # 1. Price ### Baseline | April 2020 | RDS | BigQuery | | -------- | -------- | -------- | -------- | | Storage 160GB | $19.60 | $3.0 | | Backup 60GB | $5.85 | - | | Instance Usage 720h | $8.5 | - | | Query scan 1TB | - | $0 | | Pricing | $33.95 | **$3.0** | --- ### Optimized with Partition | April 2020 | RDS | BigQuery | | -------- | -------- | -------- | -------- | | Storage 160GB | $19.60 | - | | Storage 60GB | - | $1.2 | | Cold Storage 100GB | - | $1 | | Backup 60GB | $5.85 | - | | Instance Usage 720h | $8.5 | - | | Query scan 1TB | - | $0 | | Pricing | $33.95 | **$2.20 (-27%)** | --- ### April 2021 x10, Baseline *We need at least a RDS Instance of type db.t3.large with 8GiB RAM, 2 vCPUs, minimum.* | | RDS | BigQuery | | -------- | -------- | -------- | | Storage 1TB | $115.00 | $19.80 | | Backup 1TB | 95.00 | - | | Instance Usage 720h | $105.85 | - | | Query scan 1TB | - | $0 | | Pricing | $315.85 | **$19.80** | --- ### April 2021 x10, with Optimization *We need at least a RDS Instance of type db.t3.large with 8GiB RAM, 2 vCPUs, minimum.* | | RDS | BigQuery | | -------- | -------- | -------- | | Storage 1TB | $115.00 | - | | Storage 100GB | - | $2.0 | | Storage 900GB | - | $9.0 | | Backup 1TB | 95.00 | - | | Instance Usage 720h | $105.85 | - | | Query scan 1TB | - | $0 | | Pricing | $315.85 | **$11 (-44%)** | --- # 2. Performance | | Average time to execute | | -------- | -------- | | RDS | 77.2s | | BigQuery | **5.5s (x14 faster)** | --- # 3. Engineering Effort - Everything achievable by plain old SQL. And a bit more. Training your ML model in SQL! - We'll be able reuse most of work we have done for Project Almond --- ### Postgres ```python import psycopg2 conn = psycopg2.connect() cursor = conn.cursor() QUERY = 'SELECT name FROM WHERE state = "TX" LIMIT 100' cursor.execute(QUERY) for row in cursor.fetchall(): print(row[0]) ``` --- ### BigQuery ```python from google.cloud import bigquery client = bigquery.Client() QUERY = ('SELECT name FROM WHERE state = "TX" LIMIT 100') query_job = client.query(QUERY) for row in query_job.result(): print(row.name) ``` --- ## Architecture We just need to change Segment's distination from RDS to BigQuery. We **can't have both** unless we upgrade our Segment plan to "Business". Our current "Teams" plan only allows one *Segment Warehouse* and our RDS integration is considered as *"Warehouse"*. Segment does not have such connectors for Anthena based on publicly available information. So replicating this in AWS might not be straight forward. Fun fact: Anthena uses Presto, which was opensourced by Facebook lol. - Current: Segement -> RDS -> Apps - Proposed - without data lake: Segement -> BigQuery -> Apps - with data lake: Segement -> Cloud Storage -> Dataflow -> BigQuery -> Apps --- # 4. Optimize globally, not locally - We should strive to optimize globally not locally. Having a data warehoue would provide us a unified and capable platform for all our analytics needs. > [We're] used to picking the best tool for the job by optimizing locally. ... What you need to do is start building your muscles for optimizing globally. Not in isolation of other considerations, but in conjunction with them. - Charity Majors - provides common platform not just for our internal analytical needs but can be used to offer value added services for end-users, on top of our existing core products (eg Analytics for Learning Fabric) --- # 5. Googliness - Tight integration with Data Studio. How tight? - Did I mention you can train ML model with SQL? Serverless! # FAQ ### What about VPCs and stuff? - Google pioneered in Zero Trust Networks: [BeyondCorp](https://research.google/pubs/pub43231/). - Analogy: VPC is like a castle, complete with big walls and surrounding moat. But once that perimeter is breached, an attacker has relatively easy access to a company’s privileged intranet. - Consider a common scenario, we put everything like RDS inside a VPC and whitelist only our office IPs. All a hacker needs to do is get access to the WiFi. - We'll just use RSA key pairs to authenticate between publicly available endpoints. --- ### Do we really need this, now? - We don't need this immediately but having it now would make our job easier and thus more productive.