Data Exercise - HackMD

--- #slideOptions: # transition: slide title: Data Exercise --- # Data exercise for Sparta Commodities By Álvaro Muñoz Delgado Data team mission: > To provide a single source of truth for Unity to discover, consume and transform data into insights, decisions and products. - Unity, 2019 --- ## 1. Data Storage solution ### Goals 1. Store all the data for future insights 2. Efficiency for querying For __Goal 1__, a Data Lake would be needed over a Data Warehouse because data is never discarded in the first while data needs more rigids schemas and is discarded after processing on the second. Therefore, a "store everything" policy becomes way more versatile when developing new business systems (such as the Data Research ideas mentioned during the first call). In Google Cloud Platform, the tool used as the central storage repository for the Data Lake is Cloud Storage. Since costs of Cloud Storage depends on how frequently we access the data, I would suggest to have two differentiated buckets: - **Raw Data**: all the data will be stored here at first (staging area) - **Archive Storage**: after some time, raw data will be moved here since it's a cost-effective way of storing all the data that doesn't need to be accessed often (costing less than 1/10 the regular price) Both batch and streaming data will be inserted first into the Data Lake (Cloud Storage) and then into the Data Warehouse (BigQuery) once is processed (more in point 2. Data flow design). To achieve **Goal 2**, real-time data that requires low-latency queries and should scale over time will also be stored into BigTable besides BigQuery. Moreover, if we want our Data Warehouse to also be efficient and useful for current and future analytics workflows, we have to consider two key points: - efficiency = time + cost - plots will require to query many rows and just a few columns Therefore, there are several reasons to have chosen BigQuery as the Data Warehouse: - BigQuery uses a column-oriented storage mechanism, which is quite efficient for storing streaming inserts and querying thousands of rows very efficiently - BigQuery allows table optimization through clustered tables and partitioning, which translates into reduced query time and costs - The two above reasons significantly reduce costs It is important to notice than clustering and partitioning are two key techniques that have to be implemented for each table. Moreover, BigQuery only allows to have 4000 partitionings per table, so data structure becomes really important in order to organise all the partitions over years of information (elaborated in point 3. Data Structure). To sum up, my proposal for the Data Storage solution is to create a Data Lake with Cloud Storage, use BigTable as the main database for live data and use BigQuery as the Data Warehouse for analytics. **Note**: since BigTable can be quite expensive, I recommend to thoroughly study the amount of streaming data, types of reads we need, etc before consider using it. Perhaps a scalable SQL tool like Cloud Spanner might be better in some cases, even though latency can be higher. --- ## 2. Data Flow design ### Goals 1. No data loss 2. Minimum lag for live data On the one hand, the data flow must follow an ETL logic. While the data is processed, it will be stored in its original format into Cloud Storage to be consistent with the data lake philosophy. On the other hand, by using a tool like Pub/Sub (based on Apache Kafka) we're already ensuring **no data is lost** because of Pub/Sub's at-least-once delivery policy. If we configure a dead queue for not confirmed or not-schema-compliant messages, we won't lose any data, plus we can reprocess this data if needed. The core pipeline tool in Google Cloud Platform is called Dataflow (based on Apache Beam) which allow us to: - Use the same business logic for both streaming and batch pipelines - Automatic horizontally scale the system to avoid bottlenecks, minimize lag and reduce overall costs - Use Avro fromat to store objects directly in BigQuery to increase the performance - Backfill, reprocess dead queue data in batch Nonetheless, other needs will arise as the platform evolves and might include other logics or Cloud Functions to perform certain tasks. We can use Cloud Composer (based on Apache Airflow) to schedule and manage workflows for this new business needs. Streaming data will be processed by Dataflow and inserted in real time to BigTable. Another key factor for the streaming data flow pipeline will be the **sliding windows** for storing anaylitics data into BigQuery. Since the minimum time range stated is 5 minutes, DataFlow can use sliding windows to collect all the data processed for the past 5 minutes, do the required calculations and insert one row per key per 5 minutes. This will vastly reduce the final data stored on the Data Warehouse. To sum up, the ETL pipeline must be designed using Dataflow for both batch and streaming pipelines and orchestrated by Composer when needed. --- ## 3. Data Structure ### Goal - Fast queries There are several considerations and key points to analyze for each of the proposed Data Storage systems. **BigTable** The goal of this database is to return live data as fast as possible. We will design the schema to create a row for each new event following a tall and narrow pattern: many rows and few columns. Since each row is indexed by a single row-key, one of the most important design choices in Bigtable is choosing a row-key, since it scales best when read and write operations distributed across nodes and tables. Moreover, given the need to perform range scans based on time, we have to include the timestamp to the row-key. I recommend to use a single-timestamp unserialized pattern because we will always want to retrieve all columns but only a specified range of timestamps. Since the row-key is unique, timestamp and data keys have to be added to the row-key. One possible row-key design could be: - **source_name#key_name#timestamp**, i.e "blender#key1#2021-05-09-1205" With this approach, values will be highly varied and evenly distributed across all nodes, reducing query time. **BigQuery** As the Data Warehouse, BigQuery might also be queried to perform certain analytics in almost real-time. Therefore, we have to structure it in an optimized manner. This optimiziation can be achieved through the beforehand mentioned partition and clustering. Therefore, let's calculate and consider different scenarios for Date partitioning. --- Taking into account that the lowest range of time to plot is 5 minutes: - Each row will register 5 minutes of data - 1 hour, 12 rows - 1 day, 288 rows - 1 month, 8.640 rows, - 1 year, 103.680 rows - 10 years, 1.036.800 rows If we consider to store one table per year with all the readings per key, we would have 311.040.000 rows per year in total, which is perfectly manageable by BigQuery. Moreover, since the maximum limit of partitions per table of BigQuery is 4.000, if we partitionate by day we can have ~11 years of data per table (simplifying the tables management). If we had hourly partitions, we would only have room for over 3 months per table and 7.464.960.000 rows per table. Even though BigQuery could analyze over 7 billion tables with no problem, there's no reason to choose hourly over daily partition. If in adittion to partition the tables we also cluster them by key, we would be significantly reducing costs and increasing the system performance. --- **Another considerations for BigQuery** The size of each row to be sent is: - timestamp | timestamp | 8 bytes - open | float | 8 bytes - close | float | 8 bytes - max | float | 8 bytes - min | float | 8 bytes Around 40 bytes + 10-20 bytes of metadata =~ 60 bytes. Hence, one year of rows is about 103.680 * 60 = 5.93 MB. Given that BigQuery always charges for at least 10MB queried (even if it's less), we would be wasting 4 MB of data even if we query a whole year for one key. Perhaps, we can create a join query of several tables for some analytics or enriched data parts of the platform in order to avoid the waste of resources. Moreover, it is important to remind that BigQuery caches results under 128MB, so no extra costs are charged if the users queries the same data several times. This could derive in different strategies. For instance, we could always query preferred keys for each user all at once the first time the user logs in. --- ## Reference architecture ![Reference architecture](https://i.imgur.com/2LDDCWe.png)