--- title: "Implementing Layered Data Architecture In Dagster - George T. C., Lai" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # Implementing Layered Data Architecture In Dagster - George T. C., Lai {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/u6FKx99uik2i9pyfJi9FAP height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 [toc] [Slide](https://onedrive.live.com/view.aspx?resid=20362B1C0A9764D3!14337&ithint=file%2cpptx&authkey=!ACnoLoD7SOrxc3o) :::success **Main Takeaways:** - Concept of Layered Data Architecture - Introduction To Data Orchestration Framework -- Dagster - Implementation Of Layered Data Architecture Using Dagster - Experience Sharing ::: ## Layered Data Architecture - Some history: - was first introduced by Databricks - Medallion Architecture ```graphviz digraph { node[shape="r"] Bronze[label="Bronze (Raw Integration)"] Silver[label="Silver (Filtered, Cleaned, Augmented)"] Gold[label="Gold (Business-level Aggregates)"] Batch, Streaming -> Bronze -> Silver -> Gold } ``` - Goal: improve the structure and quality of data incrementally, steadily and progressively as it flows through each stage. - 簡化 transformation 的 logic - problems multilayered data structure may solve: - excessive focus on operations/ operators and not data, which is what data engineers should be paying attention to - Stages: 3 or 4 stages, no rule of thumb but should not be more. - check quality at every layer (instead of final stage) ### Snowflake's Modern Data Architecture Four layers data design pattern: + Raw Layer + Raw data, no transformations + matches source schema + Conformed Layer + Raw and deduped data + data type standardization(dates) + Reference Layer + Business hierachies + Modeled Layer + Integrated, cleansed, modeled data. + Often dimensionally modeled. ### Analytical Data In A Domain Team With Layered Data Architecture ```graphviz digraph { node[shape="r"] BA[label="Business Analytics"] BA -> Bronze, Silver, Gold OperationData -> Bronze Gold -> DataProduct -> MarketingDomain } ``` ![](https://hackmd.io/_uploads/BkxWxOlC2.png) # Dagster DAG = Directed Acyclic Graph ## Why Dagster -- Pros - :thumbsup: has Software defined assets (SDA); presenter recommends dagster largely because of this feature (see notes on SDA later) - each SDA has a unique access key, which can be used to resolve dependencies - data type validation (is integrated with `pydantic`) - allows you to integrate prepared data from different data teams in a graph-like (DAG), intuitive way - shows (via built-in visualization of) lineage of data sources and ops so you can track - extensive supports: - BigQuery, AWS, etc. - pandas, dbt, spark, greate expectations, pandera, etc. - allows you to write support for other frameworks - is cloud native - compuation and I/O are decoupled - can replace I/O manager and compuations separately - rich ecosystem ## Building Blocks of Dagster - `Op` is the core unit of computation such as - Deriveing a data set from other datasets - this is implemented as a decorator - you only need something like this: ```python @op def func(): return 1 # imagine you have many functions that look like this ``` - dagster helps you resolve dependencies - `Graph` is a set of interconnected ops or sub-graphs can be used in three different ways - Graph backed asset - Directly inside a job - Inside another graphs (as a sub-graph) - :exclamation:A graph cannot be run directly - A `Job` - is the main unit of execution and monitoring in Dagster - is also implemented as a decorator `@job`, see notes on `Op` - A Schedule - something like a `cronjob`, which runs commands at specified intervals/ timestamps ### Software Defined Asset - Asset: an object in persistent storage, such as a table, file, or persisted machine learning model - SDA includes - A unique asset key - A set of upstream asset keys - An op (computation) ![](https://hackmd.io/_uploads/SyKlGdeRn.png) ## Experience Sharing some 心得 from the presenter: 1. watch your postgres resource consumption - retention rate 2. you need to write code to automatically clean up excessive logs (this is an open issue, not resolved yet) - presenter recommends a periodic, scheduled deletion of excessive logs (through calling dagster API) 3. watch K8S resource consumption - all jobs will be exucuted in one pod :cry: 4. 409 conflict error - error before version 1.3.6 - solution: update to latest version or retry 5. Most issues are solved nowadays, but we can learn from how to discover and solve problems in the future. Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份 [Slides on OneDrive](https://onedrive.live.com/view.aspx?resid=20362B1C0A9764D3!14337&ithint=file%2cpptx&authkey=!ACnoLoD7SOrxc3o)