Implementing Layered Data Architecture In Dagster - George T. C., Lai

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

共筆入口：https://hackmd.io/@pycontw/2023
手機版請點選上方按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Collaborative Writing Workplace：https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Implementing Layered Data Architecture In Dagster - George T. C., Lai
- Layered Data Architecture
  - Snowflake's Modern Data Architecture
  - Analytical Data In A Domain Team With Layered Data Architecture
Dagster

Slide

Main Takeaways:

Concept of Layered Data Architecture
Introduction To Data Orchestration Framework – Dagster
Implementation Of Layered Data Architecture Using Dagster
Experience Sharing

Layered Data Architecture

Some history:

was first introduced by Databricks

Medallion Architecture

Goal: improve the structure and quality of data incrementally, steadily and progressively as it flows through each stage.
- 簡化 transformation 的 logic
  - problems multilayered data structure may solve:
    - excessive focus on operations/ operators and not data, which is what data engineers should be paying attention to
Stages: 3 or 4 stages, no rule of thumb but should not be more.
- check quality at every layer (instead of final stage)

Snowflake's Modern Data Architecture

Four layers data design pattern:

Raw Layer
- Raw data, no transformations
- matches source schema
Conformed Layer
- Raw and deduped data
- data type standardization(dates)
Reference Layer
- Business hierachies
Modeled Layer
- Integrated, cleansed, modeled data.
- Often dimensionally modeled.

Analytical Data In A Domain Team With Layered Data Architecture

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Dagster

DAG = Directed Acyclic Graph

Why Dagster – Pros

Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
has Software defined assets (SDA); presenter recommends dagster largely because of this feature (see notes on SDA later)
- each SDA has a unique access key, which can be used to resolve dependencies
- data type validation (is integrated with pydantic)
allows you to integrate prepared data from different data teams in a graph-like (DAG), intuitive way
shows (via built-in visualization of) lineage of data sources and ops so you can track
extensive supports:
- BigQuery, AWS, etc.
- pandas, dbt, spark, greate expectations, pandera, etc.
- allows you to write support for other frameworks
is cloud native
compuation and I/O are decoupled
- can replace I/O manager and compuations separately
rich ecosystem

Building Blocks of Dagster

Op is the core unit of computation such as

Deriveing a data set from other datasets

this is implemented as a decorator

you only need something like this:

@op
def func():
    return 1
# imagine you have many functions that look like this

dagster helps you resolve dependencies

Graph is a set of interconnected ops or sub-graphs can be used in three different ways
- Graph backed asset
- Directly inside a job
- Inside another graphs (as a sub-graph)
- Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  A graph cannot be run directly
A Job
- is the main unit of execution and monitoring in Dagster
- is also implemented as a decorator @job, see notes on Op
A Schedule
- something like a cronjob, which runs commands at specified intervals/ timestamps

Software Defined Asset

Asset: an object in persistent storage, such as a table, file, or persisted machine learning model
SDA includes
- A unique asset key
- A set of upstream asset keys
- An op (computation)

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

some 心得 from the presenter:

watch your postgres resource consumption
- retention rate
you need to write code to automatically clean up excessive logs (this is an open issue, not resolved yet)
- presenter recommends a periodic, scheduled deletion of excessive logs (through calling dagster API)
watch K8S resource consumption
- all jobs will be exucuted in one pod
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
409 conflict error
- error before version 1.3.6
- solution: update to latest version or retry
Most issues are solved nowadays, but we can learn from how to discover and solve problems in the future.

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份

Slides on OneDrive

Implementing Layered Data Architecture In Dagster - George T. C., Lai

Layered Data Architecture

Snowflake's Modern Data Architecture

Analytical Data In A Domain Team With Layered Data Architecture

Dagster

Why Dagster – Pros

Building Blocks of Dagster

Software Defined Asset

Experience Sharing