Implementing Layered Data Architecture In Dagster - George T. C., Lai
ζ‘θΏδΎε° PyCon TW 2023 ε
±η
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More β
ε
±ηε
₯ε£οΌ https://hackmd.io/@pycontw/2023
ζζ©ηθ«ι»ιΈδΈζΉ ζιε±ιθ°η¨ε葨γ
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More β
Collaborative Writing WorkplaceοΌ https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
εΎι裑ιε§ε
±η
Slide
Main Takeaways:
Concept of Layered Data Architecture
Introduction To Data Orchestration Framework β Dagster
Implementation Of Layered Data Architecture Using Dagster
Experience Sharing
Layered Data Architecture
Some history:
was first introduced by Databricks
Medallion Architecture
%0
Bronze
Bronze (Raw Integration)
Silver
Silver (Filtered, Cleaned, Augmented)
Bronze->Silver
Gold
Gold (Business-level Aggregates)
Silver->Gold
Batch
Batch
Batch->Bronze
Streaming
Streaming
Streaming->Bronze
Goal: improve the structure and quality of data incrementally, steadily and progressively as it flows through each stage.
η°‘ε transformation η logic
problems multilayered data structure may solve:
excessive focus on operations/ operators and not data, which is what data engineers should be paying attention to
Stages: 3 or 4 stages, no rule of thumb but should not be more.
check quality at every layer (instead of final stage)
Snowflake's Modern Data Architecture Four layers data design pattern:
Raw Layer
Raw data, no transformations
matches source schema
Conformed Layer
Raw and deduped data
data type standardization(dates)
Reference Layer
Modeled Layer
Integrated, cleansed, modeled data.
Often dimensionally modeled.
Analytical Data In A Domain Team With Layered Data Architecture
%0
BA
Business Analytics
Bronze
Bronze
BA->Bronze
Silver
Silver
BA->Silver
Gold
Gold
BA->Gold
DataProduct
DataProduct
Gold->DataProduct
OperationData
OperationData
OperationData->Bronze
MarketingDomain
MarketingDomain
DataProduct->MarketingDomain
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More β
Dagster DAG = Directed Acyclic Graph
Why Dagster β Pros
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More β
has Software defined assets (SDA); presenter recommends dagster largely because of this feature (see notes on SDA later)
each SDA has a unique access key, which can be used to resolve dependencies
data type validation (is integrated with pydantic
)
allows you to integrate prepared data from different data teams in a graph-like (DAG), intuitive way
shows (via built-in visualization of) lineage of data sources and ops so you can track
extensive supports:
BigQuery, AWS, etc.
pandas, dbt, spark, greate expectations, pandera, etc.
allows you to write support for other frameworks
is cloud native
compuation and I/O are decoupled
can replace I/O manager and compuations separately
rich ecosystem
Building Blocks of Dagster
Op
is the core unit of computation such as
Deriveing a data set from other datasets
this is implemented as a decorator
you only need something like this:
ββββββββ@op
ββββββββdef func ():
ββββββββ return 1
ββββββββ
dagster helps you resolve dependencies
Graph
is a set of interconnected ops or sub-graphs can be used in three different ways
Graph backed asset
Directly inside a job
Inside another graphs (as a sub-graph)
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More β
A graph cannot be run directly
A Job
is the main unit of execution and monitoring in Dagster
is also implemented as a decorator @job
, see notes on Op
A Schedule
something like a cronjob
, which runs commands at specified intervals/ timestamps
Software Defined Asset
Asset: an object in persistent storage, such as a table, file, or persisted machine learning model
SDA includes
A unique asset key
A set of upstream asset keys
An op (computation)
Image Not Showing
Possible Reasons
The image was uploaded to a note which you don't have access to The note which the image was originally uploaded to has been deleted
Learn More β
Experience Sharing some εΏεΎ from the presenter:
watch your postgres resource consumption
you need to write code to automatically clean up excessive logs (this is an open issue, not resolved yet)
presenter recommends a periodic, scheduled deletion of excessive logs (through calling dagster API)
watch K8S resource consumption
all jobs will be exucuted in one pod
Image Not Showing
Possible Reasons
The image file may be corrupted The server hosting the image is unavailable The image path is incorrect The image format is not supported
Learn More β
409 conflict error
error before version 1.3.6
solution: update to latest version or retry
Most issues are solved nowadays, but we can learn from how to discover and solve problems in the future.
Below is the part that speaker updated the talk/tutorial after speech
θ¬θ
ζΌζΌθ¬εΎζζ΄ζ°ζεθͺ€ζε½±ηηι¨δ»½
Slides on OneDrive