Building Self-serve Data Platform Based On Dagster, DBT and DuckDB - 賴宗智

歡迎來到 MOPCON 2024 共筆

Image Not Showing Possible Reasons

共筆入口：https://hackmd.io/@mopcon/2024
手機版請點選上方按鈕展開議程列表。

從這開始

Modern Data Stack (MDS)

Most bloggers' point of view:

Good

Layers: ingestion, warehousing, transformation, Bl.
Horizontal products and unlimited scale using cloud infrastructure. (Cost is the primary constraint to data processing)
Low overhead investment (infra/data engineers).
United by SQL.
Both fast from an iteration perspective and a pure query execution time perspective.

Bad

Data mesh is a decentralized sociotechnical approach to share, access, and manage analytical data in complex and large-scale environments.
Principles
- Domain ownership (decentralization)
- Data as a product (product thinking)
- Self-serve data platform (focused on this talk)
- Federated computational governance

(來不及了…XD)

Image Not Showing Possible Reasons

Image Not Showing Possible Reasons

Image Not Showing Possible Reasons

Image Not Showing Possible Reasons

Pros
- Versatility (General-purposed ETL/ELT data pipelines)
- Flexibility (manage multiple database environments, adapter by data model/contract)
- Cost efficiency (via DuckDB)
- Easy to rollback for disaster recovery (GitOps)
Cons
- Needs to maintain EL tasks for various sources and destinations
- Lack of built-in data lineage for dbt models

Modern Data Stack
Data Mesh
- Self-serve data platform
- Data orchestration / pipeline, workflow management
Dagster as a self-serve data platform
- Domain-agnostic components
- Domain-specific repos
WAP-based data pipeline using dbt
- with DuckDB, or
- with staging table in DWs