Try   HackMD

Implementing Layered Data Architecture In Dagster - George T. C., Lai

ζ­‘θΏŽδΎ†εˆ° PyCon TW 2023 ε…±η­†

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More β†’

ε…±η­†ε…₯口:https://hackmd.io/@pycontw/2023
ζ‰‹ζ©Ÿη‰ˆθ«‹ι»žιΈδΈŠζ–Ή ζŒ‰ιˆ•ε±•ι–‹θ­°η¨‹εˆ—θ‘¨γ€‚
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More β†’

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
εΎžι€™θ£‘ι–‹ε§‹ε…±η­†

Slide

Main Takeaways:

  • Concept of Layered Data Architecture
  • Introduction To Data Orchestration Framework – Dagster
  • Implementation Of Layered Data Architecture Using Dagster
  • Experience Sharing

Layered Data Architecture

  • Some history:

    • was first introduced by Databricks
      • Medallion Architecture
        
        
        
        
        
        
        %0
        
        
        
        Bronze
        
        Bronze (Raw Integration)
        
        
        
        Silver
        
        Silver (Filtered, Cleaned, Augmented)
        
        
        
        Bronze->Silver
        
        
        
        
        
        Gold
        
        Gold (Business-level Aggregates)
        
        
        
        Silver->Gold
        
        
        
        
        
        Batch
        
        Batch
        
        
        
        Batch->Bronze
        
        
        
        
        
        Streaming
        
        Streaming
        
        
        
        Streaming->Bronze
        
        
        
        
        
        
  • Goal: improve the structure and quality of data incrementally, steadily and progressively as it flows through each stage.

    • η°‘εŒ– transformation ηš„ logic
      • problems multilayered data structure may solve:
        • excessive focus on operations/ operators and not data, which is what data engineers should be paying attention to
  • Stages: 3 or 4 stages, no rule of thumb but should not be more.

    • check quality at every layer (instead of final stage)

Snowflake's Modern Data Architecture

Four layers data design pattern:

  • Raw Layer
    • Raw data, no transformations
    • matches source schema
  • Conformed Layer
    • Raw and deduped data
    • data type standardization(dates)
  • Reference Layer
    • Business hierachies
  • Modeled Layer
    • Integrated, cleansed, modeled data.
    • Often dimensionally modeled.

Analytical Data In A Domain Team With Layered Data Architecture







%0



BA

Business Analytics



Bronze

Bronze



BA->Bronze





Silver

Silver



BA->Silver





Gold

Gold



BA->Gold





DataProduct

DataProduct



Gold->DataProduct





OperationData

OperationData



OperationData->Bronze





MarketingDomain

MarketingDomain



DataProduct->MarketingDomain





Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More β†’

Dagster

DAG = Directed Acyclic Graph

Why Dagster – Pros

  • Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More β†’
    has Software defined assets (SDA); presenter recommends dagster largely because of this feature (see notes on SDA later)
    • each SDA has a unique access key, which can be used to resolve dependencies
    • data type validation (is integrated with pydantic)
  • allows you to integrate prepared data from different data teams in a graph-like (DAG), intuitive way
  • shows (via built-in visualization of) lineage of data sources and ops so you can track
  • extensive supports:
    • BigQuery, AWS, etc.
    • pandas, dbt, spark, greate expectations, pandera, etc.
    • allows you to write support for other frameworks
  • is cloud native
  • compuation and I/O are decoupled
    • can replace I/O manager and compuations separately
  • rich ecosystem

Building Blocks of Dagster

  • Op is the core unit of computation such as
    • Deriveing a data set from other datasets
    • this is implemented as a decorator
      • you only need something like this:
      ​​​​​​​​@op
      ​​​​​​​​def func():
      ​​​​​​​​    return 1
      ​​​​​​​​# imagine you have many functions that look like this
      
      • dagster helps you resolve dependencies
  • Graph is a set of interconnected ops or sub-graphs can be used in three different ways
    • Graph backed asset
    • Directly inside a job
    • Inside another graphs (as a sub-graph)
    • Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More β†’
      A graph cannot be run directly
  • A Job
    • is the main unit of execution and monitoring in Dagster
    • is also implemented as a decorator @job, see notes on Op
  • A Schedule
    • something like a cronjob, which runs commands at specified intervals/ timestamps

Software Defined Asset

  • Asset: an object in persistent storage, such as a table, file, or persisted machine learning model
  • SDA includes
    • A unique asset key
    • A set of upstream asset keys
    • An op (computation)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More β†’

Experience Sharing

some εΏƒεΎ— from the presenter:

  1. watch your postgres resource consumption

    • retention rate
  2. you need to write code to automatically clean up excessive logs (this is an open issue, not resolved yet)

    • presenter recommends a periodic, scheduled deletion of excessive logs (through calling dagster API)
  3. watch K8S resource consumption

    • all jobs will be exucuted in one pod
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More β†’
  4. 409 conflict error

    • error before version 1.3.6
    • solution: update to latest version or retry
  5. Most issues are solved nowadays, but we can learn from how to discover and solve problems in the future.

Below is the part that speaker updated the talk/tutorial after speech
θ¬›θ€…ζ–ΌζΌ”θ¬›εΎŒζœ‰ζ›΄ζ–°ζˆ–ε‹˜θͺ€ζŠ•ε½±η‰‡ηš„ιƒ¨δ»½

Slides on OneDrive