Try   HackMD

Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin

歡迎來到 PyCon TW 2024 共筆

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

共筆入口:https://hackmd.io/@pycontw/2024
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2024 Collaborative Writing
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Collaborative Writing Workplace:https://hackmd.io/@pycontw/2024
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing

Data Quality

Accuracy
Completensee
Consistency
Timeliness
Uniqueness
Validity

Integrity

  • Domain
  • User-Define

Data Contracts

產生data與使用data的合約

Four principles

  • interface
  • goverment

Open Data Contract Standard

  • Ex.YAML

Service-level agreement(SLA)

Data Testing

Not only test your code, but also your data.
Challenge:Test Case

  • Testing
    • Code(Unit test)
    • Data

Data LifeCycle

Data Validation Framework

Great expectations(2018)

  • Python based open source tool
  • huge community support

dbt(data build tool)

  • software base data tool
  • genral purpose

hub for dbt
elementary

How to choose framework

do less for easier adoption

Q&A

  1. 請問這個議題算是DataOps的一部分嗎
    yes
  2. 請問 dbt 和 great expectation 建議併用嗎?
    剛剛演講中也有提到,要看你的情境
  3. 過去的經驗上,在 data quality 測試執行中會花很多時間嗎?
  4. 剛剛這些工具如何偵測 missing data?
  5. 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎?
  6. 請問data test,我的理解就是直接在prod環境做資料檢查嗎?
  7. 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT?
  8. 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎?

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份