Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin
歡迎來到 PyCon TW 2024 共筆
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
共筆入口:https://hackmd.io/@pycontw/2024
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2024 Collaborative Writing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Collaborative Writing Workplace:https://hackmd.io/@pycontw/2024
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
從這裡開始共筆
Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing
Data Quality
Accuracy
Completensee
Consistency
Timeliness
Uniqueness
Validity
Integrity
Data Contracts
產生data與使用data的合約
Four principles
Open Data Contract Standard
Service-level agreement(SLA)
Data Testing
Not only test your code, but also your data.
Challenge:Test Case
Data LifeCycle
Data Validation Framework
Great expectations(2018)
- Python based open source tool
- huge community support
dbt(data build tool)
- software base data tool
- genral purpose
hub for dbt
elementary
How to choose framework
do less for easier adoption
Q&A
- 請問這個議題算是DataOps的一部分嗎
yes
- 請問 dbt 和 great expectation 建議併用嗎?
剛剛演講中也有提到,要看你的情境
- 過去的經驗上,在 data quality 測試執行中會花很多時間嗎?
- 剛剛這些工具如何偵測 missing data?
- 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎?
- 請問data test,我的理解就是直接在prod環境做資料檢查嗎?
- 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT?
- 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎?
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份