Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin
Collaborative writing start from below
從這裡開始共筆
Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing
Data Quality
Accuracy
Completensee
Consistency
Timeliness
Uniqueness
Validity
Integrity
Data Contracts
產生data與使用data的合約
Four principles
Open Data Contract Standard
Service-level agreement(SLA)
Data Testing
Not only test your code, but also your data.
Challenge:Test Case
Data LifeCycle
Data Validation Framework
Great expectations(2018)
- Python based open source tool
- huge community support
dbt(data build tool)
- software base data tool
- genral purpose
hub for dbt
elementary
How to choose framework
do less for easier adoption
Q&A
- 請問這個議題算是DataOps的一部分嗎
yes
- 請問 dbt 和 great expectation 建議併用嗎?
剛剛演講中也有提到,要看你的情境
- 過去的經驗上,在 data quality 測試執行中會花很多時間嗎?
- 剛剛這些工具如何偵測 missing data?
- 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎?
- 請問data test,我的理解就是直接在prod環境做資料檢查嗎?
- 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT?
- 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎?
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份