--- title: "Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin" tags: PyConTW2024, 2024-organize, 2024-共筆 --- # Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin {%hackmd NY3XkI1xQ1C9TrHQhoy9Vw %} <iframe src=https://app.sli.do/event/8B4BwuE4pwcDXzD2PHCG3d height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆 Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing ## Data Quality Accuracy Completensee Consistency Timeliness Uniqueness Validity ## Integrity * Domain * User-Define ## Data Contracts 產生data與使用data的合約 Four principles * interface * goverment [Open Data Contract Standard](https://github.com/bitol-io/open-data-contract-standard) * Ex.YAML Service-level agreement(SLA) ## Data Testing Not only test your code, but also your data. Challenge:Test Case * Testing * Code(Unit test) * Data Data LifeCycle ## Data Validation Framework Great expectations(2018) * Python based open source tool * huge community support dbt(data build tool) * software base data tool * genral purpose hub for dbt elementary ## How to choose framework do less for easier adoption ## Q&A 1. 請問這個議題算是DataOps的一部分嗎 yes 2. 請問 dbt 和 great expectation 建議併用嗎? 剛剛演講中也有提到,要看你的情境 3. 過去的經驗上,在 data quality 測試執行中會花很多時間嗎? 4. 剛剛這些工具如何偵測 missing data? 5. 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎? 6. 請問data test,我的理解就是直接在prod環境做資料檢查嗎? 7. 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT? 8. 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎? Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份