---
title: "Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin"
tags: PyConTW2024, 2024-organize, 2024-共筆
---
# Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin
{%hackmd NY3XkI1xQ1C9TrHQhoy9Vw %}
<iframe src=https://app.sli.do/event/8B4BwuE4pwcDXzD2PHCG3d height=450 width=100%></iframe>
> Collaborative writing start from below
> 從這裡開始共筆
Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing
## Data Quality
Accuracy
Completensee
Consistency
Timeliness
Uniqueness
Validity
## Integrity
* Domain
* User-Define
## Data Contracts
產生data與使用data的合約
Four principles
* interface
* goverment
[Open Data Contract Standard](https://github.com/bitol-io/open-data-contract-standard)
* Ex.YAML
Service-level agreement(SLA)
## Data Testing
Not only test your code, but also your data.
Challenge:Test Case
* Testing
* Code(Unit test)
* Data
Data LifeCycle
## Data Validation Framework
Great expectations(2018)
* Python based open source tool
* huge community support
dbt(data build tool)
* software base data tool
* genral purpose
hub for dbt
elementary
## How to choose framework
do less for easier adoption
## Q&A
1. 請問這個議題算是DataOps的一部分嗎
yes
2. 請問 dbt 和 great expectation 建議併用嗎?
剛剛演講中也有提到,要看你的情境
3. 過去的經驗上,在 data quality 測試執行中會花很多時間嗎?
4. 剛剛這些工具如何偵測 missing data?
5. 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎?
6. 請問data test,我的理解就是直接在prod環境做資料檢查嗎?
7. 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT?
8. 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎?
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份