title: "Ensuring Data Integrity with Validation and Pipeline Testing - Shuhsi Lin"

Slide: https://speakerdeck.com/sucitw/ensuring-data-integrity-with-validation-and-pipeline-testing

## Data Quality
Accuracy
Completensee
Consistency
Timeliness
Uniqueness
Validity

## Integrity
* Domain
* User-Define

## Data Contracts
產生data與使用data的合約

Four principles
* interface
* goverment

[Open Data Contract Standard](https://github.com/bitol-io/open-data-contract-standard)
* Ex.YAML

Service-level agreement(SLA)

## Data Testing
Not only test your code, but also your data.

Challenge:Test Case
* Testing
* Code(Unit test)
* Data

Data LifeCycle

## Data Validation Framework
Great expectations(2018)
* Python based open source tool
* huge community support

dbt(data build tool)
* software base data tool
* genral purpose hub for dbt

elementary

## How to choose framework
do less for easier adoption

## Q&A
1. 請問這個議題算是DataOps的一部分嗎
yes
2. 請問 dbt 和 great expectation 建議併用嗎?
剛剛演講中也有提到,要看你的情境
3. 過去的經驗上,在 data quality 測試執行中會花很多時間嗎?
4. 剛剛這些工具如何偵測 missing data?
5. 想請問不同情境用不同資料測試框架,可以舉幾個情境例子分享嗎?
6. 請問data test,我的理解就是直接在prod環境做資料檢查嗎?
7. 有針對既有資料做偵測 Quality 的工具嗎?還是直接建起 DBT?
8. 真實世界的題目,data missing 被定義好後,實務上會被客戶一直改定義嗎?