Challenges in Data Cleaning and Transformation: Mistakes, Confusion, and Solutions - Iris Chen

歡迎來到 PyCon TW 2023 共筆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

共筆入口：https://hackmd.io/@pycontw/2023
手機版請點選上方按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Collaborative Writing Workplace：https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.

Collaborative writing start from below
從這裡開始共筆

Challenges in Data Cleaning and Transformation: Mistakes, Confusion, and Solutions - Iris Chen

slides

Introduction

任務：優化影片推薦系統
背景：公司在幫OTT平台做推薦

優化有2個方向

Method1:更強的模型 -> 逼近模型預測的上限
Method2:改善資料品質 -> 提升模型預測的上限

發現目前資料的品質比較有問題，所以從Method2著手，改善資料品質

需要清理的地方

Data Cleaning Method

text pruning:去除贅字之類的
Name normalization: 包括多國語系的電影，會使用不同語言，名稱需要做normalization
Data enrichment：引入外部資料來改善缺失值問題

Data Cleaning Process

在每一個不同的清理步驟，都會有不同版本的處理方式
- 路徑會有很多種

Text Pruning

舉例：玩具總動員
Name, Reason for Messiness

玩具總動員 -> extra spaces
玩☆具☆總☆動☆員 -> weird punctuation

(熱門首播)玩具總動員-> redundant words

光是在text pruning的部分就做了很多實驗，所以有很多版本的codes&data

Name Normalization

同個名字在不同語言不同國家之下的不同表示方式
- e.g. 石原里美, 石原聰美
可能會包含暱稱
- e.g. 李奧納多(Leo)

Data Enrichment

許多資訊為空缺

使用外部資料
- 可以讓資訊更完整
- 可能會產生新的欄位
問題：要使用哪裡的資料？
- IMDB or Filmarks？
問題：取代 or 加入？
- 格式差異
- 部分欄位加入 -> 選擇哪些欄位？

Summary

Source Selection
Method Combination
Detailed Step Experiments

Challenges

Challenge 1. Effective Data Quality Monitoring Strategies

使用MLflow來記錄各個實驗結果

需要監測的數值：

Row count
Duplicate Rate
Distinct Rate

Data Quality 相關工具：

Data-DIFF
Pandas Profiling
（講者主推）PipeRider：確保開發過程中Data的正確性

Challenge 2. Data pipeline Management

一開始把所有步驟都寫在同一個file

問題：
- 檔案肥大、難以管理
- 在測試的時候難以確定是哪個因子影響推薦結果的影響

工具：

Airflow
Dagster
Prefect
（講者主推）DBT
- 建立簡單 Unit Test 檢查欄位是否正常
- Lineage Graph (類似 Airflow 的 DAGs)

Using DBT to standardized the ETL proccess

每一個方法(text pruning, name normalization, etc)都是獨立的

Lineage Graph (DBT上的其中一個講者推薦的feature)

根據Model執行順序產生示意圖
示意圖：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

DBT + PipeRider Compare

展現Before and After
- e.g. row 的數量變化
example: 增加去除贅字的function後, 對資料的影響

The collaboration with InfuseAI

InfuseAI is the owner of PipeRider

Order	Table	Rows	Column
2	text_pruning	4761	6(+0)
1	name normalization	4761	6(+0)
3	data_enrichment	1000	6(+0)

Conclusion

Effective Data Quality Monitoring Strategies -> PipeRider
Data Popeline Management -> DBT

https://github.com/KKStream/dbt_imdb

Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份