在 Dcard 我們如何用 Python 打造推薦系統 - 陳子元
歡迎來到 PyCon TW 2023 共筆
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
共筆入口:https://hackmd.io/@pycontw/2023
手機版請點選上方 按鈕展開議程列表。
Welcome to PyCon TW 2023 Collaborative Writing
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Collaborative Writing Workplace:https://hackmd.io/@pycontw/2023
Using mobile please tap to unfold the agenda.
Collaborative writing start from below
從這裡開始共筆
推薦系統100%用python寫
採用python跟其他語言的搭配
dcard,台灣的社群網站
Dcard的推薦系統
分成兩個-
why we choose python to bulid a recommendation system?
- skillset for our team members
- end to end data science
- 目的:每個參與的人都能從頭到尾參與這個系統
- 相關書籍:Designing Machine Learning Systems
- data collection and preperation
- data exploration and analysis
- feature engineering
- model development
- evaluation and validation
- deployment *who train model and who deploy mode
context
contextual insights
Benefit of More Context
- add a realtime feature
- e.g. member's subscription status
- using PyFlink
- add a batch updated feature
- add a new model
- add a new serving strategy
End-to-end data science lead to
- less communition overhead
- Avoiding Fingerpointing and Blame Games (?)
- faster iteration
- greater ownership (我覺得是ownership是講者打錯成ownship)
challenge of
- Limitations of individuals
- Skillset
Dcard如何處理
slow python
History
在很大一段時間,Python很慢對Dcard不是問題
-
Version1: Batch Prediction
- Daily job慢沒關係
- 但對重度使用者體驗不佳 -> Version2
-
Version2: Near realtime prediction
- Batch prediction cannot satisfy heavy users
- Publish task after some criteria
- read N posts
- subscribe forums
- online in last 20 minutes
- Cannot immeditately response user's action
- Cannot response latest posts
- 例:剛訂閱後,還在Queue裡面,無法馬上在首頁看到
-
Version3: Realtime prediction
How we enhance python speed?
優化python速度的方法
不要讓優化速度變得像膝跳反應一樣,你要知道自己想優化哪段、如何優化
"If you can't measure it your can't improve it" —— 卡內基
- Profiling & Tracing
- Tracing Result
- 80% IO
- Cache
- Feature store migration
- Preload data in background jobs
- Asynchronous data loading
- 20% Inference
Decard時間都花在這三件事情上:
- Feature transform(sklearn)
- Sklearn要保證自己的model是general的 -> 重寫套件
- Custom with sparse -> np.array
- remove redundant parts according to our usage
- Raw data construction
- model inference
Summary
- 最初速度不會是最大的問題
- 要提升速度,先做tracing找出問題在哪裡
How we organize Python code
Some general things
- PEP8
- clean codes
- naming
- short function
- SOLID (一個 function 只做一件事)
- Unit tests
- Linter
- Formatter
- pre-commit
Dependency management
- pipenv
- always pin versions
- 目前嘗試用Rye
- 解決Local Python格式不同的問題(eg. M1 & Intel)
Training architecture
split stages to
好處:
- rerun 實驗比較快
- allocate different resources
Configuration(寫成code)
- 把Config寫成Code
- Recored feature definition
- Feature definition
- Why不用現成的?(例如yaml, hydra)
- 自己寫比較彈性
- 可以有繼承
- 可以做
__post_init__
feature definition
configs alternatives
inherit config
trainer vs logger
using inheritance
- 可以用繼承來解決
- Logger專門紀錄
- Trainer繼承Logger -> Trainer傳進Logger
composition or dependency injection
challenge of code review
code review for experimental codes
summmary
adhering to coding
takeaway
- end to end data science
- speed is not a problem in early stage
QA
-
腥煽色暴力等等的內容對受眾很有吸引力,在引入推薦系統時有考慮過推薦系統是否會過度推薦這類貼文?團隊有沒有觀察到系統的推薦是否都是令人產生負面情緒的貼文?
-
如果 MLE 做整個 End-to-End,那在 dcard 的 DE 只參與最前面的資料收集進到 BigQuery 的部分嗎?
- DE的時間花在架設這些系統,比起叫他們做Query更值得
-
找不到那麼多人可以一條龍的E2E的話,目前Dcard的策略是部分提供部分做compoment team嗎?
-
Dcard公司主要使用哪個雲平台(AWS、GCP、Azure ?)
-
有沒有推薦的profilling and tracing tools?
-
現階段 Dcard 內部還沒採用Gpu 訓練模型嗎,是的話想知道是否為內部實驗出來與 SOTA 的結論 ( deep model 可以超越傳統線性或FM model)不同?
- 首頁推薦是用NN
- Dcard資料比較像是Tabular data,許多論文提到這種data使用NN不一定特別有優勢
-
如果要不想自己維持順序,或許可以用 numpy 的 structured array
-
加速interface有試過其他interface engine 如 tensorRt
-
請問Dcard如何驗證推薦系統的成效?
-
Slide
-
DDD or TDD?
-
如何學習DS?
-
有考慮Docker?
Below is the part that speaker updated the talk/tutorial after speech
講者於演講後有更新或勘誤投影片的部份