在 Dcard 我們如何用 Python 打造推薦系統 - 陳子元

--- title: "在 Dcard 我們如何用 Python 打造推薦系統 - 陳子元" tags: PyConTW2023, 2023-organize, 2023-共筆 --- # 在 Dcard 我們如何用 Python 打造推薦系統 - 陳子元 {%hackmd H6-2BguNT8iE7ZUrnoG1Tg %} <iframe src=https://app.sli.do/event/qDmLwJDxPbvDLgmAWKZD86 height=450 width=100%></iframe> > Collaborative writing start from below > 從這裡開始共筆推薦系統100%用python寫採用python跟其他語言的搭配 dcard,台灣的社群網站 ### Dcard的推薦系統分成兩個- + 首頁推薦 ⬅️ 今天講這個 + 相關文章推薦 ## why we choose python to bulid a recommendation system? - skillset for our team members - end to end data science - 目的：每個參與的人都能從頭到尾參與這個系統 - 相關書籍：[Designing Machine Learning Systems](https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/) - data collection and preperation - data exploration and analysis - feature engineering - model development - evaluation and validation - **deployment** *who train model and who deploy mode - context contextual insights ### Benefit of More Context + add a realtime feature + e.g. member's subscription status + using PyFlink + add a batch updated feature + using Airflow + add a new model + using PyTorch + add a new serving strategy + using FastAPI End-to-end data science lead to - less communition overhead - Avoiding Fingerpointing and Blame Games (？) - faster iteration - greater ownership (我覺得是ownership是講者打錯成ownship) challenge of - Limitations of individuals - Skillset ## Dcard如何處理 ### slow python #### History 在很大一段時間，Python很慢對Dcard不是問題 - Version1: Batch Prediction - Daily job慢沒關係 - 但對重度使用者體驗不佳 -> Version2 - Version2: Near realtime prediction - Batch prediction cannot satisfy heavy users - Publish task after some criteria - read N posts - subscribe forums - online in last 20 minutes - Cannot immeditately response user's action - Cannot response latest posts - 例：剛訂閱後，還在Queue裡面，無法馬上在首頁看到 - Version3: Realtime prediction #### How we enhance python speed? 優化python速度的方法 > 不要讓優化速度變得像膝跳反應一樣，你要知道自己想優化哪段、如何優化 > "If you can't measure it your can't improve it" —— 卡內基 - Profiling & Tracing - Tracing Result - 80% IO - Cache - Feature store migration - Preload data in background jobs - Asynchronous data loading - 20% Inference Decard時間都花在這三件事情上: 1. Feature transform(sklearn) * Sklearn要保證自己的model是general的 -> 重寫套件 - Custom with sparse -> np.array - remove redundant parts according to our usage 3. Raw data construction * List[dict] -> DataFrame 5. model inference ## Summary + 最初速度不會是最大的問題 + 要提升速度，先做tracing找出問題在哪裡 ## How we organize Python code **Some general things** * PEP8 * clean codes - naming - short function - SOLID (一個 function 只做一件事) * Unit tests * Linter * Formatter * pre-commit **Dependency management** + pipenv + always pin versions + 目前嘗試用Rye + 解決Local Python格式不同的問題（eg. M1 & Intel） + ### Training architecture #### split stages to 好處： - rerun 實驗比較快 - allocate different resources #### Configuration(寫成code) - 把Config寫成Code - 解決事後找不到當初Train的config的問題 - Recored feature definition - Feature definition - Why不用現成的？(例如yaml, hydra) - 自己寫比較彈性 - 可以有繼承 - 可以做`__post_init__` ### feature definition - scikit-learn pipeline ### configs alternatives ### inherit config ### build pipeline & transformer ### trainer vs logger ### using inheritance - 可以用繼承來解決 - Logger專門紀錄 - Trainer繼承Logger -> Trainer傳進Logger ### composition or dependency injection ### challenge of code review ### code review for experimental codes ### summmary adhering to coding ### takeaway - end to end data science - speed is not a problem in early stage ## QA + 腥煽色暴力等等的內容對受眾很有吸引力，在引入推薦系統時有考慮過推薦系統是否會過度推薦這類貼文？團隊有沒有觀察到系統的推薦是否都是令人產生負面情緒的貼文？ + 的確有吸引力 + 有營運團隊在確認這是否會有影響 + 如果 MLE 做整個 End-to-End，那在 dcard 的 DE 只參與最前面的資料收集進到 BigQuery 的部分嗎？ + DE的時間花在架設這些系統，比起叫他們做Query更值得 + 找不到那麼多人可以一條龍的E2E的話，目前Dcard的策略是部分提供部分做compoment team嗎？ + Dcard公司主要使用哪個雲平台(AWS、GCP、Azure ?) + GCP + 有沒有推薦的profilling and tracing tools？ + Open t什麼的 + 現階段 Dcard 內部還沒採用Gpu 訓練模型嗎，是的話想知道是否為內部實驗出來與 SOTA 的結論（ deep model 可以超越傳統線性或FM model）不同？ + 首頁推薦是用NN + Dcard資料比較像是Tabular data，許多論文提到這種data使用NN不一定特別有優勢 + 如果要不想自己維持順序，或許可以用 numpy 的 [structured array](https://numpy.org/doc/stable/user/basics.rec.html) + 加速interface有試過其他interface engine 如 tensorRt + 請問Dcard如何驗證推薦系統的成效？ + 用戶滿意很難衡量 + Slide + https://drive.google.com/file/d/1TRHSuSLWHQNOj8E8RuZ58RiTnNj1sXwv/view + DDD or TDD? + TDD有 + DDD還沒，評估中 + 如何學習DS？ + 有考慮Docker？ + 沒 Below is the part that speaker updated the talk/tutorial after speech 講者於演講後有更新或勘誤投影片的部份