Scaling The E-Commerce Recommendations System - 黃耀慶(Arthur Huang)

# Scaling The E-Commerce Recommendations System - 黃耀慶(Arthur Huang) {%hackmd @HWDC/BJOE4qInR %} >#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/session-page/3311) >#### 》[填寫議程滿意度問卷｜回饋建言給辛苦的講者](https://forms.gle/yPVK5UJY2rVs7RVE7) # Agenda - Chanllenges in LINE Shopping - Multi-stage recommender - Retrieval - Ranking - Re-Rank .... # Challenges in LINE SHOPPING - Complex Scenario - More than 20 types of recommendations - Shopping Recommendation Scenario - Charts - Point - Search Keyword - Shop - Huge Item - More than millions of Products - 如何快速找到客戶有興趣的商品 # Multi-Stage Recommmender - 5 STEPS - Item Corpus (Millions) - Retrieval(Hundreds) - 快速找到用戶有興趣的商品 - Ranking (dozens) - 基於用戶行為進行排序 - Re-rank (dozens) - 新鮮度的問題 - 排序 - Recommended Items # Retrieval **Two-Tower Model (雙塔模型)** - Flow : User/Item Feature -> **Model Tower** -> **Dot Product** -> Product - **Model Tower**：NNs for Learning embedding (為了取得User/Item 的Embedding) - User Tower - Item Tower - **Dot Product**： - Get Simliarty between user-item embedding - 限制：無法使用User/Item feature - 推薦funnel - 有興趣: 用戶點擊 0.0001% - 沒興趣: 沒點擊商品（幾乎為全商品）做隨機抽樣 - Learning User-Item Embeddings - Target # In-Batch Negative Sampling # Feature Engineering ![IMG_4898](https://hackmd.io/_uploads/r1q8iX-a0.jpg) - **Numeric Feature ** - Normalization - Power Transform：轉換成正態分佈 - Wilson Score Interval (e.g. CTR)：削減與業務邏輯上特別情況(ex.曝光)的bias - **Categorical Feture** - One-Hot - Label Encoding + Embedding Layer (e.g. User ID, Item ID)：解決稀疏矩陣的問題 - Feature Hashing：會有哈希衝突的問題，需trade-off - Ordinal Encoding - S, M, L, XL, XXL -> 0, 1, 2, 3, 4, 5 - Frequency Encoding - **Text Feature** - Bert Encoding：取得Embedding # Feature Enineering - Embedding Layer - Parameters Size = num_embeddings x embedding_dim - Shared Embedding：共用Embedding，可以節省一半左右的size - Reduce Param Size # Retrieval - Inference - 透過取得 Item/User Embedding，透過近似向量搜尋快速找到適合的推薦商品 - ANN index - Item (Offline) - 行為不常改變，可以線下批次更新 - User (Online) - 即時性的, 因用戶的行為是經常改變的, e.g. 這次想買滑鼠 , 下次想買 XXX - Item2Item(猜你喜歡)：以用戶喜歡的商品，與其他商品進行相似度比對 # Ranking : Ranking based on user behavior in the module 雙塔模型無法使用user/item feature，所以會額外加上Ranking的手法 - Deep Ranking Network - Learning the probability of click event - Target(Focus on Module) - Positive : Click - Negative : Impression but no click - 這邊只在於"這個版面" 有沒有點 - e.g. 夯話題 - 思考：為什麼不能拿有曝光，但沒點擊的項目作為Negative？ - 點擊：有興趣 → 非常有興趣 - 沒點擊：不一定沒興趣 → 有興趣 - 完全沒有被Retrival到：沒興趣 - 透過 PySpark 進行 scaling - 分散式運算的基本概念 - # Re-rank: Ranking by Divefrsity, Freshness, Business Logic - Diversity (推薦多樣性) - do not show items from the same category in a sequence. - Freshness (保持用戶的新鮮程度，新商品上架沒資料的時候可以這樣推) - Promote fresher items - e.g. Youtube 5%~10% for new items, this is good for long term , 避免同溫層, 讓user 做興趣探索 - Business Logic (特定節慶/檔期) - Promotion / Holiday Campagin - 在假期期間, 提高"對應產品"的權重. - 父親節 => 父親會用的 + 適合送禮的 - Product Profit - 新增商品上架時間 - 推薦系統本質上是一個GMV(網站成交金额)放大器 # Model Training - Petastorm - 過往記憶體不夠，就是狂加 - 需要取捨：大量資料處理 → Petastorm dataset (parquet) - 如何透過大量資料資料讀取，打亂資料 - 透過列式儲存降低資料存取的負擔，特別擅長處理稀疏矩陣 # ML flow - 搭配資料視覺化工具 # Pytorch Lighting - 透過將流程模組化之後，重複使用。 (template) # Airflow - 資料排程管理使用 ==聊天區== - DS or ML <=5 - Search or Recommendation <=5 好難記...都是圖看講者能不能提供投影片了, .