owned this note changed 7 months ago

Published Linked with GitHub

Scaling The E-Commerce Recommendations System - 黃耀慶(Arthur Huang)

歡迎來到 Hello World Dev Conf 共筆
共筆入口：https://hackmd.io/@HWDC/2024

》議程介紹

》填寫議程滿意度問卷｜回饋建言給辛苦的講者

Agenda

Chanllenges in LINE Shopping
Multi-stage recommender
Retrieval
Ranking
Re-Rank
…

Challenges in LINE SHOPPING

Complex Scenario
- More than 20 types of recommendations
- Shopping Recommendation Scenario
  - Charts
  - Point
  - Search Keyword
  - Shop
Huge Item
- More than millions of Products
- 如何快速找到客戶有興趣的商品

Multi-Stage Recommmender

5 STEPS
- Item Corpus (Millions)
- Retrieval(Hundreds)
  - 快速找到用戶有興趣的商品
- Ranking (dozens)
  - 基於用戶行為進行排序
- Re-rank (dozens)
  - 新鮮度的問題
  - 排序
- Recommended Items

Retrieval

Two-Tower Model (雙塔模型)

Flow :
User/Item Feature -> Model Tower -> Dot Product -> Product
- Model Tower：NNs for Learning embedding (為了取得User/Item 的Embedding)
  - User Tower
  - Item Tower
- Dot Product：
  - Get Simliarty between user-item embedding
限制：無法使用User/Item feature
推薦funnel
- 有興趣: 用戶點擊 0.0001%
- 沒興趣: 沒點擊商品（幾乎為全商品）做隨機抽樣
Learning User-Item Embeddings
Target

In-Batch Negative Sampling

Feature Engineering

**Numeric Feature **
- Normalization
- Power Transform：轉換成正態分佈
- Wilson Score Interval (e.g. CTR)：削減與業務邏輯上特別情況(ex.曝光)的bias
Categorical Feture
- One-Hot
- Label Encoding + Embedding Layer (e.g. User ID, Item ID)：解決稀疏矩陣的問題
- Feature Hashing：會有哈希衝突的問題，需trade-off
- Ordinal Encoding
  - S, M, L, XL, XXL -> 0, 1, 2, 3, 4, 5
- Frequency Encoding
Text Feature
- Bert Encoding：取得Embedding

Feature Enineering

Embedding Layer
- Parameters Size = num_embeddings x embedding_dim
- Shared Embedding：共用Embedding，可以節省一半左右的size
  - Reduce Param Size

Retrieval - Inference

透過取得 Item/User Embedding，透過近似向量搜尋快速找到適合的推薦商品
- ANN index
Item (Offline)
- 行為不常改變，可以線下批次更新
User (Online)
- 即時性的, 因用戶的行為是經常改變的, e.g. 這次想買滑鼠 , 下次想買 XXX
Item2Item(猜你喜歡)：以用戶喜歡的商品，與其他商品進行相似度比對

Ranking : Ranking based on user behavior in the module

雙塔模型無法使用user/item feature，所以會額外加上Ranking的手法

Deep Ranking Network
- Learning the probability of click event
- Target(Focus on Module)
  - Positive : Click
  - Negative : Impression but no click
- 這邊只在於"這個版面" 有沒有點
- e.g. 夯話題
思考：為什麼不能拿有曝光，但沒點擊的項目作為Negative？
- 點擊：有興趣 → 非常有興趣
- 沒點擊：不一定沒興趣 → 有興趣
- 完全沒有被Retrival到：沒興趣
透過 PySpark 進行 scaling
- 分散式運算的基本概念

Re-rank: Ranking by Divefrsity, Freshness, Business Logic

Diversity (推薦多樣性)
- do not show items from the same category in a sequence.
Freshness (保持用戶的新鮮程度，新商品上架沒資料的時候可以這樣推)
- Promote fresher items
- e.g. Youtube 5%~10% for new items, this is good for long term , 避免同溫層, 讓user 做興趣探索
Business Logic (特定節慶/檔期)
- Promotion / Holiday Campagin
  - 在假期期間, 提高"對應產品"的權重.
  - 父親節 => 父親會用的 + 適合送禮的
- Product Profit
- 新增商品上架時間
推薦系統本質上是一個GMV(網站成交金额)放大器

Model Training - Petastorm

過往記憶體不夠，就是狂加
- 需要取捨：大量資料處理 → Petastorm dataset (parquet)
- 如何透過大量資料資料讀取，打亂資料
- 透過列式儲存降低資料存取的負擔，特別擅長處理稀疏矩陣

ML flow

搭配資料視覺化工具

Pytorch Lighting

透過將流程模組化之後，重複使用。 (template)

Airflow

資料排程管理使用

聊天區

DS or ML <=5
Search or Recommendation <=5
好難記…都是圖
看講者能不能提供投影片了, .

Cheatsheet

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Select a repo