Enhanced EC Recommendations: Trustworthy Validation with Large Language Models for Two-Tower Model - 陳峻廷(Dan Chen)

# Enhanced EC Recommendations: Trustworthy Validation with Large Language Models for Two-Tower Model - 陳峻廷(Dan Chen) {%hackmd @HWDC/BJOE4qInR %} >#### 》[議程介紹](https://hwdc.ithome.com.tw/2024/session-page/3315) >#### 》[填寫議程滿意度問卷｜回饋建言給辛苦的講者](https://forms.gle/nRK1NgFqS1s2qmg29) --- [TOC] --- # 01-What is Trustworthy ## Trustworthy > Element of trustworthy * Robustness - Fault-tolerant * Fairness - Remove bias * [韓國酷澎案例，手調參數被罰錢](https://www.bnext.com.tw/article/79446/coupang-korea-accusation) * Transparency/Explainability - Reduce risks ## Trustworthy Recommendation > four Persepective - Flow: Data Preparation -> Data Representation -> Recommendation Generation -> **Performance Evaluation** --- # 02-Evaluation Framework > How to Correctly Evaluate AI ## [Brickmaster](https://medium.com/twdsmeetup/brickmaster-scenario-wise-recommendation-system-engine-validation-ba3f663c930a) > Two - stage Recommendation system - Scalable - 痛點：訓練很快，推論很慢 - Trustwotthy - Secenario-wise - KPI-Oriented = Ranking - 每個場景的 KPI 不同，常用 [CTR](https://www.wordstream.com/click-through-rate) - line today 場景會使用 retention，跳轉率等 ## Evaluation Framework(1/2) | ![FigureA](https://hackmd.io/_uploads/ry7_-UZaR.png) | |:--:| |*Figure A: A conceptual framework for building TRSs*| **Stage-4** 特別注重在Technical Evaluation [Ref-Trustworthy Recommender Systems: An Overview](https://arxiv.org/pdf/2208.06265) | ![FigureB](https://hackmd.io/_uploads/HyjeG8bTR.jpg) | |:--:| |*Figure B-Evalution Framework(1/2)*| --- # 03-Offline & Online Evaluation | ![FigureC](https://hackmd.io/_uploads/r1AzGL-6C.jpg) | |:--:| |*Figure C - Offline Evaluation*| **Business**: 指標上的驗證雖然很重要，但要判斷指標的是否為落後指標，或是先行指標 - 快速驗證指標：CTR、CVR - 落後指標：可能需要時間累積資料，難以迭代評估 | ![FigureD](https://hackmd.io/_uploads/HkANX8baA.jpg) | |:--:| |*Figure D - Online Evalution*| 1. **Setting Goal** : 2. **Setting Metrics** : AA test → Make sure metrics is 有意義的 3. **Decide Minimun Experimental unit** (通常是ID) 4. **Estimate Sampling Size**，related to : - alpha - power - variance - min diff (e.g. CTR:+2%) 5. **Random Grouping** (50%,50%) - make sure AB group is indept. - 流量沒有固定或是分配獨立的話，實驗可能是沒有效果的 ## A/B test > Key points show how your algorithms can contribute to your business * if experiment isn't significant * 推薦系統本身有問題 * 量化方向或方法有誤 * PSM * sample ratio mismatch * [Simpson’s paradox](https://medium.com/sherry-ai/%E6%95%B8%E6%93%9A%E7%9C%9F%E7%9A%84%E6%9C%83%E8%AA%AA%E8%A9%B1-%E6%B7%BA%E8%AB%87%E8%BE%9B%E6%99%AE%E6%A3%AE%E6%82%96%E8%AB%96-22c067f6a535) * novelty effect |![image](https://hackmd.io/_uploads/BJ8J8U-6A.png)| |:--:| |*Figure E - Case - EC Shop Recommendation*| --- # 04- LLM on Recommendation ## Feature engineering (Produced by Julia Intern) * tokenization (prompt engineering) * temperature 0.4 比較好的取出商品規格 * text embedding generation (o.s. 改變世界的都是Intern) ## Evaluate embedding - RankMe/ a-ReQ Metrics 可以透過 LLM 來建構雙塔模型 [Ref01](http://arxiv.org/pdf/2305.07961) [Ref02](https://arxiv.org/pdf/2308.16505) ## Conclusion & Challenge - Data Quality - Multiple-Metrics evaluation - Conduct A/B test Experiment - Human Perception --- # 05-Q&A --- ==閒聊區==