owned this note changed 7 months ago
Published Linked with GitHub

Enhanced EC Recommendations: Trustworthy Validation with Large Language Models for Two-Tower Model - 陳峻廷(Dan Chen)

歡迎來到 Hello World Dev Conf 共筆
共筆入口:https://hackmd.io/@HWDC/2024

議程介紹

填寫議程滿意度問卷|回饋建言給辛苦的講者



01-What is Trustworthy

Trustworthy

Element of trustworthy

Trustworthy Recommendation

four Persepective

  • Flow: Data Preparation -> Data Representation -> Recommendation Generation -> Performance Evaluation

02-Evaluation Framework

How to Correctly Evaluate AI

Brickmaster

Two - stage Recommendation system

  • Scalable
    • 痛點:訓練很快,推論很慢
  • Trustwotthy
  • Secenario-wise
  • KPI-Oriented = Ranking
    • 每個場景的 KPI 不同,常用 CTR
    • line today 場景會使用 retention,跳轉率等

Evaluation Framework(1/2)

FigureA
Figure A: A conceptual framework for building TRSs

Stage-4 特別注重在Technical Evaluation

Ref-Trustworthy Recommender Systems: An Overview

FigureB
Figure B-Evalution Framework(1/2)

03-Offline & Online Evaluation

FigureC
Figure C - Offline Evaluation

Business:
指標上的驗證雖然很重要,但要判斷指標的是否為落後指標,或是先行指標

  • 快速驗證指標:CTR、CVR
  • 落後指標:可能需要時間累積資料,難以迭代評估
FigureD
Figure D - Online Evalution
  1. Setting Goal :

  2. Setting Metrics : AA test → Make sure metrics is 有意義的

  3. Decide Minimun Experimental unit (通常是ID)

  4. Estimate Sampling Size,related to :

    • alpha
    • power
    • variance
    • min diff (e.g. CTR:+2%)
  5. Random Grouping (50%,50%)

    • make sure AB group is indept.
    • 流量沒有固定或是分配獨立的話,實驗可能是沒有效果的

A/B test

Key points show how your algorithms can contribute to your business

  • if experiment isn't significant
    • 推薦系統本身有問題
    • 量化方向或方法有誤
    • PSM
  • sample ratio mismatch
  • novelty effect
image
Figure E - Case - EC Shop Recommendation

04- LLM on Recommendation

Feature engineering

(Produced by Julia Intern)

  • tokenization (prompt engineering)
    • temperature 0.4 比較好的取出商品規格
  • text embedding generation

(o.s. 改變世界的都是Intern)

Evaluate embedding

  • RankMe/ a-ReQ Metrics

可以透過 LLM 來建構雙塔模型

Ref01
Ref02

Conclusion & Challenge

  • Data Quality
  • Multiple-Metrics evaluation
  • Conduct A/B test Experiment
  • Human Perception

05-Q&A


閒聊區

Select a repo