Try   HackMD

RLHF vs DPO

RLHF

Reinforce Learning Human feedback

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  1. Supervised Fine-Tuning(SFT) : 採用supervised的方式來微調預先訓練的語言模型。
  2. Reward model training : 目標是訓練一個模型來適應人類的偏好。在這個階段,首先從提示庫中進行取樣,並使用大型語言模型產生多個回應。然後,人工對這些反應進行排名,根據這些排名訓練一個獎勵模型
  3. Reinforce learning PPO(Proximal Policy Optimization):根據獎勵模型優化策略。
    • 從資料集中抽取一個新的prompt。
    • PPO模型從監督式策略(SFT)初始化。
    • 策略生成一個輸出。
    • 獎勵模型為輸出計算獎勵。
    • 使用PPO根據獎勵更新策略。

RLHF 的替代方案 - DPO

Paper:
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Why: RLHF是一個複雜且通常不穩定的過程,涉及訓練多個LM,並在訓練循環中從LM策略中採樣,帶來顯著的計算成本。

DPO 透過簡單的classfication objective直接最佳化語言模型以符合人類偏好,不需要建立Reward model或是使用Reinforce learning
也就是DPO的本質在於增加了首選的responce相對於不被首選的response的relative log probability, 但是他同時包含了一個動態的,per-example important weight,防止模型因為設計的機率而退化。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Preliminaries:

SFT 階段

獲得一個

πSFT

Reward modeling

在SFT階段時,針對一個prompt X 會產生一對答案

(y1,y2)πSFT(y|x).

人工會標注出

  1. yw
    (prefered one)
  2. yl
    (dispreferred one),

透過這些數據訓練一個reward model

r(y,x)
而建立reward model時常用的損失函數為 Bradley-Terry(BT)
BT model 規定人類偏好分佈
p
可以表示為

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

假設從

p 中sample出一個資料集

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

同時建立reward model

rϕ(x,y) 並且對參數做maximum likelihood,將其轉變成一個二元分類問題,並且使用negative log-likelihood loss:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

與RLHF相似,ϕ是logistic函數,而

rϕ(x,y) 通常從SFT模型
πSFT(y|x)
初始化,並且在transformer結構的頂部加上一個線性層,這個層是對 reward value 產生單一純量預測。

  • 接下來,如果是之前ChatGPT所用的RLHF的話,會使用訓練好的獎勵模型來提供feedback並迭代,此迭代策略(也就是PPO)可以表示為:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

  • πref
    ,
    πθ
    都是從
    πSFT
    初始化而來
  • 其中
    β
    是對獎勵函數的修正,避免迭代中的策略
    πθ(y|x)
    與base reference policy
    πref(y|x)
    偏離太遠

DPO methodology

利用從reward function到最優策略的分析映射,使其能夠將reward function的損失函數轉換為策略的損失函數

具體做法是給定人工標注的reference dataset, DPO會使用simple binary cross entropy目標來優化策略,所以不需要在訓練期間學習reward function.

通過公式推導,公式三可以用以下公式表示:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

其中

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Z(x)只與
x
與base policy
πref
有關,目的是使右邊取值在[0,1]之間。

再來公式四兩邊取對數並通過一些代數運算得到

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

有了此公式五就可以把公式二中的

r (獎勵函數)換掉,實現DPO的目的

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

且目標是追求括號裡的值最大化,也就是生成good response的機率提高,bad response的機率降低,使loss最小化

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

DPO更新公式:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

實驗結果:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

(1)在sentiment generation任務中,所有KL散度下,DPO都取得最大的reward
(2)在summarization 任務中,DPO贏過PPO的最好版本表現,並且在不同的Sampling temperature下也能維持更高的robustness
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

(3)在單輪對話任務中,DPO在表現好的win rate中,取得了最高的win rate
(4)同時在訓練過程中,DPO表現出較快的收斂速度,訓練也較平穩。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

使用在Summarization任務中表現較好的temperate測試在CNN/DailyMail dataset,DPO的表現也較好。

針對RLHF設計的算法 - ReMax (取代PPO)

Why: RLHF的主要計算開銷來自於第三階段,而瓶頸來自於PPO。

而其中一個沈重的計算開銷來自於Value model

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Value model可以有效的估計策略的預期長期回報,但PPO中的Value model通常與LLM大小相似,使得儲存的需求加了一倍。

此外Value model還需要儲存其梯度、啟動、優化器等狀態,進一步增加了接近四倍的GPU儲存需求。使其成為RLHF的主要計算障礙

Problem formulation

在RLHF setup中 reward model 對於一個完整個回應提供了

r(a1,...,aT)

  • Prefernce data:
    r(a1,...,aT)
  • Negative preference data:
    r(a1,...,aT)

最佳化目標可表示如下:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

對於特定任務我們會引入instructional prompts,讓

x表示此Prompt,包含任務指令跟輸入,LLM
πθ
產生response
a1:T
,reward model評估response 跟prompt的相關性,目標是reward最大化。

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Reward maximization as Sequential Decision-making Task
對於一個reward maximization problem,將其轉變為一個可以用RL解決的 Markov Decision Process(MDP),其中

s
a
分別為state 跟action,
P
表示transition function,描述基於現在的
s
a
預測出下一個狀態
s
的probablistic transition。
r
為reward function,
ρ
為初始狀態分佈

MDP 框架的主要目標是優化預期的long-term return,也就是中間獎勵的累積總和:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

在LLM中,狀態指一系列之前生成的tokens,表示為

st=(x,a1,...,at1)。action 則是從vocabulary set中選擇一個token,initial state
s1
對應於prompt
x
,與
ρ
對齊

雖然reward model

r(x,a1:T) 評估整個response而不是token,但還是可以調整MDP框架如下。也就是指定單一token獎勵為0。
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Transition function

P 將current token 加到歷史紀錄中,形成後續狀態
Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

st 表示 tuple
st
中的元素。
到這裡,解決問題二(公式二)的RL algorithm 可以等價解決問題一(公式一)。

Natural solution to problem 2 : PPO

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

然而PPO中有三個特性,在對於LLM的RLHF是沒有用到的(因RLHF沒有這些問題)

  1. fast simulation
    在傳統的應用中獲得 Long-term Return成本很高,但是在RLHF中沒有這個問題,如公式二所示,對RLHF任務來說,long-term return 只是LLM一個response 的trajectory reward.而獲得這個獎勵只需要一個query跟reward model,在現代GPU上通常不超過三秒
  2. deterministic transitions
    在RLHF中 transition都是確定的,如在公式四中給定一個
    (st,at)
    就會知道下一個狀態
    st+1
    ,還有reward function 中
    r(st,at)
    也是確定性的,因為他來自神經網路
  3. trajectory-level rewards
    如公式三所示,只有當整個軌跡在
    t=T
    結束時,
    r(st,at)
    才非零。這意味著 RLHF 任務接近「單階段」最佳化問題,因為中間階段為0。因此,RL 中使用的value function在這裡可能沒有那麼有用

Prorposed method

Classical algorithm REINFORCE

考慮一個固定的Prompt X,透過REINFORCE,我們可以決定問題一中目標的梯度為

截圖 2024-01-30 下午1.38.45

現在給定一組prompt

(x1,...xN)
可以透過將LLM model
πθ
展開來導出隨機梯度估計器

截圖 2024-01-30 下午1.44.50
截圖 2024-01-30 下午3.05.29

作者發現REINFORCE可能很適合解決RLHF問題,特別是他需要對預期回報的估計,也就是問題一中的目標函數值,可以透過對LM及reward model的qeury獲得。

此外,REINFORCE透過僅利用

r(x,a1:T)並忽略中間獎勵來更新策略,原本這種方式被認為是缺點,因為中間獎勵可能會包含豐富的訊號,但在RLHF中因為中間獎勵全部為0,所以這個缺點就消失了。

但REINFORCE並不是能解決使用PPO的所有缺點,在使用OPT-1.3B模型的實驗中就觀察到在提高獎勵最大化的性能方面表現不好
並且推測是實驗中觀察到有可能原因是隨機梯度的high variance

截圖 2024-01-30 下午2.20.29

  • 左:REINFORCE的性能較差
  • 右:REINFORCE的隨機梯度值遠大於ReMax

因此基於REINFORCE提出了另一個演算法 ReMax

為了減輕隨機梯度中的high variance,在REINFORCE的基礎上,對於梯度估計減去baseline value

截圖 2024-01-30 下午2.35.03
截圖 2024-01-30 下午3.11.14

為了穩定訓練,透過減去Baseline value

r(x,ā1:T),來修改獎勵分數。

主要程式碼如下:

截圖 2024-01-30 下午2.44.54

實驗結果

  • 速度
    截圖 2024-01-30 下午2.50.41
  • 有效性
    截圖 2024-01-30 下午3.02.48
  • 佔用的記憶體
    截圖 2024-01-30 下午3.18.34

深入比較DPO 與 RLHF

Policy Optimization in RLHF: The Impact of
Out-of-preference Data

現有的human preference alignment 方法主要分為兩大類

  1. 無獎勵模型的方法,如DPO
  2. 基於獎勵模型的方法,如RLHF,在這篇論文裡稱作RMB-PO(Reward-Model-Based Policy Optimization )

Problem Formulation:

使用Contextual bandit來表述alignment問題,令

s
a
分別代表狀態(Context)及決策動作
目標是獲得一個決策policy
π
可以最大化reward:
截圖 2024-01-31 下午2.49.09

符號

ρ表示狀態分佈,r是ground truth reward function

通常Bradley-Terry 模型會被用來定義人類偏好的分佈

截圖 2024-01-31 下午3.00.41

a>a 表示
a
is more preferred compared with
a

而reward learning object表示為:

截圖 2024-02-29 下午1.03.56

三種方法的策略:

πref為參考策略模型
β>0
為超參數

  • RMB-PO透過偏好資料集中採樣狀態來優化策略

    截圖 2024-01-31 下午3.12.31

  • RMB-PO+, RMB-PO的一個變形,使用新的無偏好的資料集

    截圖 2024-01-31 下午3.14.20
    新資料及通常好取得且數量大於
    Dpref

  • DPO: 僅使用preference data

    截圖 2024-01-31 下午3.21.33

隨機優化的觀點

根據以上的公式,作者認為所有的方法都是公式一的隨機近似,其中誤差有三個來源

  1. reward model error
    |r^(s,a)r(s,a)|
  2. 使用有限樣本計算期望值
    Eaπ(·|s)[·]
    的估計誤差
  3. 使用有限樣本計算期望值
    Esp(·)[·]
    的估計誤差

其中第一個誤差主要是因為偏好資料有限造成的,而資料成本很高
跟DPO相比,RMB-PO目標是在減少第二個錯誤
而RMB-PO+又更進一步減少第三個錯誤

截圖 2024-01-31 下午5.27.52

Experiment

linear bandit

在linear bandit task 中,定義

截圖 2024-01-31 下午6.21.41
其中feature 表示為:
截圖 2024-01-31 下午6.22.31

實驗採兩種策略

  1. 在reward model及 policy model中不存在特徵不匹配的情況,也就是
    ϕπ=ϕr
  2. 另一種設定是
    截圖 2024-01-31 下午6.19.47

截圖 2024-01-31 下午5.41.52

Optimality gap: 目前找到的最佳解(best known solution),與最佳解的邊界(a value that bounds the best solution) 之間的距離 ->越小越好

Ablation study

截圖 2024-01-31 下午6.27.23

neural bandit

在neural bandit的情況下,reward和policy模型的feature representation純粹是從給定的數據中學習的

截圖 2024-01-31 下午5.42.43

進一步擴展preference free的資料大小並沒有太大幫助。