RLHF vs DPO
RLHF
Reinforce Learning Human feedback
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Supervised Fine-Tuning(SFT) : 採用supervised的方式來微調預先訓練的語言模型。
- Reward model training : 目標是訓練一個模型來適應人類的偏好。在這個階段,首先從提示庫中進行取樣,並使用大型語言模型產生多個回應。然後,人工對這些反應進行排名,根據這些排名訓練一個獎勵模型
- Reinforce learning – PPO(Proximal Policy Optimization):根據獎勵模型優化策略。
- 從資料集中抽取一個新的prompt。
- PPO模型從監督式策略(SFT)初始化。
- 策略生成一個輸出。
- 獎勵模型為輸出計算獎勵。
- 使用PPO根據獎勵更新策略。
RLHF 的替代方案 - DPO
Paper:
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Why: RLHF是一個複雜且通常不穩定的過程,涉及訓練多個LM,並在訓練循環中從LM策略中採樣,帶來顯著的計算成本。
DPO 透過簡單的classfication objective直接最佳化語言模型以符合人類偏好,不需要建立Reward model或是使用Reinforce learning
也就是DPO的本質在於增加了首選的responce相對於不被首選的response的relative log probability, 但是他同時包含了一個動態的,per-example important weight,防止模型因為設計的機率而退化。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Preliminaries:
SFT 階段
獲得一個
Reward modeling
在SFT階段時,針對一個prompt X 會產生一對答案
.
人工會標注出
- (prefered one)
- (dispreferred one),
透過這些數據訓練一個reward model
而建立reward model時常用的損失函數為 Bradley-Terry(BT)
BT model 規定人類偏好分佈 可以表示為
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
假設從 中sample出一個資料集
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
同時建立reward model 並且對參數做maximum likelihood,將其轉變成一個二元分類問題,並且使用negative log-likelihood loss:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
與RLHF相似,ϕ是logistic函數,而 通常從SFT模型初始化,並且在transformer結構的頂部加上一個線性層,這個層是對 reward value 產生單一純量預測。
- 接下來,如果是之前ChatGPT所用的RLHF的話,會使用訓練好的獎勵模型來提供feedback並迭代,此迭代策略(也就是PPO)可以表示為:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- , 都是從 初始化而來
- 其中 是對獎勵函數的修正,避免迭代中的策略 與base reference policy 偏離太遠
DPO methodology
利用從reward function到最優策略的分析映射,使其能夠將reward function的損失函數轉換為策略的損失函數
具體做法是給定人工標注的reference dataset, DPO會使用simple binary cross entropy目標來優化策略,所以不需要在訓練期間學習reward function.
通過公式推導,公式三可以用以下公式表示:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
其中
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
只與 與base policy 有關,目的是使右邊取值在[0,1]之間。
再來公式四兩邊取對數並通過一些代數運算得到
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
有了此公式五就可以把公式二中的 (獎勵函數)換掉,實現DPO的目的
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
且目標是追求括號裡的值最大化,也就是生成good response的機率提高,bad response的機率降低,使loss最小化
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
DPO更新公式:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
實驗結果:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
(1)在sentiment generation任務中,所有KL散度下,DPO都取得最大的reward
(2)在summarization 任務中,DPO贏過PPO的最好版本表現,並且在不同的Sampling temperature下也能維持更高的robustness
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
(3)在單輪對話任務中,DPO在表現好的win rate中,取得了最高的win rate
(4)同時在訓練過程中,DPO表現出較快的收斂速度,訓練也較平穩。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
使用在Summarization任務中表現較好的temperate測試在CNN/DailyMail dataset,DPO的表現也較好。
針對RLHF設計的算法 - ReMax (取代PPO)
Why: RLHF的主要計算開銷來自於第三階段,而瓶頸來自於PPO。
而其中一個沈重的計算開銷來自於Value model
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Value model可以有效的估計策略的預期長期回報,但PPO中的Value model通常與LLM大小相似,使得儲存的需求加了一倍。
此外Value model還需要儲存其梯度、啟動、優化器等狀態,進一步增加了接近四倍的GPU儲存需求。使其成為RLHF的主要計算障礙
在RLHF setup中 reward model 對於一個完整個回應提供了
- Prefernce data:
- Negative preference data:
最佳化目標可表示如下:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
對於特定任務我們會引入instructional prompts,讓表示此Prompt,包含任務指令跟輸入,LLM 產生response ,reward model評估response 跟prompt的相關性,目標是reward最大化。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Reward maximization as Sequential Decision-making Task
對於一個reward maximization problem,將其轉變為一個可以用RL解決的 Markov Decision Process(MDP),其中跟分別為state 跟action,表示transition function,描述基於現在的跟預測出下一個狀態的probablistic transition。為reward function, 為初始狀態分佈
MDP 框架的主要目標是優化預期的long-term return,也就是中間獎勵的累積總和:
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
在LLM中,狀態指一系列之前生成的tokens,表示為。action 則是從vocabulary set中選擇一個token,initial state 對應於prompt ,與 對齊
雖然reward model 評估整個response而不是token,但還是可以調整MDP框架如下。也就是指定單一token獎勵為0。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Transition function 將current token 加到歷史紀錄中,形成後續狀態
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
表示 tuple 中的元素。
到這裡,解決問題二(公式二)的RL algorithm 可以等價解決問題一(公式一)。
Natural solution to problem 2 : PPO
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
然而PPO中有三個特性,在對於LLM的RLHF是沒有用到的(因RLHF沒有這些問題)
- fast simulation
在傳統的應用中獲得 Long-term Return成本很高,但是在RLHF中沒有這個問題,如公式二所示,對RLHF任務來說,long-term return 只是LLM一個response 的trajectory reward.而獲得這個獎勵只需要一個query跟reward model,在現代GPU上通常不超過三秒
- deterministic transitions
在RLHF中 transition都是確定的,如在公式四中給定一個 就會知道下一個狀態,還有reward function 中也是確定性的,因為他來自神經網路
- trajectory-level rewards
如公式三所示,只有當整個軌跡在 結束時, 才非零。這意味著 RLHF 任務接近「單階段」最佳化問題,因為中間階段為0。因此,RL 中使用的value function在這裡可能沒有那麼有用
Prorposed method
Classical algorithm – REINFORCE
考慮一個固定的Prompt X,透過REINFORCE,我們可以決定問題一中目標的梯度為

現在給定一組prompt
可以透過將LLM model 展開來導出隨機梯度估計器


作者發現REINFORCE可能很適合解決RLHF問題,特別是他需要對預期回報的估計,也就是問題一中的目標函數值,可以透過對LM及reward model的qeury獲得。
此外,REINFORCE透過僅利用並忽略中間獎勵來更新策略,原本這種方式被認為是缺點,因為中間獎勵可能會包含豐富的訊號,但在RLHF中因為中間獎勵全部為0,所以這個缺點就消失了。
但REINFORCE並不是能解決使用PPO的所有缺點,在使用OPT-1.3B模型的實驗中就觀察到在提高獎勵最大化的性能方面表現不好
並且推測是實驗中觀察到有可能原因是隨機梯度的high variance

- 左:REINFORCE的性能較差
- 右:REINFORCE的隨機梯度值遠大於ReMax
因此基於REINFORCE提出了另一個演算法 – ReMax
為了減輕隨機梯度中的high variance,在REINFORCE的基礎上,對於梯度估計減去baseline value


為了穩定訓練,透過減去Baseline value ,來修改獎勵分數。
主要程式碼如下:

實驗結果
深入比較DPO 與 RLHF
Policy Optimization in RLHF: The Impact of
Out-of-preference Data
現有的human preference alignment 方法主要分為兩大類
- 無獎勵模型的方法,如DPO
- 基於獎勵模型的方法,如RLHF,在這篇論文裡稱作RMB-PO(Reward-Model-Based Policy Optimization )
使用Contextual bandit來表述alignment問題,令跟分別代表狀態(Context)及決策動作
目標是獲得一個決策policy 可以最大化reward:

符號表示狀態分佈,r是ground truth reward function
通常Bradley-Terry 模型會被用來定義人類偏好的分佈

表示 is more preferred compared with
而reward learning object表示為:

三種方法的策略:
令為參考策略模型 為超參數
隨機優化的觀點
根據以上的公式,作者認為所有的方法都是公式一的隨機近似,其中誤差有三個來源
- reward model error
- 使用有限樣本計算期望值的估計誤差
- 使用有限樣本計算期望值的估計誤差
其中第一個誤差主要是因為偏好資料有限造成的,而資料成本很高
跟DPO相比,RMB-PO目標是在減少第二個錯誤
而RMB-PO+又更進一步減少第三個錯誤

Experiment
linear bandit
在linear bandit task 中,定義
其中feature 表示為:

實驗採兩種策略
- 在reward model及 policy model中不存在特徵不匹配的情況,也就是
- 另一種設定是


Optimality gap: 目前找到的最佳解(best known solution),與最佳解的邊界(a value that bounds the best solution) 之間的距離 ->越小越好
Ablation study

neural bandit
在neural bandit的情況下,reward和policy模型的feature representation純粹是從給定的數據中學習的

進一步擴展preference free的資料大小並沒有太大幫助。