RLHF vs DPO

RLHF

Reinforce Learning Human feedback

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Supervised Fine-Tuning(SFT) : 採用supervised的方式來微調預先訓練的語言模型。
Reward model training : 目標是訓練一個模型來適應人類的偏好。在這個階段，首先從提示庫中進行取樣，並使用大型語言模型產生多個回應。然後，人工對這些反應進行排名，根據這些排名訓練一個獎勵模型
Reinforce learning – PPO（Proximal Policy Optimization）:根據獎勵模型優化策略。
- 從資料集中抽取一個新的prompt。
- PPO模型從監督式策略(SFT)初始化。
- 策略生成一個輸出。
- 獎勵模型為輸出計算獎勵。
- 使用PPO根據獎勵更新策略。

RLHF 的替代方案 - DPO

Paper:
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Why: RLHF是一個複雜且通常不穩定的過程，涉及訓練多個LM，並在訓練循環中從LM策略中採樣，帶來顯著的計算成本。

DPO 透過簡單的classfication objective直接最佳化語言模型以符合人類偏好，不需要建立Reward model或是使用Reinforce learning
也就是DPO的本質在於增加了首選的responce相對於不被首選的response的relative log probability, 但是他同時包含了一個動態的,per-example important weight,防止模型因為設計的機率而退化。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Preliminaries:

SFT 階段

獲得一個

π^{S F T}

Reward modeling

在SFT階段時，針對一個prompt X 會產生一對答案

(y 1, y 2) \sim π^{S F T} (y | x)

人工會標注出

$y_{w}$ (prefered one)
$y_{l}$ (dispreferred one)，

透過這些數據訓練一個reward model

r^{*} (y, x)

而建立reward model時常用的損失函數為 Bradley-Terry(BT)
BT model 規定人類偏好分佈

p^{*}

可以表示為

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

假設從

p^{*}

中sample出一個資料集

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

同時建立reward model

r_{ϕ} (x, y)

並且對參數做maximum likelihood，將其轉變成一個二元分類問題，並且使用negative log-likelihood loss:

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

與RLHF相似，ϕ是logistic函數，而

r_{ϕ} (x, y)

通常從SFT模型

π^{S F T} (y | x)

初始化，並且在transformer結構的頂部加上一個線性層，這個層是對 reward value 產生單一純量預測。

接下來，如果是之前ChatGPT所用的RLHF的話，會使用訓練好的獎勵模型來提供feedback並迭代，此迭代策略（也就是PPO）可以表示為：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

$π_{r e f}$ ,
$π_{θ}$ 都是從
$π^{S F T}$ 初始化而來
其中
$β$ 是對獎勵函數的修正，避免迭代中的策略
$π_{θ} (y | x)$ 與base reference policy
$π_{r e f} (y | x)$ 偏離太遠

DPO methodology

利用從reward function到最優策略的分析映射，使其能夠將reward function的損失函數轉換為策略的損失函數

具體做法是給定人工標注的reference dataset, DPO會使用simple binary cross entropy目標來優化策略，所以不需要在訓練期間學習reward function.

通過公式推導，公式三可以用以下公式表示：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

其中

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Z (x)

只與

x

與base policy

π_{r e f}

有關，目的是使右邊取值在[0,1]之間。

再來公式四兩邊取對數並通過一些代數運算得到

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

有了此公式五就可以把公式二中的

r

(獎勵函數)換掉，實現DPO的目的

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

且目標是追求括號裡的值最大化，也就是生成good response的機率提高，bad response的機率降低，使loss最小化

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

DPO更新公式：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

實驗結果：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

(1)在sentiment generation任務中，所有KL散度下，DPO都取得最大的reward
(2)在summarization 任務中，DPO贏過PPO的最好版本表現，並且在不同的Sampling temperature下也能維持更高的robustness

(3)在單輪對話任務中，DPO在表現好的win rate中，取得了最高的win rate
(4)同時在訓練過程中，DPO表現出較快的收斂速度，訓練也較平穩。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

使用在Summarization任務中表現較好的temperate測試在CNN/DailyMail dataset，DPO的表現也較好。

針對RLHF設計的算法 - ReMax （取代PPO）

Why: RLHF的主要計算開銷來自於第三階段，而瓶頸來自於PPO。

而其中一個沈重的計算開銷來自於Value model

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Value model可以有效的估計策略的預期長期回報，但PPO中的Value model通常與LLM大小相似，使得儲存的需求加了一倍。

此外Value model還需要儲存其梯度、啟動、優化器等狀態，進一步增加了接近四倍的GPU儲存需求。使其成為RLHF的主要計算障礙

Problem formulation

在RLHF setup中 reward model 對於一個完整個回應提供了

r (a_{1}, . . ., a_{T})

Prefernce data:
$r (a_{1}, . . ., a_{T})$
Negative preference data:
$r (a'_{1}, . . ., a'_{T})$

最佳化目標可表示如下：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

對於特定任務我們會引入instructional prompts,讓

x

表示此Prompt，包含任務指令跟輸入，LLM

π_{θ}

產生response

a_{1 : T}

，reward model評估response 跟prompt的相關性，目標是reward最大化。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Reward maximization as Sequential Decision-making Task
對於一個reward maximization problem，將其轉變為一個可以用RL解決的 Markov Decision Process(MDP)，其中

s

跟

a

分別為state 跟action，

P

表示transition function，描述基於現在的

s

跟

a

預測出下一個狀態

s^{'}

的probablistic transition。

r

為reward function,

ρ

為初始狀態分佈

MDP 框架的主要目標是優化預期的long-term return，也就是中間獎勵的累積總和：

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

在LLM中，狀態指一系列之前生成的tokens，表示為

s_{t} = (x, a_{1}, . . ., a_{t - 1})

。action 則是從vocabulary set中選擇一個token，initial state

s 1

對應於prompt

x

，與

ρ

對齊

雖然reward model

r (x, a_{1 : T})

評估整個response而不是token，但還是可以調整MDP框架如下。也就是指定單一token獎勵為0。

Transition function

P

將current token 加到歷史紀錄中，形成後續狀態

* s_{t}

表示 tuple

s_{t}

中的元素。
到這裡，解決問題二（公式二）的RL algorithm 可以等價解決問題一（公式一）。

Natural solution to problem 2 : PPO

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

然而PPO中有三個特性，在對於LLM的RLHF是沒有用到的（因RLHF沒有這些問題）

fast simulation
在傳統的應用中獲得 Long-term Return成本很高，但是在RLHF中沒有這個問題，如公式二所示，對RLHF任務來說，long-term return 只是LLM一個response 的trajectory reward.而獲得這個獎勵只需要一個query跟reward model,在現代GPU上通常不超過三秒
deterministic transitions
在RLHF中 transition都是確定的，如在公式四中給定一個
$(s_{t}, a_{t})$ 就會知道下一個狀態
$s_{t + 1}$ ，還有reward function 中
$r (s_{t}, a_{t})$ 也是確定性的，因為他來自神經網路
trajectory-level rewards
如公式三所示，只有當整個軌跡在
$t = T$ 結束時，
$r (s t, a t)$ 才非零。這意味著 RLHF 任務接近「單階段」最佳化問題，因為中間階段為0。因此，RL 中使用的value function在這裡可能沒有那麼有用