# Research questions about RLHF
## Brief primer on RLHF
Given an input $x$, the initial policy generates multiple outputs $y_0, y_1$, human raters indicate that they prefer $y_i$ over $y_{1-i}$, and the reward model is trained with the loss:
$$loss(r_\theta)=-\mathbb{E}_{x,y_0,y_1,i}\left[\log(\sigma(r_\theta(x,y_i)-r_\theta(x,y_{1-i})))\right]$$
The reward for the learned policy is then:
$$R(x,y, \phi)=r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]$$
which is equivalent to the learning objective:
\begin{align}
&\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot|x)} \left[R(x,y,\phi)\right] \\
&=\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)} \left[r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]\right] \\
&= \arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x) \middle\| \pi^{SFT}(\cdot|x)\right)\right]
\end{align}
But the above objective encourages mode-seeking behavior, for a couple of reasons:
1. $r_\theta(x,y)$ can be maximized by tuning $\pi_\phi^{RL}$ to only generate $y$'s that maximize $r_\theta(x,y)$.
1. Since $\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x) \middle\| \pi^{SFT}(\cdot|x)\right)$ represents an expectation over $y\sim \pi_\phi^{RL}(\cdot|x)$, this regularization term will focus disproportionately on high-reward outputs $y$, and have much less effect on the rest of the distribution $\pi^{SFT}(\cdot|x)$. Even though this will dampen the mode-seeking caused by (1), it does not really smooth the rest of the distribution outside of the modes.
What if we flip the operands of the KL divergence penalty?
\begin{align}
&\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi^{SFT}(\cdot|x) \middle\| \pi_\phi^{RL}(\cdot|x) \right)\right] \\
& \arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathbb{E}_{y'\sim \pi^{SFT}(\cdot |x)} \log\left(\frac{\pi^{SFT}(y'|x)}{\pi_\phi^{RL}(y'|x)}\right)\right]
\end{align}
Now the KL divergence penalty will be applied even to lower-reward samples and will overall be applied more evenly across the entire sample space (assuming that $\pi^{SFT}$ is less "peaky" and closer to the uniform distribution than $\pi_\phi^{RL}$ is).
However, a couple of issues still exist with this new formulation:
1. Now the learning objective contains both $y$ and $y'$, which are sampled from two different models. In practice, we could optimize each term separately, alternating between batches.
- Would this cause training instabilities?? We're essentially interpolating between two different distributions, but picking only one distribution to take a step towards at each update and alternating distributions at each step.
1. Reward hacking - since we are sampling from $\pi^{SFT}(\cdot |x)$ to approximate the KL divergence penalty, it is possible $\pi_\phi^{RL}$ will learn to generate $y$'s with high reward but low likelihood under the original distribution (e.g. $\pi^{SFT}(y|x)$ is low). In practice, RLHF usually involves continual re-training of the reward model via human annotations that would likely then assign a lower reward to such degenerate outputs, but this is a manual correction (whereas the original RLHF formulation would be more likely to penalize these degenerate outputs).
What if we instead used both KL divergence terms?
\begin{align}
&\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi^{SFT}(\cdot|x) \middle\| \pi_\phi^{RL}(\cdot|x) \right) - \gamma\,\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x)\middle\|\pi^{SFT}(\cdot|x)\right)\right]
\end{align}
where we introduce a separate hyperparameter $\gamma$ instead of using JSD so that we can tune the strength by which we weight by one distribution over the other.
## Current issues with RLHF:
1. Can amplify undesirable distributional biases
- e.g. in Glaese et al. (https://arxiv.org/abs/2209.14375), the RLHF-tuned model is both more incorrect and biased on ambiguous questions in the [BBQ benchmark](https://arxiv.org/abs/2110.08193) than its pretraining-only counterpart. The authors hypothesize that this could be because the multi-objective RL tuning made the model *less* likely to answer "I don't know" to ambiguous questions
- Perez et al. (https://arxiv.org/abs/2212.09251) RLHF can cause LMs to output more extremist political views
1. Seems to be more effective during pre-training than fine-tuning (https://arxiv.org/abs/2302.08582)
- Speculative intuition: Easier to learn an initial "positive" behavior than to unlearn a "negative" behavior later in training. Should be possible to prove though?
1. Human preferences are often not transitive or fixed. They are contextual, personal, and dynamic
- [Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning](https://arxiv.org/abs/2206.13316)
- Even if preferences between trajectories are noiseless and transitive, there may not exist a unique optimal policy: https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d3837ee7981e8c064774da6cdd98bf-Paper.pdf
1. Not much support in RLHF for the uncertainty of the preference
- Not sure if this is in a paper, but apparently Anthropic's Claude model uses an RLHF model that provides a scale of how "strong" a preference is. Is this different from uncertainty though??
1. RL leads to distribution collapse
- https://arxiv.org/abs/2205.11275
- Mode collapse: https://arxiv.org/pdf/2012.01365.pdf#page=7 (section 5), https://arxiv.org/pdf/2303.17548.pdf
- Vague relationship to this paper I think? https://arxiv.org/abs/2012.11635
- If the RL objective for tuning an LM is to maximize the reward $J_{RL}(\theta)=\mathbb{E}_{x\sim\pi_\theta}r(x)$, then $\theta^*=\delta_{x^*}$ where $x^*=\arg\max_x r(x)$. This leads to **mode collapse**!
- Using a KL penalty doesn't necessarily fix this problem:
\begin{align}
J_{KL-RL}(\theta) &= \mathbb{E}_{x\sim\pi_\theta}[r(x)-\beta D_{KL}(\pi_\theta, \pi_0)] \\
&= \mathbb{E}_{x\sim\pi_\theta}[r_\theta'(x)] \\
&\text{where} \\
r_\theta'(x) &= r(x)+\beta(\log\pi_0(x) - \log\pi_\theta(x))
\end{align}
It just so happens that:
\begin{align}
\pi_{KL-RL}^* = \frac{1}{2}\pi_0(x)\exp(r(x)/\beta)
\end{align}
which still concentrates the distribution around high-reward examples with high probabilities assigned by $\pi_0$
1. Leads to poorer calibration?
- GPT-4 Technical Report (https://arxiv.org/pdf/2303.08774.pdf, page 12/Figure 8)
- Kadavath et al. seems to show that temperature adjustment can alleviate this issue?? (https://arxiv.org/pdf/2207.05221.pdf, page 11/Figure 9)
## Research questions I'm interested in
1. How does RLHF lead to miscalibration, what downstream impacts does it have, and how can it be alleviated?
- KL-regularised RL still leads to mode collapse, which may result in more "sharp" probability distributions (and in turn, poor calibration)
- Kadavath et al. resolved this for MC by increasing the temperature (would result in different sampled sequences though)
- But this doesn't apply for language generation tasks -- what are the implications? Not having the right temperature value for a given task can make a big difference.
- What are the downstream effects on other language generation tasks that were not accounted for in the RLHF part of training?
- What if there are multiple rounds of RLHF, or multiple objectives?
- What if we instead measured calibration in terms of the gap between the entropy of the test data versus the entropy of the model-generated text? (a la [Braverman et al.](http://proceedings.mlr.press/v119/braverman20a/braverman20a.pdf))
- For $Pr$ the true distribution over $T$-length sequences of words and $\hat{Pr}$ the learned distribution, we can write $\frac{1}{T}KL(Pr||\hat{Pr})=CE(Pr||\hat{Pr})-\frac{1}{T}H(\mathcal{Pr})$ where $H(\mathcal{Pr})$ is the entropy of $Pr$, defined by: $H(\mathcal{Pr})=\mathbb{E}_{w_{1:T}\sim\mathcal{Pr}}\left[\log\frac{1}{Pr(w_{1:T})}\right]$