# Research questions about RLHF ## Brief primer on RLHF Given an input $x$, the initial policy generates multiple outputs $y_0, y_1$, human raters indicate that they prefer $y_i$ over $y_{1-i}$, and the reward model is trained with the loss: $$loss(r_\theta)=-\mathbb{E}_{x,y_0,y_1,i}\left[\log(\sigma(r_\theta(x,y_i)-r_\theta(x,y_{1-i})))\right]$$ The reward for the learned policy is then: $$R(x,y, \phi)=r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]$$ which is equivalent to the learning objective: \begin{align} &\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot|x)} \left[R(x,y,\phi)\right] \\ &=\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}}\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)} \left[r_\theta(x,y)-\beta\log\left[\frac{\pi_\phi^{RL}(y|x)}{\pi^{SFT}(y|x)}\right]\right] \\ &= \arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x) \middle\| \pi^{SFT}(\cdot|x)\right)\right] \end{align} But the above objective encourages mode-seeking behavior, for a couple of reasons: 1. $r_\theta(x,y)$ can be maximized by tuning $\pi_\phi^{RL}$ to only generate $y$'s that maximize $r_\theta(x,y)$. 1. Since $\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x) \middle\| \pi^{SFT}(\cdot|x)\right)$ represents an expectation over $y\sim \pi_\phi^{RL}(\cdot|x)$, this regularization term will focus disproportionately on high-reward outputs $y$, and have much less effect on the rest of the distribution $\pi^{SFT}(\cdot|x)$. Even though this will dampen the mode-seeking caused by (1), it does not really smooth the rest of the distribution outside of the modes. What if we flip the operands of the KL divergence penalty? \begin{align} &\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi^{SFT}(\cdot|x) \middle\| \pi_\phi^{RL}(\cdot|x) \right)\right] \\ & \arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathbb{E}_{y'\sim \pi^{SFT}(\cdot |x)} \log\left(\frac{\pi^{SFT}(y'|x)}{\pi_\phi^{RL}(y'|x)}\right)\right] \end{align} Now the KL divergence penalty will be applied even to lower-reward samples and will overall be applied more evenly across the entire sample space (assuming that $\pi^{SFT}$ is less "peaky" and closer to the uniform distribution than $\pi_\phi^{RL}$ is). However, a couple of issues still exist with this new formulation: 1. Now the learning objective contains both $y$ and $y'$, which are sampled from two different models. In practice, we could optimize each term separately, alternating between batches. - Would this cause training instabilities?? We're essentially interpolating between two different distributions, but picking only one distribution to take a step towards at each update and alternating distributions at each step. 1. Reward hacking - since we are sampling from $\pi^{SFT}(\cdot |x)$ to approximate the KL divergence penalty, it is possible $\pi_\phi^{RL}$ will learn to generate $y$'s with high reward but low likelihood under the original distribution (e.g. $\pi^{SFT}(y|x)$ is low). In practice, RLHF usually involves continual re-training of the reward model via human annotations that would likely then assign a lower reward to such degenerate outputs, but this is a manual correction (whereas the original RLHF formulation would be more likely to penalize these degenerate outputs). What if we instead used both KL divergence terms? \begin{align} &\arg\min_\phi -\mathbb{E}_{\substack{x\sim \mathcal{D}}} \left[\mathbb{E}_{y\sim \pi_\phi^{RL}(\cdot | x)}[r_\theta(x,y)] -\beta\,\mathrm{KL}\left(\pi^{SFT}(\cdot|x) \middle\| \pi_\phi^{RL}(\cdot|x) \right) - \gamma\,\mathrm{KL}\left(\pi_\phi^{RL}(\cdot|x)\middle\|\pi^{SFT}(\cdot|x)\right)\right] \end{align} where we introduce a separate hyperparameter $\gamma$ instead of using JSD so that we can tune the strength by which we weight by one distribution over the other. ## Current issues with RLHF: 1. Can amplify undesirable distributional biases - e.g. in Glaese et al. (https://arxiv.org/abs/2209.14375), the RLHF-tuned model is both more incorrect and biased on ambiguous questions in the [BBQ benchmark](https://arxiv.org/abs/2110.08193) than its pretraining-only counterpart. The authors hypothesize that this could be because the multi-objective RL tuning made the model *less* likely to answer "I don't know" to ambiguous questions - Perez et al. (https://arxiv.org/abs/2212.09251) RLHF can cause LMs to output more extremist political views 1. Seems to be more effective during pre-training than fine-tuning (https://arxiv.org/abs/2302.08582) - Speculative intuition: Easier to learn an initial "positive" behavior than to unlearn a "negative" behavior later in training. Should be possible to prove though? 1. Human preferences are often not transitive or fixed. They are contextual, personal, and dynamic - [Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning](https://arxiv.org/abs/2206.13316) - Even if preferences between trajectories are noiseless and transitive, there may not exist a unique optimal policy: https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d3837ee7981e8c064774da6cdd98bf-Paper.pdf 1. Not much support in RLHF for the uncertainty of the preference - Not sure if this is in a paper, but apparently Anthropic's Claude model uses an RLHF model that provides a scale of how "strong" a preference is. Is this different from uncertainty though?? 1. RL leads to distribution collapse - https://arxiv.org/abs/2205.11275 - Mode collapse: https://arxiv.org/pdf/2012.01365.pdf#page=7 (section 5), https://arxiv.org/pdf/2303.17548.pdf - Vague relationship to this paper I think? https://arxiv.org/abs/2012.11635 - If the RL objective for tuning an LM is to maximize the reward $J_{RL}(\theta)=\mathbb{E}_{x\sim\pi_\theta}r(x)$, then $\theta^*=\delta_{x^*}$ where $x^*=\arg\max_x r(x)$. This leads to **mode collapse**! - Using a KL penalty doesn't necessarily fix this problem: \begin{align} J_{KL-RL}(\theta) &= \mathbb{E}_{x\sim\pi_\theta}[r(x)-\beta D_{KL}(\pi_\theta, \pi_0)] \\ &= \mathbb{E}_{x\sim\pi_\theta}[r_\theta'(x)] \\ &\text{where} \\ r_\theta'(x) &= r(x)+\beta(\log\pi_0(x) - \log\pi_\theta(x)) \end{align} It just so happens that: \begin{align} \pi_{KL-RL}^* = \frac{1}{2}\pi_0(x)\exp(r(x)/\beta) \end{align} which still concentrates the distribution around high-reward examples with high probabilities assigned by $\pi_0$ 1. Leads to poorer calibration? - GPT-4 Technical Report (https://arxiv.org/pdf/2303.08774.pdf, page 12/Figure 8) - Kadavath et al. seems to show that temperature adjustment can alleviate this issue?? (https://arxiv.org/pdf/2207.05221.pdf, page 11/Figure 9) ## Research questions I'm interested in 1. How does RLHF lead to miscalibration, what downstream impacts does it have, and how can it be alleviated? - KL-regularised RL still leads to mode collapse, which may result in more "sharp" probability distributions (and in turn, poor calibration) - Kadavath et al. resolved this for MC by increasing the temperature (would result in different sampled sequences though) - But this doesn't apply for language generation tasks -- what are the implications? Not having the right temperature value for a given task can make a big difference. - What are the downstream effects on other language generation tasks that were not accounted for in the RLHF part of training? - What if there are multiple rounds of RLHF, or multiple objectives? - What if we instead measured calibration in terms of the gap between the entropy of the test data versus the entropy of the model-generated text? (a la [Braverman et al.](http://proceedings.mlr.press/v119/braverman20a/braverman20a.pdf)) - For $Pr$ the true distribution over $T$-length sequences of words and $\hat{Pr}$ the learned distribution, we can write $\frac{1}{T}KL(Pr||\hat{Pr})=CE(Pr||\hat{Pr})-\frac{1}{T}H(\mathcal{Pr})$ where $H(\mathcal{Pr})$ is the entropy of $Pr$, defined by: $H(\mathcal{Pr})=\mathbb{E}_{w_{1:T}\sim\mathcal{Pr}}\left[\log\frac{1}{Pr(w_{1:T})}\right]$