Is RLHF More Difficult than Standard RL

# Is RLHF More Difficult than Standard RL ## Abstract Imagine teaching a child to play chess. One approach is to give clear rewards for winning and penalties for losing. Another approach is to guide them through preferences, showing which moves are better. Intuitively, the reward-based approach seems easier because it provides explicit feedback. A simple observation about preference feedback is that preferences can always be reconstructed from reward signals. In other words, preference feedback contains arguably less information than scalar rewards, which may render the RLHF problem more challenging. A natural question to ask is: Is preference-based RL more difficult than reward-based RL? That’s essentially what we aims to discuss. This paper investigates whether RLHF is more challenging than standard RL. By reducing RLHF to reward-based RL, the authors demonstrate that existing RL algorithms can efficiently handle human preferences, challenging the assumption that RLHF is inherently more complex. ## Introduction Reinforcement Learning (RL) is a powerful paradigm where agents learn to make decisions by interacting with an environment and receiving feedback in the form of rewards. Traditional RL relies on scalar rewards to guide the learning process. However, in many real-world applications, defining a precise reward function is challenging. This is where Reinforcement Learning from Human Feedback (RLHF) comes into play. RLHF leverages human preferences to train agents, making it more intuitive for human users and better aligned with human values. For example, in applications like recommendation systems or image generation, human preferences can be more easily solicited than precise reward functions. Despite its appeal, RLHF poses the question: Is it more difficult than standard RL? Existing research on preference-based RL has developed specialized algorithms to efficiently learn near-optimal policies from preferences. However, these efforts typically build new theoretical foundations for preference-based RL rather than utilizing existing techniques in standard RL. This paper addresses this gap by demonstrating that preference-based RL can be effectively reduced to reward-based RL, allowing for the application of existing RL algorithms with minimal additional complexity. The authors propose two main approaches. For preferences derived from reward-based probabilistic models, they reduce preference-based RL to robust reward-based RL, capable of tolerating small errors in rewards. For arbitrary preferences, they reduce the problem to multiagent reward-based RL, finding Nash equilibria for factored Markov games. This can be further reduced to adversarial MDPs when preferences depend only on the final state. For preferences that depend on entire trajectory, we develop an adapted version of optimistic Maximum Likelihood Estimation (OMLE) algorithm handling factored Markov games under general function approximation. Through these reductions, the authors instantiate all reward-based RL subroutines with provable algorithms and extend their theory to various models, including tabular MDPs and those with generic function approximation. They also provide guarantees when K-wise comparisons are available, showcasing the versatility and efficiency of their methods. By leveraging existing RL techniques, this paper challenges the assumption that RLHF is inherently more complex than standard RL, providing a new perspective on integrating human preferences into RL frameworks. ## Preliminaries We consider reinforcement learning in episodic MDPs, specified by a tuple $(H, \mathcal{S}, \mathcal{A}, \mathbb{P})$. Here $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and $H$ is the length of each episode. $\mathbb{P}$ is the transition probability function; for each $h \in[H]$ and $s, a \in \mathcal{S} \times \mathcal{A}, \mathbb{P}_h(\cdot \mid s, a)$ specifies the distribution of the next state. A trajectory $\tau \in(\mathcal{S} \times \mathcal{A})^H$ is a sequence of interactions $\left(s_1, a_1, \cdots, s_H, a_H\right)$ with the MDP. A Markov policy $\pi=\left\{\pi_h: \mathcal{S} \rightarrow \mathcal{A}\right\}_{h \in[H]}$ specifies an action distribution based on the current state, while a general policy $\pi=\left\{\pi_h:(\mathcal{S} \times \mathcal{A})^{h-1} \times \mathcal{S} \rightarrow \mathcal{A}\right\}_{h \in[H]}$ can choose a random action based on the whole history up to timestep $h$. In RLHF, we consider two types of preferences: utility-based ones and general ones ### Utility-based preferences For utility-based comparison, we assume there exists an underlying reward function $r^{\star}:(\mathcal{S} \times \mathcal{A})^H \rightarrow[0,1]$, the value of a policy can be defined as $\mathbb{E}_\pi\left[\sum_{h=1}^H r^{\star}\left(s_h, a_h\right)\right]$, i.e. the expected cumulative reward obtained by executing the policy. An optimal policy is one that maximizes $\mathbb{E}_\pi\left[\sum_{h=1}^H r^{\star}\left(s_h, a_h\right)\right]$. We say that $\pi$ is $\epsilon$-optimal if $\mathbb{E}_\pi\left[\sum_{h=1}^H r^{\star}\left(s_h, a_h\right)\right] \geq \max _{\pi^{\prime}} \mathbb{E}_{\pi^{\prime}}\left[\sum_{h=1}^H r^{\star}\left(s_h, a_h\right)\right]-\epsilon$. We can extend this setting where utlity is available for a trajectory, where the reward function is defined as $r^{\star}:(\mathcal{S} \times \mathcal{A})^H \rightarrow[0, H]$, and the value of policy is $\mathbb{E}_{\tau \sim \pi}[r(\tau)]$. > **Definition 1**: > (Comparison oracle). A comparison oracle takes in two trajectories $\tau_1, \tau_2$ and returns $$ o \sim \operatorname{Ber}\left(\sigma\left(r^{\star}\left(\tau_1\right)-r^{\star}\left(\tau_2\right)\right)\right), $$ where $\sigma(\cdot)$ is a link function, and $r^{\star}(\cdot)$ is the underlying reward function. </div> <div class="definition"> Additionally，the comparison outcome $o$ indicates $\tau_1$ is preferred over $\tau_2$ and vice versa. Inputs $\tau_1, \tau_2$ must be actual trajectories generated by the algorithm, not synthesized, because human evaluators find it difficult to compare out-of-distribution samples like random pixels. ### General preferences For general preferences, we assume that for every trajectory pair $\tau, \tau^{\prime}$, the probability that a human evaluator prefers $s$ over $s^{\prime}$ is $$ M\left[\tau, \tau^{\prime}\right]=\operatorname{Pr}\left[\tau \succ \tau^{\prime}\right] . $$ Instead of defining the optimal policy in the usual sense, we consider von Neumann winner > **Definition 2**: > $\pi^{\star}$ is the von Neumann winner policy if $\left(\pi^{\star}, \pi^{\star}\right)$ is a symmetric Nash equilibrium of the constant-sum game: $$\max _\pi \min _{\pi^{\prime}} \mathbb{E}_{\tau \sim \pi, \tau^{\prime} \sim \pi^{\prime}} M\left[\tau, \tau^{\prime}\right]$$ </div> <div class="definition"> The duality gap of the game is defined as $$ \operatorname{DGap}\left(\pi_1, \pi_2\right):=\max _\pi \mathbb{E}_{\tau \sim \pi, \tau^{\prime} \sim \pi_2} M\left[\tau, \tau^{\prime}\right]-\min _\pi \mathbb{E}_{\tau \sim \pi_1, \tau^{\prime} \sim \pi} M\left[\tau, \tau^{\prime}\right] . $$ We say that $\pi$ is an $\epsilon$-approximate von Neumann winner if the duality gap of $(\pi, \pi)$ is at most $\epsilon$. It is known that * Intuitively, the von Neumann winner $\pi^{\star}$ is a randomized policy that "beats" any other policy $\pi^{\prime}$ in the sense that $\mathbb{E}_{\tau \sim \pi^{\star}, \tau^{\prime} \sim \pi^{\prime}} M\left[\tau, \tau^{\prime}\right] \geq 1 / 2$ * If the utility-based preference model holds and the transitions are deterministic, the von Neumann winner is the optimal policy * The von Neumann winner is the only solution concept that satisfies population-consistency and compositionconsistency in social choice theory (Brandl et al., 2016). ### Function approximation > **Definition 3**: > (eluder dimension). For any function class $\mathcal{F} \subseteq(\mathcal{X} \rightarrow \mathbb{R})$, its Eluder dimension $\operatorname{dim}_{\mathrm{E}}(\mathcal{F}, \epsilon)$ is defined as the length of the longest sequence $\left\{x_1, x_2, \ldots, x_n\right\} \subseteq \mathcal{X}$ such that there exists $\epsilon^{\prime} \geq \epsilon$ so that for all $i \in[n], x_i$ is $\epsilon^{\prime}$-independent of its prefix sequence $\left\{x_1, \ldots, x_{i-1}\right\}$, in the sense that there exists some $f_i, g_i \in \mathcal{F}$ such that $$ \sqrt{\sum_{j=1}^{i-1}\left(\left(f_i-g_i\right)\left(x_j\right)\right)^2} \leq \epsilon^{\prime} \text { and }\left|\left(f_i-g_i\right)\left(x_i\right)\right| \geq \epsilon^{\prime} . $$ </div> <div class="definition"> >**Assumption 1**: >(Realizability) $r^{\star} \in \mathcal{R}$. ## Utility-based RLHF Instead of using scalar rewards, the paper proposes translating human preferences into reward signals, making it possible to use standard RL algorithms. The authors introduce a preference-to-reward interface that creates approximate reward labels and only queries the comparison oracle when uncertainty is significant. This approach ensures low sample complexity and minimizes the number of human queries, as it doesn't scale with the complexity of the RL algorithm. Additionally, the interface collects feedback in batches, streamlining the training process. ### A preference-to-reward interface The interaction protocol of an RL algorithm with the Preference-to-Reward (P2R) Interface is shown below. ![shiyitu](https://hackmd.io/_uploads/B1I67Bo7R.jpg) P2R maintains a confidence set of rewards $B_r$. When the RL algorithm wishes to learn the reward label of a trajectory $\tau$, P2R checks whether $B_r$ approximately agrees on the reward of $\tau$. If so, it can return a reward label with no queries to the comparison oracle; if not, it will query the comparison oracle on $\tau$ and a fixed trajectory $\tau_0$, and update the confidence set of reward functions. The details of P2R are presented in Algorithm 1. ![algorithm1](https://hackmd.io/_uploads/B1iCQSoQC.jpg) The performance of P2R depends on the RL algorithm running on top of it. In particular, we define sample complexity of a standard reward-based RL algorithm $\mathscr{A}$ as follows. > **Definition 4**: >An RL algorithm $\mathscr{A}$ is $g(\epsilon)$-robust and has sample complexity $\mathcal{C}(\epsilon, \delta)$ if it can output an $\epsilon$-optimal policy using $\mathcal{C}(\epsilon, \delta)$ samples with probability at least $1-\delta$, even if the reward of each trajectory $\tau$ is perturbed by $\varepsilon(\tau)$ with $\|\varepsilon(\tau)\|_{\infty} \leq g(\epsilon)$ </div> <div class="definition"> ### P-OMLE: Improved query complexity via white-box modification While the P2R interface provides a clean and effortless recipe for modifying standard RL algorithms to work with preference feedback, it incurs a large query cost to the comparison oracle, e.g., cubic dependence on $d_{\overline{\mathcal{R}}}$ in Example 3. We believe that this disadvantage is caused by the black-box nature of P2R, and that better query complexities can be achieved by modifying standard RL algorithms in a white-box manner and specialized analysis. In this section, we take OMLE (Liu et al., 2022a) as an example and introduce a standalone adaptation to trajectory preference feedback with improved query complexity. We expect other optimism-based algorithms (e.g., UCBVI-BF and GOLF) can be modified in a similar white-box manner to achieve better query complexity. The details of the Preference-based OMLE algorithm (P-OMLE) are provided in Algorithm 2. Compared to OMLE with reward feedback (Algorithm 3), the only difference is that now the reward confidence set is computed using preference feedback directly using log-likelihood. Denote by $V_{r, p}^\pi$ the expected cumulative reward the learner will receive if she follows policy $\pi$ in model $(p, r)$. In the $t$-th iteration, the algorithm follows the following steps: * Optimistic planning: Find the policy-model pair $(\pi, r, p)$ that maximizes the value function $V_{r, p}^\pi$; * Data collection: Construct an exploration policy set $\Pi_{\text {exp }}\left(\pi^t\right)^3$ and collect trajectories by running all policies in $\Pi_{\exp }\left(\pi^t\right)$ * Confidence set update: Update the confidence set using the updated log-likelihood: $$ \mathcal{L}\left(r, \mathcal{D}_{\text {rwd }}\right):=\sum_{\left(\tau, \tau^{\prime}, y\right) \in \mathcal{D}_{\text {rwd }}} \ln \sigma\left(r(\tau)-r\left(\tau^{\prime}\right), y\right), \quad \mathcal{L}\left(p, \mathcal{D}_{\text {trans }}\right):=\sum_{(\pi, \tau) \in \mathcal{D}_{\text {trans }}} \ln \mathbb{P}_p^\pi(\tau) $$ ![algorithm2](https://hackmd.io/_uploads/S19Zg8s7C.jpg) ## RLHF From General Reference The utility-based approach imposes strong assumptions on human preferences. Not only is the matrix $M\left[\tau, \tau^{\prime}\right]$ assumed to be exactly realizable by $\sigma\left(r(\tau)-r\left(\tau^{\prime}\right)\right)$, but also $\sigma$ is assumed to be known and have a gradient lower bound. Moreover, the utility-based approach assumes that transitivity: if $\operatorname{Pr}\left[\tau_1 \succ \tau_2\right] \geq 0.5, \operatorname{Pr}\left[\tau_2 \succ \tau_3\right] \geq 0.5$, then $\operatorname{Pr}\left[\tau_1 \succ \tau_3\right] \geq 0.5$. These limitations of the utility-based approach motivates us to consider general preferences. ### A reduction to Markov games **Factorized and independent Markov games (FI-MG)**. Consider a two-player zero-sum Markov games with state space $\mathcal{S}=\mathcal{S}^{(1)} \times \mathcal{S}^{(2)}$, action space $\mathcal{A}^{(1)}$ and $\mathcal{A}^{(2)}$ for each player respectively, transition kernel $\left\{\mathbb{P}_h\right\}_{h \in[H]}$ and reward function $r$. We say the Markov game is factorized and independent if the transition kernel is factorized: $$ \mathbb{P}_h\left(s_{h+1} \mid s_h, a_h\right)=\mathbb{P}_h\left(s_{h+1}^{(1)} \mid s_h^{(1)}, a_h^{(1)}\right) \times \mathbb{P}_h\left(s_{h+1}^{(2)} \mid s_h^{(2)}, a_h^{(2)}\right), $$ where $s_h=\left(s_h^{(1)}, s_h^{(2)}\right), s_{h+1}=\left(s_{h+1}^{(1)}, s_{h+1}^{(2)}\right), a_h=\left(a_h^{(1)}, a_h^{(2)}\right) \in \mathcal{A}^{(1)} \times \mathcal{A}^{(2)}$. The above definition implies that the Markov game can be partitioned into two MDPs, where the transition dynamics are controlled separately by each player, and are completely independent of each other. The only source of correlation between the two MDPs arises from the reward function, which is permitted to depend on the joint trajectory from both MDPs. Building on the above factorization structure, we define partial trajectory $\tau_{i, h}:=\left(s_1^{(i)}, a_1^{(i)}, \ldots, s_h^{(i)}\right)$ that consists of states of the $i$-th MDP factor and actions of the $i$-th player. Furthermore, we define a restricted policy class $\Pi_i$ that contains all policies that map a partial trajectory to a distribution in $\Delta_{\mathcal{A}_i}$, i.e., $$ \Pi_i:=\left\{\left\{\pi_h\right\}_{h \in[H]}: \pi_h \in\left(\left(\mathcal{S}^{(i)} \times \mathcal{A}_i\right)^{h-1} \times \mathcal{S}^{(i)} \rightarrow \Delta_{\mathcal{A}_i}\right)\right\} \text { for } i \in[2] . $$ And the goal is to learn a restricted Nash equilibrium $\left(\mu^{\star}, \nu^{\star}\right) \in \Pi_1 \times \Pi_2$ such that $$ \mu^{\star} \in \arg \max _{\mu \in \Pi_1} \mathbb{E}_{\tau \sim \mu, \tau^{\prime} \sim \nu^{\star}}\left[r\left(\tau, \tau^{\prime}\right)\right] \quad \text { and } \quad \nu^{\star} \in \arg \min _{\nu \in \Pi_2} \mathbb{E}_{\tau \sim \mu^{\star}, \tau^{\prime} \sim \nu}\left[r\left(\tau, \tau^{\prime}\right)\right] . $$ **Finding von Neumann winner via learning restricted Nash**. We claim that finding an approximate von Neumann winner can be reduced to learning an approximate restricted Nash equilibrium in a FI-MG. The reduction is straightforward: we simply create a Markov game that consists of two independent copies of the original MDP and control the dynamics in the $i$-th copy by the $i$-th player's actions. Such construction is clearly factorized and independent. Moreover, the restricted policy class $\Pi_i$ is equivalent to the universal policy class in the original MDP. We further define the reward function as $r\left(\tau, \tau^{\prime}\right)=M\left[\tau, \tau^{\prime}\right]$ where $M$ is the general preference function. By definition, we immediately obtain the following equivalence relation. >**Proposition**: If $\left(\mu^{\star}, \nu^{\star}\right)$ is a restricted Nash equilibrium of the above FI-MG, then both $\mu^{\star}$ and $\nu^{\star}$ are von Neumann winner in the original problem. The problem we are faced with now is how to learn restricted Nash equilibria in FI-MG. In the following sections, we present two approaches that leverage existing RL algorithms to solve this problem: (i) when the preference function depends solely on the final states of the two input trajectories, each player can independently execute an adversarial MDP algorithm; (ii) for general preference functions, a straightforward adaptation of the OMLE algorithm is sufficient under certain eluder-type conditions ### Learning from final-state-based preferences via adversarial MDPs In this section, we consider a special case where the preference depends solely on the final states of the two input trajectories, i.e., $M\left(\tau, \tau^{\prime}\right)=M\left(s_H, s_H^{\prime}\right) .{ }^4$ Given the previous equivalence relation between von Neumann winner and restricted Nash in FI-MG, one natural idea is to apply no-regret learning algorithms, as it is well-known that running two copies of no-regret online learning algorithms against each other can be used to compute Nash equilibria in zero-sum normal-form games. Since this paper focuses on sequential decision making, we need no-regret learning algorithms for adversarial MDPs, which we define below. **Adversarial MDPs**. In the adversarial MDP problem, the algorithm interacts with a series of MDPs with the same unknown transition but adversarially chosen rewards for each episode. Formally, there exists an unknown groundtruth transition function $\mathbb{P}=\left\{\mathbb{P}_h\right\}_{h=1}^H$. At the beginning of the $k$-th episode, the algorithm chooses a policy $\pi^k$ and then the adversary picks a reward function $r^k=\left\{r_h^k\right\}_{h=1}^H$. After that, the algorithm observes a trajectory $\tau^k=\left(s_1^k, a_1^k, y_1^k, \ldots, s_H^k, a_H^k, y_H^k\right)$ sampled from executing policy $\pi^k$ in the MDP parameterized by $\mathbb{P}$ and $r^k$, where $\mathbb{E}\left[y_h^k \mid s_h^k, a_h^k\right]=r_h^k\left(s_h^k, a_h^k\right)$. We define the regret of an adversarial MDP algorithm $\mathscr{A}$ to be the gap between the algorithm's expected payoff and the best payoff achievable by the best fixed Markov policy: $$ \operatorname{Regret}_K(\mathscr{A}):=\max _{\pi \in \Pi_{\text {Markov }}} \sum_{k=1}^K \mathbb{E}_\pi\left[\sum_{h=1}^H r_h^k\left(s_h, a_h\right)\right]-\sum_{k=1}^K \mathbb{E}_{\pi_k}\left[\sum_{h=1}^H r_h^k\left(s_h, a_h\right)\right] . $$ Now we explain how to learn a von Neumann winner via running adversarial MDP algorithms. We simply create two copies of the original MDP and instantiate two adversarial MDP algorithms $\mathscr{A}_1$ and $\mathscr{A}_2$ to control each of them separately. To execute $\mathscr{A}_1$ and $\mathscr{A}_2$, we need to provide reward feedback to them in each $k$-th episode. Denote by $s_H^{k,(1)}$ and $s_H^{k,(2)}$ the final states $\mathscr{A}_1$ and $\mathscr{A}_2$ observe in the $k$-th episode. We will feed $y^k \sim \operatorname{Ber}\left(M\left(s_H^{k,(1)}, s_H^{k,(2)}\right)\right)$ into $\mathscr{A}_1$ and $1-y^k$ into $\mathscr{A}_2$ as their reward at step $H-1$, respectively. And all other steps have zero reward feedback.The formal pseudocode is provided in Algorithm 4 (Appendix E). The following theorem states that as long as the invoked adversarial MDP algorithm has sublinear regret, the above scheme learns an approximate von Neumann winner in a sample-efficient manner. ### Learning from trajectory-based preferences via OMLE_equilibrium In this section, we consider the more general case where the preference $M\left[\tau, \tau^{\prime}\right]$ is allowed to depend arbitrarily on the two input trajectories. Similar to the utility-based setting, we assume that the learner is provided with a preference class $\mathcal{M} \subseteq\left((\mathcal{S} \times \mathcal{A})^H \times(\mathcal{S} \times \mathcal{A})^H \rightarrow[0,1]\right)$ and transition function class $\mathcal{P}$ a priori, which contains the groundtruth preference and transition we are interacting with. Previously, we have established the reduction from learning the von Neumann winner to learning restricted Nash in FI-MG. In addition, learning restricted Nash in FI-MG is in fact a special case of learning Nash equilibrium in partially observable Markov games (POMGs). As a result, we can directly adapt the existing OMLE algorithm for learning Nash in POMGs (Liu et al., 2022b) to our setting, with only minor modifications required to learn the von Neumann winner. ![algorithm5](https://hackmd.io/_uploads/HywnEUiQR.jpg) >**Theorem** >Suppose Assumption 3 holds. There exist absolute constant $c_1$ and $c_2$ such that for any $(T, \delta) \in \mathbb{N} \times(0,1]$ if we choose $\beta_{\mathcal{M}}=c_1 \ln (|\mathcal{M}| T / \delta)$ and $\beta_{\mathcal{P}}=c_1 \ln (|\mathcal{M}| T / \delta)$ in Algorithm 5 , then with probability at least $1-\delta$, the duality gap of the output policy of Algorithm 5 is at most $$ \frac{4 \xi\left(d_{\mathcal{P}}, T, c_2 \beta_{\mathcal{P}},\left|\Pi_{\exp }\right|\right)}{T}+c_2 \sqrt{\frac{d_{\mathcal{M}} \beta_{\mathcal{M}}}{T}} $$ where $d_{\mathcal{M}}=\operatorname{dim}_{\mathrm{E}}(\mathcal{M}, 1 / T)$. ## Motivating Examples >**Example 1**: >(Tabular MDPs). Suppose state and action spaces are finite and small. In this case $d_{\overline{\mathcal{R}}}=\widetilde{\mathcal{O}}(|\mathcal{S} \| \mathcal{A}|)$. The UCBVI-BF algorithm, proposed by Azar et al. (2017), is a model-based tabular RL algorithm which uses upper-confidence bound value iteration with Bernstein-Freedman bonuses. >**Example 2**: >(Low Bellman-eluder dimension). Bellman-eluder dimension (Jin et al., 2021) is a complexity measure for function approximation RL with a Q-value function class $\mathcal{F}$. It can be shown that a large class of RL problems, including tabular MDPs, linear MDPs, reactive POMDPs and low Bellman rank problems, all have small Bellman-eluder dimension. Furthermore, Jin et al. (2021) designed an algorithm, named GOLF, which (i) first constructs a confidence set for the optimal Q-functions by including all the candidates with small temporal difference loss, (ii) then optimistically picks Q-estimate from the confidence set and executes its greedy policy, and (iii) repeats (i) and (ii) using the newly collected data. Assuming that $\mathcal{F}$ satisfies realizability and completeness property, GOLF is $g(\epsilon)=\Theta\left(\frac{\epsilon}{\sqrt{d_{\mathrm{BE}}} H^2}\right)$-robust with sample complexity $\mathcal{C}(\epsilon, \delta)=\widetilde{\mathcal{O}}\left(\frac{d_{\mathrm{BE}} H^4 \ln |\mathcal{F}|}{\epsilon^2}\right)$ where $d_{\mathrm{BE}}$ is the Bellman-eluder dimension of the problem. Trajectory-based preferences When the reward function is trajectory-based, we can instantiate P2R with the OMLE algorithm (Liu et al., 2022a) to solve any model-based RLHF of low generalized eluder dimension. In brief, OMLE is an optimism-based algorithm that maintains a model confidence set. This set comprises model candidates that demonstrate high log-likelihood on previously collected data. And in each iteration, the algorithm chooses a model estimate optimistically and executes its greedy policy to collect new data. >**Example 3**: >(Low generalized eluder dimension). Generalized eluder dimension (Liu et al., 2022a) is a complexity measure for function approximation RL with a transition function class $\mathcal{P}$. In Appendix C.1, we show that a simple adaptation of OMLE is $g(\epsilon)=\Theta\left(\frac{\epsilon}{\sqrt{d_{\mathcal{R}}}}\right)$-robust with sample complexity $\mathcal{C}(\epsilon, \delta)=\widetilde{\mathcal{O}}\left(\frac{H^2 d_{\mathcal{P}}\left|\Pi_{\text {exp }}\right|^2 \ln |\mathcal{P}|}{\epsilon^2}+\frac{H d_{\mathcal{R}}\left|\Pi_{\text {exp }}\right|}{\epsilon}\right)$, where $d_{\mathcal{P}}$ denotes the generalized eluder dimension of $\mathcal{P},\left|\Pi_{\text {exp }}\right|$ is a parameter in the OMLE algorithm, and $d_{\mathcal{R}}=\operatorname{dim}_{\mathrm{E}}(\mathcal{R}, \epsilon)$. ## Supporting Lemmas and Theoretical Analysis ### Theoretical analysis of P2R We assume that $\sigma(·)$ is known and satisfies the following regularity assumption. > **Assumption 2**: > $\sigma(0)=\frac{1}{2}$; for $x \in[-H, H], \sigma^{\prime}(x) \geq \alpha>0$. </div> <div class="Assumption"> Assumption 2 is common and is satisfied by the popular Bradley-Terry model, where $\sigma$ is the logistic function. > **Lemma 1**: > If $\sigma$ is unknown, even if there are only two possible candidates, the optimal policy is indeterminate. </div> <div class="lemma"> >**Proof of Lemma 1**: >Consider two link functions $\sigma_1(x):=\frac{1}{1+\exp (-x)}$ and $$ \sigma_2(x):=\frac{1}{2}+\alpha_1 x+\alpha_2\left(x-\alpha_3\right) \cdot I\left[|x|>\alpha_3\right] $$ where $\alpha_1=1, \alpha_2=-0.484, \alpha_3=0.3$ Consider an MDP with $H=2$ and initial state $s_0$. Suppose that there are three terminal states $s_1, s_2, s_3$, where we observe that the trajectory preferences only depend on the terminal state in the following way: $$ \operatorname{Pr}\left[s_1 \succ s_2\right]=0.7, \operatorname{Pr}\left[s_2 \succ s_3\right]=0.8, \operatorname{Pr}\left[s_1 \succ s_3\right]=0.903 . $$ This can be explained by both $\sigma_1$ with $r^{(1)}=\left\{s_0: 0, s_1: 0.847, s_2: 0, s_3:-1.386\right\}$, and $\sigma_2$ with $r^{(2)}=\left\{s_0: 0, s_1\right.$ : $\left.0.2, s_2: 0, s_3:-0.3\right\}$. Suppose that state $s_0$ has two actions $a$ and $b$, which leads to distributions $\left\{0.39: s_1, 0.61: s_3\right\}$ and $\left\{1: s_2\right\}$ respectively. Under $r^{(1)}$ and $r^{(2)}$, the optimal action would be $a$ and $b$ respectively. Therefore without knowledge of the link function, it is impossible to identify the optimal policy even with perfect knowledge of the transition and comparison probabilities. <div> <div class="proof"> > **Lemma 2**: > If $\sigma^{\prime}(\cdot)$ is not lower bounded, for instance when $\sigma(x)=\frac{1}{2}(1+\operatorname{sign}(x))$, the optimal policy is indeterminate. > **Proof of Lemma 2**： >Similarly, consider an MDP with $H=2$ and initial state $s_0$. Suppose that there are three terminal states $s_1, s_2, s_3$, where we observe that $$ \operatorname{Pr}\left[s_1 \succ s_2\right]=1, \operatorname{Pr}\left[s_2 \succ s_3\right]=1, \operatorname{Pr}\left[s_1 \succ s_3\right]=1 . $$Meanwhile the state $s_0$ has two actions $a$ and $b$, leading to distributions $\left\{0.5: s_1, 0.5: s_3\right\}$ and $\left\{1: s_2\right\}$ respectively. In this case, both $r^{(1)}=\left\{s_0: 0, s_1: 0.5, s_2: 0, s_3:-1\right\}$ and $r^{(2)}=\left\{s_0: 0, s_1: 1, s_2: 0, s_3:-0.5\right\}$ fits the comparison results perfectly. However under $r^{(1)}$, the optimal action is $b$ while under $r^{(2)}$, the optimal action is $a$. P2R enjoys the following theoretical guarantee when we choose $\epsilon_0=g(\epsilon) / 2$, $\beta=\frac{\epsilon_0^2}{4}$, $d_{\overline{\mathcal{R}}}=\operatorname{dim}_{\mathrm{E}}\left(\overline{\mathcal{R}}, \epsilon_0\right)$ and $m=\Theta\left(\frac{d_{\overline{\mathcal{R}}} \ln \left(d_{\overline{\mathcal{R}}} / \delta\right)}{\epsilon_0^2 \alpha^2}\right)$ >**Theorem 1** > Suppose Assumption 1 and 2 hold. Let $\epsilon_0=g(\epsilon) / 2, d_{\overline{\mathcal{R}}}=\operatorname{dim}_{\mathrm{E}}\left(\overline{\mathcal{R}}, \epsilon_0\right)$ and $m=\Theta\left(\frac{d_{\overline{\mathcal{R}}} \ln \left(d_{\overline{\mathcal{R}}} / \delta\right)}{\epsilon_0^2 \alpha^2}\right)$. Suppose that $\mathscr{A}$ is an $g(\epsilon)$-robust $R L$ algorithm with sample complexity $\mathcal{C}(\epsilon, \delta)$. By running $\mathscr{A}$ with the interface in Algorithm 1, we can learn an $\epsilon$-optimal policy using $\mathcal{C}(\epsilon, \delta)$ samples and $\widetilde{\mathcal{O}}\left(\frac{d_{\mathcal{R}}^2}{\alpha^2 g(\epsilon)^2}\right)$ queries to the comparison oracle with probability $1-2 \delta$. >>**Lemma B5**: >With $m=\Theta\left(\frac{\ln \left(1 / \delta^{\prime}\right)}{\alpha^2 \epsilon^{\prime 2}}\right)$, for each $\tau$ such that the comparison oracle is queried, with probability $1-\delta^{\prime}$, $$ \left|\hat{r}(\tau)-\left(r^{\star}(\tau)-r^{\star}\left(\tau_0\right)\right)\right| \leq \epsilon^{\prime} $$ >>**Lemma B6**: >Set $m=\Theta\left(\frac{d \ln (d / \delta)}{\epsilon_0^2 \alpha^2}\right)$ and $\beta=\frac{\epsilon_0^2}{4}$. With probability $1-\delta$, the number of samples on which the comparison oracle is queried is at most $\operatorname{dim}_{\mathrm{E}}\left(\overline{\mathcal{R}}, \epsilon_0\right)$. >> **Lemma B7**: > With probability $1-\delta, r^{\star} \in \mathcal{B}_r$ throughout the execution of Algorithm. >> **Lemma B8**: >With probability $1-\delta$, for each $\tau$ in Line 3 of Algorithm 1 , the returned reward $\hat{r}$ satisfies $$ \left|\hat{r}-\left(r^{\star}(\tau)-r^{\star}\left(\tau_0\right)\right)\right| \leq 2 \epsilon_0 . $$ >**Proof of Theorem 1**: >Choose $\epsilon_0:=g(\epsilon) / 2, \beta=\frac{\epsilon_0^2}{4}$ and $m=\Theta\left(\frac{d_{\overline{\mathcal{R}}} \ln \left(d_{\overline{\mathcal{R}}} / \delta\right)}{\epsilon_0^2 \alpha^2}\right)$. By Lemma B. 8 (rescaling $\delta$ ), with probability $1-\delta$, the reward returned by the reward interface is $g(\epsilon)$-close to $\widetilde{r}^{\star}:=r^{\star}-r^{\star}\left(\tau_0\right)$ throughout the execution of the algorithm. By the definition of sample complexity, with probability $1-\delta$, the policy returned by $\mathscr{A}$ is $\epsilon$-optimal for $\widetilde{r}^{\star}$, which implies that it is also $\epsilon$-optimal for $r^{\star}$. The number of samples (episodes) is bounded by $\mathcal{C}(\epsilon, \delta)$. Finally by Lemma B.6, the number of queries to the comparison oracle is at most $$ \operatorname{dim}_{\mathrm{E}}\left(\overline{\mathcal{R}}, \epsilon_0\right) \cdot m \leq \widetilde{\mathcal{O}}\left(\frac{d_{\overline{\mathcal{R}}}^2}{g^2(\epsilon) \alpha^2}\right) $$ ## Discussions and Takeaways This paper studies RLHF via efficient reductions. For utility based preferences, we introduce a Preference-to-Reward Interface which reduces preference based RL to standard reward-based RL. Our results are amenable to function approximation and incur no additional sample complexity. For general preferences without underlying rewards, we reduce finding the von Neumann winner to finding restricted Nash equilibrium in a class of Markov games. This can be more concretely solved by adversarial MDP algorithms if the preference depends solely on the final state, and by optimistic MLE for preferences that depend on the whole trajectory. Our results demonstrate that RLHF, from both utility-based and general preferences, can be readily solved under standard assumptions and by existing algorithmic techniques in RL theory literature. This suggests that RLHF is not much harder than standard RL in the complexity sense, and needs not to be more complicated in the algorithmic sense. Consequently, our findings partially answer our main query: RLHF may not be more difficult than standard RL. Future work we can do * For utility based settings, currently a global lower gradient bound α of the link function is assumed (Assumption 2). For a logistic link function, α can decrease exponentially with respect to H and lead to large query complexity bounds. We conjecture that α can be relaxed to local measures of the gradient lower bound which could alleviate the exponential dependence on H. * For non-utility based preferences, our current approach reduces to adversarial MDP or Markov games, which limits its applicability (especially under function approximation). It remains unclear whether it is possible to reduce finding the von Neumann winner to a standard single-player RL problem. ## References * Wang, Y., Liu, Q., & Jin, C. (2023). Is RLHF More Difficult than Standard RL? A Theoretical Perspective. arXiv:2306.14111