<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2303.00001) | [Note link](https://mathpretty.com/16439.html) | [Code link](https://github.com/minaek/reward_design_with_llms) | ICLR 2023 :::success **Thoughts** This study explores using prompting strategies with a large language model (LLM) to train reinforcement learning (RL) methods. Specifically, it investigates how LLMs can be employed to generate reward signals that align with user objectives, and evaluates these approaches using RL training techniques like DQN or on-policy RL methods. ::: ## Abstract In reinforcement learning, designing an effective reward function is a challenging task. This study addresses this by using a proxy reward function, which involves prompting a large language model. ## Background Autonomous agents are increasingly valuable today, as they can learn policies based on human user behavior to improve control and decision-making. However, implementing this technique presents two key challenges: 1. Designing effective reward functions. 2. Obtaining a large source of labeled data. ## Method ![image](https://hackmd.io/_uploads/H1dZhY75R.png) Reinforcement learning can be modeled as a Markov Decision Process (MDP), where an agent selects actions in each episode to maximize the cumulative reward. In this study, the Markov Decision Process is defined as $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, p, \mathcal{R}, \gamma \rangle$, where $\mathcal{S}$ represents the space of representations of utterances in the negotiation so far, and $\mathcal{A}$ is the state space (the set of all possible utterances). The function $p: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$ denotes the transition probability, while $\gamma$ is the discount factor. Traditionally, the reward function $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ maps states and actions to a real number. In this study, a large language model (LLM) is employed as a proxy reward function, where the LLM takes in a text prompt and outputs a string. Here, $A^\ast$ is defined as the set of all strings, and $\rho \in A^\ast$ represents a text prompt. The large language model is represented as $LLM: A^\ast \rightarrow A^\ast$. For a given prompt $\rho$, it concatenates the following components: 1. $\rho_1 \in A^\ast$: A string that **details the task** at hand. 2. $\rho_2 \in A^\ast$: A user-provided string that **outlines their objectives**, either by providing $N$ examples or through a natural language description of their goals. 3. $\rho_3$: A **textual representation of states and actions** from an RL episode, generated using a parser $f: \mathcal{S} \times \mathcal{A} \rightarrow A^\ast$. 4. $\rho_4 \in A^\ast$: A **question** asking whether the RL agent’s behavior, as described in $\rho_3$, meets the user's objectives as outlined in $\rho_2$. Finally, they define a binary value $g: A^\ast \rightarrow \{ 0, 1\}$ that maps the textual output of LLM. ## Experiment In this section, they evaluate three different questions with various tasks: 1. **Few-shot prompting**: Can LLMs generate reward signals that align with user objectives based on a few provided examples? - **Task**: Ultimatum Game (Few-shot) ![image](https://hackmd.io/_uploads/H16dUiXqR.png) 2. **Zero-shot prompting**: When objectives are clearly defined, can LLMs generate reward signals that are consistent with those objectives without any examples? - **Task**: Matrix Games (Zero-shot) ![image](https://hackmd.io/_uploads/SkHqIjX50.png) 3. **Few-shot prompting in complex domains**: Can LLMs produce reward signals that align with user objectives from examples in more intricate, long-term scenarios? - **Task**: DEALORNODEAL (Few-shot) ![image](https://hackmd.io/_uploads/SkYsIsm9R.png) Two metrics are used in the evaluation: 1. **Labeling Accuracy**: This measures the average accuracy of the predicted reward values during RL training compared to the ground-truth reward functions. 2. **Agent Accuracy**: This measures the average accuracy of the RL agents themselves. In the experiment, comparisons are made with a supervised learning (SL) model. For the LLM, they use GPT-3, and the training is conducted using either the DQN algorithm or on-policy RL methods.