Guiding Pretraining in Reinforcement Learning with Large Language Models - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2302.06692) | [Note link](https://blog.csdn.net/qq_43510916/article/details/135839005) | [Code link](https://github.com/yuqingd/ellm) | ICML 2023 :::success **Thoughts** This article proposes the ELLM method, which uses a large language model to generate goals based on the current state of the agent, guiding reinforcement learning to effectively explore in an environment lacking external rewards. However, this method is dependent on the accuracy of the environment characteristics and language model. ::: ## Abstract Reinforcement learning (RL) struggles without dense rewards. Intrinsically motivated exploration offers limited benefits in large environments. ELLM (Exploring with LLMs) uses text-based background knowledge to shape exploration. It rewards agents for goals suggested by a language model, guiding them towards meaningful behaviors. ## Background Reinforcement learning thrives with frequent rewards but defining these rewards for complex tasks is challenging. Without external rewards, RL agents must still learn behaviors. **But what should they learn?** ## Method This paper introduces a method called **Exploring with LLMs (ELLM)**, which leverages pretrained language models as a source of information about useful behaviors. ELLM queries LLMs for potential goals based on an agent’s context and rewards the agent for achieving them. This approach biases exploration toward diverse, context-sensitive, and human-meaningful goals. ELLM-trained agents demonstrate better coverage of useful behaviors during pretraining and perform better or match baselines when fine-tuned on downstream tasks. ![image](https://hackmd.io/_uploads/BkRHOHI5A.png) ### Problem Description Here, they consider partially observed Markov decision processes defined by a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{O}, \Omega, \mathcal{T}, \gamma, \mathcal{R})$. - Observations $o \in \Omega$ - Emviroment states $s \in \mathcal{S}$ - Actions $a \in \mathcal{A}$ - Dynamics of the environment $\mathcal{T}(s^\prime \mid s, a)$ - Reward $\mathcal{R}$ - Discount factor $\gamma$ ![image](https://hackmd.io/_uploads/rJ4vuHL5C.png) ELLM uses GPT-3 to suggest suitable exploratory goals and employs SentenceBERT embeddings to measure the similarity between these goals and the agent's behaviors, providing intrinsically-motivated rewards. This study considers two forms of agent training: - a **goal conditioned** setting where the agent is given a sentence embedding of the list of suggested goals, $\pi(a \mid o, E(g^{1:k}))$ - a **goal-free** setting where the agent does not have access to the suggested goals, $\pi(a \mid o)$ ![image](https://hackmd.io/_uploads/ry1FOHU9C.png) ## Experiment Two hypotheses will be tested: 1. Prompted pretrained LLMs can generate exploratory goals that are diverse, commonsensical, and context-sensitive 2. Training an ELLM agent on these exploratory goals enhances performance on downstream tasks compared to methods that do not use LLM priors. They evaluate ELLM in two complex environments: - Crafter - Housekeep ![image](https://hackmd.io/_uploads/Sy__AHIq0.png) For the first hypothesis ![image](https://hackmd.io/_uploads/SJZLJLLc0.png) For the second hypothesis ![image](https://hackmd.io/_uploads/HJoA1IIcA.png)