## Shared answer
We thank the reviewers for their helpful feedback. Overall, reviewers appreciated our method, finding it innovative and useful (VWkY), interesting and very effective (2GW2), and able to lead to strong results (j8Na). They appreciated the depth of our emprical analyses and evaluations (wqBP, VWkY), as well as our presentation and visualizations (wqBP, VWkY, 2GW2, j8Na).
Reviewer wqBP and Reviewer VWkY expressed interest in understanding how computationally expensive Motif is. Running our code based on PagedAttention as implemented in vLLM (Kwon et al., 2023) on a node with eight A100s, the annotation of the full dataset of pairs takes about 4 GPU days when using Llama 2 13b and 7.5 GPU days when using Llama 2 70b. Given the asynchronous nature of our code, the required wall-clock time can be significantly reduced if additional resources are available. We have now included all this information in the paper in Appendix A.8.4. We believe that, together with the use of open models and our algorithm's robustness to data diversity and cardinality (see Appendix A.8.3), this makes experimenting with Motif affordable and accessible to many academic labs.
We will release our efficient implementation, together with the entire set of annotations used in our experiments, for the benefit of the research community. Note that, once the dataset is annotated, there is no use of the LLM anymore, and the combination of reward model training and a 1B-steps RL training run can take less than 10 hours on an A100 GPU. This also means that a policy trained with Motif can be deployed to act in real time, as long as the policy architecture runs fast enough, and regardless of the computational cost of running the LLM itself at annotation time. Our updated paper explicitly highlights these computational considerations.
Reviewer VWkY and Reviewer j8Na were interested in comparing Motif to additional approaches using LLMs for decision-making. The first LLM-based baseline to which we compare is based on leveraging Llama 2 70b as a policy on the raw text space (similar to Wang et al, 2023). This did not lead to any performance improvement over a random baseline (we discuss this in Section F of the Appendix). The second LLM-based baseline we implemented is a version of the recently-proposed ELLM algorithm (Du et al, 2023). This implementation of ELLM closely follows the details from the paper. As intrinsic reward it uses the cosine similarity between the representation (provided by a BERT sentence encoder) of the game messages and the "early game goal" extracted from the NetHack Wiki (the first two lines from [the early game strategy](https://nethackwiki.com/wiki/Standard_strategy#The_early_game)). Despite Motif not relying on any such information from the NetHack Wiki, it significantly outperforms ELLM in all tasks. ELLM does not provide any benefit on complex tasks: its reward function cannot, by design, exhibit the exploration and credit assignment properties of the one produced by Motif (see Section 4.2 of the paper, "Alignment with human intuition"). Note that the results for the ELLM baseline are still running and we currently can only show up to 650M steps (due to its iteration speed being considerably slower than Motif). We will include the full curves in the final version of the paper.
## Reviewer wqBP (Score: 6, Confidence: 3)
Thank you for your feedback!
> Using a 70-billion LLM to generate a preference dataset from given captions is quite expensive
Due to our implementation being modular and asynchronous (that will be publicly released), dataset annotation is not especially expensive. Please see the general response for complete computational considerations. In addition, in Figure 15 we show that the performance of our method is particularly robust to the size of the dataset of annotations: Motif is able to outperform the baselines even with a dataset that is five times smaller (i.e., ~100k annotations).
> while I understand this is out of the scope of the paper, perhaps using a large VLM to annotate frames without captions might have been more economical?
The question of whether annotations extracted from a VLM would be more efficient than the ones extracted from running an LLM on captions is interesting. Unfortunately, our experiments with current open VLM models suggest that none of them are able yet to interpret visual frames well enough to provide an effective evaluation or even accurate captions (most likely because current open models are predominantly trained on natural images). However, given the current pace of VLM research, this may change very soon. Thus, investigating the difference in the efficiency of various types of feedback will be an interesting avenue for future research.
In general, the question of large VLMs operating on images vs LLMs operating on captions brings interesting tradeoffs. Large VLM are more general, since they do not assume access to captions, but are faced with a more challenging task since they work with complex images rather than compressed text descriptions. From a purely computational standpoint, if captions are available and are of high quality then LLMs are likely more economical since their inputs are smaller.
> it might be worthwhile having a baseline that gives preferences using a simpler model (say sentiment analysis) and learn the RL policy using this intrinsic reward model.
We added the results of an experiment using a sentiment analysis model as a preference model in the updated paper (Figure 8 of Appendix A.6). We use a [T5 model](https://huggingface.co/mrm8488/t5-base-finetuned-imdb-sentiment) fine-tuned for sentiment analysis, and extract, for each message, a score computed as the sigmoid of the confidence of the model in its positive or negative sentiment prediction. Then, for each pair in the dataset, we assign a preference based on the message with higher sentiment score. Finally, we execute reward training and RL training as with Motif.
Results on the `score` task show performance close to zero, both with and without extrinsic reward. This poor performance can be easily explained: a generic sentiment analysis model cannot capture the positivity or negativity of NetHack-specific captions. For instance, killing or hitting are generally regarded as negative statements, but they become positive in the context of killing or hitting monsters in NetHack. Llama 2 can understand this out-of-the-box without any fine-tuning, as demonstrated by our experiments. Also note that such a vanilla sentiment analysis model cannot be easily steered, thus losing any opportunity for the controllability offered by Motif.
To attest to Motif's strong performance, we also compared with an additional LLM-based baseline (as requested by Reviewer VWkY). The details of this additional experiment are presented in the common response above.
> the paper mentions that the agent exhibits a natural tendency to explore the environment by preferring messages that would also be intuitively preferred by humans. Is this a consequence of having a strong LLM, or is it due to the wording of the prompt?
We believe this is due to the fact that the LLM was pretrained on massive amounts of human data, and then fine-tuned on human preferences. Indeed, even when using the zero-knowledge prompt presented in Prompt 2 of the Appendix B, Motif's reward function allows agents to play the game effectively even without any reward signal from the environment (see Figure 6b).
> An ablation over $\alpha_2$ been provided in the appendix, but the value of $\alpha_1$ the coefficient for the intrinsic reward is kept fixed at 0.1; could you explain the reason behind that?
The important factor when combining two terms in a reward function is the relative weight given to each one of them. We decided to ablate by varying $\alpha_2$ to progressively give more weight to the extrinsic reward, compared to the intrinsic reward. This allows us to show that, while Motif already performs well in the absence of extrinsic reward (i.e., for $\alpha_2=0$), adding progressively more importance to the reward signal coming from the environment (by increasing $\alpha_2$) correspondingly increases performance, but only up to a point. In the moment at which the relative weight given to the intrinsic reward becomes too small, the performance starts degrading as the agent acts essentially more and more as the extrinsic-only baseline.
> In Figure 6c, the score for the reworded prompt is quite low but its dungeon level keeps steadily rising compared to the default prompt.
Figure 6c shows that a rewording of the prompt in a task as complex as the one of finding the oracle can cause a complete change in the strategy followed by the agent. In the case of the original prompt, the agent hacks the reward, solving the task without the need of going down the dungeon. When using the reworded prompt, the agent instead starts going down the dungeon to look for the oracle, and finds it for up to 7% of the episodes. The plot thus shows the sensitivity of systems like Motif to variations of the prompts that could be perceived as small to humans. In the paper we put a strong emphasis on understanding such changes in behavior as we believe it to be fundamental if we are to release any LLM-based agent in more realistic situations. As a side note, we corrected the y axis label and normalization in Figure 6c in the updated paper to be consistent with the rest of the paper (from "Score" to "Success Rate").
## Reviewer VWkY
Thank you for you feedback!
> The paper could benefit from a more extensive comparison to other methods, especially those that also attempt to integrate LLMs into decision-making agents.
First of all, we refer the reviewer Figure 7 in the Appendix F, in which we show that Motif outperforms four competitive baselines, including E3B (Henaff et al, 2022) and NovelD (Zhang et al., 2021), two state-of-the-art approaches specifically created for procedurally-generated domains such as NetHack.
In the updated paper, we have now additionally added a comparison to ELLM (Du et al., 2023), a recent approach for deriving reward functions from LLMs, showing that Motif's performance significantly surpasses such LLM-based baselines across all tasks. This is due to the peculiar features of Motif's intrinsic reward (e.g., its anticipatory nature), which, by design, are not implied by a reward function based on the cosine similarity between a goal and a caption. Our implementation is described in detail in the general answer above.
> There is a lack of discussion on the computational cost and efficiency aspects of implementing Motif.
Due to our implementation being modular and asynchronous (that will be publicly released), dataset annotation is not especially expensive. Please see the general response for complete computational considerations. In addition, in Figure 15 we show that the performance of our method is particularly robust to the size of the dataset of annotations: Motif is able to outperform the baselines even with a dataset that is five times smaller (i.e., ~100k annotations).
> While the paper makes a strong case for Motif, it doesn't delve deeply into the limitations or potential drawbacks of relying on LLMs for intrinsic reward generation.
We believe, as the reviewer does, that addressing limitations in LLM-based work is critical: this is why a substantial fraction of our paper is devoted to analyzing limitations and pitfalls of intrinsic motivation from an LLM's feedback. In particular, we want to highlight that we dedicated a full page of the paper to demonstrating evidence for, explaining, and characterizing _misalignment by composition_, a negative phenomenon relevant to our framework, whose emergence is a current limitation of Motif. In addition, we studied the sensitivity of Motif to different prompts, showing in Figure 6c that semantically-equivalent prompts can lead, in complex tasks, to drastically different behaviors. We believe this is a limitation of current approach based on an LLM's feedback, and hope that future work will be able to address it. Finally, we also included in Appendix H.3 a study on the impact of the data diversity of the dataset (through which we elicit preferences) and the resulting final performance.
Please notice that we are also very upfront about the fundamental assumption behind Motif, which is also a fundamental assumption behind the zero-shot application of LLMs to new tasks: that the LLM contains prior knowledge about the environment of interest. Our Introduction is centerered around this assumption.
> Could the authors offer insights into why agents trained on extrinsic only perform worse than those trained on intrinsic only rewards?
Our paper already provides some insights in the "Alignment with human intuition" paragraph of Section 4.2, but we will now provide an additional perspective that can be beneficial to understand this result. By inspecting the messages preferred by the intrinsic reward function, one can quickly realize that the agent will receive from the LLM's feedback three kinds of rewards: _direct rewards_, _anticipatory rewards_ and _exploration-directed rewards_. Direct rewards (e.g., for "You kill the cave spider!") leverage the LLM's knowledge of NetHack, implying a reward very similar to the score (i.e., the extrinsic reward). Motif's reward, however, goes beyond this. Anticipatory rewards (e.g., for "You hit the the cave spider!") implicitly transport credit from the future to the past, encouraging events not rewarded by the extrinsic signal and easing credit assignment. Finally, exploration-directed rewards (e.g., for "You find a hidden door.") directly encourage explorative actions that will lead the agent to discover information in the environment. Together, these three types of rewards allow the agent to maximize a proxy for the game score that is way easier to optimize compared to the actual game score, explaining the increased performance.
> What's the best strategy to optimally balance intrinsic and extrinsic rewards during training?
We show in Figure 10c in Appendix that Motif is quite robust to how the two rewards are balanced. Broadly speaking, given that the nature of Motif's intrinsic reward brings it closer to a value function, future work can explore potentially more effective ways to leverage such type of intrinsic reward, for instance via potential-based reward shaping (Ng et al., 1999).
> Can the authors elaborate on the limitations of using LLMs for generating intrinsic rewards? Are there concerns about misalignment or ethical considerations?
As highlighted in our previous answer to the third point, the space given to our studies on misalignment and robustness in our paper is a conscious decision. We believe this constitutes a first step in establishing this as a common practice when designing new algorithms. We want to remark here, as we did in our conclusions, that "we encourage future work on similar systems to not only aim at increasing their capabilities but to accordingly deepen this type of analysis, developing conceptual, theoretical and methodological tools to align an agent’s behavior in the presence of rewards derived from an LLM’s feedback."
> How robust are agents trained with Motif against different types of adversarial attacks or when deployed in varied environments?
Our experiments on prompt sensitivity (Figure 6c, 11b, 11c) can be interpreted as being close to this kind of study, showing the seemingly small variations of a prompt can trigger large or small variations of performance and behavior, depending on the environment. Future work should explore the possibility of studying the effect of actual adversarial attacks on prompts.
## Reviewer 2GW2
Thank you for your feedback!
> Can Motif be applied to other environments beyond the NetHack Learning Environment (NLE)?
We could not investigate this question in our current paper, as it is already compact with detailed analysis on the behavior, the risks of using LLMs and the possibilities for defining diverse rewards. By adding other environments, we could not provide such in-depth analysis.
We strongly believe that Motif is a general method, and it can applied to other environments after reasonably-sized efforts, when its main assumptions are satisfied. In particular, Motif's LLM needs to have enough knowledge about the environment, which is related to the presence on the Internet of text related to it, and the availability of an event captioning system. These assumptions are realistic in many environments, both when dealing with a physical system (e.g., a robot accompanied by a vision captioner) and a simulated/virtual world (e.g., a commonly-played videogames or Web browsing). Additionally, one could apply the general architecture of Motif to any environments based on visual observations, by just substituting a VLM in place of the LLM. We believe this is an exciting direction for future work.
To give some context on our choice of environment, NetHack is a challenging and illustrative domain to deploy an algorithm like Motif. Captions in NetHack are non-descriptive: they do not provide a complete picture on the underlying state of the game. Moreover, these captions are sparse, appearing in only 20% of states. This means that overall there is a high degree of partial observability. Despite this challenge Motif is able to thrive and show results that we have not witnessed in the literature previously.
We believe that if we were to apply Motif in other environments with more complete descriptions we could see even stronger performance. This would bring important questions to be studied: what exactly is the impact of partial observability on preferences obtained from an LLM? Do more detailed captions unlock increasingly more refined behaviors from the RL agent? Such important questions could be investigated by future work.
> What a RL algorithm is used for RL fine-tuning?
We use the asynchronous PPO implementation of Sample Factory (Petrenko et al., 2020). This information is available in the paper on the bottom of page 4. We chose this implementation as it extremely fast: we can train an agent on 2B steps of interactions in only 24h using only one V100 GPU.
## Reviewer j8Na
Thank you for your feedback!
> I think the point of this paper is that "joint optimization of preference-based and extrinsic reward helps resolve the sparse reward problems". As the source of feedback, either humans or LLMs are OK. I think describing this as LLM's contribution might be an overstatement. [...] As a preference-based RL method, I guess there are no differences from the original paper [1].
We never claim that an LLM's feedback is inherently better than the feedback coming from humans, even though we believe assessing whether that could be the case is an interesting avenue for future work. Instead, we simply leverage an LLM's feedback because of its scalability: in just a few hours on a small-sized cluster, one can annotate thousands of pairs of games events, which would require significant amounts of human NetHack experts's labour otherwise. This scalability, leveraged also in recent work on chat agents (e.g., Constitutional AI from Bai et al, 2022), allows for a method like Motif to fully leverage human common sense to bootstrap an agent's knowledge.
> In the LLM literature, [2] leverages GPT-4 to solve game environments, and [3] incorporates LLM-based rewards for RL pretraining.
In [2] the Voyager algorithm uses complex prompt engineering involving significant amounts of human knowledge and engineering (such as deliberately prompting the LLM to use the “explore” command regularly). Additionally, Voyager relies on a Javascript API to bridge the gap between the LLM's language outputs and the high dimensional observations and actions of Minecraft. Finally, Voyager relies on perfect state information about certain features of the game (e.g. agent position and neighbouring objects). Voyager also builds on GPT-4's strong coding abilities, which would likely not be the case of current open models. Altogether, these factors strongly limit Voyager's general applicability and reproducibility. On the other hand, Motif relies on very limited human knowledge, being able to get significant performance even without any information about NetHack. Moreover Motif is way simpler to implement, with very few, clearly separated, moving pieces, providing a robust solution for leveraging prior knowledge from large models. This makes our approach a significantly more general method that has the potential to be applied to multiple domains or be possibly combined with powerful Large Vision Language Models.
We have additionally compared Motif to the ELLM approach of [3], adapted as described in the general response, showing that Motif significantly outperforms it in all NLE tasks. Please see the common response for a detailed description of the experiment. The paper also highlights in Section 5 (Related Work) the important differences between Motif and the ELLM algorithm. In particular, ELLM's reward function cannot, by design, exhibit the exploration and credit assignment properties of the one produced by Motif (see Section 4.2 of the paper, "Alignment with human intuition"). We believe those differences are key to the strong performance of Motif.
> Terminology: I'm not sure if a preference-based reward should be treated as an "intrinsic" reward. I think it is extrinsic knowledge (from humans or LLM).
As standard, we refer to extrinsic reward as the reward that comes from the task to be performed in the environment, whereas intrinsic rewards are provided by the algorithm. From that point of view, the reward provided by the LLM is intrinsic (as opposed to extrinsic) to the agent. Please notice that this terminology has previously been used in the literature (see, for instance, the ELLM [3] paper).
> Which RL algorithm is used for Motif? I may miss the description in the main text.
We use the asynchronous PPO implementation of Sample Factory (Petrenko et al., 2020). This information is available in the paper on the bottom of page 4. We chose this implementation as it extremely fast: we can train an agent on 2B steps of interactions in only 24h using only one V100 GPU.
> Are there any reason why employ LLaMA 2 rather than GPT-3.5 / 4?
Yes, we believe there are important and significant reasons to prefer using Llama 2 rather than GPT3.5 or GPT4. GPT3.5 and GPT4 are subject to changes over time, require significant financial efforts to be used at scale, and rely on unknown methodologies and practices. Despite the fact that they might provide better performance, they are problematic for rigorous scientific reproducibility, and thus they are significantly less preferrable than Llama 2 for conducting scientific research. We explicitly made this decision for the benefit of the scientific community, and we will also release our code and dataset to ease experimenting with a method like Motif for other members of the community.
> (Minor Issue)
We thank the reviewer for spotting the typo. We corrected it in the updated version of the paper.
### Second reply to Reviewer j8Na
Thank you for your timely answer! We are glad to know that we have effectively addressed the majority of the reviewer's concerns. We now provide answers to the remaining two concerns on Contribution and Terminology.
__Contribution__
We appreciate the suggestion from the reviewer. We highlight that we already discuss this early in the paper, in our introduction, stating that "since the idea of manually providing this knowledge on a per-task basis does not scale, we ask: what if we could harness the collective high-level knowledge humanity has recorded on the Internet to endow agents with similar common sense?". To make the advantage of AI feedback even more explicit to a reader, we added a brief but precise sentence to the conclusion, saying that "[Motif] bootstraps an agent's knowledge using AI feedback as a scalable proxy for human common sense.".
__Terminology__
In our paper, we demonstrate that the reward from Motif is not _only_ "a proxy for extrinsic reward", but instead captures rich information about the future that (1) helps the agent *explore* unknown parts of the environment, (2) *discover* inherently interesting patterns in the environment and (3) achieve *creative* solutions (please see Section 4.2 "Misalignment by composition in the oracle task"). Notice that these three characteristics are part of the formal definition of intrinsic motivation of Schmidhuber, 1990. In particular, Motif's intrinsic reward "is something that is independent of external reward, although it may sometimes help to accelerate the latter" (Section V.B from Schmidhuber, 1990). Even though we not follow the specific way in which intrinsic motivation is defined in that seminal work (i.e. through learning progress), we believe Motif adhere's to the underlying principles.
To better explain what we mean, we report here an excerpt of our response to VWkY, which provides a more detailed discussion on why Motif's reward is much more than simply a replication of the extrinsic reward, and instead incorporates strong elements of intrinsic motivation:
> Our paper already provides some insights in the "Alignment with human intuition" paragraph of Section 4.2, but we will now provide an additional perspective that can be beneficial to understand this result. By inspecting the messages preferred by the intrinsic reward function, one can quickly realize that the agent will receive from the LLM's feedback three kinds of rewards: direct rewards, anticipatory rewards and exploration-directed rewards. Direct rewards (e.g., for "You kill the cave spider!") leverage the LLM's knowledge of NetHack, implying a reward very similar to the score (i.e., the extrinsic reward). Motif's reward, however, goes beyond this. Anticipatory rewards (e.g., for "You hit the the cave spider!") implicitly transport credit from the future to the past, encouraging events not rewarded by the extrinsic signal and easing credit assignment. Finally, exploration-directed rewards (e.g., for "You find a hidden door.") directly encourage explorative actions that will lead the agent to discover information in the environment.
In this passage we explicitly distinguish the three ways in which Motif helps: (1) through rewards directly related to the score, (2) through anticipatory rewards and (3) through exploration-directed rewards. Notice that if Motif only provided rewards of the type (1), we could see Motif as a proxy to the score. However, (2) and (3) make it abundantly clear that Motif goes much further than that and provides intrinsic motivation to the agent to discover the environment. It is also through (2) and (3) that Motif's performance significantly outperforms the baselines.
Finally, we would like to highlight that we are as explicit as we can be as to the nature of the reward obtained by Motif, i.e. that it is preference based. This is present in the title, the introduction and throughout the paper at numerous occasions. Notice that it is also the basis for the name of our algorithm (**Motif** -> **Moti**vation from AI **f**eedback).
We are hopeful that these answers should address your concerns, but let us know if any further clarification is required.
Jurgen Schmidhuber, Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010)