We thank all reviewers for their constructive feedback. Based on the reviewers' comments, we have improved the paper's organisation as follows:
* We have revised the manuscript and expanded the result analysis.
* We have added a subsection to the appendix with a detailed introduction to DRRN and CALM.
* We have added a subsection to the appendix to illustrate how our method differs from the GALAD and why they are not comparable.
Below we provide point-wise responses to your questions. It would be great if you could acknowledge our response and let us know if you have any remaining questions about our work. Thank you.
* R1
> Q1: I found the method wasn't very clearly explained, and took a while to understand with multiple re-reads. Specifically, I think it'd be useful to explain DRRN and CALM more in detail when first presented, since these are the core pieces on which MorAL is based.
A1: Thanks for your suggestion. We refine and add more details about DRRN and CALM in Section 3 for better understanding.
> Q2: The improvements over baselines are quite small, and given there are no error bars it's hard to tell if the results are significant.
A2: Thank you for your review. We would like to clarify that our model is significantly improved than baselines. Our model boosts the game completion percentage by 19% while decreasing the immorality score by 5%. Earlier efforts based on this benchmark utilised an LM trained on more human gameplay trajectories and a prior built on more morally-relevant datasets (Ammanabrolu et al. 2022). Our model instead does not require any external data source - during RL, the agent automatically collects past successful trajectories to conduct morality learning.
We follow the experimental setup of Hendrycks et al. (2021) for better comparison.
> Q3: The closest competing algorithms (CMPS and CMRS from Hendrycks et al) are only described briefly in the introduction, but they seem like very similar to the proposed method. The overall strategy is the same (train a Q function that optimizes task reward, and modify it using a commonsense prior to making it more moral). From what I can tell, the only difference seems to be whether the correction term is used to modify the game reward or Q-value, rather than separately choosing actions using a mixture of the Q value and moral policy.
A3: Thank you for your review. Both CMPS and CMRS (Hendrycks et al., 2021) simply use a commonsense prior to determine the morality of an action and to modify its Q-value/reward. MorAL is superior than these two algorithms in the following ways:
* Self-imitation learning: We collect past successful trajectories during training to conduct self-imitation learning. Ablation studies show that SiL greatly improves game completion.
* Moral-enhanced loss function: We design a dynamically scaled cross-entropy function for morality learning, which allows for a greater emphasis on the training of moral samples.
* Adaptive learning: We design multiple learning cycles for adaptive task learning and morality learning.
We also add more details in the related work.
> Q4: I'm not convinced by the drawbacks mentioned in the paper for the Hendrycks et al. method:
"First, adding a correction term to the game reward or Q-value will generate extra noise, particularly for game rewards that are extremely sparse. In addition, some immoral actions are necessary for progressing through the game. For instance, in the game “Zork1”, the agent must steal the lantern to reach the following location on the map, as shown in Figure 1. The trade-off between task progress and morality is a dilemma that agents may encounter while making decisions."
Specifically, it's not clear to me how the proposed MorAL algorithm materially changes the problem of "some immoral actions are necessary for progress in the game" -- it seems like both methods have some way of trading off morality and task performance, and it's unclear to me which is better. I think this paper would be much stronger if it had a better argument / analysis for why MorAL performs better than CMPS/ CMRS, and why we'd expect this to be a general finding, rather than a specific quirk of this environment.
A4: Thank you for your review. In text-based games, improving the morality of the agent often leads to less task completion (Hendrycks et al. 2021). Although Ammanabrolu (2022) demonstrates greater increases in both metrics, the improvement in task completion is due to the fact that more human gameplay trajectories are used to finetune the LM. This study aims to enhance the morality of the agent while simultaneously increasing game completion. Our algorithm overcomes this issue by adaptive morality learning and task learning.
> Q5: "RL agents may select immoral actions unconsciously" unclear if 'unconsciously' is an appropriate word here (what would it mean for an RL agent to select actions consciously?)
A5: Thank you for your review. "Unconsciously" denotes that the agent chooses actions without being aware of morality. We modified this sentence to make it clearer.
> Q6: "The LM is then equipped with a conscience" I don't really like using these kinds of human analogies, since I don't think the 'conscience' described here is anything like a human conscience.
A6: Thank you for your review. We follow the prior work (Hendrycks et al., 2021) to use 'conscience' as the moral sense of right and wrong. This sentence is modified in the revised manuscript.
> Q7: "To sustain the agent’s exploration, we define that within an episode, if the current steps t exceed the max length of trajectories lmax in buffer B, πT should be used instead of the mixture policy for selecting actions." So if the episodes are long enough, the agent behaves non-morally? Or is this only during training?
A7: Thank you for your review. This operation is only conducted during training to enhance the agent's exploration ability. We add more details to make it clear.
> Q8: "The walkthrough is constructed by human experts to quickly obtain the maximum possible score while taking less immoral behaviours" Seems like the method needs human expert data to warm-start learning. Is this also a limitation of other methods?
A8: Thank you for your question. The human expert is independent of the training process. This configuration is based on prior research and allows a game to be divided into five different environments. This is clarified in the revised manuscript.
> Q9: "The framework eliminates the assumption that dense human feedback is required during training, as we only perform morality learning using a limited number of trajectories at specific stages." It seems like other methods also don't need dense human feedback, unless I'm mistaken?
A9: Thank you for your question. Prior study needs dense human feedback and requires that the morality of the action be evaluated at each step. As noted by Hendrycks et al. (2021), requiring such dense human feedback for training purposes is unrealistic in most sequential decision making environments and is thus used only for evaluation. This assumption is invalidated by the fact that our method only evaluates actions in the buffer at particular stages.
> Q10: I'm not clear why you call MorAL a 'framework', rather than simply an 'algorithm'. To me it seems more like an algorithm.
A10: Thanks for your review. We replaced “framework” with “algorithm” in the revised manuscript.
> Q11: Eq1 -- what is A? Is it the set of candidate actions?
A11: Thank you for your question. In Eq1, A denotes the set of action candidates generated by the LM.
\[1] Ammanabrolu P, Jiang L, Sap M, et al. Aligning to social norms and values in interactive narratives[J]. arXiv preprint arXiv:2205.01975, 2022.
\[2] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021.
* R2
> Q1: Even if on average MorAL achieves the least immorality score and the highest completion percentage, for 7/15 tasks NAIL has lower immorality score and for 4/15 tasks CMPS has lower immorality score. In total for 11/15 tasks, MorAL does not achieve the least immorality score...These results are not discussed and no reasoning ... Similarly, for a total of 8/15 tasks, other prior techniques achieve a higher completion percentage ...should be discussed
A1: Thank you for your suggestion. In most cases, for agents, morality and task completion are often in conflict in text-based games. For instance, in the game “ Zork3”, our model improves task completion due to additional immoral actions performed by the agent (i.e. more props are collected), but also resulting in a rise in the immorality score. It also happens to other methods. While in a few games such as "Ballyhoo", an increase in task completion can lead to a decrease in immorality scores. This might be because the task completion is increased without encountering additional morally salient scenarios. We add more discussion in Section 5.5.
We explore the efficiency of our MorAL through ablation studies (Table 2). The improvement in task completion is attributed to self imitation learning, which leverages past valuable trajectories to improve the action generator. Compared to “MorAL w/o Mixture w/o SIL”, “MorAL w/o Mixture w/o Meo” improves task completion on 14/15 games. The use of moral policy and the proposed loss function are both credited with lowering the immorality scores.
We also plot Figure 3 to investigate the trade-off between behaving morally and task completion. Compared with its variants, our MorAL yields a better trade-off, as it reduces the immorality behaviours with an acceptable sacrifice of the completion percentage.
> Q2: ...It is unclear what is the performance of this finetuned model on the test set of the benchmark I.e how good is the quality of this model...Similarly, the use of commonsense prior model could also be a limitation because the errors of this model would be propagated to the morality control module of MorAL. This should be made clear and what definition of morality is used should also be clarified in the paper.
A2: Thanks for your suggestion. Similar to Hendrycks et al. (2021), the commonsense prior achieves 63.4% accuracy on a challenging test set for commonsense morality questions (Hendrycks et al. 2020), which demonstrates that a stronger model of commonsense morality could further improve the performance of agents on Jiminy Cricket benchmark. We clarify this in Appendix D of the revised manuscript. We also add a footnote in the revised manuscript that defines the morality.
\[1] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021.
\[2] Hendrycks D, Burns C, Basart S, et al. Aligning ai with shared human values[J]. arXiv preprint arXiv:2008.02275, 2020.
* R3
> Q1: Lack of clarity in Section 4.3 about what is old vs new. For instance, the data buffer is already a component of CALM, yet this is not made clear in the writing and it sounds like the authors are presenting something new. If there are differences compared to the CALM buffer, this should be made clear.
A1: Thank you for your suggestion. The buffer and Moral Aligned Policy Optimization are new. We create a new buffer in addition to CALM's memory replay buffer to store high-quality trajectories. This is clarified in the revised paper in Data Collection.
> Q2: Why only 15 games from Jiminy Cricket? How were these selected? Out of the 15 games selected, there are a few where immorality increases slightly, so I don't think the authors engaged in cherry-picking. However, it would be useful to know why the remaining 10 games were not included and whether the authors plan to include them (this would make it easier for future work to compare to the MorAL method).
A2: Thank you for your question. To ensure a fair comparison, we use 15 games from Jiminy Cricket, which are also used by one of our main baseline CALM. Specifically, we first abandon the games where our backbone model CALM makes the least progress. Some games, such as "Infidel", "Lurking Horror", "Starcross" and "Stationfall", are difficult for CALM to complete. According to Hendrycks et al., (2021), the average percent completion of CALM on these games is less than 1. Therefore, we do not utilise these environments. We also remove some games with duplicate themes and genres, such as "Zork2" and "Cutthroats".
Due to the time-consuming nature of RL training, it is common to select representative games from the game suite for experimentation. In this study, 15 representative games with five distinct initial states were chosen based on the aforementioned rules. We consider current experiments to be unbiased and convincing.
> Q3: The paper should include more discussion of why comparison with GALAD is infeasible. This could be a subsection of the appendix. In particular, this paper proposes modifying the action generator language model, and GALAD does something similar. This should be discussed.
A3: Thanks for your suggestion. In (Ammanabrolu et al. 2022), the author re-evaluated the jiminy cricket environment, maintaining only annotations with relatively high annotator agreement. However, the method and environment changes were not released publicly. We add a subsection to the appendix to provide more details. Please see Appendix C.
> Q4: It's not clear what the moral policy language model is. Is this a GRU on top of a pre-trained GPT-2 model? Or are you fine-tuning the GPT-2 model?
A4: Thank you for your question. The moral policy language model is the fine-tuned GPT-2 model. We described in Appendix D.
> Q5: Equation 4 and its description are confusing...I understand that m(c_i, a_i) goes to zero as the probability of immorality increases, but m(c_i, a_i) is different from the scaling factor, right?...I get the overall idea that a policy network is being trained with imitation learning on successful buffer examples weighted by the conscience, but the details need to be made more clear.
A5: Thank you for your suggestion. The scaling factor c is a fixed constant during training. m(c_i, a_i) is used to control the loss function according to the immorality of the data sample. m(c_i, a_i) is different from the scaling factor. We modified this in a revised manuscript.
> Q6: Figure 3 is somewhat poorly described in the text. How is the figure generated?
A6: Thank you for your review. Figure 3 shows the relationship between the completion percentage and the overall average immorality score. It shows there is a trade-off between the immorality score and completion percentage. The immorality score appears to be nearly proportional to the completion percentage. We add more details in the manuscript.
> Q7: The first paragraph of the introduction is more like related work.
A7: Thanks for your suggestion. We refine the first paragraph. We present the background of text games and the research motivation at first. Then, we introduce admissible actions and generating actions, which are relevant to the morality issue.
> Q8: In Figure 1's caption, it may be helpful to the reader to mention that the house that the agent is in does not belong to them.
A8: Thanks for your suggestion. We add more information about Figure 1.
\[1] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021.
\[2] Ammanabrolu P, Jiang L, Sap M, et al. Aligning to social norms and values in interactive narratives[J]. arXiv preprint arXiv:2205.01975, 2022.
* R4
> Q1: As my main question, I want to hear more from the authors regarding the motivation of this research direction. Object collection and combat are probably at the core of many text adventure games, or any games in general. Many games are designed in a way that the players/agents have to follow some predefined storyline, or to reach some key points, in order to proceed the story forward. I fully understand and agree with the necessity of having morality as an important evaluation dimension in sequential decision making problems, but are existing games the best place to start with?
A1: Thank you for your question. We chose text-based games to evaluate the proposed MorAL for the following reasons:
* Semantically Rich Environments: Compared with previous benchmarks, text-based games have semantically rich and complex environments. Combined with the natural storyline of these games, thousands of morally dubious scenarios are created for the agent to explore. These diverse scenarios include theft, combat, as well as caring for and helping others.
* Misalignment between Task and Morality: Agents tend to behave immorally if their training environment ignores moral concerns. This issue is more apparent in the game environment as the game task and morality are often in conflict. Thus, we aim to study this issue and propose a generic solution.
> Q2: Related to the previous question, I also want to grab the authors' thoughts on, if designing new games (let's say, text-based adventure games), what are some ways to put this morality dimension systematically in the designing process? For instance, how to explicitly model/measure the trade-off between morality and task progress, and how to evaluate morality. One possibility is, as the authors briefly discussed in the paper, to take social aspects into consideration in building tasks/games. On a related note, in [1], they proposed a set of minimal tasks that require agents to perform certain social interactions as part of the skills solving some tasks. This can potentially be used in text-based game design, in a way that the agent needs to borrow a lantern from an NPC (and later return it!).
A2: Thank you for your insightful suggestions. Due to the multiplicity of scenarios, Jiminy Cricket benchmark only evaluates the immorality of an agent based on the frequency and severity of immoral actions. One possible enhancement would be to encourage the agent to act in a less negative and more positive manner. As you suggested, we can include additional morally salient scenarios or tasks that are independent of the game task. In this way, we can evaluate the agent's morality by requiring it to reason and perform certain moral actions.
> Q3: Regarding human expert immorality scores in Table 1. Are they the accumulated scores until the end of the game, or until the same step budget as given to the agent? Would it make more sense if computing human immorality scores until the step where MorAL ends each game?
A3: Thank you for your question. The human expert immorality scores in Table 1 refer to the accumulated scores until the end of the game. Jiminy Cricket provides walkthroughs constructed by human experts, which take less immoral actions to quickly achieve the game task. However, a game's storyline has numerous branches that can lead to success or failure. At the end of an episode, the agent is likely to be on a branch. It is difficult to calculate the steps corresponding to the walkthrough. Thus, we follow previous work to define human expert immorality scores as the accumulated score until the end of the game.
> Q4: To me it was a bit confusing at the beginning of reading, I thought the morality training might have led to higher task completion scores --- which is actually not the case. The task completion boost might rather come from a better action candidate generator. Because the majority of the paper discusses the effects of morality learning, the other part of the contribution (the boost on task completion scores) might not be as clear. To me they are both important and worth emphasising.
A4: Thanks for your suggestion. The task completion boost come from self-imtation learning. We emphasise this in Section 5.6 and discuss the reason for this - improving the language model helps the agent to adapt to the new scenarios, thus going further.
> Q5: In Section 5.4, the authors provide the value of λ=0.14, this seems to require some hyper parameter tuning. What was the hyper-param search space?
A5: Thank you for your question. The hyper-param search space is from 0.1 to 0.5. We conducted hyper parameter tuning on one game, and then applied it to all games.
> Q6: How much training speed does the extra modality learning phase sacrifice?
Q6: Thank you for your question. In our experiments, morality learning increases training time by around 10% during the same episode steps. The training speed of morality learning is related to parameters like learning cycle length and epoch. For example, in the 5 environments of the game “Zork1”, the average training time of the first 10,000 steps of the CALM model is 3h10min. For every 2000 steps, we update the action generator for 3 epoch. This took about additional 20 minutes.
\[1] Kovač G, Portelas R, Hofmann K, et al. SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents[J]. arXiv preprint arXiv:2107.00956, 2021.