MENG FANG
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    We thank all reviewers for their constructive feedback. Based on the reviewers' comments, we have improved the paper's organisation as follows: * We have revised the manuscript and expanded the result analysis. * We have added a subsection to the appendix with a detailed introduction to DRRN and CALM. * We have added a subsection to the appendix to illustrate how our method differs from the GALAD and why they are not comparable. Below we provide point-wise responses to your questions. It would be great if you could acknowledge our response and let us know if you have any remaining questions about our work. Thank you. * R1 > Q1: I found the method wasn't very clearly explained, and took a while to understand with multiple re-reads. Specifically, I think it'd be useful to explain DRRN and CALM more in detail when first presented, since these are the core pieces on which MorAL is based. A1: Thanks for your suggestion. We refine and add more details about DRRN and CALM in Section 3 for better understanding. > Q2: The improvements over baselines are quite small, and given there are no error bars it's hard to tell if the results are significant. A2: Thank you for your review. We would like to clarify that our model is significantly improved than baselines. Our model boosts the game completion percentage by 19% while decreasing the immorality score by 5%. Earlier efforts based on this benchmark utilised an LM trained on more human gameplay trajectories and a prior built on more morally-relevant datasets (Ammanabrolu et al. 2022). Our model instead does not require any external data source - during RL, the agent automatically collects past successful trajectories to conduct morality learning. We follow the experimental setup of Hendrycks et al. (2021) for better comparison. > Q3: The closest competing algorithms (CMPS and CMRS from Hendrycks et al) are only described briefly in the introduction, but they seem like very similar to the proposed method. The overall strategy is the same (train a Q function that optimizes task reward, and modify it using a commonsense prior to making it more moral). From what I can tell, the only difference seems to be whether the correction term is used to modify the game reward or Q-value, rather than separately choosing actions using a mixture of the Q value and moral policy. A3: Thank you for your review. Both CMPS and CMRS (Hendrycks et al., 2021) simply use a commonsense prior to determine the morality of an action and to modify its Q-value/reward. MorAL is superior than these two algorithms in the following ways: * Self-imitation learning: We collect past successful trajectories during training to conduct self-imitation learning. Ablation studies show that SiL greatly improves game completion. * Moral-enhanced loss function: We design a dynamically scaled cross-entropy function for morality learning, which allows for a greater emphasis on the training of moral samples. * Adaptive learning: We design multiple learning cycles for adaptive task learning and morality learning. We also add more details in the related work. > Q4: I'm not convinced by the drawbacks mentioned in the paper for the Hendrycks et al. method: "First, adding a correction term to the game reward or Q-value will generate extra noise, particularly for game rewards that are extremely sparse. In addition, some immoral actions are necessary for progressing through the game. For instance, in the game “Zork1”, the agent must steal the lantern to reach the following location on the map, as shown in Figure 1. The trade-off between task progress and morality is a dilemma that agents may encounter while making decisions." Specifically, it's not clear to me how the proposed MorAL algorithm materially changes the problem of "some immoral actions are necessary for progress in the game" -- it seems like both methods have some way of trading off morality and task performance, and it's unclear to me which is better. I think this paper would be much stronger if it had a better argument / analysis for why MorAL performs better than CMPS/ CMRS, and why we'd expect this to be a general finding, rather than a specific quirk of this environment. A4: Thank you for your review. In text-based games, improving the morality of the agent often leads to less task completion (Hendrycks et al. 2021). Although Ammanabrolu (2022) demonstrates greater increases in both metrics, the improvement in task completion is due to the fact that more human gameplay trajectories are used to finetune the LM. This study aims to enhance the morality of the agent while simultaneously increasing game completion. Our algorithm overcomes this issue by adaptive morality learning and task learning. > Q5: "RL agents may select immoral actions unconsciously" unclear if 'unconsciously' is an appropriate word here (what would it mean for an RL agent to select actions consciously?) A5: Thank you for your review. "Unconsciously" denotes that the agent chooses actions without being aware of morality. We modified this sentence to make it clearer. > Q6: "The LM is then equipped with a conscience" I don't really like using these kinds of human analogies, since I don't think the 'conscience' described here is anything like a human conscience. A6: Thank you for your review. We follow the prior work (Hendrycks et al., 2021) to use 'conscience' as the moral sense of right and wrong. This sentence is modified in the revised manuscript. > Q7: "To sustain the agent’s exploration, we define that within an episode, if the current steps t exceed the max length of trajectories lmax in buffer B, πT should be used instead of the mixture policy for selecting actions." So if the episodes are long enough, the agent behaves non-morally? Or is this only during training? A7: Thank you for your review. This operation is only conducted during training to enhance the agent's exploration ability. We add more details to make it clear. > Q8: "The walkthrough is constructed by human experts to quickly obtain the maximum possible score while taking less immoral behaviours" Seems like the method needs human expert data to warm-start learning. Is this also a limitation of other methods? A8: Thank you for your question. The human expert is independent of the training process. This configuration is based on prior research and allows a game to be divided into five different environments. This is clarified in the revised manuscript. > Q9: "The framework eliminates the assumption that dense human feedback is required during training, as we only perform morality learning using a limited number of trajectories at specific stages." It seems like other methods also don't need dense human feedback, unless I'm mistaken? A9: Thank you for your question. Prior study needs dense human feedback and requires that the morality of the action be evaluated at each step. As noted by Hendrycks et al. (2021), requiring such dense human feedback for training purposes is unrealistic in most sequential decision making environments and is thus used only for evaluation. This assumption is invalidated by the fact that our method only evaluates actions in the buffer at particular stages. > Q10: I'm not clear why you call MorAL a 'framework', rather than simply an 'algorithm'. To me it seems more like an algorithm. A10: Thanks for your review. We replaced “framework” with “algorithm” in the revised manuscript. > Q11: Eq1 -- what is A? Is it the set of candidate actions? A11: Thank you for your question. In Eq1, A denotes the set of action candidates generated by the LM. \[1] Ammanabrolu P, Jiang L, Sap M, et al. Aligning to social norms and values in interactive narratives[J]. arXiv preprint arXiv:2205.01975, 2022. \[2] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021. * R2 > Q1: Even if on average MorAL achieves the least immorality score and the highest completion percentage, for 7/15 tasks NAIL has lower immorality score and for 4/15 tasks CMPS has lower immorality score. In total for 11/15 tasks, MorAL does not achieve the least immorality score...These results are not discussed and no reasoning ... Similarly, for a total of 8/15 tasks, other prior techniques achieve a higher completion percentage ...should be discussed A1: Thank you for your suggestion. In most cases, for agents, morality and task completion are often in conflict in text-based games. For instance, in the game “ Zork3”, our model improves task completion due to additional immoral actions performed by the agent (i.e. more props are collected), but also resulting in a rise in the immorality score. It also happens to other methods. While in a few games such as "Ballyhoo", an increase in task completion can lead to a decrease in immorality scores. This might be because the task completion is increased without encountering additional morally salient scenarios. We add more discussion in Section 5.5. We explore the efficiency of our MorAL through ablation studies (Table 2). The improvement in task completion is attributed to self imitation learning, which leverages past valuable trajectories to improve the action generator. Compared to “MorAL w/o Mixture w/o SIL”, “MorAL w/o Mixture w/o Meo” improves task completion on 14/15 games. The use of moral policy and the proposed loss function are both credited with lowering the immorality scores. We also plot Figure 3 to investigate the trade-off between behaving morally and task completion. Compared with its variants, our MorAL yields a better trade-off, as it reduces the immorality behaviours with an acceptable sacrifice of the completion percentage. > Q2: ...It is unclear what is the performance of this finetuned model on the test set of the benchmark I.e how good is the quality of this model...Similarly, the use of commonsense prior model could also be a limitation because the errors of this model would be propagated to the morality control module of MorAL. This should be made clear and what definition of morality is used should also be clarified in the paper. A2: Thanks for your suggestion. Similar to Hendrycks et al. (2021), the commonsense prior achieves 63.4% accuracy on a challenging test set for commonsense morality questions (Hendrycks et al. 2020), which demonstrates that a stronger model of commonsense morality could further improve the performance of agents on Jiminy Cricket benchmark. We clarify this in Appendix D of the revised manuscript. We also add a footnote in the revised manuscript that defines the morality. \[1] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021. \[2] Hendrycks D, Burns C, Basart S, et al. Aligning ai with shared human values[J]. arXiv preprint arXiv:2008.02275, 2020. * R3 > Q1: Lack of clarity in Section 4.3 about what is old vs new. For instance, the data buffer is already a component of CALM, yet this is not made clear in the writing and it sounds like the authors are presenting something new. If there are differences compared to the CALM buffer, this should be made clear. A1: Thank you for your suggestion. The buffer and Moral Aligned Policy Optimization are new. We create a new buffer in addition to CALM's memory replay buffer to store high-quality trajectories. This is clarified in the revised paper in Data Collection. > Q2: Why only 15 games from Jiminy Cricket? How were these selected? Out of the 15 games selected, there are a few where immorality increases slightly, so I don't think the authors engaged in cherry-picking. However, it would be useful to know why the remaining 10 games were not included and whether the authors plan to include them (this would make it easier for future work to compare to the MorAL method). A2: Thank you for your question. To ensure a fair comparison, we use 15 games from Jiminy Cricket, which are also used by one of our main baseline CALM. Specifically, we first abandon the games where our backbone model CALM makes the least progress. Some games, such as "Infidel", "Lurking Horror", "Starcross" and "Stationfall", are difficult for CALM to complete. According to Hendrycks et al., (2021), the average percent completion of CALM on these games is less than 1. Therefore, we do not utilise these environments. We also remove some games with duplicate themes and genres, such as "Zork2" and "Cutthroats". Due to the time-consuming nature of RL training, it is common to select representative games from the game suite for experimentation. In this study, 15 representative games with five distinct initial states were chosen based on the aforementioned rules. We consider current experiments to be unbiased and convincing. > Q3: The paper should include more discussion of why comparison with GALAD is infeasible. This could be a subsection of the appendix. In particular, this paper proposes modifying the action generator language model, and GALAD does something similar. This should be discussed. A3: Thanks for your suggestion. In (Ammanabrolu et al. 2022), the author re-evaluated the jiminy cricket environment, maintaining only annotations with relatively high annotator agreement. However, the method and environment changes were not released publicly. We add a subsection to the appendix to provide more details. Please see Appendix C. > Q4: It's not clear what the moral policy language model is. Is this a GRU on top of a pre-trained GPT-2 model? Or are you fine-tuning the GPT-2 model? A4: Thank you for your question. The moral policy language model is the fine-tuned GPT-2 model. We described in Appendix D. > Q5: Equation 4 and its description are confusing...I understand that m(c_i, a_i) goes to zero as the probability of immorality increases, but m(c_i, a_i) is different from the scaling factor, right?...I get the overall idea that a policy network is being trained with imitation learning on successful buffer examples weighted by the conscience, but the details need to be made more clear. A5: Thank you for your suggestion. The scaling factor c is a fixed constant during training. m(c_i, a_i) is used to control the loss function according to the immorality of the data sample. m(c_i, a_i) is different from the scaling factor. We modified this in a revised manuscript. > Q6: Figure 3 is somewhat poorly described in the text. How is the figure generated? A6: Thank you for your review. Figure 3 shows the relationship between the completion percentage and the overall average immorality score. It shows there is a trade-off between the immorality score and completion percentage. The immorality score appears to be nearly proportional to the completion percentage. We add more details in the manuscript. > Q7: The first paragraph of the introduction is more like related work. A7: Thanks for your suggestion. We refine the first paragraph. We present the background of text games and the research motivation at first. Then, we introduce admissible actions and generating actions, which are relevant to the morality issue. > Q8: In Figure 1's caption, it may be helpful to the reader to mention that the house that the agent is in does not belong to them. A8: Thanks for your suggestion. We add more information about Figure 1. \[1] Hendrycks D, Mazeika M, Zou A, et al. What would jiminy cricket do? towards agents that behave morally[J]. arXiv preprint arXiv:2110.13136, 2021. \[2] Ammanabrolu P, Jiang L, Sap M, et al. Aligning to social norms and values in interactive narratives[J]. arXiv preprint arXiv:2205.01975, 2022. * R4 > Q1: As my main question, I want to hear more from the authors regarding the motivation of this research direction. Object collection and combat are probably at the core of many text adventure games, or any games in general. Many games are designed in a way that the players/agents have to follow some predefined storyline, or to reach some key points, in order to proceed the story forward. I fully understand and agree with the necessity of having morality as an important evaluation dimension in sequential decision making problems, but are existing games the best place to start with? A1: Thank you for your question. We chose text-based games to evaluate the proposed MorAL for the following reasons: * Semantically Rich Environments: Compared with previous benchmarks, text-based games have semantically rich and complex environments. Combined with the natural storyline of these games, thousands of morally dubious scenarios are created for the agent to explore. These diverse scenarios include theft, combat, as well as caring for and helping others. * Misalignment between Task and Morality: Agents tend to behave immorally if their training environment ignores moral concerns. This issue is more apparent in the game environment as the game task and morality are often in conflict. Thus, we aim to study this issue and propose a generic solution. > Q2: Related to the previous question, I also want to grab the authors' thoughts on, if designing new games (let's say, text-based adventure games), what are some ways to put this morality dimension systematically in the designing process? For instance, how to explicitly model/measure the trade-off between morality and task progress, and how to evaluate morality. One possibility is, as the authors briefly discussed in the paper, to take social aspects into consideration in building tasks/games. On a related note, in [1], they proposed a set of minimal tasks that require agents to perform certain social interactions as part of the skills solving some tasks. This can potentially be used in text-based game design, in a way that the agent needs to borrow a lantern from an NPC (and later return it!). A2: Thank you for your insightful suggestions. Due to the multiplicity of scenarios, Jiminy Cricket benchmark only evaluates the immorality of an agent based on the frequency and severity of immoral actions. One possible enhancement would be to encourage the agent to act in a less negative and more positive manner. As you suggested, we can include additional morally salient scenarios or tasks that are independent of the game task. In this way, we can evaluate the agent's morality by requiring it to reason and perform certain moral actions. > Q3: Regarding human expert immorality scores in Table 1. Are they the accumulated scores until the end of the game, or until the same step budget as given to the agent? Would it make more sense if computing human immorality scores until the step where MorAL ends each game? A3: Thank you for your question. The human expert immorality scores in Table 1 refer to the accumulated scores until the end of the game. Jiminy Cricket provides walkthroughs constructed by human experts, which take less immoral actions to quickly achieve the game task. However, a game's storyline has numerous branches that can lead to success or failure. At the end of an episode, the agent is likely to be on a branch. It is difficult to calculate the steps corresponding to the walkthrough. Thus, we follow previous work to define human expert immorality scores as the accumulated score until the end of the game. > Q4: To me it was a bit confusing at the beginning of reading, I thought the morality training might have led to higher task completion scores --- which is actually not the case. The task completion boost might rather come from a better action candidate generator. Because the majority of the paper discusses the effects of morality learning, the other part of the contribution (the boost on task completion scores) might not be as clear. To me they are both important and worth emphasising. A4: Thanks for your suggestion. The task completion boost come from self-imtation learning. We emphasise this in Section 5.6 and discuss the reason for this - improving the language model helps the agent to adapt to the new scenarios, thus going further. > Q5: In Section 5.4, the authors provide the value of λ=0.14, this seems to require some hyper parameter tuning. What was the hyper-param search space? A5: Thank you for your question. The hyper-param search space is from 0.1 to 0.5. We conducted hyper parameter tuning on one game, and then applied it to all games. > Q6: How much training speed does the extra modality learning phase sacrifice? Q6: Thank you for your question. In our experiments, morality learning increases training time by around 10% during the same episode steps. The training speed of morality learning is related to parameters like learning cycle length and epoch. For example, in the 5 environments of the game “Zork1”, the average training time of the first 10,000 steps of the CALM model is 3h10min. For every 2000 steps, we update the action generator for 3 epoch. This took about additional 20 minutes. \[1] Kovač G, Portelas R, Hofmann K, et al. SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents[J]. arXiv preprint arXiv:2107.00956, 2021.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully