# General response:
We thank all reviewers for their insightful comments and acknowledgment of our contributions. We highlight the major contributions of our work as follows:
1. The creation and development of a new benchmark, entailing a substantial amount of effort (reviewer hHwF). Our benchmark differs from prior benchmark featuring multiple different task objectives in a single episode. Agents need to collaborate to achieve maximum collaboration efficiency. It is also equipped with a new opportunity to attract a human as one of the contributors (reviewer hHwF). If released, it will be a valuable resource for researchers (reviewer iv1S).
2. a novel infrastructure designed to evaluate planning and coordination capabilities in gaming interaction(reviwer Wtib) and its potential of collaborating with human users. In addition, we perform cross-domain experiments to transfer our infrastructure into minecraft to demonstrate its generality.
3. The introduction of a novel metric, CoS (Collaboration Score), to assess collaboration capabilities with comprehensive evaluations (reviewer Wtib )
# Response to reviewer hHwF
We sincerely thank you for your time and constructive comments. Below, we provide detailed replies to your comments and hope we can resolve your major concerns.
> Unlike other well-known environments (such as Sims in Park 2023), the environment proposed by the authors is only partially collaborative. The authors do not propose any communication between agents or the possibility of transferring some tasks to other agents. All agents have the same functionality and essentially the same (complete) environment observation. This setting differs little from a single-agent one with the ability to perform several actions at one time. These limitations seriously reduce the value of the proposed environment for testing complex multi-agent scenarios with communication.
<!-- The environment supports both.
The focus is on coordination.
The human machine experiment contains communications. -->
We thank you for your comments. Below, we try to address your concerns regarding the collaborative mechanism in our game.
1. Regardless the centralized coordination scheme we adopted in the LLM dispatcher benchmarking part, the CuisineWorld game itself in fact supports/ a wide range of collaborative mechaisms, including direct communication between agents and tasks can surely be transferred as well.
2. To further provide your evidences to back this point, in Sec. 5.2, we've provided experiments between human players and muti-NLPs with LLMs, in this settings, human is guided to directly communicate with the LLM agents in natural language. Human may ask for help, suggest tasks for LLM agnets, etc. The results confirms that our framework allows such intuitive communication and good collaboration and human player experiences can be facilitated. Therefore, we believe our game supports various collaborative mechaisms, on par with Sims in Park 2023.
3. Finally, why we choose the centralized coordination scheme: in this work, we focus on task allocation and coordination **efficiency**. Especially, we're trying to measure the coordination efficiency **quantitatively**, rather than showcasing the how collaboration on complex tasks is possible using LLMs as in Park 2023. We found this to be a very challenging task for LLMs in terms of quantitative measures. Therefore, in this work, we choose to primarily investigate centralized coordination scheme as you mentioned in the review. We commit to exploring a more general, decentralized scheme where agents do not share all the states, etc, in the future work.
<!-- Thank you for your insightful comments regarding the collaborative aspects of our proposed environment. I would like to address your concerns as follows:
1. our environment differs from those in well-known studies like Sims in Park 2023, it is designed to support both with full observations and partial observations. In our experiment settings, since the focus is on multi-agent coordination, we choose to use full observations.
2. Our human machine experiment contains explicit communications from human users to machines.
-->
> One of the key limitations of the proposed infrastructure is the use of ad-hoc modules for extracting and validating actions. In fact, these are domain-specific templates that require manual configuration. This also leads to a strong attachment to a specific environment. An example of the difficulties of adapting to Minecraft further demonstrates this.
<!-- -extracting:
Trivial, as long as action is well formatted (no matter conitnous or discrete), ex. <act>....</act>, the extraction using regular expressions are simple and straightforward.
Moreover, if we use powerful LLM like GPT-4, no need to do such as it can strictly produce actions without emitting additional, useless tokens.
-validating:
Given current state, every game should contain a description legal actions.
All we need to to relay this information to LLM -->
Thanks for raising this matter. We try to address your concerns on action extracting and validation below.
1. Action Extraction:
The action extraction process in our system is designed to be straightforward and adaptable. By prompting LLM to produce actions in a well-defined format (e.g., wrapped between <act>...</act>), we simplify the extraction process, making it efficient through the use of regular expressions. Even though we require knowledge of each game's valid action format, we believe this assumption is resonable as valid actions differ by each game and easy to obtian. In addition, we may even ask LLM itself to extract actions for us in substitue for regular expression.
2. Action Validation:
The key is to utilize game state information: For action validation, our infrastructure relies on the inherent description of legal actions provided within each game. To facilitate this, we **simply** relay this game state information to language models like GPT-4. This enables the LLM to make informed decisions about the legality and appropriateness of actions within the specific context of the game. We believe this approach for action validation is intuitive and adaptable to new game domains with minimal efforts.
3. Adaptability Concerns:
Case of Minecraft: Regarding the example of Minecraft, Minecraft tasks have different task structures and different action spaces compared to CuisineWorld. We demonstrated that without much changes, the infrastructure can successfullly adapt to Minecraft. Please take a look at sec. 7 for how we approach this.
> The prompt compiled by the authors already essentially contains an action plan in the recipe. In fact, the authors are not testing the agent's ability to plan actions (to make a chain of actions only according to the description of the initial and final state) but the ability to translate a chain of actions from one format (hierarchical in the recipe) to another (in the format of predicates of actions).
The "planning" or "recipe generation from dish name" in cuisionwold is indeed quite simple. As shown in examples in Figure 8 to Figure 20 in appendices, all the recipes only involve very few steps -- we do not provide complicated dishes in the current study. However, **the most complicated part is how to translate these seemingly simple recipes into organized, efficient coordination among multiple agents to achieve high throughput (collaboration efficiency)**. For example in level 9, it's not trival to assign agents to different subtasks giving the fact that different tasks might arrive at different moment.
> The authors in (1) describe the classical conditional optimization problem, for which there are a number of standard heuristic solvers that solve the problem quite accurately. If the authors use this statement, it is necessary to refer to these solvers and comment on the possibility/non-possibility of their use.
The problem formulation described in (1) is a mixed integer programming problem which is NP-hard. It requires manual configurations for different setups. In addition, there is no generalization for each problem setup. For each minor changes, there are sophisticated modifications for each problem setup. For example, we need to redefine constraints when adding more agents.
Please refer to [1] for more details.
[1] Korsah et al. A comprehensive taxonomy for multi-robot task allocation.
> In experiments with 12 subjects, quantitative parameters were not specified, according to which statistical tests were then calculated: the number of episodes, repetitions, episode time, etc.
Please kindly note that we actually did have lots of quantitative parameters in these studies. The details are specified in section 5.2.2. In particular, We used ANOVA to test the effects of different experimental conditions on collaboration performance and the subjective perceptions. Tukey HSD tests were conducted on all possible pairs of experimental conditions, as described in detail from line 479 to line 483.
In addition, we reported p values for all major statistical tests as in line 486, line 490, line 495, line 498, line 500, line 505. The number of episodes is 3 per configuration as reported in line 968. The episode time 60 steps.
We will clarify them in the final version.
> The statement "number of API calls but also reduces context length" in the case of a centralized scheduler is not obvious and cannot serve as a serious advantage compared to other approaches.
We will explain why the centralized scheme can perform advantageously on this:
1. Text redundancy: In decentralized setting, each agent need to compile a full system message which includes a distinct copy of recipes, rules, etc, while in centralized setting we only need one system message to describe them, which significantly reduce the total number of input tokens (within the input context) of LLM and therefore make the framework more affordable.
2. Communication overhead: each agent needs to receive both their own state and states of the other agents,possibly in text, while in centralized setting there is no communication among agents as others states are directly observable. This can also save the cost on input tokens.
4. In decentralized setting, to generate one env step, we need to call API N times (N == #agents), while in centralized setting we only need to call the API once and we will be able to obtain actions for all agents (as produced by the centralized LLM dispatcher).
> The authors mention that "Cuisine World is a game that emulates a virtual kitchen in which several robotic agents," but there is no robotic component in the actions of the agents.
<!-- explain:
-robotic task planning
-Virtual environment has physics
-We will be more clear on which robotic componets we have -->
It is very common for the robotics community to study the planning problem and and multi-agent task allocation problems alone. Please refer to [1]
Additionally, in our virtual environments and Minecraft, we actually have physics constraints. For example, agents cannot pick up an object if they are not close to the object. Objects that are not in the correct state cannot be transformed into the new state.
We will be more clear what robotics component we have.
[1] Korsah et al. A comprehensive taxonomy for multi-robot task allocation.
> The authors mention Inference Knowledge that "encapsulates insights and helpful hints for the agent," but nowhere does it say what these tips are and how they are significant for the final metric.
<!-- -explain how this work in the prompt
-show the ablation study (confirm the merit with final metrics) -->
We apologize for the confusion, the inference knowledge provides some tips to the agent: for example: "the food order is keep coming, you should finish as many dishes as possible". Our ablation study as shown in table 2 (GPT-4 (full) vs. GPT-4 w/o inference knowledge) demonstrates its effectiveness. The performance will drop without inference knowledge.
# Response to reviewer wtib
> Dependency on Feedback for GPT-4: The reliance of GPT-4 on feedback, as highlighted in Table 2, raises concerns about its robustness. The significant drop in performance without feedback indicates a potential limitation in the model's autonomy and adaptability. This dependence on feedback may affect the generalizability of the proposed approach and warrants further investigation into the model's self-sufficiency.
<!-- -current focus is to develop a game interaction framework on top of GPT-4 (the best LM we have) as is so drawbacks and limitation of the LM itself is anticipated. we manage to remedy this by offering feedback, which works quite well.
-similar observation has been spotted by the language agents communites, ex. deps, voyeger, ...
-in our future work, we commit to investigating this problem:
-fine-tuning, agent-tuning
-RAG, ex. JARVIS-1 -->
Our present focus is on developing a robust game interaction framework leveraging GPT-4, which is currently the most advanced language model available. We understand that relying on such a model comes with inherent limitations. The observed dependency on feedback is one such anticipated drawback, and we have noted its impact on the model's performance in scenarios where feedback is not available as in table 2.
Similar observations (reliance of GPT-4 on feedback) has been noted in other paper in langauge agents communities and robotics planning communities. For example: [1, 2, 3]. Therefore, we believe this is a general, unsovled issue.
In our future work, we commit to investigate this problem through fine-tuning weaker LLMs and Retrival Augmented Generation.
[1] Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents
[2] Voyager: An Open-Ended Embodied Agent with Large Language Models
[3] RePLan: Robotic Replanning with Perception and Language Models
> Sensitivity to Prompt Inputs: The sensitivity of GPT-4 to prompt inputs, as indicated in Table 2, introduces a potential vulnerability. Without sufficient inference knowledge, the model's performance drops, raising questions about its robustness in real-world scenarios where prompt inputs vary. This sensitivity might limit the applicability of the proposed methodology and calls for additional exploration of prompt handling mechanisms.
Table 2 ablations are mainly on system messages, therefore, on the background, we can control this, i.e. end-user will not freely interact with the LM on this part. It's indeed a problem. We can mitigate this issue through agent fine-tuning[1].
[1] https://github.com/THUDM/AgentTuning
> Limitation of Close-Sourced LLMs: The use of current high-performance LLMs, which are close-sourced, lacks transparency. The absence of access to the model's internal workings and training data raises concerns about reproducibility and scrutiny. This limitation hinders the ability of the research community to independently validate and build upon the findings, impacting the transparency and openness of the research.
<!-- -Thanks for the commnets
-cite recent work on agent LM, our infra also offers a pathway to training and evaluating better open sourced agent LM/language agents
-future research -->
We fully recognize that relying on closed-source LLMs limits the transparency of our research. The inability to access the model's internal workings and training data indeed poses challenges for reproducibility and independent validation. This is a critical concern in the field, and we are mindful of its implications for the broader research community. Moreover, our infrastructure also offers a pathway for evaluating language agents, which complement the recent effort on training open-sourced agent LMs [1]. We commit to working with the rest of the community to address this issue on the reliance on closed-sourced LLMs.
[1] https://github.com/THUDM/AgentTuning
> Text-Game Setting and Lack of Player Perspective: The use of a text-game setting, coupled with the absence of observing players' states from their perspective (e.g., through screen observation or cameras), is a potential limitation. The inability to infer player intentions from their perspective may limit the generalizability of the proposed infrastructure to real-world gaming scenarios, where visual cues and player states play a crucial role.
<!-- -very good point
-what visual cues/information is indeed needed in gaming scenarios:
-object state, enviornment state, player state -- not needed, we can direclty obtain from the game engine
-player sentiment/intention, player behaviour (ex. the player get stuck at some corners, etc), are indeed useful and cannot be obtained from the game engine, but can still be quite challening for the current multi-modal LLMs.
-we commit to investigating this
-->
Thank you for raising an important point about the limitations inherent in our text-game setting and the lack of player perspective in our research. Your insights are valuable, and we would like to address them as follows:
We agree that the absence of direct observation of players' states, such as through screen observation or cameras, is a limitation in our current setup. This lack of visual cues and player perspectives could potentially affect the generalizability of our infrastructure to more immersive, real-world gaming scenarios.
In the context of our text-game environment, essential elements like the object state, environment state, and player state are directly obtained from the game engine. This setup ensures that critical game-related information is captured accurately and consistently.
However, we acknowledge that more nuanced aspects such as player sentiment, intentions, and specific behaviors (e.g., getting stuck in certain game areas) are not readily available from the game engine. These elements are indeed crucial for a comprehensive understanding of player interaction and experience in gaming scenarios.
Integrating such complex player-specific information remains a significant challenge for current multi-modal LLMs. These models have not yet fully developed the capability to interpret and respond to such subtle and diverse player inputs effectively.
We are committed to investigating these limitations in our future work. Our aim is to explore pathways to integrate more sophisticated player perspective analysis into our infrastructure. This could involve advancements in multi-modal learning models that are better equipped to interpret and utilize visual cues and player behavior data.
We also see potential in developing methods to simulate or infer player intentions and sentiments, thereby enriching the interaction model beyond the current text-based framework.
> Ethical Considerations and Risk of Misuse: While the paper acknowledges ethical considerations and the risk of misuse, it would benefit from a more in-depth exploration of potential risks and proposed mitigation strategies. The discussion on responsible AI practices is critical, but further elaboration on specific safeguards against misuse, especially in content generation and manipulation, would enhance the ethical discourse.
<!-- -point to our existing ethical discussion
-additional points (will be added to the final version):
-be specific, how to implement safeguard to prevent from gaming interaction being manipulated through, ex. prompt injection -->
Thank you for emphasizing the importance of addressing ethical considerations and the risk of misuse in our work. We appreciate your suggestion to expand upon these aspects.
On implementing safeguards, we believe the following can be done:
1) Enhanced monitoring and filtering mechanisms to identify and block malicious inputs.
2) Regular audits and updates to the system to address evolving threats and maintain robust defense mechanisms against misuse.
3) Prompt injection to minimize the chance of generating malicious contents.
We will include these discussions in the final version.
> Impact Beyond Gaming Scenarios: The paper primarily focuses on gaming scenarios, and there can be limited impact. A broader discussion on the potential applications and generalizability of MindAgent and CuisineWorld to diverse settings beyond gaming can strengthen the paper's impact in wide range of applications.
<!-- -thanks for the suggestion
-name a few (will be added to the final version):
-robotic coordination in production scenes
-collaborative software development (cite MetaGPT) -->
Thanks for this suggestion. While our current research primarily explores gaming scenarios through MindAgent in CuisineWorld and Minecraft, we recognize the importance of demonstrating how these technologies can be relevant and beneficial in a variety of other contexts.
For example,
1) One potential application of our research lies in the field of robotics, specifically in coordinating robotic actions in industrial and production environments. The principles of agent coordination and decision-making developed in our gaming scenarios can be directly applicable to managing complex tasks in manufacturing or logistics, where efficient, coordinated action is crucial.
2) Another area where our findings could be influential is in collaborative software development [1]. We can draw parallels between multi-agent coordination in games and collaborative efforts in software development projects.
We are committed to further exploring these and other potential applications in future research. Our goal is to extend the utility of our findings to diverse fields where coordination, decision-making, and collaboration are key.
[1] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
# response to reviewer iv1S
> No major weaknesses were found in this paper, however, I would have liked to see the authors explore the process of building an LLM for this particular task. From the reported results the only models shown capable of this are closed-source. It would be interesting to see if fine-tuning open models such as Llama could potentially compete with GPT-4 & Claude in such a task.
<!-- -Our current focus is to build a language agent framework that allows the emergence of complex gaming interaction, .....
-this is indeed great future work and we will be solving this as well as bringing other functionalities, ex. multi-modal input, etc -->
Thank you for your positive feedback on our paper and for highlighting an important area for future exploration – the development and utilization of open-source language models in complex gaming interactions. We appreciate your suggestion.
Our primary aim in the current phase of our research has been to establish a robust language agent framework. This framework is designed to facilitate the emergence of complex interactions in gaming scenarios, leveraging the capabilities of high-performance, albeit closed-source, models like GPT-4 and Claude.
You have rightly pointed out the potential of fine-tuning open-source models like Llama to compete with their closed-source counterparts in these tasks. We acknowledge the significance of this approach, especially in terms of enhancing transparency and accessibility in AI research. In our future work, we are committed to exploring the feasibility and effectiveness of open-source models in our language agent framework through agent finetuning [1].
[1] https://github.com/THUDM/AgentTuning
> Additionally, It's hard to reason about how good the performance is without any notion of the optimal plan. What is the best obtainable CoS?
Thanks for rasing this. We believe a good reference can be obtained by benchmarking human players. However, our current game console (text-based interface) is not ideal for human players in terms of being less intuitive, etc, making it hard to collect human gameplay data. We commit to improving this and offering a reference CoS score int the future work.
# response to reviewer WvPg
> The main weakness of this paper for the current submission is that I do not believe that it is appropriate for ACL: the authors have used LLMs, it seems successfully, to support reasoning in an interesting and novel AI context. But the only "linguistic" aspect of the paper is the prompt engineering, which is described in an appendix and is not particularly sophisticaed. I would suggest revising this paper and sending it to an AI or planning venue.
We thank you for your comments, below, we try to address your concerns on the scope of this paper below:
We believe our paper falls into the following categories (see https://aclrollingreview.org/cfp):
- Multimodality and Language Grounding to Vision, Robotics and Beyond
- NLP applications (we choose this as our primary topic)
Effectively, the main scientific problem (from a bit high level) studied by our paper is **language grounding to embodied actions**. More precisely, we study a specific class of embodied actions: actions made by a group of agents with a goal of facilitating collaboration and accomplishing complex tasks **efficiently** (aka multi-agent coordination).
It is understandable that folks may perceive this problem from a more general and less linguistic perspective, as the multi-agent community might approach this problem without a lingustic component. However, in our work, we venture into a solution using language models, which plans, coordinates and interacts with the environment purely with text. More importantly, it likely thinks and reasons in text as well, as we provide receipes, game rules, special instructions, feedbacks all in text. Such a pure-text based solution, or approaching angle used to be fictional but we've observed the recent paradigm shift brought by LLMs and this study has become the first few work exploring this direction.
So what makes a pure-text based solution special, or more "lingustics"? Let's go back to the topics above: "...Language Grounding to Vision, Robotics and Beyond". One of the many fancinating things about language grounding, is to see how this symbolic system (language) connects to realms of physical, dynamic, and interactive environments, and how can such connection facilitates the way we approach the problems (ex. multi-agent collaborations) in these environments, even inspire new solutions, and we are doing exactly that in this paper.
To sum up, we believe our paper fits some of the topics suggested by *ACL CFP well. Please feel free to let us know if you have furtuer concerns, and we're more than happy to help out!
> A very large amount of content is included in the appendices -- in any revision, this is also something to potentially address, as there is useful content there and, while reading the paper, I constantly found myself flipping to the end.
Thanks for your suggestion, we will adjust the paper to make it more reader friendly.
Response to AC:
Dear Area Chair M4Tr,
Thank you for your insightful meta-review of our paper “MindAgent: Emergent Gaming Interaction". We're grateful for your acknowledgment of our work's novelty and importance.
It's encouraging to know that our paper has met NAACL's standards and addressed all the concerns raised by previous round reviewers. Your suggestion to submit our paper to NAACL is much appreciated, and we assure you that we'll make the necessary improvements for the final edition.
In line with your valuable feedback, we are dedicated to meticulously addressing each point raised by the reviewers as described in the author response section. In addition, we plan to resolve the minor issues identified by the reviewers, ensuring a comprehensive and improved version of our paper.
Thank you once again for your thorough review. We value your constructive feedback as it will help to enhance the quality of our paper.