# General Response to All Reviewers
We thank all the reviewers for the insightful comments and constructive suggestions to strengthen our work. In addition to the response to specific reviewers, here we would like to highlight our contributions and clarify some major concerns.
**1.Our Contributions**
We are glad to find that **all reviewers** agree our study on LLMs’ capacity for planning and communication in embodied multi-agent cooperation is **valuable** and **exciting**, our proposed modular framework is **effective** [aixR, tHvL, scdJ], shows the potential of LLMs for embodied AI and multi-agent cooperation [tHvL, scdj], our user study is well-designed [Cpp9], and the failure cases and limitations are well explored [aixR, tHvL, Cpp9], which makes the study more **comprehensive** and **helpful** for future research [tHvL].
**2.Clarifications**
* **[More framework details]** We add a new working example on TDW-MAT in the supplementary materials (Figure 1) to illustrate the data flow and elaborate below on the input, output, and implementation of each module as suggested by reviewer scdJ, aixR, Cpp9. We will also release our codes for easy reproducibility.
* The Observation Module
* Input: TDW-MAT's provided raw observation: 256*256 first-person view RGB image, Depth image, and instance segmentation mask
* Output: the states (positions, names, ids, objects holding if agents) of the key objects including target objects, containers, and the agents, and a local occupancy map
* Implementation: Build 3d point clouds using the RGB-D image, then extract the states of the key objects and build a local occupancy map.
* The Belief Module
* Input: the extracted information from the Observation Module and the last step's high-level plan generated by the Reasoning Module
* Output: a semantic map acting as the scene memory, a task progress recorder recording the count of the target objects already transported, the states of the agents, and the agent's action and dialogue history
* Implementation: the semantic map is constructed and updated with the local occupancy map and key objects' states extracted at each step; the task progress recorder is initialized with all zeros and updated whenever the agent is in the range of the goal position; the state of the other agent is updated whenever it is seen; the action history stores the latest 10 high-level plans executed by the agent. While the dialogue history stores the latest 3 messages sent by all agents.
* The Communication Module
* Input: the semantic map, task progress recorder, states of the agents, action, and dialogue history given by the Belief Module
* Output: a message to be sent if chosen to communicate
* Implementation: first convert the semantic map, task progress recorder, and states of the agents using templates into text State Description. Then we use GPT-4 to generate the message given the current information with the concatenation of the Instruction Head, Goal Description, State Description, Action History, and Dialogue History together as the prompt.
* The Reasoning Module
* Input: Same as the Communication Module with an additional message generated by it
* Output: A high-level plan to execute
* Implementation: first compile all available high-level plans given the current state into text *Action List*, including
* *go to room \**
* *explore current room*
* *go grasp target object/container \**
* *put holding objects into the holding container*
* *transport holding objects to the bed*
* *send a message: "\*"*
Then concatenate all prompt components together and use GPT-4 to generate a high-level plan with chain-of-thought prompting.
* The Execution Module
* Input: High-level plan generated by the Reasoning Module and the semantic map from the Belief Module
* Output: A low-level action the TDW-MAT can accept
* Implementation: To fulfill a high-level plan, the agent first uses A-Star based planner to find the shortest path from the current location to the target location if needed and then carry out the interaction required to finish the high-level plan
We will also add this to our revised manuscripts.
* **[Our framework design is general, while the implementation of each module is flexible, and can be adapted to any env]** Experiments on two different embodied challenges show our modular framework can be adapted to different observation spaces (symbolic scene graph on C-WAH v.s. visual Ego-centric RGB-D images on TDW-MAT) and action spaces (abstracted navigation on C-WAH v.s. low-level navigation control on TDW-MAT). We would like to emphasize that our modular design and mostly **training-free** modules powered by LLMs make the adaption to new environments easier than RL methods or heuristic methods which require a large amount of data to train on or a lot of human efforts to design delicate rules. To provide more supportive evidence, we add a new experiment on TDW-MAT where the instance segmentation mask is not provided as observation anymore. To deal with this, we fine-tune a MaskRCNN on an image dataset pre-collected in the task environment for object detection and segmentation. The results are shown below.
Table A. Performance on TDW-MAT without GT segmentation mask.
| | Transport Rate | EI
| - | - | - |
| HP | 0.40 | /
| HP + HP | 0.55 | 21%
| HP + LLM | 0.60 | 33%
| LLM + LLM | 0.64 | 34%
* **[Evaluation baselines]** As suggested by reviewer aixR and Cpp9, we add a new experiment to test RL baseline MAT's performance on the TDW-MAT. We believe the experiments in the main paper are a meaningful and fair comparison, given the HP baseline requires no training as well as our framework, and is the previously strongest.
We hope our responses below convincingly address all reviewers’ concerns.
# Response to Reviewer aixR
*Thank you for your insightful and constructive comments! We have added additional experiments and will modify our paper according to your comments.*
> Q1: LLMs are expected to be good-ish at communicating with each other in language, but this seems to be a key finding emphasized in the experiments section.
Our key finding emphasizes how LLMs can help embodied agents cooperate better rather than showing LLMs are good at communicating. Being good at chatting alone doesn't guarantee better cooperation, which can also be seen in the ablation study where LLMs with no belief module perform poorly.
> Q2: Many algorithmic details missing in the main text.
Please refer to the general response for most questions and others addressed below.
> Q2.1: What is the learning algorithm?
Our framework utilizes pre-trained LLMs and requires no training.
> Q2.2: On the naming of the last module
Good point! We are renaming the last *Planning Module* to the *Execution Module* for less confusion. Thanks for the suggestion!
> Q3.1: "LLM Agents know when to request help and can respond to others’ requests”: this claim cannot only be supported by an example in Figure 3d.
Thanks for the constructive suggestions! We add more quantitative analysis to support our claims. It's hard to do automatic analysis, we tried our best to manually label all the requests in messages sent by agents, and check whether they are responded to. There are 139 requesting messages in total, and 72% responded in an average step of 6.71. To be clarified, from our observation, the agents "request" not only when stuck, but also, if not always, for better efficiency. There's no way to tell if HP is "requesting" or "responding" since it has no way to communicate with the collaborator.
> Q3.2: Regarding HP+LLM and LLM+LLM have practically the same performance
Our experiments are designed to show our embodied agents built with LLMs **are better cooperators** no matter whether the partner is heterogeneous (HP+LLM is better than HP+HP) or homogeneous (LLM+LLM is better than HP+HP) rather than "LLMs cooperate better together". [line215-216] The argument is only made for C-WAH and we'll tune the argument down for better rigor in our revised manuscript.
> Q4.1: Regarding marl baselines
Thanks for providing more related work! we add a new experiment to test RL baseline MAT's performance on the TDW-MAT. The results are shown below. According to the MAT paper and code, it assumes shared observation among all agents, so we provide GT observation of other agents for the MAT agents as well.
Table B. MAT's performance on the TDW-MAT.
| | Transport Rate | EI
| - | - | - |
| MAT | 0.16 | -286%
The performance is not very good, which may be because of the increased complexity of our setting where multiple **decentralized** embodied agents cooperate to solve a **long-horizon** task in a **largely unexplored space**. [1,2,3] also find that end-to-end RL methods struggle to finish the task due to the complicated observation, long-horizon tasks, and sparse rewards, while the hierarchical planning-based method achieves better performance. We believe the experiments in the main paper are a meaningful and fair comparison, given the HP baseline requires no training as well as our framework, and is the previously strongest.
> Q4.2: Was the 5-module framework motivated by prior work in cognitive frameworks?
Thanks for providing more related work, we draw a lot of inspiration from this line of work and will add them to our revised manuscript.
> Q5: Writing issues
Thanks for pointing this out, we will do another round of proofreading and change the point of view to be consistent with the third-person point of view.
> Q7: Section 3.2.3 states that “effective communication needs to solve two problems: what to send and when to send.” The section addresses “what to send” but does not mention addressing the “when to send” problem.
[line 38-39] The Reasoning Module accounts for "when to send" together with the high-level plan generation. We will add this statement to section 3.2.3 for better clarification
> Q8: Line 179 says “We adopt [a prior work’s] hierarchical planner.” Authors should clarify that this is only when “HP” is written in the table, and is used in lieu of the LLM as the reasoning module.
We will clarify this in the experiments section.
> Q9: Briefly describe regression planning (line 180) in a sentence or parenthetical.
Regression planning will search for an action sequence to achieve the subgoal. We'll add this and the reference [4] to the revised manuscript.
> Q10: Table 1: How do the HP + LLM cooperate? One is presumably working in symbolic space and the other in language space.
Both agents take the same environment observation as input, and output environment-acceptable primitive actions. Despite the inner working mechanism being different, the LLM agent can still model the HP agent's state and plan cooperatively.
> Q11: it is possible that human gets better at collaboration with more trials. Have the authors tried to control for potential side effects of this
[line 249-252] We made sure each subject gets familiar with the interface in a few pilot trials and then carry out the same amount of trials under each scenario.
[1] Retrospectives on the embodied ai workshop.
[2] Watch-and-help: A challenge for social perception and human-ai collaboration.
[3] The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai.
[4] Planning as search: A quantitative approach.
*We wish that our response has addressed your concerns, and turned your assessment to the positive side. If you have any more questions, please feel free to let us know during the rebuttal window. Thank you very much! We appreciate your constructive suggestions and detailed comments! Thank you!*
Best,
Authors
# Response to Reviewer Gy2h
*We appreciate the positive and constructive comments from you! We have modified our paper according to your comments.*
**[More details on the user study]** We recruit the subjects from the authors' social network who have no prior knowledge of the method details. [line 259] The first evaluation criteria is the same as the one we used in the main experiments, a.k.a the average steps they took to finish the task.[line 252-258] After each trial including a baseline to cooperate with, we asked subjects to rate the agent they just cooperated with on a 7-point Likert Scale based on three criteria adapted from [1]. We adapted the web interface built by [1] to allow humans to control the agents to interact with the environment and communicate with the agents through a chat box. The subjects are instructed to finish the task as fast as possible and whether there is a collaborator or not. We will include these details in our revised manuscript.
**[Framework generalizibility]** Our framework design is general, while the implementation of each module is flexible, and can be adapted to any env. Experiments on two different embodied challenges show our modular framework can be adapted to different observation spaces (symbolic scene graph on C-WAH v.s. visual Ego-centric RGB-D images on TDW-MAT) and action spaces (abstracted navigation on C-WAH v.s. low-level navigation control on TDW-MAT). We would like to emphasize that our modular design and mostly **training-free** modules powered by LLMs make the adaption to new environments **easier** than RL methods or heuristic methods which requires a large amount of data to train on or a lot of human efforts to design delicate rules.
**[Related works]** Thank you for pointing out the related works! In our revised manuscript, we will add a lot more related works on situated dialogues.
[1] Watch-and-help: A challenge for social perception and human-ai collaboration.
*Please let us know if you have any further questions about our paper. We really appreciate your time! Thank you!*
Best,
Authors
# Response to Reviewer tHvL
*We appreciate your positive comments on our novel framework and well-designed experiments! We address your concerns in detail below.*
**[the novelty of the work]** We are the first to study LLMs' capacity for planning and communication in embodied multi-agent cooperation, which raises the new challenge of how to model the other agent's state and communicate with each other for cooperative planning. To solve the above challenges, we designed novel frameworks including a belief module to explicitly track the state of the others, and proposed to use LLMs to enable **direct communication between each other**, which is both **effective** and **interpretable**. We believe our work underscores the potential of LLMs for embodied AI and lays the foundation for future research in multi-agent cooperation.
**[whether the improvement is significant enough]** TDW-MAT is a rather challenging environment, where common RL methods struggle to work, as shown in our new experiments with MARL baseline MAT in the Table B and also discussed in [1]. HP is a strong baseline on the previous benchmark which requires a lot of human-written heuristics to work, while our training-free and easy-to-build framework can still surpass it to transport more objects in the given time. What's more, as shown in Table A in the general response, in our new experiments with no GT segmentation mask provided by the environment, our framework built with LLMs shows a clear improvements over HP (EI from 21% improved to 34%).
Table B. MAT's performance on the TDW-MAT.
| | Transport Rate | EI
| - | - | - |
| MAT | 0.16 | -286%
[1] The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai.
*We hope that the additional explanations have convinced you of the merits of our work. Please do not hesitate to contact us if you have other concerns.*
*We appreciate your time! Thank you so much!*
Best,
Authors
# Response to Reviewer Cpp9
*We thank the reviewer for the constructive comments. We have modified our paper according to these comments.*
> Q1: lack of technical details and descriptions
We add a new working example on TDW-MAT in the supplementary materials (Figure 1) to illustrate the data flow and elaborate on the input, output, and implementation of each module. Please refer to the general response for more details on the frameworks.
> Q1.1 What are the outputs of the observation module? How is it built or trained (if it is)
[line103] The output is high-level information such as visual scene graphs, objects, relationships between objects, and other agents’ locations. For the observation module, when the environment provides a symbolic scene graph, it’s simply a filter to extract the objects we care about; when the environment provides ego-centric RGB-D images and instance segmentations, the observation module first maps the RGBD images into 3d point clouds, then extracts the positions and relationships of the objects with the help of instance segmentations; when the environment provides ego-centric RGB-D images only, we fine-tune MaskRCNN on an image dataset pre-collected in the task environment for object detection and segmentation.
> Q1.2: under the communication module, it lists the Instruction Head, Goal Description, State Description, etc, but it does not say how any of these parts are connected or what their input/output relationships are
[line 128] These are the component of our designed prompt, and are simply sequentially concatenated with more details provided in the app.A.3 & app.A.4. Some actual prompts with these components contextualized is in the app.C
> Q1.3: on the Planning Module in section 3.2.5, I have no idea what this actually means.
To fulfill a high-level plan, the agent first uses A-Star based planner to find the shortest path from the current location to the target location if needed and then carry out the interaction required to finish the high-level plan.
> Q2: the method is far to hand-coded and ad-hoc to really tell us anything useful about how this system would generalize past this specific environment
Our framework design is **general**, while the implementation of each module is **flexible**, and can be **adapted** to any env. Experiments on two different embodied challenges show our modular framework can be adapted to different observation spaces (symbolic scene graph on C-WAH v.s. visual Ego-centric RGB-D images on TDW-MAT) and action spaces (abstracted navigation on C-WAH v.s. low-level navigation control on TDW-MAT). We would like to emphasize that our modular design and mostly **training-free** modules powered by LLMs make the adaption to new environments easier than RL methods or heuristic methods which require a large amount of data to train on or a lot of human efforts to design delicate rules.
> Q2.1: the observation module just uses a GT segmentation mask?
To keep a fair comparison with the baseline HP, we use the GT segmentation mask as in previous works [1]. What's more, the implementation of our framework is flexible and any vision perception model can be used as the observation module when there is no GT segmentation mask provided. To provide more supportive evidence, we add a new experiment on TDW-MAT where the instance segmentation mask is not provided as observation anymore. To deal with this, we fine-tune a MaskRCNN on an image dataset pre-collected in the task environment for object detection and segmentation. The results are shown in Table A in the general response.
With stronger pre-trained vision perception models developed, the whole pipeline can be training-free with no GT segmentation mask provided.
> Q3: the novelty of the work
We are the first to study LLMs' capacity for planning and communication in embodied multi-agent cooperation, which raises the new challenge of how to model the other agent's state and communicate with each other for cooperative planning. To solve the above challenges, we designed a novel modular framework including a belief module to explicitly track the state of the others, and proposed to use LLMs to enable direct communication between each other, which is both **effective** and **interpretable**. We believe our work underscores the potential of LLMs for embodied AI and lays the foundation for future research in multi-agent cooperation.
> Q4: Concerning different dataset split from the original paper
Our settings are not identical, and due to the high cost of the evaluation involving LLMs, we sampled a smaller test set following the original benchmark's guidelines
> Q5: the CICERO paper would be a really good comparison to the paper as it also has a language model combined with planning components. It is different in many ways, but the comparison would be useful.
We agree the comparison with more baselines including CICERO would be useful. CICERO is a successful and large project requiring a large amount of domain-specific conversation data to fine-tune the language model while it's infeasible to collect in our targeted embodied environments. An effective design of the CICERO is the combination of strategic reasoning using RL and the imitation dialogue model using LMs. We now add a new RL baseline MAT and the results are shown in Table B.
Table B. MAT's performance on the TDW-MAT.
| | Transport Rate | EI
| - | - | - |
| MAT | 0.16 | -286%
The performance is not very good, which may be because of the increased complexity of our setting where multiple **decentralized** embodied agents cooperate to solve a **long-horizon** task in a **largely unexplored space**.
[1] The threedworld transport challenge: A visually guided task-and-motion planning benchmark towards physically realistic embodied ai.
*We hope that the additional explanations have convinced you of the merits of our work. Please do not hesitate to contact us if you have other concerns. We appreciate your time!*
Best,
Authors
# Response to Reviewer scdJ
*We appreciate the positive and constructive comments from you! We have added additional working examples and modified our paper according to your comments.*
1. **New working example and details about the framework** We add a new working example on TDW-MAT in the supplementary materials (Figure 1) to illustrate the data flow and elaborate on the input, output and implementation of each module in the general response.
*We sincerely appreciate your comments. Please feel free to let us know if you have further questions.*
Best,
Authors