# NeurIPS 2024: SMALL rebuttal
## Code link to ACs (Comment)
Dear AC,
As requested by Reviewer YTmU, we are providing an anonymized link to our code and environments for the submission. You can access them here: [https://anonymous.4open.science/r/NeurIPS7307submission](https://anonymous.4open.science/r/NeurIPS7307submission).
Best regards,
The Authors of #7307
## Global Rebuttal
We sincerely thank all reviewers for their thoughtful and constructive feedback. Your reviews have greatly improved the paper. And we are very grateful for the appreciation of the safety of our approach (BHvc, Rgzt), the innovativeness of our framework (YTmU, mHur), the comprehensiveness of our ablation study (BHvc), and the recognition of our research scenarios and potential application areas (YTmU, mHur, Rgzt).
In the rebuttal process, we have responded to each reviewer's specific questions and concerns in detail below. In addition, we will revise the paper to incorporate all other valuable suggestions and comments. Key clarifications and improvements:
1. Code availability: We have provided an anonymous GitHub repository to the Area Chair, containing implementations of both environments and the SMALL algorithm, ensuring reproducibility.
2. Baseline selection and ablation studies: We have justified our choice of baselines and conducted additional ablation studies as requested.
3. Multi-agent focus: We've explained why we focused on multi-agent settings despite potential applicability to single-agent scenarios.
4. Environmental details: We've provided more comprehensive information about observation and action spaces.
We appreciate the opportunity to clarify these points and strengthen our paper.
## Reply to Reviewer YTmU (s:4,c:4)
We thank the reviewer for the constructive comments and provide our point-wise response below.
> **Weaknesses 1:** My may concern is on the literature review, reproducibility and the experiment setting. The literature review on single-agent RL with Natural Language Constraints is missing. The literature review can be improved to show more relevant works.
**Response to W1:** We appreciate your feedback. In the related work section (line 82-93), we discussed [1,2], which are in the setting of single-agent RL with Natural Language Constraints. If you consider any other domains or specific works crucially relevant to ours, we are happy to include them.
> **Weaknesses 2 & Quesiton 2:** The code is not available for SMALL. It is further inappropriate to have a paper that proposes a new benchmark without code in the submission. However I'd be happy to raise my rating if the authors could explain more on the implementation and experiment settings.
**Response to W2&Q2:** We appreciate your feedback. Although the relevant code was indeed not provided in our submission, we have forwarded an anonymous GitHub repository to the Area Chair in accordance with the conference guidelines. This includes the implementations of both environments and the SMALL algorithm. We plan to make the codebase publicly available upon publication to ensure full reproducibility and support further research in this area.
> **Weaknesses 3:** The baseline selection of the experiment should be justified. Why are MAPPO and HATRPO, which are relatively old and irrelevant to the specific problem selected?
**Response to W3:** Thank you for your question about our baseline selection. In the field of Safe MARL, MAPPO-Lagrange and HATRPO-Lagrange, proposed by [3], are well-known and SOTA algorithms. We chose MAPPO and HATRPO as baselines for two main reasons. First, we followed the approach of previous work in this area [2]. Second, MAPPO and HATRPO still demonstrate SOTA performance in MARL [4,5]. By using these as baselines, we aimed to provide a resonable comparison and show how our method performs against established algorithms in both standard and safe MARL contexts.
> **Question 1:** line 151 BERT is a encoder-only LM I believe. line 184 what does "such as" in "we utilize LLMs, such as Llama3-8B" mean? what are the exact LLM used in SMALL? Are chatGPT or Llama used as LM or you trained or fine-tuned your own LM? These details are more important and should be clearly explained in the main paper.
**Response to Q1:** We apologize for the confusion. To clarify our method, we use BERT (an encoder-only model) to encode both Textual Constraints and Texted Observations, generating embeddings $E_l$ and $E_{o,t}^i$ respectively (line 151). Regarding line 184, we specifically use Llama3-8B to generate $v_t^i$ in Equation 5 by querying the model with a prompt and obtaining a binary response (Appendix E.3). We then multiply $v_t^i$ with $\text{dist}(E_l, E_{o,t}^i)$ as shown in Equation 5. We will make these details clearer in the next version of our paper.
---
Reference:
[1] Prakash, B., Waytowich, N.R., Ganesan, A., Oates, T. and Mohsenin, T., 2020, March. Guiding Safe Reinforcement Learning Policies Using Structured Language Constraints. In SafeAI@ AAAI (pp. 153-161).
[2] Yang, T.Y., Hu, M.Y., Chow, Y., Ramadge, P.J. and Narasimhan, K., 2021. Safe reinforcement learning with natural language constraints. Advances in Neural Information Processing Systems, 34, pp.13794-13808.
[3] Gu, S., Kuba, J.G., Chen, Y., Du, Y., Yang, L., Knoll, A. and Yang, Y., 2023. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319, p.103905.
[4] Yu, C., Velu, A., Vinitsky, E., Gao, J., Wang, Y., Bayen, A. and Wu, Y., 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35, pp.24611-24624.
[5] Y. Zhong, J. G. Kuba, X. Feng, S. Hu, J. Ji, and Y. Yang. Heterogeneous-agent reinforcement learning. Journal of Machine Learning Research, 25(32):1–67, 2024. URL http://jmlr.org/papers/v25/23-0488.html
## Reply to Reviewer mHur (s:4,c:3)
We thank the reviewer for the constructive comments and provide our point-wise response below.
> **Weaknesses 1:** Does the performance gain really comes from the language?
**Response to W1:** Thank you for this insightful comment. However, we respectfully argue that the performance gain comes from the language processing for two reasons:
1. Generalization: According to the experimental result and detailed constraints in Appendix F, our method can handle a wide range of natural language constraints, not just predefined categories. In the mean time, we only random picked 30 constraints for pre-training the encoder, only use those as a dataset is not efficent for training a cost preditor with generatlization capability.
2. Semantic understanding: In SMALL, the language model captures semantic relationships between different constraints and states, allowing for more intelligent decision-making than a simple mapping between states and random vectors would allow. The capability should be thanks to those langugae model pre-trained with enmormons huamn langugage.
However, we appreciate your suggestion about testing the system with random vectors as an input to varify whether the performance come from the language. Here we show an ablation using random embeding to replace the human constraints as input,
| | Reward | Cost |
| -------- | -------- | -------- |
| HAPPO | 12.5±2.4 | 22.4±4.0 |
| SMALL-HAPPO with Random $E_l$ Embedding | 13.9±2.7 | 34.5±5.8 |
| SMALL-HAPPO | 11.6±2.1 | 5.8±4.2 |
The slight improvement in reward for Random embedding may stem from increased exploration, as discussed in [1]. This aligns with the concept of using prediction errors as an exploration bonus, which is at the core of the RND method, where the exploration bonus is defined as the difference between a randomly initialized target network $f(x)$ and a predictor network $\hat{f}(x;\theta)$. In our ablation, replacing $E_l$ with a random embedding creates a scenario where the "prediction" task becomes essentially random, potentially leading to consistently high intrinsic rewards. This could encourage the agent to explore more broadly, sometimes stumbling upon rewarding states more frequently than a purely extrinsic reward-driven agent. However, this comes at the cost of meaningful constraint adherence, as evidenced by the significantly higher cost. This comparison shows that the language understanding component of SMALL is crucial for achieving superior performance.
> **Weaknesses 2:** Weakness of ablation studies
**Response to W2:** We appreciate your suggestion regarding the ablation study. Our approach of removing individual components to compare with the full algorithm is a common and widely accepted practice in the field, as seen in works like [2,3]. This method effectively demonstrates the importance of each component by observing the performance impact when it's removed.
However, we respect your point about an additive approach and have conducted additional experiments to address this. It's important to note that our baseline (HAPPO, HAPPO-Language) doesn't include a framework for processing natural language, so simply adding components to it wouldn't equate to our SMALL algorithm. Instead, we've created a new baseline using HAPPO-Language with the SMALL framework (as shown in Figure 1) but without components (a), (b), \(c\), and (d). We then additively included each component, as you suggested. The result of this ablation in LaMaSafeGoal(Ant) with the 2 agent Easy Layout show as following,
| | Reward | Cost |
| -------- | -------- | -------- |
| HAPPO-Language with SMALL w/o a,b,c,d | 12.7±3.5 | 20.1±5.0 |
| HAPPO-Language with SMALL w/o b,c,d | 11.8±2.7 | 17.4±6.6 |
| HAPPO-Language with SMALL w/o c,d | 6.8±5.7 | 8.1 ± 1.4 |
| HAPPO-Language with SMALL w/o d | 5.1±1.5 | 4.8±1.0 |
| SMALL-HAPPO | 11.6±2.1 | 5.8±4.2 |
Where, a) fine-tuning of the language encoder, b) decoder c) descriptor in App.D.3 and d) $v^i_t$ in Eq.5.
> **Weaknesses 3:** Typos
**Response to W3:** Thank you for pointing out those typos. We will fix it in the rivision.
> **Questions 1:** Is this method can only be applied to "multi-agent" RL? I believe this method can be generally applied to single-agent RL, so I wonder the authors decided to test this idea on multi-agent settings?
**Response to Q1:** Thank you for your question. While our method could potentially be applied to single-agent RL, we specifically chose to focus on multi-agent settings for several reasons. First, multi-agent environments present unique challenges, such as inter-agent coordination and collision avoidance, which are particularly relevant when dealing with natural language constraints. Second, the complexity of interpreting and adhering to language constraints increases significantly in multi-agent scenarios, where multiple agents must understand and cooperatively follow the same instructions. Lastly, we identified a gap in the literature regarding safe multi-agent RL with natural language constraints, making this an important area for contribution. However, we agree that exploring the application of our method to single-agent RL could be a valuable direction for future research, potentially revealing interesting comparisons between single and multi-agent performance under language constraints.
> **Questions 2:** Examples of "redundant" from the original L and how can it be removed?
**Response to Q2:** Thank you for this question. Here are examples from both environments:
In LaMaSafe-Grid, an original constraint might be: "You have a pair of magic shoes to walk on lava and meadow. But you cannot swim. Be careful not to collide with other robots!" The condensed version is: "Can touch lava and grass, avoid water and collisions with other agents."
For LaMaSafe-Goal, an original constraint could be: "Robots must steer clear of any blue circles in the area. These represent dangerous zones that could damage your circuits. Also, be careful not to collide with other robots as it may cause malfunctions!" The condensed version is: "Avoid blue circular objects and collisions with other agents."
These condensed versions remove descriptive elements, focusing on core instructions to make constraints more concise and actionable for agents. All constraints are detailed in the Appendix.F.
---
Reference:
[1] Burda, Y., Edwards, H., Storkey, A. and Klimov, O., 2018. Exploration by random network distillation. arXiv preprint arXiv:1810.12894.
[2] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M. and Silver, D., 2018, April. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).
[3] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
## Reply to Reviewer BHvc (s:7,c:4)
We thank the reviewer for your positive support. Below, we provide a point-wise response to your questions.
> **Weaknesses 1:** I had trouble understanding Figure 4 due to the colours used in the plots for the lines. I have colour blindness and was not able to distinguish between the colours and hence could not comment on those comparison in the plots. It would be good if the authors could rectify the colours in the plots.
**Response to w1:** Thank you for pointing this out. We will apply a colorblind-friendly palette in the camera-ready version. To help clarify the results for the 4-agent Ant scenarios, we have converted the data into a table format for easier comparison:
| Algorithm | Average Accumulative Reward (Easy) (Mean ± Std) | Average Episode Cost (Easy) (Mean ± Std) | Average Accumulative Reward (Medium) (Mean ± Std) | Average Episode Cost (Medium) (Mean ± Std) |
|---------------------|-------------------------------------------------|------------------------------------------|---------------------------------------------------|--------------------------------------------|
| MAPPO | 6.5 ± 2.9 | 10.1 ± 5.7 | 4.5 ± 3.8 |47.7 ± 22.8 |
| HAPPO | 10.2 ± 2.8 | 8.6 ± 2.3 | 4.4 ± 1.8 | 17.1 ± 3.0 |
| SMALL-MAPPO (Ours) | 5.1 ± 0.9 | 0.3 ± 0.1 | 3.8 ± 1.9 | 8.3 ± 3.7 |
| SMALL-HAPPO (Ours) | 6.8±1.1 | 4.4±0.8 | 4.2±1.3 | 6.7±4.6 |
| MAPPO-Lagrangian | 5.3 ± 0.8 | 0.8 ± 0.5 | 3.5 ± 0.5 | 2.2 ± 1.3 |
| HAPPO-Lagrangian | 7.0 ± 1.5 | 2.4 ± 1.1 | 5.2 ± 1.5 | 1.3 ± 0.5 |
> **Weaknesses 2:** The scalability experiments don’t seem to be convincing enough. The method is just tested out in scenarios with 4 agents which is not too complex for most of the SOTA MARL algorithms. It would be good to see if the model is able to infer “no-collision amongst agents themselves” constraint when the environment is too crowded and how it would affect the performance of SMALL.
**Response to w2:** Thank you for your comment. Compared to the two agents' layouts, the observation space for 4 Ant agents is already quite ample at 416 dimensions (shown in the w3 response). When comparing the average performance between 2 and 4 agents, the gap is significant (Figure 3 compare to Figure 4 or table in w1). This substantial performance difference is primarily due to the increased environmental complexity. To note that, we maintained the same map size while adding two more complex agents, which dramatically increases the difficulty of coordination and collision avoidance.
> **Weaknesses 3 and Question 1:** The paper is missing some implementational details about the environments like observation space used for the baselines, etc. (specific examples about the observation could help). Could you please give some more details about the observation descriptors used for SMALL and the other baselines? Like what is the observation space for the baselines?
**Response to w3&q1:** Thank you for your question. Here's a more detailed breakdown of the observation and action spaces for each environment:
| For each agent | Action Space | Observation Space |
| -------- | -------- | -------- |
| LaMaSafe-Grid | Discrete(7) | Dict('direction': Discrete(4), 'image': Box(0, 255, (7, 7, 3), uint8)) |
| LaMaSafe-Goal (Point) | Box(-1.0, 1.0, (2,), float64) | (12,) + (16 * number of agents) |
| LaMaSafe-Goal (Car) | Box(-1.0, 1.0, (2,), float64) | (24,) + (16 * number of agents) |
| LaMaSafe-Goal (Ant) | Box(-1.0, 1.0, (8,), float64) | (40,) + (16 * number of agents) |
Some important notes:
1. In LaMaSafe-Goal, each agent's observation consists of two components: the regular observation and the radar state, which depends on the number of agents.
2. The table shows the action and observation spaces for a single agent. For example, in a scenario with 4 Ant agents, the total observation space would be [40 + (16 * 4)] * 4 = 416.
3. SMALL uses the same observation length as other baselines.
We will make sure to include these details more explicitly in our revised paper.
> **Weaknesses 4:** It seems like the natural language constraints that were used for training the triplets loss function is the same as the ones used in testing. How would the model perform when it is shown constraints that are different to the ones shown in
training?
**Response to w4:** As detailed in Appendix F, we randomly sample 30 constraints for pre-training BERT using the triplet loss function to align semantic understanding. For policy training and evaluation, we use a different set of constraints that are not part of this pre-training set. Therefore, our reported results already demonstrate the model's performance on new constraints, showcasing its ability to generalize to unseen language instructions. To avoid further confusion, we will also briefly mention this important detail in the main section of the paper.
> **Typos: Minor:** Missing space after a period on line 34 / Typo in Figure 1 (Environment)
**Response:** Thank you for pointing this out. We will fix it in the revision.
## Reply to Reviewer Rgzt (s:5,c:4)
We thank the reviewer for your positive support. Below, we provide a point-wise response to your questions.
> **Weaknesses 1:** The proposed method is for MARL. While the algorithm for dealing with MARL is standard, the innovative part for handling safe constraints does not seem specific to MARL. The authors should clarify why MARL, instead of RL, is important.
**Response to w1:** We appreciate your comment and agree that our method could potentially be applied to single-agent RL. However, the MARL context is crucial for our work for several reasons: (1) It addresses the complexity of multi-agent interactions under natural language constraints, which is fundamentally different from single-agent scenarios. (2) It demonstrates the scalability and generalization of our approach across varying team sizes. (3) It requires agents to develop a collective interpretation of constraints, a challenge unique to multi-agent settings. (4) Many real-world applications of natural language constraints (e.g., autonomous vehicle fleets, robot teams) are inherently multi-agent problems. While we acknowledge that some aspects of our approach could be applied to single-agent RL, we believe that the most interesting challenges and potential applications lie within the multi-agent domain, particularly in scenarios where coordinated, safe behavior among multiple agents is critical.
> **Weaknesses 2:** The current experiments are conducted on a scenario created by the authors. It would be more convincing to evaluate on more scenarios. If such a scenario is difficult to find from existing ones, we should question how realistic the problem is in practice.
**Response to w2:** We appreciate your comment. Our environments, LaMaSafe-Grid and LaMaSafe-Goal, are built upon established frameworks - MiniGrid and Safety-Gymnasium respectively. This approach leverages proven simulation platforms while introducing the novel aspect of natural language constraints. By using these existing scenarios and incorporating language, we address a gap in current benchmarks which typically don't consider natural language constraints in multi-agent settings.
Our two environments cover a diverse range of complexities, from 2D to 3D, discrete to continuous action spaces, and varying levels of state space complexity. This diversity helps demonstrate the robustness and generalizability of our method. While we agree that evaluating on more scenarios would be valuable, we believe our current framework provides a solid foundation for exploring the integration of natural language constraints in safe multi-agent reinforcement learning.
> **Question 1:** In line 177 and Eq 5, why use different notation sim and dist? Shall they be the same?
**Response to q1:** Thanks for pointing it out. Yes, they are the same, and sorry for the typo, the cosine similarity between the constraint embedding $E_l$ and the observation embeddings $E_{o,t}^i$ should be denoted as $\text{dist}(E_l,\boldsymbol{E}_{o,t})\in [0,1]^{n}$.
> **Question 2:** How are the triplets collected in order to train a constraint embedding model? What if a new scenario is given?
**Response to q2:** Thank you for this important question. In our current implementation, we use a semi-automated approach to generate triplets. We start by collecting human-authored descriptions of specific safety requirements (e.g., avoiding blue hazard zones). We then use GPT to augment this data, generating semantically equivalent but linguistically diverse variants. Positive pairs are formed from semantically similar descriptions, while negative samples are drawn from unrelated constraint categories. This process is followed by human verification to ensure quality. For new scenarios, we can rapidly apply this workflow to generate relevant triplet datasets. While this approach may require some manual adjustment for entirely new environments, it provides a solid foundation for quick adaptation.