ICML24_NSRL rebuttal

## Response to Reviewer fYN9 We sincerely thank you for your time and constructive comments. Below, we provide detailed replies to your comments and hope we can resolve your major concerns. > the improvements may not be contributed by the proposed new ideas. Our ablation studies in Figure 3 show that the proposed ideas of (1) **end-to-end training** and (2) **neural guidance** indeed bring significant improvement for all tasks: - By comparing the proposed "INSIGHT" model with the variant "Fixed" (without end-to-end training) in Figure 3, we can see that the end-to-end training brings a significant improvement of over 85% Normalized Return across all tasks, by allowing the model to refine structured states with reward signals. - By comparing "INSIGHT" and "w/o NG" (without neural guidance) in Figure 3, we can see that neural guidance can improve the performance on four tasks, particularly on Pong and BeamRider. These ablation studies indicate that knowledge distillation from vision foundation models alone is not sufficient to achieve these improvements; the proposed end-to-end NS-RL training and neural guidance play critical roles in improving the performance. We kindly refer the reviewer to the ablation studies in Section 4.2 for more analyses, which validate the effectiveness of the proposed new ideas.  > The experiments only conducted on a single benchmark. The focus of NSRL is to improve the transparency of decision-making agents, especially for producing interpretable and verifiable policies. The Atari games are considered to be very challenging for current NSRL approaches and have been widely used in recent works for NSRL [1,2,3]. In the meantime, we acknowledge that more results on realistic tasks are important for validating the effectiveness of the proposed method, so we conduct preliminary experiments on MetaDrive [4], which is a challenging environment for autonoumous driving in complex scenes. | | INSIGHT | Neural | Coor-Neural | Random | | -------- | -------- | -------- |-------- |-------- | | Success Rate | 0.37±0.21 | 0.17±0.12 |0.07±0.05 |0±0 | The table above provides a comparison between our method and the neural baselines in terms of success rate when using 1M samples from the environment. As shown, INSIGHT achieved a success rate of 0.37 with only 1M trained samples, surpassing the neural baselines. Due to the preparation time and the complexity of the environment, the training results for 2M and 5M samples will be released later. [1] Interpretable and Explainable Logical Policies via Neurally Guided Symbolic Abstraction, NeurIPS 2023. [2] Symbolic visual reinforcement learning: A scalable framework with object-level abstraction and differentiable expression search, arXiv 2022. [3] Discovering symbolic policies with deep reinforcement learning, ICML 2021. [4] Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning, TPAMI 2022. > missing references on NS methods. Thank you for the pointers to the important references and we will cite these works in the paper revision. --- ### Discussion Stage | Success Rate\Method | INSIGHT | Neural | Coor-Neural | Random | | -------- | -------- | -------- |-------- |-------- | | 1M | 0.37±0.21 | 0.17±0.12 |0.07±0.05 |0±0 | 2M | 0.42±0.13 | 0.19±0.06 |0.21±0.14 |0±0 | 5M | 0.51±0.09 | 0.49±0.11 |0.41±0.12 |0±0 ## Response to Reviewer 964p We sincerely thank you for your time and constructive comments. Below, we provide detailed replies to your questions and hope we can resolve your major concerns. > Q1: The approach of using FastSAM followed by DeAot to generate segmentation for the entire episode only fits scenarios where the object of interest appears in the first frame and continues to appear in all subsequent frames without the emergence of other objects of interest. This condition is typically only met in very simple toy environments. We would like to clarify that we have addressed this issue of newly emerged objects by **periodically re-evaluating FastSAM every tenth frame to include newly emerging objects.** This enables DeAot to effectively track these new objects as they appear. We have explained this practice in Line 561-565 of Appendix A.1 and we will revise the main paper to clarify this point. It's worth noting that object emergence is very common in Atari games, such as BeamRider and Enduro. Consequently, our model's ability to handle newly emerging objects is crucial for its success in these dynamic gaming environments.  > Q2: Employing a pre-trained agent for sampling in each task does not work well for environments with multiple scenes within an episode, as these pre-trained agents cannot access scenes that appear later. A typical example of such an environment is the Atari game Montezuma's Revenge. For the tasks we studied in this work, we find that using scenes sampled from pretrained agent is good enough for learning the visual perception module. For future extensions of our model to environments like Montezuma's Revenge, this issue can be solved by online sampling new scenes to the Frame-Symbol dataset during policy learning.  > Weakness #2: In order to transform image inputs into symbolic representations, the method described in this paper sacrifices a significant amount of information, which is indispensable in most practical applications, as illustrated by the following questions 3 and 4. > Q3: The article uses the center as the coordinate for objects, essentially ignoring the object's shape and status. However, in practical visual scenes required for various manipulation tasks (represented by Meta-world), both shape and status are crucial factors. The principle behind using object coordinates is that for interpretable decision-making, the state representations themselves should be interpretable. This is a common choice in recent NSRL works [1,2]. Additional attributes such as the bounding boxes and colors of objects are directly available, but in our preliminary experments they did not enhance task performance on Atari tasks. We agree that for manipultation tasks object coordinates are not enough. Similar to using FastSAM and DeAot for object segmentation and tracking, for manipulation tasks we can employ models for affordance prediction to detect areas to contact, and then record the coordinates of area centers. Thus, rather than a weakness, choosing object coordinates is a design choice for interpretable control, and it can be straightforwardly combined with other information of objects. [1] Interpretable and Explainable Logical Policies via Neurally Guided Symbolic Abstraction, NeurIPS 2023. [2] Symbolic visual reinforcement learning: A scalable framework with object-level abstraction and differentiable expression search, arXiv 2022. > Q4: The authors recognize that "the object coordinates x are subject to limited expressiveness," and therefore, they use a neural actor $\pi_{neural}$ as the strategy for interacting with the environment, with the expectation of distilling it into the EQL actor. However, the input $w_{perception}$ to $\pi_{neural}$ which is the outcome of the training on Learning Object Coordinates, also loses a significant amount of information, such as shape, color, state, etc. As described in the right column of line 183, the neural actor takes as input the hidden representations produced by the visual encoder, rather than the predicted object coordinates. The hidden representations also includes other information like shape and color. We will revise the paper to clarify this point. > Weakness #3: For a simple game like Pong, the explanation in Figure 5 seems extremely complex and difficult to read. I think this is not the goal that interpretability aims to achieve; at least, it's not human-friendly. The explanations in Figure 5 look lengthy because we instruct the language model to cover details of the input symbolic policies, which is necessary for tackling the halluciation of language models. This is a design choice for generating credible explanations, rather than a weakness. When conciseness is preferred over completeness, it is straightforward to generate concise explanations using the current explanations and language mdoels. > line 139, "objets" -> "objects" Thank you for your careful reading, and we will correct this in the revision. > Q5: If **w_perception** is also trained during the policy learning phase, are the training labels obtained in real time through FastSAM and DeAot? If so, I would like to know the proportion of computational resources consumed by calling these vision foundation models relative to the total training consumption. No. We use a fixed frame-symbol dataset, in which the training labels are pre-extracted through FastSAM and DeAot, instead of in real time. > Q6: I'm curious about the objects you focused on in these nine environments because, in my past attempts, SAM wasn't very sensitive to non-realistic game environments, especially some of the very small objects. I believe it's very necessary to include the FastSAM segmentation results of the first frame of each game, as well as the DeAot segmentation results, in the appendix. We include some examples of coordinate predictions in Figure A3 of the Appendix. We will include the segmentation results of FastSAM and DeAot in the appendix in the paper revision. To improve segmentation performance of FastSAM and DeAot, we resize the images to 1024 * 1024 before segmentation, as mentioned in Line 565 of Appendix A.1. The coordinates are normalized to [0,1] so that we can resize the images to 84*84 during policy learning. To further enhance segmentation, we also increase the 'points_per_side' parameter and the 'crop_n_layers' parameter. > I understand the importance and difficulty of neuro-symbolic reinforcement learning. My main concern with this paper is its applicability. If the method could be applied to more realistic environments, such as the autonomous driving simulation environment CARLA or the navigation environment Habitat, I am willing to engage in discussion and increase the score. Indeed, results on more realistic tasks can enhance this paper, so we are actively working on experiments on the MetaDrive environments [3], which is a newer and faster simulatior for automous driving than CARLA. Following are the preliminary results: | | INSIGHT (ours) | Neural | Coor-Neural | Random | | -------- | -------- | -------- |-------- |-------- | | Success Rate | 0.37±0.21 | 0.17±0.12 |0.07±0.05 |0±0 | The table above presents a comparison of our method to the neural baseline in terms of success rate, after training 1 million steps. Our proposed method outperforms the neural baselines by a large margin in this environment. We are still training these models and will share the results for 2M and 5M steps later. [3] Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning, TPAMI 2022. We hope our response can address your concerns and let us know if there is any further question. --- ### 1. Concerns about the pretrained CNN Indeed, the hidden representations from the pretrained CNN may also suffer from potential information loss. However, this is precisely where the importance of end-to-end training comes into play. Through this process, the CNN can be fine-tuned using reward signals. If attributes such as shape, color, and state are pivotal for effective gameplay, the end-to-end training phase becomes helpful. It allows for the refinement of pre-trained features by incorporating feedback from the reward signals. In essence, the end-to-end training ensures that the CNN learns to prioritize and incorporate all relevant features essential for optimal performance in the given task. Concerning the visual perception performance for Pong, Table 3 shows that the pretrained visual perception (the "Fixed" column) performs relatively well on Pong compared with other games. During the online policy learning phase, the visual perception is fine-tuned with additional scenes via reward signals, which further enhances its performance. ### 2. The breakdown of the time allocation for each component and the concern on the increased computation time for extending the dataset during online training The estimated training times for each component, using an NVidia RTX 3090 GPU, are as follows: (1) **Pretraining the visual perception**: 2-3 hours (2) **Policy learning**: This phase involves fine-tuning the visual perception and is divided into two scenarios: - When fine-tuning the visual perception, it takes about 7-8 hours. - If the visual perception is fixed (not fine-tuned), it takes around 6 hours. It is important to note that training $\pi_\text{neural}$ and distilling $\pi_\text{neural}$ to $\pi_\text{EQL}$ occur concurrently during the policy learning phase. Regarding the concern about the computational overhead from expanding the dataset online, it's manageable due to two key reasons: 1. Dataset expansion is infrequent, only necessary when new object categories are detected in the environment. 2. The execution of FastSAM and DeAot on a new image takes approximately 14ms and runs in a separate process, thereby not interrupting the main policy learning workflow. This approach ensures that the online dataset expansion does not significantly impact the overall computation time. > **3. Questions about the frame-symbol dataset** > the files named id_new.png and id_seg.png - The file named `<id>_new.png` identifies the new objects that the DeAoT model begins to track starting from the `<id>`-th image in the sequence. - The file named `<id>_seg.png` displays the segmentation results generated by the FastSAM model when applied to the `<id>`-th image, illustrating how objects are differentiated within that frame. > the ordering of the five objects - The objects are ordered based on their confidence scores from FastSAM. The objects are arranged from the one with the highest confidence score to the one with the lowest, ensuring that the ordering is consistent and not manually configured. > how is the identity of predicted objects fixed in environments like BeamRider and Enduro? - To maintain consistent object identities in each environment, all episodes are combined into a single video sequence. The DeAoT model then tracks objects across this unified sequence, ensuring that each object's identity remains stable and consistent. > how are missing objects accounted for in these cases? - When an object is missing from a frame, it is noted as 'non-existent' in the dataset, and its coordinates are set to (0,0). This approach allows for a systematic way to account for absent objects without disrupting the overall tracking and identification process. > The frame-symbol datasets for all Atari environments and MetaDrive. The datasets for all the specified environments, including MetaDrive, are accessible via this [link]( https://drive.google.com/file/d/1GH3G2CoqnUGmJETjUJmdGZrQINZ2p6VY/view?usp=sharing). These datasets, along with our codebase, will be released for the sake of reproducibility and further research. ### 4. Questions about the codebase Thank you for bringing these issues to our attention. The file `cleanrl/sam_track/demo.ipynb` mentioned in the readme file is actually a typo. Instead, you should refer to the `cleanrl/sam_track/demo.py` script. As for the file `cleanrl/ppo_atari_eal.py` mentioned in line 16, it seems to be another typo. The correct file is `cleanrl/ppo_atari_eval.py`. For pretraining the CNN, we use the `cleanrl/train_cnn.py` script. Meanwhile, the training of the policies $\pi_\text{neural}$ and $\pi_\text{EQL}$ is handled by the `cleanrl/train_policy_atari.py` script. ### 5. The input dimension for EQL The input dimension for EQL is determined without assuming prior knowledge of the number of objects in the environment. To accommodate various scenarios, we establish a maximum object count of 256 for all experiments, a threshold that is sufficient for all Atari games. Each object is represented by two coordinates (x and y), and to capture temporal dynamics, we include four consecutive frames. Consequently, the input dimension for the EQL network is calculated as 256 * 2 * 4 = 2048. For any instances where the actual number of objects is less than 256, the corresponding input slots are filled with the coordinates (0,0) to maintain a consistent input dimension. We thank you for your feedback and questions, and we will improve our paper and codebase according to these valuable feedback. Meanwhile, we look forward to be informed if your concerns have been fully addressed by our reponse. Feel free to let us know if there is any further question or discussion! ## Response to Reviewer uqeU We sincerely appreciate your time and the positive feedback for our work, espcially your advices on improving the clarity and accessiblity of this paper. Below, we would like to first respond to the qeustions and limitations raised in your review, and then discuss your constructive suggestions. > Q1. Why are visual observation less involved in Pong? In the right column of line 306, we state that "visual observation less involved in Pong". This is because in Pong, the background is fixed, and there are only three moving objects (two paddles and a ball), which is simpler than other environments. > Q2. Will you open source your code for future comparisons? We have included our source code in the supplementary materials. We will refine the code base and make it public after the review period. > Q3. Is there any specific reason why you use the human scores from (Wilson et al., 2018), instead of one of [4] or [5]? We think since the CGP is released more recently than [4] or [5], the human scores might be closer to the current human level. --- **Your Suggestions/Comments for Writting** > Authors could motivate ... I provide references for this below. > > You can further highlight the importance of ... can help tackling many RL related issues Thanks for providing examples that can motivate interpretable RL. We will modify the introduction section accordingly. > Furthermore, it would be interesting to have a discussion on the level of transparency of ..., that can rely on large language model to improve transparency. Yes, as mentioned in the third paragraph of section 1, intrinsic transparency does not imply accessibility to non-expert users, and they should be considered as two different criteria for interpretable RL approaches. We believe this issue has been overlooked by the NSRL community and worths further discussion. Logic-based methods (like NUDGE) can characterize reasoning procedures (e.g. has_key -> open_door) directly because of their innate ability in expressing relations. So from this perspective we agree that they are more transparent than methods based on mathmatical equations. In the meantime, users need to have prior knowledge to understand the semantics of logical expression, which can be a problem in practice. > There are implication on using (potentially hallucinating) LLMs, that might be worth discussing. In fact, the process of concept grounding mentioned in section 3.3 is designed to tackle with the hallucination of language models. Our prompts for soliciting explanations are also designed in a way to enforce language models to focus on the input symbolic policy. Nevetheless, we agree that a separate section on this issue will strengthen the paper. > I would advise to write in bold both at the start of the captions and .... > In 4.2, instead of « Task performance », write , .... > For all table, put in bold only the results ... > You can give details about ... Thank you for your suggestions on improving paper writting. We will revise the paper accordingly. > The paragraph starting l.420 might benefit from ..., as its correlated with the position of the ball. > In your Fig. 1, the explanation ..., but still is able to point out the importance of the enemy’s position for the policy. Indeed, by leveraging language explanations, it is more straightfoward to spot wierd agent behaviors (the enemy position is weirdly important for the policy). This is an advantage for soliciting language explanations using language models. Thank you for suggesting related work that support this finding. > There should be a discussion on .... (Put e.g. Fig 4 in appendix if space is needed). We agree that the alignment issue needs discussion and can be an interesting future work. > You extract the bounding boxes of ..., using this. Thank you for suggesting this meaningful evaluation of the image-symbol dataset. > For the loss L_exists, you haven’t introduced pi,j. I guess that it’s the prediction of ci,j. This should be introduced. We are sorry that we did miss this point. pi,j represents the probability that the logits output by the existence layer of CNN are transformed by the sigmoid function. > In algorithm 1: > - what does x_i and x_i+1 stand for? It should be written inside the algorithm? This actually means the same meaning as iteration==n. > for iteration=n do, is this meant to underline that you only execute it after the neural policy optimisation or is it a typo? This is a typo. In fact, what we want to express here is for iteration=1,2,…,n do. **For Figures** > In Figure 1: > - what is y_agent,1 to y_agent,4? Can you show these values extracted from the image (or maybe the sequence of images) y_agent,1 to y_agent,4 actually represents the vertical coordinates of the agent in four consecutive frames. > - From what are the policy interpretation decision explanation extracted? The image and actions? The figure could make the whole process clearer Policy interpretations are extracted using the learned symbolic policies accompanied by descriptions of the task. Decision explanation requires the symbolic expression, the action at the corresponding moment and the gradient of the input relative to the action. > Figure 2: > - The visual perception module does not correspond to the explanations given in 3.2...Which one does INSIGHT use? The visual perception module connects its hidden layer's output to the neural actor. The coordinates output by an FC layer are sent to EQL for interpretability. --- Yes, we will include the discussion on misalignment and the evaluation on the dataset in the revised paper. | Env\Method | INSIGHT | SPACE+MOC(w/o OC) | SPACE+MOC | -------- | -------- | -------- | -------- | | Pong | 97.5 | 91.5 ±0.1 |87.4 ±0.9 | For evaluation, as shown in the table above, we provided the F-score results of INSIGHT and MOC on the Pong dataset generated by OCatari. INSIGHT's score is significantly higher than MOC's and close to 1, serving as a preliminary demonstration of the effectiveness of our dataset segmentation. Regarding the misalignment, we offer a possible evaluation process: For a task *E*, one can first generate a natural language explanation *N* for the policy expression *P*. Then, another language model is used to select actions based on *N* and the structured states. This means that if the LLM scores high, misalignment situations will not occur (for example, in the Pong environment, the agent decides to move up following the formula, while the LLM believes it should move down). --- Response for Reviewer uqeU： Yes, we will incorporate the discussion on the misalignment and the evaluation using the OCAtari dataset in our revised manuscript. The misalignment issue may stem from the hallucinations of LLMs, which sometimes leverage their prior knowledge to fabricate explanations not anchored in the policy expressions. To address this issue, we have meticulously designed our prompts, as outlined in Table A11 of the Appendix. For example, Rule 1 allows the use of prior knowledge by LLMs, but insists that explanations remain rooted in the policy's mathematical underpinnings. Rule 2 dictates that LLMs' policy explanations should be based on logits and action probabilities. Rule 7 demands a comprehensive breakdown of every term in the policy's formula within the explanations. Our findings suggest that such prompt engineering significantly curtails LLM hallucinations, enhancing the alignment between generated explanations and the actual policy actions. These insights into misalignment will be elaborated upon in the revised paper. Here are our initial findings (F-Score) from the Pong environment within the OCAtari dataset, compared with the results of **SPACE+MOC**, as reported in the MOC paper [1]: | Environment | INSIGHT | SPACE+MOC (w/o OC) | SPACE+MOC | -------- | -------- | -------- | -------- | | Pong | 97.5 | 91.5 |87.4 | These results highlight that our methodology, INSIGHT, achieves an impressive F-Score of 97.5%, significantly outperforming **SPACE+MOC**. We intend to extend these evaluations to additional environments and will include these comprehensive results in our paper. [1] Delfosse, Quentin, et al. "Boosting object representation learning via motion and object continuity." Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2023. ## Response to Reviewer M12T We sincerely appreciate your time and valuable feedback. Below, we offer detailed responses to your questions and hope we can address your concerns effectively. > The framework assumes ... Is it capable of relational (first-order) representation or limited to propositional representations? EQL is capable of expressing relational logic. For instance, to represent the spatial relationship of objects, one can use the difference between their coordinates. > The assumption of .... How would it work in environments where non-coordinate information plays a vital role in action decisions? > Is the proposed framework generalizable to environments where other features (except coordinates) play a central part in action decisions? The use of object coordinates as structured states is common in the recent literature for handling Atari tasks [3,4], and our experimental results support the effectiveness of this design choice. Meanwhile, it is straightforward to extend INSIGHT to use other information. For example, we can encode categorical features as one-hot vectors, and we can exploit object affordance through the location of interactable locations in manipulation tasks. > The demonstration of the generated textual explanations is limited to ..... > Is it possible to assess the quality of generated explanations beyond examples, e.g. with qualitative results or elaborated arguments? The issue of hallucinations (or the quality of generated explanations) in LLM remains an open question. A potential evaluation protocol is as follows. For a task *E*, one can first generate a natural language explanation *N* for the policy expression *P*. Then one use another language model to select actions based on *N* and the structured states. If the cumulative reward received by this language model matches that of *P*, one may say that *N* is aligned with *P*. > The title says end-to-end, but actually the proposed algorithm requires a pre-trained perception model. This fact may not be aligned well with the term end-to-end. This is a misunderstanding. By end-to-end, we mean that the perception module and policy networks are trained jointly during policy learning. The pre-trained perception module is pretrained before policy learning for better performance. > The search space is not clearly stated. ..., it would be better to clarify what required inputs to the proposed entire system in the paper, including inductive biases. We have provided the set of activate functions in Table A2. We will further clarify this in the main text to avoid ambiguity. > Would it scale to environments where agents need to perform multi-hop reasoning with objects' attributes and relations? Yes, as previously mentioned, EQL can represent the relationships between objects. For multi-hop reasoning, simply stacking several layers of EQL can represent the complex, multilayered relationships between different objects. > What are inductive biases required by the framework? How strong they are compared to other neuro-symbolic RL approaches? Our inductive bias lies in the activation functions used by EQL. We assume that common symbols such as arithmetic operations, squaring, and cubing grant the EQL network sufficient expressive capability. This is a modest assumption given that deep neural networks inherently possess significant expressive power. Compared to NUDGE, which requires manually defined rules, and Diffses, which relies on perfectly extracted coordinates from a vision perception module, our method exhibits the least inductive bias. [1] Interpretable and Explainable Logical Policies via Neurally Guided Symbolic Abstraction, NeurIPS 2023. [2] Symbolic visual reinforcement learning: A scalable framework with object-level abstraction and differentiable expression search, arXiv 2022. --- > I am not suggesting solving hallucinations on LLMs but improving the evaluation method....

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.