Rebuttal for Xiyang Corl Paper

## Summarize our discussion with Reviewer htJq, qQsj and vQKv. We appreciate the engaging discussion of our paper with Reviewers htJq, qQsj, and vQKv. All of them raised insightful points that strengthened our paper, improving its clarity, exposition, and rigor. Since the discussion thread is long, we summarize the discussions following the initial review and our follow-ups. **Reviewer htJq** - All concerns (both initial review and follow up questions) have been **addressed** and **acknowledged**. - In general finds the paper promising and **suggested acceptance** after the presentation concerns were addressed. We have addressed all presentation concerns raised by Reviewer htJq by: - revising the main contributions, abstract, parts of Section 3, Section 4, Section 5.4, and Section 6. - More specifically, after the initial review, Reviewer htJq helped improve our papers exposition by (1) simplifying the explanation of behavioral and instant incentives, (2) precisely stating our contributions, and (3) framing our results in a way that better aligns with the stated contributions. - Post-initial review presentation questions have been **acknowledged** by reviewer. **Reviewer qQsj** - All technical concerns (both initial review and follow up questions) have been **addressed** and **acknowledged**. - Agrees with all reviewers that paper has good contribution and good results. Primarily had issues with narrative and writing: - narrative in introduction (motivate approach without inaccurately demoting CTDE) - lack of proper discussion on limitations (generalization, high-dimensional input) - We have addressed these by: - rewriting the introduction in section 1, - adding a section in Section 6 on limitations that includes a discussion on the possible limitations w.r.t generalizability, high-dimensional input states, among other things. - We also expanded on the discussion of empirical analysis of centralized and decentralized methods in Section 5.4 along with a justification of design choice of incentives. **Reviewer vQKv**: - All concerns (initial review only, no follow up concerns) have been **addressed** and **acknowledged**. - Note our responses to other reviewers regarding (1) clarity (2) CTDE v.s. DTDE (3) and found them sufficient - Finds our paper very solid, and raised the recommendation to **Strong Accept**.    ## Rebuttal for Reviewer htJq (3rd round) Thank you for clarifying your follow-up question. Below, we briefly discuss additional tuneable hyperparameters,a dditional code, and design choices. However, it should be noted that our main contribution is a practically working and efficient novel joint trajectory and intent prediction algorithm using MARL for autonomous driving in heterogeneous traffic. In general, it is well-known that training even simple MARL algorithms is hard. Yet, our approach extends MARL for trajectory planning research in autonomous driving to harder domains (heterogeneous traffic) under minimalistic assumptions (decentralized training, no weight sharing, variable agents etc.). Considering that even getting decentralized MARL algorithms to converge effectively in simpler environments is a challenge, the fact that our combined approach not only trains well, but also outperforms many state-of-the art baselines, is a significant achievement, in our opinion. In summary, thinking of our contribution in terms of a just the improved percentage points is highly reductive. Our work is a significant push in the research landscape of decentralized MARL and autonomous riving in heterogeneous traffic. - **more interacting parts of the system:** There are no additional interacting parts of the system. All three modules (controller, behavioral and instant incentive inference) use the same form of inputs that comes from the ego agent's observations of opponents. The only extra complexity here is the observation wrapper that processes the observations, which is shared by the episode batch creator. We are using the same observation wrapper to convert the initial observation from the environment into the input when performing all baselines in our paper. - **more hyperparameters to tune:** Yes, there are a few extra hyperparameters introduced by behavioral and instant incentive inference module. We have included details of these hyperparameters in Appendix C of our paper. We also present some additional experiment results on tuning hyperparameters in Appendix E. - **Behavioral incentive inference module**: - the hidden state dimension of encoder and decoder, - the dimension of behavioral incentive, - the learning rate of behavioral incentive inference module, - the coefficient for soft update policy, - the length of historical observation sequence, - the drop out rate. - **Instant incentive inference module**: - The hidden state dimension of GAT and recurrent layer, - the batch size of sampling the moment from the episode batch for training, - the learning rate of instant incentive inference module the length of trajectrory prediction, - the drop out rate. - **more code to maintain:** Yes, both behavioral and instant incentive inference modules are defined separately and there have are in separate files. The behavioral incentive inference module is defined by a separate training and execution code, with autoencoder network structures. Similarly, the instant incentive inference module is defined by a training and execution code, with GAT and recurrent network structures. - **more design choices to make:** Yes, we also explored some alternative design in our inference modules, like using different network structures, or using a hard updating policy in behavior module. We have included these results in Appendix D. Results show our current design has better performance. ___ ## Rebuttal for Reviewer htJq (3rd round) Thank you for clarifying your follow-up question about methodological complexity and our apologize about the previous confusions. Regarding the methodological complexity concerns you raise: - **more interacting parts of the system:** There is no more interacting parts of the system. All three modules (controller, behavioral and instant incentive inference) use the same form of inputs that comes from the ego agent's observations of opponents, and there is no more interacting parts among agents or between agents and the environment. As we mentioned in Section 4 of the paper and Fig. 1, behavioral incentive inference uses the sequence of historical observations as the input, instant incentive inference uses the current observation of opponents and behavior incentives (originally from the observations), controller combines the current observation, behavioral and instant incentives (both of which comes from the observations of opponents) as the input. The extra complexity here is the observation wrapper that processes the observations, which is shared by the episode batch creator. Notably, we are using the same observation wrapper to convert the initial observation from the environment into the input when performing all baselines in our paper. - **more hyperparameters to tune:** Yes, there are a few extra hyperparameters introduced by behavioral and instant incentive inference module. We have included details hyperparameters in Appendix C of our paper. We also present some additional experiment results on tuning hyperparameters in Appendix E. - **Behavioral incentive inference module**: The hidden state dimension of encode and decoder, the dimension of behavioral incentive, the learning rate of behavioral incentive inference module, the coefficient for soft update policy, the length of historical observation sequence, the drop out rate. - **Instant incentive inference module**: The hidden state dimension of GAT and recurrent layer, the batch size of sampling the moment from the episode batch for training, the learning rate of instant incentive inference module the length of trajectrory prediction, the drop out rate. - **more code to maintain:** Yes, both behavioral and instant incentive inference modules are defined separately. The behavioral incentive inference module is defined by a separate training and execution code, with autoencoder network structures. Similarly, the instant incentive inference module is defined by a training and execution code, with GAT and recurrent network structures. - **more design choices to make:** Yes, we also explored some alternative design in our inference modules, like using different network structures, or using a hard updating policy in behavior module. We have included these results in Appendix D. Results show our current design has better performance. Given our attempt in hyperparameter tuning, design choice exploration and ablation studies, we find that our current design of behavioral and instant incentive inference helps to achieve better performance than those alternative approaches we have explored, and the extra complexity applied to the backbone code (IPPO) is ___ ## Rebuttal for Reviewer htJq (4th round) Thank you for your follow-up question regarding our module design and contributions we claimed in our paper. We are sorry for confusions we made here. According to your comments, we appreciate your suggestion on the contributions we claim. We think we may be better to repharse our second contribution proposed in our paper as:

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.