## Summarize our discussion with Reviewer htJq, qQsj and vQKv.
We appreciate the engaging discussion of our paper with Reviewers htJq, qQsj, and vQKv. All of them raised insightful points that strengthened our paper, improving its clarity, exposition, and rigor.
Since the discussion thread is long, we summarize the discussions following the initial review and our follow-ups.
**Reviewer htJq**
- All concerns (both initial review and follow up questions) have been **addressed** and **acknowledged**.
- In general finds the paper promising and **suggested acceptance** after the presentation concerns were addressed. We have addressed all presentation concerns raised by Reviewer htJq by:
- revising the main contributions, abstract, parts of Section 3, Section 4, Section 5.4, and Section 6.
- More specifically, after the initial review, Reviewer htJq helped improve our papers exposition by (1) simplifying the explanation of behavioral and instant incentives, (2) precisely stating our contributions, and (3) framing our results in a way that better aligns with the stated contributions.
- Post-initial review presentation questions have been **acknowledged** by reviewer.
**Reviewer qQsj**
- All technical concerns (both initial review and follow up questions) have been **addressed** and **acknowledged**.
- Agrees with all reviewers that paper has good contribution and good results. Primarily had issues with narrative and writing:
- narrative in introduction (motivate approach without inaccurately demoting CTDE)
- lack of proper discussion on limitations (generalization, high-dimensional input)
- We have addressed these by:
- rewriting the introduction in section 1,
- adding a section in Section 6 on limitations that includes a discussion on the possible limitations w.r.t generalizability, high-dimensional input states, among other things.
- We also expanded on the discussion of empirical analysis of centralized and decentralized methods in Section 5.4 along with a justification of design choice of incentives.
**Reviewer vQKv**:
- All concerns (initial review only, no follow up concerns) have been **addressed** and **acknowledged**.
- Note our responses to other reviewers regarding (1) clarity (2) CTDE v.s. DTDE (3) and found them sufficient
- Finds our paper very solid, and raised the recommendation to **Strong Accept**.
<!-- mentioned that we have addressed many technical comments, including: (1) performing more rigorous testing phase, like standard statistical test, on frozen models (2) including albation study on cerntralized version of our approach (3) performing weight sharing over inference modules. Reviewer qQsj pointed out a few limitations that are not addressed in our initial paper, including (1) challenges our approach may face when working with high-dimensional states. (2) The generalization problem when encountering agents with unknown strategies. We appreciate Reviewer qQsj helped us raise these limitations in our discussion and enlight potential directions for our future work. Meanwhile, the discussion with Reviewer qQsj over CTDE v.s. DTDE helps us repharse our problem setting and make it more rigorous. We appreciate the effort of Reviewer qQsj on making our paper more rigorous and solid. -->
<!-- acknowledged the interesting and promising approach of our paper, especially in the domain of autonomous driving. Reviewer htJq praised our effort in clarify our approach, experimental domains and pseudocode. Their follow-up questions and comments helped us improve our paper, including (1) formulating the definition of behavioral and instant incentives used in our approach with clear examples (2) addressing the contribution we made in our paper by emphasizing on the behavior-awareness in trajectory prediction and decision-making (3) framing our results discussion and enlighting our future research in investigating the representational selection of inference models. All of them help us improve the exposition and positioning of our paper. -->
<!-- ## Rebuttal for Reviewer htJq (2nd round)
> First, can you articulate how your results justify the added complexity of using the behavioral/instant incentive? I can see the experimental results and there is some additional analysis. I would like to hear why the improvements you see between, e.g., IPLAN-GAT and IPLAN justify the additional complexity of IPLAN.
Thanks for the question! The added incentive modules not only yield novel insights into decentralized MARL for heterogeneous autonomous driving, which we believe would generally be beneficial to the decentralized MARL research community in general, but also do **not** incur high computational complexity compared to the base controller.
Specifically, we are happy to report that:
- Each of the GAT and BM modules occupy only about 10% of the size of the core controller (which is basically iPlan without the two incentive modules). We can provide specific parameter sizes if necessary.
- In terms of inference time, both the core controller and the added incentive modules are of the same order.
- Finally in terms of training time (using the hyperparameters in the paper), iPlan takes 4.897 days whereas iPlan-GAT takes 4.042 days. Note here, that the training time complexity also results from the complexity of the simulator environment.
Let us know if you'd like us to analyze complexity in any other way.
> Second, I think there is room for improvement in explaining and motivating the hierarchical incentives used here. For example, in describing the instant incentive, you say that collision avoidance becomes more important in heavy traffic. I expect a lot of readers will object to that reasoning because collision avoidance is always important --- it is simply easier to do in light traffic so one can drive faster. I guess that it is helpful because you don't have the right features. I.e., the preference over velocity is a simplification of a much more complicated set of preferences so it is easy for the inference to allow it to adapt.
We apologize for the confusion. Let us try again.
Perhaps the most concrete way to explain the two incentives is to first differentiate between the objectives of the two incentives.
- **Behavior incentive**: Given the observations for the previous few seconds, behavior incentive performs high-level decision-making similar to action planning, "*What's the most likely action of this driver to take next?*". The answer is encoded via $\hat\beta^t_i$. This tells an agent when it is able to speed up in sparse or empty traffic or slow down in dense traffic. It also is able to recognize conservative drivers and recognize the possible need to overtake. Therefore, this incentive is able to reason between aggressive and conservative drivers.
- **Instant incentive**: Instant incentive then asks "*How should I execute this maneuver using my controller so that I'm safe and still on track towards my goal?*". Instant incentive measures classical efficiency metrics defined in robotics literature such as collision avoidance (safety), distance from goal, and smoothness.
Having gained an idea of what each incentive is responsible for, here's a toy example. Suppose Alice is driving behind Bob. Alice is a realtively more assertive and confident driver than Bob, who is driving very slowly. Now, Alice's' *behavior incentive* is tracking both Alice's and Bob's driving for the past few minutes and after observing for a short while will tell her to overtake Bob. At this point, her behavior incentive will inform her *instant incentive*, which will modify her trajectory and show her exactly how (what controls) to execute the overtake maneuver safely, as opopsed to having her stuck behind Bob.
Another way to look at it is that instant incentive is akin to motion forecasting whereas behavior incentive is akin to high-level decision making. Then, we can say that the behavior incentive biases the motion forecasting in a behavior-aware manner such that it is better suited for heterogeneous traffic. For evidence, note that in more homogeneous traffic, iPlan has similar success rate (68.44) to iPlan-GAT (no behavior incentive), whereas in chaotic traffic, the success rate drops significantly for iPlan-GAT, compared to iPlan (61.88 versus 67.81), indicating that you need behavior modeling to survive in more heterogeneous chaotic traffic
> Finally, I do still notice a few typos and grammatical errors --- some additional editing is needed for the camera ready. Perhaps consider using Grammarly or a similar tool to check for missing words or incorrect verb tenses (e.g., l.165 "Behavioral incentive captures these inherent tendencies," should either say "The behavioral incentive captures" or "Behavioral incentives capture").
We are currently fixing these. Will update this response once complete. Thanks!
When observing another vehicles states, one can differeniate the incentive driving this vehicle is two manner:
- **Behavior incentive**: Given the observations for the previous few seconds, behavior incentive asks, one can conclude the action preference of this vehicle in this circumstance, like "What's the most likely action of this driver to take next?" The answer is encoded via $\hat\beta^t_i$. This value, is obviously an indicator of aggressive or conversative behavior.
- **Instant incentive**: Instant incentive then asks "*How should I execute this maneuver using my controller so that I'm safe and still on track towards my goal?*". Instant incentive measures classical efficiency metrics defined in robotics literature such as collision avoidance (safety), distance from goal, and smoothness.
Having gained an idea of what each incentive is responsible for, here's a toy example. Suppose Alice is driving behind Bob, and a neural observer is tracking the two and generate the behavior and instant incentives of the two. Alice is a realtively more assertive and confident driver than Bob, who is driving very slowly.
Now, the neural observer's *behavior incentive infernece module* is tracking both Alice's and Bob's driving for the past few minutes and get some ideas the behavior pattern of both. When Alice gets closer to Bob, the neural observer's
*behavior incentive infernece module* find hat Alice is likely to overtake Bob, given the behavior pattern of Alice. At this point, her behavior incentive will inform her *instant incentive*, which will nake a modification of the trajectory prediction of Alice by considering the behavior pattern of Alice in prediction and predict the trajectory that Alice will be more likely to execute the overtake maneuver safely over Bob instead of slow down and keep safe distance.
So behavior incentive is acting like applying a bias term over instant incentive infernce, who actually performs the trajectory prediction. In this case, agents could perform trajectory prediction by considering the effect of behaviors, and then make more accurate predictions.
-->
## Rebuttal for Reviewer htJq (3rd round)
Thank you for clarifying your follow-up question. Below, we briefly discuss additional tuneable hyperparameters,a dditional code, and design choices.
However, it should be noted that our main contribution is a practically working and efficient novel joint trajectory and intent prediction algorithm using MARL for autonomous driving in heterogeneous traffic.
In general, it is well-known that training even simple MARL algorithms is hard. Yet, our approach extends MARL for trajectory planning research in autonomous driving to harder domains (heterogeneous traffic) under minimalistic assumptions (decentralized training, no weight sharing, variable agents etc.).
Considering that even getting decentralized MARL algorithms to converge effectively in simpler environments is a challenge, the fact that our combined approach not only trains well, but also outperforms many state-of-the art baselines, is a significant achievement, in our opinion.
In summary, thinking of our contribution in terms of a just the improved percentage points is highly reductive. Our work is a significant push in the research landscape of decentralized MARL and autonomous riving in heterogeneous traffic.
- **more interacting parts of the system:** There are no additional interacting parts of the system. All three modules (controller, behavioral and instant incentive inference) use the same form of inputs that comes from the ego agent's observations of opponents. The only extra complexity here is the observation wrapper that processes the observations, which is shared by the episode batch creator. We are using the same observation wrapper to convert the initial observation from the environment into the input when performing all baselines in our paper.
- **more hyperparameters to tune:** Yes, there are a few extra hyperparameters introduced by behavioral and instant incentive inference module. We have included details of these hyperparameters in Appendix C of our paper. We also present some additional experiment results on tuning hyperparameters in Appendix E.
- **Behavioral incentive inference module**:
- the hidden state dimension of encoder and decoder,
- the dimension of behavioral incentive,
- the learning rate of behavioral incentive inference module,
- the coefficient for soft update policy,
- the length of historical observation sequence,
- the drop out rate.
- **Instant incentive inference module**:
- The hidden state dimension of GAT and recurrent layer,
- the batch size of sampling the moment from the episode batch for training,
- the learning rate of instant incentive inference module the length of trajectrory prediction,
- the drop out rate.
- **more code to maintain:** Yes, both behavioral and instant incentive inference modules are defined separately and there have are in separate files. The behavioral incentive inference module is defined by a separate training and execution code, with autoencoder network structures. Similarly, the instant incentive inference module is defined by a training and execution code, with GAT and recurrent network structures.
- **more design choices to make:** Yes, we also explored some alternative design in our inference modules, like using different network structures, or using a hard updating policy in behavior module. We have included these results in Appendix D. Results show our current design has better performance.
___
## Rebuttal for Reviewer htJq (3rd round)
Thank you for clarifying your follow-up question about methodological complexity and our apologize about the previous confusions.
Regarding the methodological complexity concerns you raise:
- **more interacting parts of the system:** There is no more interacting parts of the system. All three modules (controller, behavioral and instant incentive inference) use the same form of inputs that comes from the ego agent's observations of opponents, and there is no more interacting parts among agents or between agents and the environment. As we mentioned in Section 4 of the paper and Fig. 1, behavioral incentive inference uses the sequence of historical observations as the input, instant incentive inference uses the current observation of opponents and behavior incentives (originally from the observations), controller combines the current observation, behavioral and instant incentives (both of which comes from the observations of opponents) as the input. The extra complexity here is the observation wrapper that processes the observations, which is shared by the episode batch creator. Notably, we are using the same observation wrapper to convert the initial observation from the environment into the input when performing all baselines in our paper.
- **more hyperparameters to tune:** Yes, there are a few extra hyperparameters introduced by behavioral and instant incentive inference module. We have included details hyperparameters in Appendix C of our paper. We also present some additional experiment results on tuning hyperparameters in Appendix E.
- **Behavioral incentive inference module**: The hidden state dimension of encode and decoder, the dimension of behavioral incentive, the learning rate of behavioral incentive inference module, the coefficient for soft update policy, the length of historical observation sequence, the drop out rate.
- **Instant incentive inference module**: The hidden state dimension of GAT and recurrent layer, the batch size of sampling the moment from the episode batch for training, the learning rate of instant incentive inference module the length of trajectrory prediction, the drop out rate.
- **more code to maintain:** Yes, both behavioral and instant incentive inference modules are defined separately. The behavioral incentive inference module is defined by a separate training and execution code, with autoencoder network structures. Similarly, the instant incentive inference module is defined by a training and execution code, with GAT and recurrent network structures.
- **more design choices to make:** Yes, we also explored some alternative design in our inference modules, like using different network structures, or using a hard updating policy in behavior module. We have included these results in Appendix D. Results show our current design has better performance.
Given our attempt in hyperparameter tuning, design choice exploration and ablation studies, we find that our current design of behavioral and instant incentive inference helps to achieve better performance than those alternative approaches we have explored, and the extra complexity applied to the backbone code (IPPO) is
___
## Rebuttal for Reviewer htJq (4th round)
Thank you for your follow-up question regarding our module design and contributions we claimed in our paper. We are sorry for confusions we made here.
According to your comments, we appreciate your suggestion on the contributions we claim. We think we may be better to repharse our second contribution proposed in our paper as: