Rebuttal for CORL 2023 Bhrij Paper

# Rebuttal for Ada-Nav Paper   **Summary of Rebuttal Discussions:** **General comments:** **Review highlights:** **Summary of core contributions:** ----------------------------------------------------------------- After roeF second comments > A study of the effects of hyperparameters is typically standard in any paper in which new hyperparameters are introduced. We agree. In the attached rebuttal pdf, please refer to Rebuttal Figure 3 that shows the affects of varying $t_d$. > Perhaps my point would have been better stated as "the transformation appears poorly justified". However, given that no alternative transformation schemes are suggested it seems to me that the primary practical contribution is this specific transformation. Thank you for the clarification. We totally agree that additional evidence is needed to show other forms of transformations work. In the attached rebuttal pdf, please refer to Figure 4 where we also try an exponential transformation showing even better results. Furthermore you comment as helped us refine the our proposed scheme. We will propose in the paper to specifically use **monotone functions between policy entropy and trajectory length**. Linear interpolation as suggested in Eq 7 and exponential functions (decribed more in the pdf) are such functions that can leverage the positive correlation. For the exponential function we replace $t_d$ with a user-specified $\alpha$ value that controls the rate of change. $t_d$ has a similar role for the Eq 7. We will update the paper will more experiments to show that by using a monotone function will increase sample efficiency with the right rate of change parameter $\alpha$. ## Response to Reviewer roEF [Weak Reject; Confidence 3]  We would like to express our gratitude to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript, which has allowed us to strengthen the manuscript. We have thoroughly responded to the reviewer's inquiries in the responses provided below. > **Weakness 1:** The proposed heuristic for trajectory length (linear interpolation based on entropy) seems poorly-justified, particularly in the minimum- and maximum-length parameters. **Response to Weakness 1:** We respectfully disagree with the reviewer on the poor justification of the transformation, which is our main point. While we agree the transformation itself may seem heuristic, our main focus for this paper is to highlight and utilize the positive correlation between policy entropy and spectral gap. We used Eq. (7) as a way to show that there exists a transformation that can leverage this insight to adaptively change trajectory length based on decreasing policy entropy. We do not propose that Ada-Nav has to be this specific transformation. Rather, Ada-Nav is the higher-level concept of increasing trajectory as policy entropy decreases. Future work can involve delving into a more nuanced and/or robust transformation. Our goal was to show that using the positive correlation as a way to adaptively change trajectory length leads to more sample-efficient training. We will update the document to emphasize that the core contribution of the paper is to show that the positive correlation between policy entropy and spectral gap can be leveraged to improve sample-efficiency, not the specific transformation used. In regards to minimum- and maximum- length parameters, please see our response to Question 2. > **Weakness 2:** The proposed method is only evaluated on REINFORCE; the authors should consider applying it to PPO (and to use PPO as a baseline). Considering an off-policy baseline using SAC or TD3 would also be encouraged. **Reponse to Weakness 2:** We thank the reviewer for the suggestion but this is again not the main goal of our work. We are not trying to compare different policy gradient methods or even propose a new one. Ada-NAV is not meant to be a new policy gradient algorithm that can be compared against another like PPO on its own. Rather, it can be implemented in a policy learning algorithm and that will lead to better sample-efficiency than using the same algorithm with a constant or random trajectory.  > **Weakness 3:** The difference between Ada-NAV and vanilla REINFORCE with maximum length seems to be somewhat minor in terms of episode return in Fig. 4 but the difference seems very large in Tab. 1; experiments don't seem to be run across multiple seeds which makes it difficult to discern whether the effect is statistically significant. **Response to Weakness 3:** In Fig. 4, we present the policy convergence performance of different methods against ours. However, Table 1 presents navigation results for the policies trained for a fixed sample budget. Hence, the comparison methods may not have fully converged within the fixed sample budget. This leads to significantly poorer performance during navigation as highlighted in Table 1. Please note that certain comparison methods might achieve better navigation performance if we train with a higher sample budget. However, our objective in this tabular comparison is to highlight the sample efficiency of our algorithm using a fixed sample budget. Secondly, thank you for mentioning the seeding. We realize after reading that we forgot to mention the specific number of seeds. We ran each experiment for 5 trials, each seeded differently. The line plots throughout the experiments have shaded regions around the solids lines to show variance across trials. The barplots have error lines at the top of the bars to show variance as well. We will be sure to mention that we ran each experiment for 5 trials. >**Weakness 4:** Experimental details seem sparse; in particular I don't see any discussion of policy parametrization (neural network? tabular? something else?) or of any other tricks that might improve performance (do you use a baseline or another advantage estimation algorithm to reduce variance?) **Response to Weakness 4:** In the Appendix, Section 8.2, we use a tabular linear approximator for the Actor-Critic algorithms for the 2D gridworld navigation. For the robotic simulation experiments we use neural networks for REINFORCE. We will update the main body to refer to Section 8.2 for more experimental setup details. Thank you. > **Question 1:** What is the reward function used for the navigation task? **Response to Question 1:** Thank you for the clarifying question as we agree this can be emphasized better in the main body. In Section 5, for the 2D navigation experiments, the agent recieves a reward of +1 when reaching the goal and +0 otherwise. This reward structure is used to show Ada-Nav's ability to handle sparse rewards. In the Appendix, Section 8.2, we use three reward terms $r_{dist} , r_{head}$ and $r_{elev}$ to generate reward functions for the even ($r_{even} = r_{dist} + r_{head}$) and uneven ($r_{uneven} = r_{dist} + r_{head} + r_{elev}$) terrain navigation scenarios. $r_{dist}$ and $r_{head}$ quantify the current distance and heading angle towards the goal. The robot receives higher rewards if it's closer to the goal and heading toward the goal direction. Similarly, $r_{elev}$ indicates the current terrain elevation based on the robot's orientation. Hence, the high roll and pitch angles of the robot quantify as highly uneven terrains. Please refer to the Section 8.2 in the Appendix for more details about the mathematical formulation of these reward functions.  We will be sure to update the main body of the document to point to Section 8.2 for this crucial information. Thank you. > **Question 2:** Is there any principled way to select the parameters $t_i, t_d$? A study of the affect of these parameters on performance would be appreciated. **Response to Question 2:** We agree that selection of $t_i$ and $t_d$ can examined further but it is out of the scope of the paper. Our key contribution is showing that growing the trajectory from $t_i$ to $t_d$, whatever they may be, allows for training performance greater than a constant $t_i$, comparable to $t_d$ while being more sample-efficient than $t_d$. We were not interested in examining which specific values maximize performance as we have already shown the sample efficiency of a growing trajectory based on policy entropy. > **Question 3:** I would expect that the proposed algorithm improves over max-length REINFORCE early in training due to reduced variance at shorter horizons, and they would perform similarly later in training due to having the same horizon at policy convergence, but many of the results seem to run counter to this expectation. Do you have an explanation for why this is the case? **Response to Question 3:** This is a good point. Due to more frequent gradient updates in the beginning, the Ada-NAV REINFORCE model is able to explore the policy space more and converge on a optimal policy than the constant max-length REINFORCE. Therefore, when increasing the trajectory length, the Ada-NAV REINFORCE exploited a better policy. > **Question 4:** Is there any reason to expect that elevation cost would be lower for a better control algorithm conditioned upon the other metrics being the same? I understand that it is an important metric for evaluating real-world performance but I don't see how it should result from a better optimizer, which is the crux of the proposal in this paper. **Response to Question 4:** We agree that our work does not propose a better navigation algorithm for uneven terrains. However, we incorporated the elevation cost as a metric to highlight that a navigation policy resulting from a better optimizer can demonstrate comparable or better navigation performance than baseline control algorithms such as ego-graph. > **Question 5:** The success rate (/total return) and trajectory length metrics seem to not align well: if the presented algorithm is just a "better optimizer" as presented, policies with similar returns to vanilla-REINFORCE would be expected to have similar results in other metrics. Rather, it seems like specifically in the navigation case the rollout length scheduling seems to bias the resulting policy towards shorter paths. Is this correct? **Repsonse to Question 5:** We totally agree that our work proposes a better optimizer for navigation policy training. Ideally, if we train for enough number of samples, all the policies trained with the same vanilla-REINFORCE should demonstrate similar navigation behavior. However, we restrict ourselves to a fixed sample budget during policy training. Hence, other policies except Ada-NAV are not fully converged due to the sample inefficiency which eventually resulted in a significant performance degradation during navigation evaluations.  ## Response to Reviewer EMyG [Strong Reject, confidence 5]  We are thankful to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript. We have thoroughly responded to reviewer’s inquiries in the responses provided below. >**Weakness 1:** The idea of using adaptive rollout length is not particularly as novel as the authors made it out to be. For example [a] proposes a meta-learning algorithm to adapt the rollout horizon. The approach of [a] also seems to be more principled. Similarly, [b] also handles a similar problem. However, [b] might be a very recent result that may not be in the public before the proposed submission. However, [a] has been there for some time. **Response to Weakness 1:** Thank you for the comment but we respectfully disgree with the reviewer. Our point is not just the adaptive rollout and claiming novelty for that aspect; instead, our key insight is the positive correlation between policy entropy and mixing time for model-free RL and adjusting rollout length accordingly. To our knowledge, we are the first to highlight and utilize this insight in the trajectory length design. In this work, we are more concerned with testing against other methods that deal with non-stationarity (induced by the changing policies), or shifting Markov chains. That motivation is why we chose to compare against Multi-level Actor-Critic (MAC). Because [a]'s meta-level policy does not adjust trajectory for the purpose of adapting to mixing time, we felt like comparing against constant and multi-level sampling is sufficient. The aspect of dealing with unknown mixing time and adapting trajectory length is very important, which the reviewer seems to have missed completely. We will try to highligh and clarify it in the final version. We will be sure, however, to mention [a] in our related works in the final version. >**Weakness 2/Questions for the Rebuttal:** The core result/idea seems to be overly simplified. The connection between mixing time and entropy is interesting but the resulting equation 7 does not seem too convincing. For example, the choice of initial policy entropy as the max entropy H if there is a guaranteed monotonic decrease in the policy entropy with RL iteration. If entropy increases, then eqn.7 might return 0. How will those practical issues affect the overall performance? Will it destabilize the RL? What forces H_c to be less than initial policy entropy? Are there any guarantees on that? Have the authors encountered a problem where eqn 7 just gives 0? How will the learning recover if such a situation occurs? **Response to Weakness 2/Questions for the Rebuttal:** We believe that a simpler presentation of the core ideas is a positive point for better readability and reach of the paper. Also, ***the question raised by the reviewer seems invalid.*** In reinforcement learning theory, as learning progresses with more interactions with the environment, the RL agent becomes more certain of which action to take, so policy entropy tends to stay the same or decrease. We have not run into a siutation where policy entropy increased from the previous episode, so the situation of increasing above the initial policy entropy and trajectory length going to 0 is unrealistic. We experimentally never saw this. For example, please look at rebuttal Figure 1 in the google doc provided in the following link. https://docs.google.com/document/d/1GIPFKHsUTQs3BFRtlt_vktpFnjFWLdXBlMH5fBF8B7U/edit?usp=sharing It plots the decreasing in policy entropy over training episodes over 5 trials for the 2D gridworld environment with 16 walls. Variation is very low between trials. Furthermore, just to make sure mathematically we never enter the regime mentioned by the reviewer, we can always initialize the policy with a uniform distribution which has the highest entropy. In regards to Eq 7 in general, we want to emphasize that our key insight in the paper is the positive correlation between policy entropy and spectral gap. Ada-Nav is the higher-level idea of using adaptive trajectory based on policy entropy. Eq 7 does not necessarily have to be the transformation used to connect trajectory to policy entropy. We needed to show that there exists some transformation than can leverage the insight to improve the sample-efficiency of a policy gradient algorithm. So while further work can be done to derive a more robust or nuanced transformation, Eq 7 sufficiently utilizes the positive correlation found to improve sample efficiency. >**Weakness 3:** The paper needs a comparison with more closely relevant baselines like [a] and [b] to validate their approach. Moreover, the results at this moment are very confusing. For example, I don't understand, why on flat terrain with no obstacle the success rate of REINFORCE is so low. If it is flat terrain with no obstacles, a simple pure-pursuit type of controller can efficiently navigate the robot to the goal. Something like a basic MPC with a non-holonomic motion model will do even better. **Reponse to Weakness 3:** In regards to relevant baselines, please see our Response to Weakness 1. Regarding the flat terrain scenario, please note that we incorporate sparse rewards explained in Appendix Section 8.2 to train our navigation policies. Obtaining successful navigation policies under sparse reward settings is non-trivial than the dense rewards generally used in the literature. Hence, all the REINFORCE based policies including Ada-NAV demonstrate realatively lower success rates in flat terrain scenarios. Moreover, since we evaluate the policy navigation performances at a fixed sample budget, we observe a significant decrease in the success rate from the other methods. However, we agree that if a policy is trained with a dense reward function, the goal-reaching task under flat terrain conditions should perform better with a trivial ploicy training effort. Further, we agree that any basic MPC such as DWA achieves a 100% success rate under this even terrain scenario. However, our main objective in this navigation performance comparison Table is to highlight how other methods underperform at a fixed sample budget and sparse reward settings while Ada-NAV archives relatively better navigation performance. >**Weakness 4:** I don't think the DWA baseline is adding any value since it is not designed for outdoor navigation. Specifically, how do you compute the cost map for outdoor navigation for DWA? **Response to Weakness 4:** We agree that DWA is not designed for outdoor navigation, especially for uneven terrains. We primarily use DWA as a baseline for even terrain navigation since such even terrain scenarios do not require any terrain estimations. DWA does not use a cost map for navigation. Instead, it uses a cost function to optimize the trajectories such that the robot can reach a goal faster. On the other hand, we use ego-graph as the baseline for uneven terrain navigation. We still report results for DWA in uneven scenarios to highlight the difficulty of the navigation task compared to even terrains. ## AFTER THE REBUTTAL >Let's assume that entropy always decreases during training from the very first iteration. (although I put some references later that counter this assumption). In this case, eqn 7 is a heuristic that says to increase the rollout horizon as the training progresses. But the authors have not presented any theoretical guarantee that this increasing schedule always works. In fact, Reference [a] that I pointed out earlier, judicially explains that during training roll-out would have to be increased/decreased or kept the same. Thus, this brings me to my original contention that the authors have reduced a complex problem to an unjustifiably trivial form. Simplicity is always good but not at the expense of trivializing the problem statement itself. **Response** ***We respectfully disagree with the reviewer.*** This is not an assumption we are making that the entropy of the policy decreases. This is what we observe and show empirically holds. Even in one of the references [d] mentioned by the reviewer, authors mentioned that "*To prevent policies from becoming deterministic too quickly, researchers use entropy regularization*", which clearly states that to stop becoming deterministic too quickly (which is connected to entropy reduction), researcers use entropy regularization. Entropy regularization is nothing but tries to increase the entropy and hence induces exploration. We remark that in this work, we are not focusing on the entropy regularization. ***Our method provides an alternative approach by modulating the frequency/noisiness of policy gradient updates through variable trajectory lengths.*** This actually controls the rate at which policy becomes deterministic. ***In summary, our main focus is NOT to show or guarantee policy entropy decreases.*** Our focus is to show that trajectory length can be based on policy entropy. All are our experiments had policy entropy goes down, but stating cases where policy entropy increases does not take away from Eq 7 or the fundamental correlation. If we had experiments where policy entropy increased, Eq 7 would simply decrease the trajectory. We can always set policy intialization to be uniform to avoid trajectory length becoming 0. Please look at the table provided below that shows different trajectory length values given different entropies using Eq 7 with uniform initialization. Let $t_i = 16, t_d = 2100$ and assume uniform policy for all states with discrete action space of 5 possible actions, so $H_i \approx 2.3219$. |Iteration| $H_c$ | $t_c$ | |----|-------------|-------------| | 1| 2.32 | 18 | | 2| 2.31 (decreased) | 27 (longer trajectory) | | 3| 2.30 (decreased) | 36 (longer trajectory) | | 4|2.32 (increased) | 18 (shorter trajectory) | If the solution is to change the wording of the paper from "adaptively increase trajectory length based on policy entropy" to "adaptively change trajectory length based on policy entropy", then that is fine because our results and theoretical justification are still valid. ***Comparison to [a] is not reasonable.*** The reviewer has reffered to [a] which clearly utilizes a different metric of uncertainity (based on model estimation error) than ours. We are using policy entropy as a proxy for the inherent uncertainty (through spectral gap/mixing time) in the environment, NOT the uncertainty in our model (as in [a]) of the environment. Therefore, decreases/increases in model estimation error are not really connected to decreases/increases in policy entropy. Furthermore, the authors of [a] state in their motivation: "*First, we note that for model-based RL to be more efficient than model-free RL, rollouts to unfamiliar states must be accurate enough such that the subsequently derived value estimates are more accurate than the estimates obtained by replaying familiar experience and relying on value-function generalization alone. However, the model itself loses accuracy in regions further away from familiar states and moreover, prediction errors begin compounding at each step along a rollout. As a result, shorter rollouts are more accurate but provide little gain in efficiency, while longer rollouts improve efficiency in the short term but ultimately cause convergence to a suboptimal policy (Holland et al., 2018)."* This motivation is the same reasoning as ours: when the agent is more confident in its actions (low entropy), the trajectory length should increase. When it's less confident, the trajectory length should decrease. ***Our key difference from [a]*** is how confidence is measured as detailed above. We contribute a empirical connection between policy entropy and spectral gap/mixing time and we leverage that to control trajectory length model-free. [a] learns a model of the environment and adjusts trajectory length by learning a meta-policy that is dependent on the model's accuracy. We did not trivialize the complex problem. Rather, we have the same problem as [a] and provide a simpler solution that does not require learning a model of the environment.  > "In regards to Eq 7 as general, we want to emphasize that our key insight in the paper is the positive correlation between policy entropy and spectral gap." The core idea is invalid as the rollout would either need to be increased/decreased or kept the same. As [a] shows, it is not a single-direction adjustment. **Response:** ***We respectfully disgree with the reviewer.*** Just because a case occurs where entropy increases, does not mean the positive correlation with spectral gap goes away. Eq 7 will decrease trajectory length accordingly if need be. In our experiments, however, it was never needed. In [a], authors learn a meta-policy that controls trajectory length and is based on the environment model's accuracy. If the accuracy increases and decreases throughtout training, so too will trajectory length. ***However, expecting policy entropy to always follow this is unjustified.*** > Regarding increasing entropy. Often to prevent convergence to a poor policy, the entropy needs to be increased during training. For example, look at [d] which proposes entropy-regularized learning. Depending on the weightage of the entropy term, the entropy can increase beyond the value of the initialized policy. To be honest, I have not seen any paper, that claims that entropy monotonically decreases with training right from the very first iteration, something that eqn.7 needs. Along these lines also please look at [e] **Reponse:** The argument made by the reviewer in this comment is **wrong**. We are not claiming or want to prove that entropy always monotonically decreases. This is an observation we have from the experiments and we show that emprirically it holds. We also don't exactly know the theoretical reson why it should increase as well. Also, the reviewer has mentioned [d] which clearly utilizes entropy regularization to enforce the entropy to increase. This supports our claim, because if the entropy was increasing automatically, then why do we need to enforce it via adding regularization. This is what exactly reflect from this statement in [d] "*To prevent policies from becoming deterministic too quickly, researchers use entropy regularization*". Also, even in [e, Equation 1], for max entropy Rl, authors consider adding Entropy of the policy as regularization again to induce exploration via increasing policy. It would again control the policy from converging to deterministic policy too quickly. We are proposing a novel way to directly control it without any entropy regularization. However, if using max-entropy regularizer with this approach, Eq 7 in our paper can handle increases to the policy entropy and using a uniform initialization will ensure trajectory length never goes to 0. >The authors claim that "DWA does not use a cost map for navigation". This is not true. DWA uses a cost map (for example see ROS navigation). In regular navigation, the cost-map is simply an occupancy map. But DWA can be augmented with a traversability map. But I don't think the authors have provided any traversability map to the DWA. **Reponse:** Thank you for the comment. We believe that the use of an occupancy map with DWA implementation in the ROS navigation package led to this confusion. We agree that the DWA implementation in the ROS navigation package uses an occupancy map generated from a 2D LiDAR scan to obtain the admissible velocity space in DWA. Hence, we understand the reviewer's argument on the use of cost maps in DWA and the possibility of incorporating any traversability map. However, we would like to point out that the original DWA algorithm presented in Fox et. al does not necessarily require any cost map for planning. As we highlighted in our previous response, DWA uses a cost function to calculate the optimal action to reach a goal while avoiding obstacles. We summarize the original DWA formulation below for better clarification. DWA represents the robot's actions as linear and angular velocity pairs $(v,\omega)$. Let $V_s = [[0, v_{max}], [-\omega_{max}, \omega_{max}]]$ be the space of all the possible robot velocities based on the maximum velocity limits. DWA considers two constraints to obtain dynamically feasible and collision-free velocities: (1) The dynamic window $V_d$ contains the reachable velocities during the next $\Delta t$ time interval based on acceleration constraints; (2) The admissible velocity space $V_a$ includes the collision-free velocities. The resulting velocity space $V_r = V_s \cap V_d \cap V_a$ is utilized to calculate the optimal velocity pair $(v^*,\omega^*)$ by minimizing the objective function below, $\mathcal{G}(v,\omega) = \gamma_1 . head(v,\omega) + \gamma_2 . obs(v,\omega) + \gamma_3 . vel(v,\omega).$ Here, $head(.)$, $obs(.)$, and $vel(.)$ are the cost functions to quantify: 1. Goal reaching cost $head(v,\omega)$; 2. Obstacle cost $obs(v,\omega)$ represents distance to the closest obstacle; 3. Velocity cost $vel(v,\omega)$. Hence, the optimal velocity pair is calculated by minimizing the function $\mathcal{G}(v,\omega)$. i.e., $(v^*,\omega^*) = argmin\big(\mathcal{G}(v,\omega)\big)$. To obtain the admissible velocity space $V_a$, each velocity pair in $V_s$ is evaluated to check whether the velocities lead to a collision with an obstacle. This is usually done by checking the distance to the obstacles detected by a 2D LiDAR after extrapolating each velocity pair for $\Delta t$ time. Hence, the original algorithm uses the distance vector from a 2D LiDAR scan to obtain the admissible velocity space $V_a$. In contrast, the ROS navigation package includes default packages (e.g., costmap_2d) to generate occupancy maps from the 2D LiDAR scan data. Hence, the DWA implementation in ROS navigation uses that occupancy map as a cost map to calculate the admissible velocity space $V_a$. However, the original implementation does not explicitly require such a cost map for planning. Rather it requires the distance information about the obstacles around the robot to generate $V_a$. We believe that the differences in the DWA implementations lead to this confusion. We hope the above explanation clarifies that DWA predominantly uses a cost function and a cost map can be used in certain implementations to generate the admissable velocity space $V_a$. We leverage the baseline DWA algorithm described above for our comparisons. Hence, we only use the obstacle distances form a 2D LiDAR in our implementation. Futher, we do not incoporate any additional traversability maps. Reference D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to collision avoidance,” IEEE Robotics & Automation Magazine, vol. 4, no. 1, pp. 23–33, 1997. ## Response to Reviewer UTMD [Weak Reject, confidence 4] We express our gratitude to the reviewer for taking the time to review our manuscript. We sincerely appreciate the feedback and provide detailed responses to all questions below. > **Question 1**: The correspondence between the entropy in equation (7) and the episode length seems heuristic. Given that only a positive correlation between the policy's entropy and the spectral gap is confirmed, this transformation seems rough and lacks applicability. **Response to Question 1:** Thank for the comment. While we agree the transformation itself may seem heuristic, our main focus for this paper is the positive correlation between policy entropy and spectral gap. We used Eq (7) to show that there exists some transformation that can leverage this insight to adaptively change trajectory length based on decreasing policy entropy. We do not propose that Ada-NAV has to be this specific transformation. Rather, Ada-NAV is the higher-level concept of increasing trajectory as policy entropy decreases. Future work can involve delving into a more nuance and/or robust transformation, our goal was to show that using the positive correlation to adaptively change trajectory length leads to more sample-efficient training. We will update the document to emphasize that the core contribution of the paper is the positive correlation between policy entropy and spectral gap that can be leveraged for sample-efficiency, not the specific transformation. > **Question 2:** How much computational cost is involved in determining the trajectory length? **Response to Question 2:** In Eq 7, the only variable that changes through out training to determine the new trajectory length is the current policy entropy. The number of operations for calculating average entropy is $O(|\mathcal{S}||\mathcal{A}|)$. To be specific, calculating the average Shannon entropy of a state space with 625 states and action space of 5 takes $9*10^{-3}$ seconds. After calculating current average entropy, we can plug that into Eq 7 to get the new trajectory length. > **Question 3:** I believe the relationship between policy learning in non-stationary environments and the length of exploration episodes used for learning is a crucial aspect in understanding the contributions of this paper. However, I feel the explanation of this relationship in the paper seems insufficient. **Response to Question 3:** Thank you for the question. We agree that non-stationary is a crucial apect in the paper, and we realize that our lack of defining non-stationary can lead to the misunderstanding the of paper. ***By non-stationary***, we do not mean changing environment dynamics. Rather, we are referring to the change in the induced Markov Chain as learning progresses. With the environment dynamics, as in $\mathbb{P}(s'|a,s)$, set constant in the environment, the change in the parameterized policy drives the change in the induced Markov Chain. Our key insight is that the induced Markov Chain gradually mixes slower over training. We base this key insight on the empirical link policy entropy and spectral gap in Figure 2(b). We will make this definition clear in the introduction. > **Question 4:** Regarding the non-stationarity of the experimental environment, it seems that the number of walls determines the non-stationarity in the simulation, but the method of transformation is unclear. In the robot navigation task what kind of environmental nonstationarity do the authors assume for robot navigation tasks? **Response to Question 4:** Thank you again as this question points out the confusion in our experiment description. As mentioned in the response to Question 3, non-stationary does not refer to changing environment. To be clear, in our experiments we do **not** change the environment. We do not add walls over time. We have one set of experiments where the walls are set to a constant 4 and another set where there are 16 walls. In the Appendix, we also have an experiment with no walls. In Section 5 when we say "Increasing Obstacles in 2D gridworld", we meant that between experiments we add more walls to show the benefits of Ada-Nav in multiple environments with different complexities. Similarly for the robotic navigation task, "Even and Uneven Terrain Navigation in Robotic Simulations" means that we have different static environments with varyling levels of terrain eveness to show Ada-Nav's benefits across all of them. We will update the document to emphasize that the environment dynamics stay constant during training. > **Question 5:** How does the trajectory length change with the learning process of the policy by the Ada-NAV in the experiment? **Response to Question 5:** As policy entropy decreases as the algorithm learns from the environment interactions, the trajectory length increases as after every episode we recalculate the policy entropy and use Eq. 7 to determine the next episode's trajectory length. For example, please look at rebuttal Figure 2 in the google doc provided in the following link. https://docs.google.com/document/d/1GIPFKHsUTQs3BFRtlt_vktpFnjFWLdXBlMH5fBF8B7U/edit?usp=sharing It plots the increase in trajectory length over training episodes over 5 trials using Ada-Nav for the 2D gridworld environment with 16 walls. Variation is very low between trials.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.