Rebuttal for Bhrij's Paper

## Rebuttal for ICML submission 2787: Global Optimality without Mixing Time Oracles in Average-reward RL Multi-level Actor-Critic | Reviewer | Before Rebuttal Score | After Rebuttal Score | Responded to our rebuttal | Comment | |----------|-----------------------|----------------------|---------------------------|---------| | xqbb | 5 | 6 | Yes | Appreciated our contributions | | UhK8 | 4 | 5 | Yes | Appreciated our theoretical and experimental results and technical novelties | | z6wk | 5 | 5 | Yes | Appreciated our contributions and believes the problem is important to the community | | BqHm | 4 | 4 | No | Appreciated our theoretical and experimental results **Summary of Reviews and Response**: **General comments and Review Highlights:** We sincerely appreciate all reviewers for their valuable feedback and insightful questions. We are particularly encouraged by the following: Reviewer z6wk believes that the problem *"is well-motivated and the problem being studied is interesting and important to the community"*. Reviewers z6wk, BqHm, and xqbb appreciate the rigorous theoertical analysis provided in the paper. Furthermore, all the reviewers note that our work alleviate the restriction oracle assumption on mixing time and highlight the experimental evidence to show the sample efficiency of MAC. We provided thorough responses to each reviewer's concerns during the rebuttal which has resulted in both reviewers (UhK8, xqbb) who engaged with us increasing their scores (final scores 6 5 5 4). **In summary,** we aim to close the ground theoretical analysis with pratical feasiblity by analyzing the global convergence of MAC. We provide the tightest known bound on mixing time while also relaxing its oracle assumption and analyze the algorithm with a general parameterized policy.  --- **Comment by Reviewer xqbb** > I thank the authors for their response which clarified my doubts. Regarding Equation 53 in the Appendix, I understand that the correct passages are: apply the square root on both sides thus getting $T^{1/4}$ and then multiplying $T^{1/2}$ by which would then turn the order to $T^{1/4}* T^{1/2} = T^{3/4}$ appearing at the denominator. If indeed this passage is not as described and the dependence on time is instead of the order of $T^{1/8}$ as the authors state, then Corollary 1 will no longer be true since the dominant term with respect to time will be $T^{-1/8}$. Do the authors agree on this or am I missing something? **Response:** We thank you for these further inquiries and apologies for the oversight. You are correct for Equation 53, $T^{1/4}* T^{1/2} = T^{3/4}$. Thus $T^{3/4}$, rather than $T^{1/8}$ should appear in the denominator of Eq 53. We provide the corrected Equation 53 below: \begin{align} \mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg]} \leq \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} T_{\max}} \log T_{\max}}{T^{\frac{3}{4}}}}\right) + \mathcal{O}\left({\frac{\sqrt{\log(T_{max}) T_{max}} \mathcal{E}^{critic}_{app}}{\sqrt{T}}}\right). \tag{53} \end{align} Next, we remark that the bound in Eq 54 still holds. We use Eq 26 in Lemma 4 and take the square root: \begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]\leq \mathcal{O}\left({ \mathcal{E}^{critic}_{app} }\right)+\widetilde{\mathcal{O}}\left({ \frac{\tau_{mix} \log T_{\max}}{\sqrt{T}}}\right) \hspace{-4mm} +\widetilde{\mathcal{O}}\left({ { \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} } \right) \tag{26}. \end{align} \begin{align} \sqrt{\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]} \leq \mathcal{O}\left({ \sqrt{\mathcal{E}^{critic}_{app}} }\right) + \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} \log T_{\max}}}{{T^{\frac{1}{4}}}}}\right) + \widetilde{\mathcal{O}}\left({ { \frac{\sqrt{\tau_{mix}\log T_{\max}}}{\sqrt{T_{\max}}}}}\right).\tag{54} \end{align} The global convergence of MAC is dependent on adding these two bounds as shown in Equation 47 which we repeat below: \begin{align} J^{*}-\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\Vert J(\theta_t)\Vert \leq \sqrt{\mathcal{E}^{actor}_{app}} + \mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\| h_t\|^2\bigg]} + \sqrt{\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ \|{ \nabla J(\theta_t) }\|^2 \right]}\tag{47} \end{align} Plugging in Eq 54 and the corrected Eq 53 into Eq 47 leads to overall corrected final bound below: \begin{align} J^{*}-\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\Vert J(\theta_t)\Vert \leq&\sqrt{\mathcal{E}^{actor}_{app}}+ \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} T_{\max}} \log T_{\max}}{T^{\frac{3}{4}}}}\right) + \mathcal{O}\left({\frac{\sqrt{\log(T_{max}) T_{max}} \mathcal{E}^{critic}_{app}}{\sqrt{T}}}\right) \\ &+ \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} \log T_{\max}}}{{T^{\frac{1}{4}}}}}\right) + \widetilde{\mathcal{O}}\left({ { \frac{\sqrt{\tau_{mix}\log T_{\max}}}{\sqrt{T_{\max}}}}}\right). \end{align} In the second term in the right hand side of above inequality, we now have $T^{-3/4}$ instead of $T^{-1/8}$, we note that since $T^{-1/4}$ is the dominating term as in the convergence rate of PPGAE in [2]. We appreciate the reviewer for pointing this mistake and helping to improve our bound. ***In regards to the inquiry about Corollary 1***, after the correction to Equation 53 above, the bound is still $\mathcal{O}\left(T^{-1/4}\right)$, as the additional assumptions $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$, do not affect the bound of Equation 54. We are happy to provide provide additional details if required.      **Summary of core contributions:** **We take this opportunity to restate and clarify the contributions of our work. Our major focus is to close the gap between theoretical analysis and practical algorithms for Average reward reinforcement learning algorithms.** We note that for any theoretical analysis of an algorithm to be closer to practical settings, it should satisfy the following criterias: **Global optimality** - It is important to establish the global optimality of an algorithm because establishing only first-order stationary convergence may lead to supoptimal solutions in terms of practical relevance. This is essentially because the algorithm may conver to saddle points as well. **Tightest Dependence on Mixing Time** - An algorithm should converge in a sample-efficient manner to mitigate the cost of data collection. Thus, improving the sample complexity bound with respect to the mixing time dependence is crucial for average reward RL. **Mixing Time Assumptions Removed** - Relaxing the oracle assumption of mixing time, which in practice is hard to calculate or estimate, makes implementing an algorithm more feasible. **Practical Trajectory Length** - Trajectory length is a crucial aspect of RL algorithms as it is the number of continuous samples processed for a gradient update. To implement an RL algorithm, a practical minimum trajectory length is needed which we will see is not provided in all prior works like Bai et al 2023. **General Parameterized Policy** - General policy parameterization, rather than linear or tabular parameterization, allows the analysis to be more applicable to real-world RL implementations that rely on neural network parameterizations for Deep RL. This table highlights the contributions of our work compared to prior work in global optimality (Bai et al.) and mixing time (Dorfman et al. 2022, Suttle et al. 2023) in regard to the above five criteria. | Ref | Global Optimality | Tightest Known Dependence on Mixing Time | Mixing Time Assumptions Removed | Practical Trajectory Length | General Parameterized Policy | |:-------------------:|-------------------|:-----------------------------------:|:---------------------------------:|:----------------------------:|:--------------------------------:| | Dorfman et al. 2022 | No | No | Yes | Yes | No | | Suttle et al. 2023 | No | No | Yes | Yes | Yes | | Bai et al. 2023 | Yes | No | No | No | Yes | | **This Work** | **Yes** | **Yes** | **Yes** | **Yes** | **Yes** |  We also provide additional experiments with larger scale and complexity and comparisons to more baselines as suggested by the reviewers in this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing). ----------------------------------------------------------------- ## Response to Reviewer z6Wk [Score 5, confidence 3] We are thankful to the reviewer for dedicating their valuable time and effort towards evaluating our work, which has allowed us to strengthen the paper. We respond to the reviewer’s inquiries in detail as follows. > **Weakness 1:** The comparison of empirical studies can be further improved. Expanding the experimental analysis to include comparisons with additional existing algorithms would provide a more comprehensive assessment of MAC's performance and generalizability **Response to Weakness 1:** ***We have now added additional experiments (larger scale with more baselines).*** Thank you for the comment. As the reviewer suggested, we kindly refer you to this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) where we provide additional experiments where we compare MAC's performance with additional existing algorithms such as vanilla Actor Critic (VAC) and REINFORCE. To expand upon the experimental validations, now we have added two experiments with larger grid sizes (10x10 and 15x15) in Section A. We also provide, as suggested by Reviewer BqHm, additional experiments in Section B comparing MAC against these baselines in more complicated environments. We remark that in the additional experiments, we consistently show a higher success rate of MAC compared to new baselines with fixed trajectory lengths. These results reinforce the need to a have trajectory length scheme that calibrates for mixing time to help module the noisy gradients introduced by the burn-out samples processed before reaching the stationary distribution.   ## Response to Reviewer UhK8 [Score 4, confidence 2] We express our gratitude to the reviewer for taking the time to review our manuscript. We sincerely appreciate the feedback and provide detailed responses to all questions below. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. > **Weakness/Question 1:** The difference from previous work [1] is less clear, and therefore the significance of contribution is less clear. **Response to Weakness 1:** We thank the reviewer for the comment and apologize if the contributions were not clear from the current write-up. We take this opportunity to clarify our contributions and highlight the differences to [1] as follows. 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence.  3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D.  4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time.   5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) to compare with different baselines such as Vanilla Actor-Critic and REINFORCE.  >**Question 1:** What is the technical novelty in the analysis of global convergence? **Response to Question 1:** Thank you the question and giving us the opportunity to clarify. We highlight the technical novelties of our work as follows. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues:  * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280.   * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and leads to higher variance. We addressed this issue in our work utilizing multi level monte carlo updates.            * **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC. ------------------------------------------ ## Response to Reviewer BqHm [Score 4, confidence 4] We thank the reviewer for their time in reviewing our manuscript. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. [3] Dorfman, R. and Levy, K. Y. Adapting to mixing time in stochastic optimization with Markovian data. In Proceedings of the 39th International Conference on Machine Learning, Jul 2022. [4] Liu, Y., Zhang, K., Basar, T., and Yin, W. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020. > **Weakness 1:** The authors did not raise any new problems or methods in this paper, making it appear as just a supplement to the MAC method. To be specific, some important lemmas listed in this paper, such as Lemma 3 and Lemma 4, have been raised or are only some incremental for the results in previous works [Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic, Adapting to mixing time in stochastic optimization with Markovian data] and the proof process of this paper's main theoretical result is a commonly used convergence proof method in optimization problems, which lacks innovation somehow. **Response to Weakness 1:** We thank you for the comment. We apologize if our framing gives the impression that we simply merged the concepts of the two papers. We would like to point out that our work is more novel by highlighting more clearly below the problem we raise, our contributions, and technical novelties to achieve our results:  ### New Problem We Raise **We raise the need to ground theoretical global convergence guarantees to practicality in average reward RL**: Although [1] established first-order stationary convergence of MAC, it is not trivial to simply derive global convergence, especially in the average reward setting. Most prior literature on global convergence is for discounted rewards that rely on gradient domination assumption that turns the optimization into a strongly convex problem. However, gradient domination is not available in the average reward setting, so we can not rely on similar machinery. Recently, [4] provided a general framework in the discounted setting that allows one to prove global convergence from the first-order stationary convergence of a PG algorithm, and [2] has adapted it to for the average reward MDP and introduced PPGAE to establish the first average reward global convergence with general parameterization. **However, there are restrictive assumptions that [2] utilize in terms of oracle mixing time and trajectory length selection that make their global convergence analysis impractical for real-world use. We aim to close this impractical gap by proving the global convergence of MAC.** More details on how we do so can be found under technical novelties.  ### Contributions 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence.  3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D.  4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time.   5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) to compare with different baselines such as Vanilla Actor-Critic and REINFORCE. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues: * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280. * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and can lead to higher variance. Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, our KL divergence summation is not a telescoping sum and cannot be simplified algebraically to a constant as in [2]. Thus, further mechanics are needed to accommodate the non-telescoping KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound this summation and reduce variance. **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC.  > **Weakness 2:** The mixing use of symbols for update times T and trajectory length T leads to ambiguity in the paper (for example, in Theorem 1). **Response to Weakness 2:** Thank you for pointing out this ambiguity. We will update the manuscript so that the update is indexed by K to remove confusion with trajectory length T.  > **Weakness 3:** The bound shown in Theorem 1 could be very loose when selecting a very large $T_{max}$. **Response to Weakness 3:** Thank you for the comment. When $T$ is large the middle two terms in the Theorem 1 become negligible compared to the last term, which also shrinks when $T_{max}$ grows, reducing the bias.  > **Weakness 4:** Too many assumptions are made. Except the 4 Assumptions explicitly listed, there are at least two additional assumptions are made in the Lemmas' statements. **Response to Weakness 4:** Thank you for raising these concerns. The assumptions in Lemma 4, $J(\theta)$ is $L$-smooth and $\sup_{\theta} | J(\theta) | \leq M$, are actually a results of Assumption 2. In regards to Lemma 3, we believe that the use of "Assume" when setting stepsize is not the most appropriate term, and will change it to the word "consider".  > **Weakness 5:** The experimental section is too simplistic and additional experiments should be provided to validate the viewpoints presented in the paper. **Response to Weakness 5:** ***We have now added additional experiments (larger scale with more baselines).*** Thank you for the comment. We kindly refer you to this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) where we provide additional experiments where we compare MAC's performance with additional existing algorithms such as vanilla Actor Critic (VAC) and REINFORCE. To expand upon the experimental validations, now we have added two experiments with larger grid sizes (10x10 and 15x15) in Section A. We also provide, additional experiments in Section B comparing MAC against these baselines in more complicated environments. We remark that in the additional experiments, we consistently show a higher success rate of MAC compared to new baselines with fixed trajectory lengths. These results reinforce the need to a have trajectory length scheme that calibrates for mixing time to help module the noisy gradients introduced by the burn-out samples processed before reaching the stationary distribution.   ## Response to Reviewer xqbb [Score 5, confidence 2] We appreciate the reviewer for their valuable time to assessing our manuscript and recognizing the novel contributions. We have addressed reviewer's queries in the responses provided below. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. > **Weakness 1:** The main contribution of the work is to extend the theoretical results used for PPGAE to the MAC algorithm. From the other side, this can also be seen as a weakness as the reported results are obtained without large modifications from the original ones appearing in the two works related to the MAC algorithm ([Suttle et al. 2023]) and the PPGAE algorithm (Bai et al. 2024). **Response to Weakness 1:** We thank the reviewer for the comment and take this opportunity to clarify our contributions and highlight the differences to [1] and [2] and our technical novelties as follows. ### Contributions 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence.  3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D.  4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time.   5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) to compare with different baselines such as Vanilla Actor-Critic and REINFORCE. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues: * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280. * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and can lead to higher variance. Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, our KL divergence summation is not a telescoping sum and cannot be simplified algebraically to a constant as in [2]. Thus, further mechanics are needed to accommodate the non-telescoping KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound this summation and reduce variance. **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC.  > **Overall Typo/Presentation Weaknesses**: Regarding the presentation of the work, it appears that there are many parts that are less clear, many sentences are not syntactically correct and a lot of typos can be spotted while reading. **Response Overall Typo/Presentation Weaknesses:** We greatly appreciate the thorough review of the manuscript and pointing out the flaws in our presentation. We will address each example provided individually by giving a more clear revision. > **Weakness 2:** Furthermore, by definition H, even if mixing time and hitting time are known, the minimum T for K > 1 is practically infeasible as we will explain in Section 4" in lines 211-214; **Response to Weakness 2:** We will revise it as follows: "Furthermore, even if mixing time and hitting time are known, by definition of epoch length, H, the minimum sample budget, T, required for the number of episodes, K, to be at least one is practically infeasible as we will explain in Section 4" > **Weakness 3:** "into terms of $||\nabla J(\theta_t)||^2$, which MAC has an bound for established by ..." in lines 324-325; **Response to Weakness 3:** We will change it to - "into terms of $||\nabla J(\theta_t)||^2$, the local convergence rate of MAC. We then can use the convergence rate bound established by (Suttle et al.), which we state in Lemma 4. > **Weakness 4:** "where we able to change remove the square root", in line 326; **Response to Weakness 4:** We will update it to- "where we able to remove the square root" > **Weakness 5:** Constant "R" should be replaced in Assumption 2.2 using "K" instead, **Response to Weakness 5:** Thank you for pointing this out. We actually instead intend to replace "K" with "R" in the assumption statement because later in the analysis, like in Equation 32, we use "R". > **Weakness 6:** The formulation of (actor update) in Equation 15 seems wrong. $\eta_t$ should probably be replaced by $\alpha_t$. **Response to Weakness 6:** Yes, you are correct. Thank you for pointing this out. > **Weakness 7:** Furthermore, since I would suggest adding more comments in the sketch of the proof of Theorem 1, like for example introducing quantity $\alpha_T'$. It is also preferable to not refer directly to Equations appearing in the Appendix, as done for Equation (38). **Response to Weakness 7:** We appreciate this feedback. For clarity, $\alpha_T'$ comes from setting $t = T$ in the definition of $\alpha_t'$ introduced in Lemma 3. We can explicitly state this when introducing $\alpha_T'$. We will also update the manuscript so that in the main body, the equations referenced are also within the main body. "Equation 38" on line 330 can be replaced with "Equation 29". > **Question 1:** $K_t$ in line 251 is not defined. Is it a typo for $J_t$? **Response to Question 1:** Thank you for pointing this out. Yes, it is indeed a typo. After sampling $J_t$ from the geometric distribution, $2^J_t$ becomes the trajectory length. > **Question 2:** The formulation of (actor update) in Equation 15 seems wrong. Besides the wrong $\eta_t$, there is a further term that multiplies but already contains the TD term in its defintion. Is it correct or is there a redundant term? **Response to Question 2:** Indeed, the $\eta_t$ should be an $\alpha_t$, and the TD term in the $\theta$ update in Eq 15 is redundant. > **Question 3:** In Equation 53 in the Appendix, the term $T^\frac{1}{8}$ seems wrong, should it be $T^\frac{3}{4}$? **Response to Question 3:** Thank you for the comment. We obtain $T^\frac{1}{8}$ in the first term of the RHS of Equation 53 from Equation 52 through two operations. First, by taking the square root of both sides we go from $T^\frac{1}{2} \to T^\frac{1}{4}$. Next, by multiplying both sides by $\frac{1}{\sqrt{T}}$, we arrive at $T^\frac{1}{4} \to T^\frac{1}{8}$. We can split these two operations into their own steps in the analysis for clarity. > **Question 4:** On the experimental part, the MAC approach seems to have better performance but much higher variance than PPGAE. Can the author comment on this? **Response to Question 4:** Thank you for the comment. We kindly refer to Section C of this document [here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing) where we show the variance of MAC was in part due to the number of trials. In the main body for the 5x5 sparse grid, we ran each algorithm for only 5 trials. Figure 5a of Section C shows less variance in MAC performance than 5b as 5a shows the learning curve of 100 trials while 5b only has 20 trials. We also note that the average learning curve of MAC achieves higher success in 5a with 100 trials than in 5b with 20 trials. In comparison to PPGAE and other baselines, we believe that the lower variance across trials suggests that for each baseline algorithm, it consistently converges to a policy that rarely ever reaches the goal. MAC however can consistently find policies that find the more goal more often.  -------------------         -->

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.