Amrit Singh Bedi
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## Rebuttal for ICML submission 2787: Global Optimality without Mixing Time Oracles in Average-reward RL Multi-level Actor-Critic | Reviewer | Before Rebuttal Score | After Rebuttal Score | Responded to our rebuttal | Comment | |----------|-----------------------|----------------------|---------------------------|---------| | xqbb | 5 | 6 | Yes | Appreciated our contributions | | UhK8 | 4 | 5 | Yes | Appreciated our theoretical and experimental results and technical novelties | | z6wk | 5 | 5 | Yes | Appreciated our contributions and believes the problem is important to the community | | BqHm | 4 | 4 | No | Appreciated our theoretical and experimental results **Summary of Reviews and Response**: **General comments and Review Highlights:** We sincerely appreciate all reviewers for their valuable feedback and insightful questions. We are particularly encouraged by the following: Reviewer z6wk believes that the problem *"is well-motivated and the problem being studied is interesting and important to the community"*. Reviewers z6wk, BqHm, and xqbb appreciate the rigorous theoertical analysis provided in the paper. Furthermore, all the reviewers note that our work alleviate the restriction oracle assumption on mixing time and highlight the experimental evidence to show the sample efficiency of MAC. We provided thorough responses to each reviewer's concerns during the rebuttal which has resulted in both reviewers (UhK8, xqbb) who engaged with us increasing their scores (final scores 6 5 5 4). **In summary,** we aim to close the ground theoretical analysis with pratical feasiblity by analyzing the global convergence of MAC. We provide the tightest known bound on mixing time while also relaxing its oracle assumption and analyze the algorithm with a general parameterized policy. <!-- **Summary of core contributions:** Existing analyses of infinite-horizon, average-reward actor-critic algorithms assume fast-mixing properties or oracle access to mixing times, which may not hold in practice. To address this shortcoming, we propose a novel multi-level Monte Carlo actor-critic algorithm, MAC, that does not rely on oracle knowledge of mixing times or fast-mixing properties. Furthermore, our novel analysis of MAC explicitly characterizes its dependence on mixing times, which has not been previously addressed in the actor-critic literature. We also provide empirical results on gridworld problems illustrating that MAC enjoys benefits over vanilla methods. In summary, this work provides a timely contribution to an important unresolved problem in the literature that has both theoretical and practical ramifications, and is thus of interest to the theoretical RL and broader RL communities alike. --> --- **Comment by Reviewer xqbb** > I thank the authors for their response which clarified my doubts. Regarding Equation 53 in the Appendix, I understand that the correct passages are: apply the square root on both sides thus getting $T^{1/4}$ and then multiplying $T^{1/2}$ by which would then turn the order to $T^{1/4}* T^{1/2} = T^{3/4}$ appearing at the denominator. If indeed this passage is not as described and the dependence on time is instead of the order of $T^{1/8}$ as the authors state, then Corollary 1 will no longer be true since the dominant term with respect to time will be $T^{-1/8}$. Do the authors agree on this or am I missing something? **Response:** We thank you for these further inquiries and apologies for the oversight. You are correct for Equation 53, $T^{1/4}* T^{1/2} = T^{3/4}$. Thus $T^{3/4}$, rather than $T^{1/8}$ should appear in the denominator of Eq 53. We provide the corrected Equation 53 below: \begin{align} \mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg]} \leq \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} T_{\max}} \log T_{\max}}{T^{\frac{3}{4}}}}\right) + \mathcal{O}\left({\frac{\sqrt{\log(T_{max}) T_{max}} \mathcal{E}^{critic}_{app}}{\sqrt{T}}}\right). \tag{53} \end{align} Next, we remark that the bound in Eq 54 still holds. We use Eq 26 in Lemma 4 and take the square root: \begin{align} \frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]\leq \mathcal{O}\left({ \mathcal{E}^{critic}_{app} }\right)+\widetilde{\mathcal{O}}\left({ \frac{\tau_{mix} \log T_{\max}}{\sqrt{T}}}\right) \hspace{-4mm} +\widetilde{\mathcal{O}}\left({ { \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} } \right) \tag{26}. \end{align} \begin{align} \sqrt{\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]} \leq \mathcal{O}\left({ \sqrt{\mathcal{E}^{critic}_{app}} }\right) + \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} \log T_{\max}}}{{T^{\frac{1}{4}}}}}\right) + \widetilde{\mathcal{O}}\left({ { \frac{\sqrt{\tau_{mix}\log T_{\max}}}{\sqrt{T_{\max}}}}}\right).\tag{54} \end{align} The global convergence of MAC is dependent on adding these two bounds as shown in Equation 47 which we repeat below: \begin{align} J^{*}-\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\Vert J(\theta_t)\Vert \leq \sqrt{\mathcal{E}^{actor}_{app}} + \mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\| h_t\|^2\bigg]} + \sqrt{\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ \|{ \nabla J(\theta_t) }\|^2 \right]}\tag{47} \end{align} Plugging in Eq 54 and the corrected Eq 53 into Eq 47 leads to overall corrected final bound below: \begin{align} J^{*}-\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\Vert J(\theta_t)\Vert \leq&\sqrt{\mathcal{E}^{actor}_{app}}+ \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} T_{\max}} \log T_{\max}}{T^{\frac{3}{4}}}}\right) + \mathcal{O}\left({\frac{\sqrt{\log(T_{max}) T_{max}} \mathcal{E}^{critic}_{app}}{\sqrt{T}}}\right) \\ &+ \widetilde{\mathcal{O}}\left({ \frac{\sqrt{\tau_{mix} \log T_{\max}}}{{T^{\frac{1}{4}}}}}\right) + \widetilde{\mathcal{O}}\left({ { \frac{\sqrt{\tau_{mix}\log T_{\max}}}{\sqrt{T_{\max}}}}}\right). \end{align} In the second term in the right hand side of above inequality, we now have $T^{-3/4}$ instead of $T^{-1/8}$, we note that since $T^{-1/4}$ is the dominating term as in the convergence rate of PPGAE in [2]. We appreciate the reviewer for pointing this mistake and helping to improve our bound. ***In regards to the inquiry about Corollary 1***, after the correction to Equation 53 above, the bound is still $\mathcal{O}\left(T^{-1/4}\right)$, as the additional assumptions $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$, do not affect the bound of Equation 54. We are happy to provide provide additional details if required. <!-- Furthermore, the overall bound of Theorem 1 is now $T^{-1/4}$, as we add the results of Eq 53 to the bound of $\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]$ in Eq 54 and now $T^{-1/4}$ is dominating term. Therefore, the additional assumptions made in Corollary 1, $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$, are **not** needed to recover $T^{-1/4}$ as in [2]. We greatly thank you for your feedback. Adding the additional assumptions for Corollary 1 still keeps the bound as $T^{-1/4}$ because the $T^{3/4}$ stems from how we bound $||h_t^{MLMC}||^2$ in Eq 48 and 50 of Theorem 1. $T^{1/2}$ that appears in the denominator of Eq 52, which we will later turn into $T^{3/4}$ in the updated manuscript in Eq 53, is only present due to bounding $\mathcal{E}(t)$ in Eq 50 which we repeat below: $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max}\mathcal{E}(t)\leq$ $(\log T_{\max})T_{\max}\widetilde{\mathcal{O}}({ \tau_{mix} (\log T_{\max})}) \mathcal{O}({T^{-\frac{1}{2}}})+ (\log T_{\max})T_{\max}\widetilde{\mathcal{O}}\left({ \tau_{mix}\frac{(\log T_{\max})}{T_{\max}} }\right)=$ $\widetilde{\mathcal{O}}\left({ \frac{\tau_{mix} (\log T_{\max})^2T_{\max}}{T^{\frac{1}{2}}}}\right)$ However because in Corollary 1, $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$: $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max}\mathcal{E}(t) = 0$ and $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max} (\mathcal{E}^{critic}_{app})^2 =0$ Thus: $\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg] \leq \frac{1}{T}\sum_{t=1}^{T}\widetilde{\mathcal{O}}({ \tau_{mix}^{\theta_t} \log T_{max} }) \leq \widetilde{\mathcal{O}}({ \tau_{mix} \log T_{max} })$ Plugging this bound into Eq 57 and following the same steps from Eq 52 and 53, we get $T^{1/2}$ in the denominator which we show in Eq 58 repeated below with extra intermediate steps which we will update the manuscript with: $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg]} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\widetilde{\mathcal{O}}\left({ \tau_{mix}^{\theta_t} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\widetilde{\mathcal{O}}\left({T \tau_{mix} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\widetilde{\mathcal{O}}\left({ \tau_{mix} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\widetilde{\mathcal{O}}\left(\sqrt{{ \tau_{mix} \log T_{max} }}\right) \leq$ $\widetilde{\mathcal{O}}\left({ \sqrt{\frac{\tau_{mix} \log T_{max}}{T}} }\right)$ Therefore, in Corollary 1, the denominating term is still the $T^{-1/4}$ that comes from plugging in the bound for $\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]$ into Eq 57. So with the correction to Theorem 1, we still recover $T^{-1/4}$ as in [2] and the additional assumptions are not needed. We greatly thank you for your feedback. --> <!-- We also provide additional details how the assumptions, $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$, do not change the corrected bound and how Corollary 1 is still correct after the change below: --> <!-- This $T^{3/4}$ stems from how we bound $||h_t^{MLMC}||^2$ in Eq 48 and 50 of Theorem 1. However, for Corollary 1, we are able to use a tighter bound for $||h_t^{MLMC}||^2$ because we add in the additional assumptions that $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$. $\mathcal{E}(t)$ represents the critic estimation error and $\mathcal{E}_{app}^{critic}(t)$ represents the critic function approximation error. Since the critic is parameterized with a linear function approximator, $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$ are non-restrictive assumptions. In proof of Theorem 1, the $T^{1/2}$ that appears in the denominator of Eq 52, which we will later turn into $T^{3/4}$ in the updated manuscript in Eq 53, is only present due to bounding $\mathcal{E}(t)$ in Eq 50 which we repeat below: $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max}\mathcal{E}(t)\leq$ $(\log T_{\max})T_{\max}\widetilde{\mathcal{O}}({ \tau_{mix} (\log T_{\max})}) \mathcal{O}({T^{-\frac{1}{2}}})+ (\log T_{\max})T_{\max}\widetilde{\mathcal{O}}\left({ \tau_{mix}\frac{(\log T_{\max})}{T_{\max}} }\right)=$ $\widetilde{\mathcal{O}}\left({ \frac{\tau_{mix} (\log T_{\max})^2T_{\max}}{T^{\frac{1}{2}}}}\right)$ However because in Corollary 1, $\mathcal{E}(t) = \mathcal{E_{app}^{critic}}(t) = 0$: $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max}\mathcal{E}(t) = 0$ and $\frac{1}{T}\sum_{t=1}^{T} \log(T_{max}) T_{max} (\mathcal{E}^{critic}_{app})^2 =0$ Thus: $\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg] \leq \frac{1}{T}\sum_{t=1}^{T}\widetilde{\mathcal{O}}({ \tau_{mix}^{\theta_t} \log T_{max} }) \leq \widetilde{\mathcal{O}}({ \tau_{mix} \log T_{max} })$ Plugging this bound into Eq 57 and following the same steps from Eq 52 and 53, we get $T^{1/2}$ in the denominator which we show in Eq 58 repeated below with extra intermediate steps which we will update the manuscript with: $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}\bigg[\Vert h_t\Vert^2\bigg]} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\sum_{t=1}^{T}\widetilde{\mathcal{O}}\left({ \tau_{mix}^{\theta_t} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\frac{1}{T}\widetilde{\mathcal{O}}\left({T \tau_{mix} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\sqrt{\widetilde{\mathcal{O}}\left({ \tau_{mix} \log T_{max} }\right)} \leq$ $\mathcal{O}\left({\frac{1}{\sqrt{T}}}\right)\widetilde{\mathcal{O}}\left(\sqrt{{ \tau_{mix} \log T_{max} }}\right) \leq$ $\widetilde{\mathcal{O}}\left({ \sqrt{\frac{\tau_{mix} \log T_{max}}{T}} }\right)$ Therefore, in Corollary 1, the denominating term is the $T^{-1/4}$ that comes from plugging in the bound of Theorem 2 for $\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ ||{ \nabla J(\theta_t) }||^2 \right]$ into Eq 57. So even with the correction to Theorem 1, we are still able to recover $T^{-1/4}$ as in [2]. We greatly thank you for your feedback. --> <!-- We further remark that these bound are agnostic to whether or not $\mathcal{E_{app}^{critic}}(t) = 0$, and we only require $\mathcal{E}(t) = 0$. --> <!-- $\frac{1}{T}\sum_{t=1}^{T}\widetilde{\mathcal{O}}({ \tau_{mix}^{\theta_t} \log T_{max} }) \leq \widetilde{\mathcal{O}}({ \tau_{mix} \log T_{max} })$ --> **Summary of core contributions:** **We take this opportunity to restate and clarify the contributions of our work. Our major focus is to close the gap between theoretical analysis and practical algorithms for Average reward reinforcement learning algorithms.** We note that for any theoretical analysis of an algorithm to be closer to practical settings, it should satisfy the following criterias: **Global optimality** - It is important to establish the global optimality of an algorithm because establishing only first-order stationary convergence may lead to supoptimal solutions in terms of practical relevance. This is essentially because the algorithm may conver to saddle points as well. **Tightest Dependence on Mixing Time** - An algorithm should converge in a sample-efficient manner to mitigate the cost of data collection. Thus, improving the sample complexity bound with respect to the mixing time dependence is crucial for average reward RL. **Mixing Time Assumptions Removed** - Relaxing the oracle assumption of mixing time, which in practice is hard to calculate or estimate, makes implementing an algorithm more feasible. **Practical Trajectory Length** - Trajectory length is a crucial aspect of RL algorithms as it is the number of continuous samples processed for a gradient update. To implement an RL algorithm, a practical minimum trajectory length is needed which we will see is not provided in all prior works like Bai et al 2023. **General Parameterized Policy** - General policy parameterization, rather than linear or tabular parameterization, allows the analysis to be more applicable to real-world RL implementations that rely on neural network parameterizations for Deep RL. This table highlights the contributions of our work compared to prior work in global optimality (Bai et al.) and mixing time (Dorfman et al. 2022, Suttle et al. 2023) in regard to the above five criteria. | Ref | Global Optimality | Tightest Known Dependence on Mixing Time | Mixing Time Assumptions Removed | Practical Trajectory Length | General Parameterized Policy | |:-------------------:|-------------------|:-----------------------------------:|:---------------------------------:|:----------------------------:|:--------------------------------:| | Dorfman et al. 2022 | <span style="color:red">No</span> | <span style="color:red">No</span> | <span style="color:green">Yes</span> | <span style="color:green">Yes</span> | <span style="color:red">No</span> | | Suttle et al. 2023 | <span style="color:red">No</span> | <span style="color:red">No</span> | <span style="color:green">Yes</span> | <span style="color:green">Yes</span> | <span style="color:green">Yes</span> | | Bai et al. 2023 | <span style="color:green">Yes</span> | <span style="color:red">No</span> | <span style="color:red">No</span> | <span style="color:red">No</span> | <span style="color:green">Yes</span> | | **This Work** | <span style="color:green">**Yes**</span> | <span style="color:green">**Yes**</span> | <span style="color:green">**Yes**</span> | <span style="color:green">**Yes**</span> | <span style="color:green">**Yes**</span> | <!-- *While Suttle et al. claimed to have the tightest mixing time bound for local convergence, we identified a mistake in their proof and redid the analysis to provide the tightest bound in this work. --> We also provide additional experiments with larger scale and complexity and comparisons to more baselines as suggested by the reviewers in this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span>. ----------------------------------------------------------------- ## Response to Reviewer z6Wk [Score 5, confidence 3] We are thankful to the reviewer for dedicating their valuable time and effort towards evaluating our work, which has allowed us to strengthen the paper. We respond to the reviewer’s inquiries in detail as follows. > **Weakness 1:** The comparison of empirical studies can be further improved. Expanding the experimental analysis to include comparisons with additional existing algorithms would provide a more comprehensive assessment of MAC's performance and generalizability **Response to Weakness 1:** ***We have now added additional experiments (larger scale with more baselines).*** Thank you for the comment. As the reviewer suggested, we kindly refer you to this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> where we provide additional experiments where we compare MAC's performance with additional existing algorithms such as vanilla Actor Critic (VAC) and REINFORCE. To expand upon the experimental validations, now we have added two experiments with larger grid sizes (10x10 and 15x15) in Section A. We also provide, as suggested by Reviewer BqHm, additional experiments in Section B comparing MAC against these baselines in more complicated environments. We remark that in the additional experiments, we consistently show a higher success rate of MAC compared to new baselines with fixed trajectory lengths. These results reinforce the need to a have trajectory length scheme that calibrates for mixing time to help module the noisy gradients introduced by the burn-out samples processed before reaching the stationary distribution. <!-- **TODO** Write out a cartoon exampple of state diagram showing polynomial mixing. as in different subtasks that are hard to jump in between. then explain with fixed step size you will reach a suboptimal subgoal in stead of the global finished goal. take experiments from iros saying showing that not calibrating trajectory length/stepsize to mixign time gives either sample inefficiency or low rewards --> <!-- **Response to Weakness 3** : <span style="color:blue">consider using the state diagram I sent on slack where it is hard to jump between the subgrids. For this situation, miscalibrated trajectory length in Monte carlo estimation of cumulative return should lead one towards getting stuck at light green state. We include this cartoon in the intro + some experiment on this type of grid world, and then we borrow some experiments from IROS showing it manifests in a *real* application in robotics </span> <span style="color:blue">add more technical details related to (i) the radius/rate of convergence of PG under constant/diminishing step-size and contrast with Adagrad</span> The MLMC estimator for the policy gradient exhibits technical similaries to running adaptive step-size, and hence achieves similar performance improvements --> ## Response to Reviewer UhK8 [Score 4, confidence 2] We express our gratitude to the reviewer for taking the time to review our manuscript. We sincerely appreciate the feedback and provide detailed responses to all questions below. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. > **Weakness/Question 1:** The difference from previous work [1] is less clear, and therefore the significance of contribution is less clear. **Response to Weakness 1:** We thank the reviewer for the comment and apologize if the contributions were not clear from the current write-up. We take this opportunity to clarify our contributions and highlight the differences to [1] as follows. 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence. <!-- The error leads to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. Without correcting the proof to align with the stated bounds we can not recover in Corollary 1 the $O(T^{-\frac{1}{4}})$ as PPGAE does in [2]. --> 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D. <!-- 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** As pointed out in the comment following eq. (26) in our paper, the proofs provided of Lemma 3 and 4 of [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. We note that both the proofs of Lemma 3 and 4 rely on the reward tracking convergence established in Theorem D.1 in [1] which contains a term $\tilde{O}{\left( \sqrt{ \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ that should actually absorb the $\tilde{O}{\left( { \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ in Lemma 3 and 4. In the present work, we have corrected the proof of Theorem D.1 provided in [1] to remove the square root and have redone the corresponding portion of the analysis. More details can be found in Appendix D. We also remark that in the main body we refer to Theorem D.1 as Lemma D.1 and we will correct it. --> 4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time. <!-- Due to MAC being a variant of actor-critic, we do not require an advantage estimation algorithm and can instead rely on the learnable critic function for the policy gradient calculation. To utilize a critic function in the global convergence analysis, we also incorporated the convergence rate of the MLMC critic function, given in Lemma 3, in our analysis. --> <!-- [**[MENTION HERE THE CORE TECHNICAL IDEA BY WHICH WE ARE ABLE TO DO THAT]**] --> 5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> to compare with different baselines such as Vanilla Actor-Critic and REINFORCE. <!-- 7. We give an overview of technical novelty here and will expand in the response to Question 1: * The catalyst for our global convergence analysis stems from the general framework provided in Lemma 5 of Bai et. al. However, there framework does not accomodate for non-constant stepsizes. We explain in greater detail in the response to Question 1 how we are able to overcome that obstacle and provide our own adapted framework in our Lemma 6 that can handle non-constant stepsizes. * With our Lemma 6 we are able to establish the global convergence of MAC in Theorem 1, providing the first global optimality for infinite horizon average reward policy gradient without mixing time oracles but with the tightest dependence on mixing time. * Furthermore, we also provide Corollary 1 that establishes we can recover $O(T^{-\frac{1}{4}})$ when we set the critic estimation error, $\mathcal{E}(t) = 0$ and critic function approximation error, $\mathcal{E}^{critic}_{app} = 0$. HERE SPECIFICALLY HIGHLIGHT THE CORE MATHEMATICAL NOVELTY OF OUR WORK AND MENTION ABOUT LEMMS/S THEOREMS WHICH ARE NEW IN OUR WORK.....FEEL FREE TO BRAG ABOUT OUR CONTRIBUTIONS HERE --> >**Question 1:** What is the technical novelty in the analysis of global convergence? **Response to Question 1:** Thank you the question and giving us the opportunity to clarify. We highlight the technical novelties of our work as follows. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues: <!-- This adapted framework is more complex the non-constant stepsize $\alpha_t$ is brought into the the third and fourth summation of the RHS of Eq 29 in as opposed to the constant stepsize in Equation 28 in Lemma 5 of Bai et. al. --> * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280. <!-- <span style="color:blue">I feel like we also need to explain the intuition/meaning behind these two mathematical issues. What does the third term in this summation mean? How does adaptive step-size deal with it and achieve some sort of variance reduction? Is the way it operates here unique to this work or similar to how Adagrad proof works? provide me some connective tissue</span> --> <!-- * The KL divergence term in Eq 28 in Lemma 5 of Bai. et al comes from simplifying a telescoping summation of KL divergences between policies of consecutive updates as described in their Eq 68. This KL divergence term beween the initial policy and the optimal policy is a constant and therefore disregarded for the PPGAE global convergence analysis. --> * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and leads to higher variance. We addressed this issue in our work utilizing multi level monte carlo updates. <!-- Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, our KL divergence summation is not a telescoping sum and cannot be simplified algebraically to a constant as in [2]. Thus, further mechanics are needed to accommodate the non-telescoping KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound this summation and reduce variance. --> <!-- $\frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_t}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t \leq \frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_T}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t = \frac{\mathbf{E}_{s\sim d^{\pi^*}}[KL(\pi^*(\cdot\vert s)\Vert\pi_{\theta_1}(\cdot\vert s))]}{T\alpha_T},$ where $\alpha_T = \frac{\alpha_T'}{\sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}}$ in Equation 33 on line 354. --> <!-- which heavily relies on the exact updated of the policy gradient used.--> <!-- $\frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_t}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t \leq \frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_T}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t = \frac{\mathbf{E}_{s\sim d^{\pi^*}}[KL(\pi^*(\cdot\vert s)\Vert\pi_{\theta_1}(\cdot\vert s))]}{T\alpha_T},$ --> <!-- where $\alpha_T = \frac{\alpha_T'}{\sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}}$ in Equation 33 on line 354. --> <!-- * Since, for MAC, due to the multi-level nature of the update, we had to specifically tailor the derivation for MAC to achive the global optimality results. --> <!-- * so handling this quantity is crucial for the global convergence analysis. Because their stepsize $\alpha$ is constant in Bai et al. and not conditioned on update index $k$, the difference between consecutive update directions $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$ is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and can lead to higher variance. Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just the KL divergence between $\pi^*$ and initial policy $\pi_1$, $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. **Bhrij: this point about uncertainty and variance may need to be fixed up more** Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just the KL divergence between $\pi*$ and initial policy $\pi_1$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, $\alpha_t$ is brought into the KL divergence summation in our Eq 29. Our KL divergence summation is therefore not a telescoping sum and cannot be simplified to a constant as in Bai et. al. Thus, further mechanics are needed to accondate the KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound the KL divergence summation by a telescoping sum: --> <!-- $\frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_t}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t \leq \frac{1}{T}\sum_{t=1}^{T}\frac{1}{\alpha_T}\mathbf{E}_{s\sim d^{\pi^*}}\zeta_t = \frac{\mathbf{E}_{s\sim d^{\pi^*}}[KL(\pi^*(\cdot\vert s)\Vert\pi_{\theta_1}(\cdot\vert s))]}{T\alpha_T},$ --> <!-- where $\alpha_T = \frac{\alpha_T'}{\sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}}$ in Equation 33 on line 354. --> <!-- <span style="color:blue">Similar comment: provide me some connective tissue about what these issues mean and why they are important/interesting and limit the usefulness of prior art in practice relative to what we put forth</span> --> <!-- Later on in the global convergence analysis, the use of Adagrad stepsize helps alleviate this issue. We will make that more clear in our analysis. --> * **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC. ------------------------------------------ ## Response to Reviewer BqHm [Score 4, confidence 4] We thank the reviewer for their time in reviewing our manuscript. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. [3] Dorfman, R. and Levy, K. Y. Adapting to mixing time in stochastic optimization with Markovian data. In Proceedings of the 39th International Conference on Machine Learning, Jul 2022. [4] Liu, Y., Zhang, K., Basar, T., and Yin, W. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020. > **Weakness 1:** The authors did not raise any new problems or methods in this paper, making it appear as just a supplement to the MAC method. To be specific, some important lemmas listed in this paper, such as Lemma 3 and Lemma 4, have been raised or are only some incremental for the results in previous works [Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic, Adapting to mixing time in stochastic optimization with Markovian data] and the proof process of this paper's main theoretical result is a commonly used convergence proof method in optimization problems, which lacks innovation somehow. **Response to Weakness 1:** We thank you for the comment. We apologize if our framing gives the impression that we simply merged the concepts of the two papers. We would like to point out that our work is more novel by highlighting more clearly below the problem we raise, our contributions, and technical novelties to achieve our results: <!-- **If doing global opt with mixing time, isn't just analysis of PG and when u get to to stationary, you can get optimality?** >It's a little more involved, bc that's more works in discounted >When you don't have PL inequality in average reward, Vaneet was able to figure it out with mixing time. so where does the analysis break without mixing time PL - strong convextiy -> taylor series of global optimality --> ### New Problem We Raise **We raise the need to ground theoretical global convergence guarantees to practicality in average reward RL**: Although [1] established first-order stationary convergence of MAC, it is not trivial to simply derive global convergence, especially in the average reward setting. Most prior literature on global convergence is for discounted rewards that rely on gradient domination assumption that turns the optimization into a strongly convex problem. However, gradient domination is not available in the average reward setting, so we can not rely on similar machinery. Recently, [4] provided a general framework in the discounted setting that allows one to prove global convergence from the first-order stationary convergence of a PG algorithm, and [2] has adapted it to for the average reward MDP and introduced PPGAE to establish the first average reward global convergence with general parameterization. **However, there are restrictive assumptions that [2] utilize in terms of oracle mixing time and trajectory length selection that make their global convergence analysis impractical for real-world use. We aim to close this impractical gap by proving the global convergence of MAC.** More details on how we do so can be found under technical novelties. <!-- **Still trying to piece the story of this one together** 1. Most disocutned rewards global convergence rely on gradient domination or PL condition to turn the problem into strong convexity 2. Liu and ZHang 2020 established a general framework that allows you to take the existing local/stationary convergence proofs and use them for global convergence, so you don't need PL condition 3. However, since discounted reward results usually have a factor 1/(1-$\gamma$), simply doing $\gamma = 1$ for average reward will not work 4. Vaneet was able to take this framework and tweak it for average reward, I think the main method for the adatpion is usign the average reward version of the performance difference lemma 5. However, they relied on mixing time and constant stepsize, and we break both those requirements --> ### Contributions 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence. <!-- The error leads to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. Without correcting the proof to align with the stated bounds we can not recover in Corollary 1 the $O(T^{-\frac{1}{4}})$ as PPGAE does in [2]. --> 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D. <!-- 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** As pointed out in the comment following eq. (26) in our paper, the proofs provided of Lemma 3 and 4 of [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. We note that both the proofs of Lemma 3 and 4 rely on the reward tracking convergence established in Theorem D.1 in [1] which contains a term $\tilde{O}{\left( \sqrt{ \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ that should actually absorb the $\tilde{O}{\left( { \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ in Lemma 3 and 4. In the present work, we have corrected the proof of Theorem D.1 provided in [1] to remove the square root and have redone the corresponding portion of the analysis. More details can be found in Appendix D. We also remark that in the main body we refer to Theorem D.1 as Lemma D.1 and we will correct it. --> 4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time. <!-- Due to MAC being a variant of actor-critic, we do not require an advantage estimation algorithm and can instead rely on the learnable critic function for the policy gradient calculation. To utilize a critic function in the global convergence analysis, we also incorporated the convergence rate of the MLMC critic function, given in Lemma 3, in our analysis. --> <!-- [**[MENTION HERE THE CORE TECHNICAL IDEA BY WHICH WE ARE ABLE TO DO THAT]**] --> 5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> to compare with different baselines such as Vanilla Actor-Critic and REINFORCE. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues: * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280. * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and can lead to higher variance. Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, our KL divergence summation is not a telescoping sum and cannot be simplified algebraically to a constant as in [2]. Thus, further mechanics are needed to accommodate the non-telescoping KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound this summation and reduce variance. **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC. <!-- * The problem we address is the gap in global convergence literature for average reward MDPs where prior work has assumed the restrictive assumption on knowing mixing time. * In regards to our method and proof process, we adapt the general framework for average reward MDP global optimality from Bai et. al, which itself is an adaptation from Liu et. al, to handle non-constant stepsizes. This adapted framework is more complex than just simply conditioning the stepsize $\alpha$ by $t$ because the KL divergence summation is no longer a telescoping sum as in Bai et. al. Later on in the global convergence analysis, the use of Adagrad stepsize helps alleviate this issue. We will make that more clear in our analysis. * --> > **Weakness 2:** The mixing use of symbols for update times T and trajectory length T leads to ambiguity in the paper (for example, in Theorem 1). **Response to Weakness 2:** Thank you for pointing out this ambiguity. We will update the manuscript so that the update is indexed by K to remove confusion with trajectory length T. <!-- **I think they have a point, slightly off though. We use T has total number of steps in training and the update index. Not sure how much that really effects the analysis since multiplying update index number by expectation of $2^{T_{max}}$ gives you number of samples** --> > **Weakness 3:** The bound shown in Theorem 1 could be very loose when selecting a very large $T_{max}$. **Response to Weakness 3:** Thank you for the comment. When $T$ is large the middle two terms in the Theorem 1 become negligible compared to the last term, which also shrinks when $T_{max}$ grows, reducing the bias. <!-- **Talk about how the middle two terms become neglible whne T is large and then choosing large T_max reduces the bias from the last term in theorem 1** --> > **Weakness 4:** Too many assumptions are made. Except the 4 Assumptions explicitly listed, there are at least two additional assumptions are made in the Lemmas' statements. **Response to Weakness 4:** Thank you for raising these concerns. The assumptions in Lemma 4, $J(\theta)$ is $L$-smooth and $\sup_{\theta} | J(\theta) | \leq M$, are actually a results of Assumption 2. In regards to Lemma 3, we believe that the use of "Assume" when setting stepsize is not the most appropriate term, and will change it to the word "consider". <!-- Change l 3 "Assume" to "consider" First two assumptions of L 4 already comes from Assumption 2 **I think they are talking about Lemma 3 when we Assume the stepsizes for critic and reward tracker and Lemma 4 when we say Assume J is L-Smooth. I guess we need to argue that these two extra assumptions are practical** --> > **Weakness 5:** The experimental section is too simplistic and additional experiments should be provided to validate the viewpoints presented in the paper. **Response to Weakness 5:** ***We have now added additional experiments (larger scale with more baselines).*** Thank you for the comment. We kindly refer you to this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> where we provide additional experiments where we compare MAC's performance with additional existing algorithms such as vanilla Actor Critic (VAC) and REINFORCE. To expand upon the experimental validations, now we have added two experiments with larger grid sizes (10x10 and 15x15) in Section A. We also provide, additional experiments in Section B comparing MAC against these baselines in more complicated environments. We remark that in the additional experiments, we consistently show a higher success rate of MAC compared to new baselines with fixed trajectory lengths. These results reinforce the need to a have trajectory length scheme that calibrates for mixing time to help module the noisy gradients introduced by the burn-out samples processed before reaching the stationary distribution. <!-- additional comparisons...and mentions here that we are also running more experiments on the robotic navigation tasks...and report just one figyure from IROS...then we will mention w are running more experiments, and will report them as soon as we got them. Then we can paste our more experiments in a nice manner later. --> <!-- **Response to Question 1:** Thanks for the question. To prove the conclusion in 3.3 i.e. Theorem 1, we need Equation 28 to hold. However, the Chernoff bound to hold in Equation 28, it requires the sum of iid random variables from the distribution m(s) or h(s) which violates in the non-iid case, since in that case the samples are correlated. For example refer to the definition of Chernoff Bound in Theorem 4 [2] where the probability is defined over the sum of random variables. Hence, we cannot directly use Equation 28 and need to leverage a different form as shown in Equation 36 in the Proof of Theorem 2. --> ## Response to Reviewer xqbb [Score 5, confidence 2] We appreciate the reviewer for their valuable time to assessing our manuscript and recognizing the novel contributions. We have addressed reviewer's queries in the responses provided below. Throughout our response we would like to label these references for convenience: [1] Suttle et al., Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level monte carlo actor-critic. ICML, 2023. [2] Bai, Q., Mondal, W. U., and Aggarwal, V. Regret analysis of policy gradient algorithm for infinite horizon average reward markov decision processes. In AAAI Conference on Artificial Intelligence, 2024. > **Weakness 1:** The main contribution of the work is to extend the theoretical results used for PPGAE to the MAC algorithm. From the other side, this can also be seen as a weakness as the reported results are obtained without large modifications from the original ones appearing in the two works related to the MAC algorithm ([Suttle et al. 2023]) and the PPGAE algorithm (Bai et al. 2024). **Response to Weakness 1:** We thank the reviewer for the comment and take this opportunity to clarify our contributions and highlight the differences to [1] and [2] and our technical novelties as follows. ### Contributions 1. **We provide global convergence results while [1] has only local convergence results.** While [1] introduces MAC and establishes its convergence to first-order stationarity, we establish its convergence to global optimality. 2. **Improvement in state of the art convergence rates.** In addition to establishing global optimality, our results improve on those in [2] by ***achieving the tightest known dependence on mixing time while removing the restrictive assumption of requiring oracle knowledge of mixing time***. This relaxation is a first for the literature on policy gradient methods for average-reward MDPs, as prior works (e.g., [2]) rely on oracle knowledge of mixing time for gradient estimation. By captalizing on MAC's multi-level gradient estimator, however, we show that this reliance is not necessary for global convergence. <!-- The error leads to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. Without correcting the proof to align with the stated bounds we can not recover in Corollary 1 the $O(T^{-\frac{1}{4}})$ as PPGAE does in [2]. --> 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** Lemma 3 and Lemma 4 of our manuscript correspond to Theorem 4.7 and 4.8 in [1]. As pointed out in the comment following eq. (26) in our paper, the proofs provided for Lemma 3 and 4 in [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. The proofs in [1] lead to bounds that are looser in terms of $T_{max}$ than what is stated in Lemma 3 and 4. The looser bounds both arise from how the error in reward tracking is bounded in Theorem D.1 of [1]. Without correcting the proof to align with the stated bounds we can not recover the $O(T^{-\frac{1}{4}})$ in Corollary 1 as PPGAE does in [2]. Hence we redid the analysis to ensure and provide it in Appendix D. <!-- 3. **We have corrected the Proof of Lemma 3 and 4 in [1].** As pointed out in the comment following eq. (26) in our paper, the proofs provided of Lemma 3 and 4 of [1] are incorrect. Hence, we cannot directly utilize the analysis developed in [1]. We note that both the proofs of Lemma 3 and 4 rely on the reward tracking convergence established in Theorem D.1 in [1] which contains a term $\tilde{O}{\left( \sqrt{ \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ that should actually absorb the $\tilde{O}{\left( { \frac{\tau_{mix}\log T_{\max}}{T_{\max}}} \right)}$ in Lemma 3 and 4. In the present work, we have corrected the proof of Theorem D.1 provided in [1] to remove the square root and have redone the corresponding portion of the analysis. More details can be found in Appendix D. We also remark that in the main body we refer to Theorem D.1 as Lemma D.1 and we will correct it. --> 4. **Closing the Gap between Theory and Practice: first work to alleviate the mixing time oracle assumption.** One major contribution of our work is getting rid of the requirement of mixing time oracle, which is the limitation of any other existing method/analysis in the literature. Our core technical insight is to utilize the idea of Multi-Level Monte Carlo (MLMC) gradient estimation which does not require knowing the mixing time in advance. This results in MAC algorithm which is a more practical algorithm than existing state-of-the-art PPGAE algorithm because the advantage estimation for the policy gradient calculation of PPGAE requires mixing time. <!-- Due to MAC being a variant of actor-critic, we do not require an advantage estimation algorithm and can instead rely on the learnable critic function for the policy gradient calculation. To utilize a critic function in the global convergence analysis, we also incorporated the convergence rate of the MLMC critic function, given in Lemma 3, in our analysis. --> <!-- [**[MENTION HERE THE CORE TECHNICAL IDEA BY WHICH WE ARE ABLE TO DO THAT]**] --> 5. **Discussion on Trajectory Length Feasibility:** We also provide a discussion on the practicality of the number of samples needed to implement MAC and compare it against the infeasible trajectory length requirement for PPGAE in Section 4. 7. **Detailed experimental comparison**. We would like to highlight that the existing literature on average reward global convergence ([2]) on the same topic does not include detailed empirical evidence to support the claims. We included empirical evidence and have now added additional experiments in this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> to compare with different baselines such as Vanilla Actor-Critic and REINFORCE. ### Technical Novelities In terms of technical contribution to the analysis, we remark that we adapt the analysis of [2] to alleviate any requirement on the mixing time of the Markov process, which requires adaptive trajectory length and an attenuating step-size. We utilize this setting in Equation 29 in Lemma 6 of Section 4.2 as our method of analyzing the global optimality of MAC. We will make this more clear in the introduction in the list of contributions. Our Equation 29 corresponds to Equation 28 in Lemma 5 of [2] that assumes constant stepsize $\alpha$. Analyzing PPGAE from [2] with this adapted framework in Eq 29 raises two issues: * **The existing in literature relies on constant step size:** The third term of the framework in [2] is in terms of the squared norm of the estimated gradient, $||w_k||^2$. Generally, during training the square norm should reach zero as the gradients convergence. [2] is able to use their Lemma 3 to bound the term. However this assumes that the step size is constant. Assuming constant stepsize leads to high variance and sometimes no convergence as it causes overstepping during the optimization process. So having a diminishing stepsize is more practical to combat these issues. In this work, by captilizing on the Adagrad stepsize of MAC, we can use our Equation 27 in Lemma 5 to bound the summation as we do in Eq 32 on line 343 repeated below: $\frac{1}{2}\sum_{t=1}^{T}\alpha_t\Vert h_t\Vert^2 \leq \sqrt{\sum_{t=1}^{T}\Vert h_t\Vert^2}.$ We can then later bound the RHS of the above inequality by Eq 23 in Lemma 2 on line 280. * **Higher Variance due to Uniform Weighting of Update Directions:** Another issue arises from the KL divergence between the optimal policy $\pi^*$ and the policy at update $k$, $\pi_k$. In the framework in Lemma 5 of [2], the difference between $KL(\pi^* | \pi_k)$ and $KL(\pi^* | \pi_{k+1})$ is used to measure the difference between consecutive update directions. In [2], the difference between consecutive update directions, $KL(\pi^* | \pi_k) - KL(\pi^* | \pi_{k+1})$, is given the same weighting, $\frac{1}{\alpha}$, for all $k$ in their Eq 28. Intuitively, this means that the policy update direction in the beginning of training when you are farther away from the $\pi^*$ should be as important as when you approaching $\pi^*$. Generally, due to the higher amount of uncertainty in the model that is present in the beginning of training, this equal weighting assumption is not practical and can lead to higher variance. Furthermore, when summing their Eq 28 over all $k$, the constant $\frac{1}{\alpha}$ can be taken out of the summation. The summation is thus telescoping and can be simplified to just $KL(\pi^* | \pi_{1})$, which is a constant that can be disregarded for their global convergence analysis. However, with a non-constant stepsize, our KL divergence summation is not a telescoping sum and cannot be simplified algebraically to a constant as in [2]. Thus, further mechanics are needed to accommodate the non-telescoping KL divergence summation which PPGAE lacks. In our work on line 348, we use the fact that in Adagrad $\alpha_T \leq \alpha_t$ to bound this summation and reduce variance. **Furthermore, the utilization of Lemma 5 for Theorem 1 in [2] is not applicable for using our Lemma 6 for our Theorem 1:** We also like to highlight that we could **not** apply Lemma 6 in a manner that is analgous to how [2] used their Lemma 5 for PPGAE with our algorithm setup, due to a difference between the established preliminary lemmas for PPAGE and MAC. In [2], they provide for PPGAE Lemma 3 that establishes a bound for the gradient estimation error, $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$, where $w_k$ is the policy gradient estimator of update $k$ for [2]. In their work, they bound the difference between the estimated gradient and the optimal estimation, $||w_k - w_k^*||$, and the norm-squared of the estimated gradient, $||w_k||^2$, on the RHS of Eq 28 in terms of $\mathbf{E}\left[||\omega_k-\nabla_{\theta}J(\theta_k)||^2\right]$. They then utilize Lemma 3 and get the final global convergence bound for PPGAE. From [1], we do not have for such a bound for the gradient estimation of the MLMC estimator for MAC. Rather, we have bounds for the norm-squared of the MLMC policy gradient estimator, $||h^{MLMC}_t||^2$ and the convergence rate of the policy gradient, $||\nabla_{\theta}J(\theta_t)||^2$, stated in Lemma 2 and Lemma 4, respectively, of our manuscript. To utilize these Lemmas, we thus provide an upper bound for the difference between the estimated MLMC gradient and the optimal MLMC estimation, $||h_t^{MLMC} - h_t^{*MLMC}||$ in terms of $||h^{MLMC}_t||^2$ and $||\nabla_{\theta}J(\theta_t)||^2$ shown in Eq 31 with more details provided in our full proof of Theorem 1 in Appendix B. Using this novel bound in Eq 31, we can then use Lemmas 2 and 4 to establish the global optimality of MAC. <!-- This adapted framework is more complex than just simply conditioning the stepsize $\alpha$ by $t$ because the KL divergence summation is no longer a telescoping sum as in Bai et. al. Later on in the global convergence analysis, the use of Adagrad stepsize helps alleviate this issue. We will make that more clear in our analysis. We also highlight that --> > **Overall Typo/Presentation Weaknesses**: Regarding the presentation of the work, it appears that there are many parts that are less clear, many sentences are not syntactically correct and a lot of typos can be spotted while reading. **Response Overall Typo/Presentation Weaknesses:** We greatly appreciate the thorough review of the manuscript and pointing out the flaws in our presentation. We will address each example provided individually by giving a more clear revision. > **Weakness 2:** Furthermore, by definition H, even if mixing time and hitting time are known, the minimum T for K > 1 is practically infeasible as we will explain in Section 4" in lines 211-214; **Response to Weakness 2:** We will revise it as follows: "Furthermore, even if mixing time and hitting time are known, by definition of epoch length, H, the minimum sample budget, T, required for the number of episodes, K, to be at least one is practically infeasible as we will explain in Section 4" > **Weakness 3:** "into terms of $||\nabla J(\theta_t)||^2$, which MAC has an bound for established by ..." in lines 324-325; **Response to Weakness 3:** We will change it to - "into terms of $||\nabla J(\theta_t)||^2$, the local convergence rate of MAC. We then can use the convergence rate bound established by (Suttle et al.), which we state in Lemma 4. > **Weakness 4:** "where we able to change remove the square root", in line 326; **Response to Weakness 4:** We will update it to- "where we able to remove the square root" > **Weakness 5:** Constant "R" should be replaced in Assumption 2.2 using "K" instead, **Response to Weakness 5:** Thank you for pointing this out. We actually instead intend to replace "K" with "R" in the assumption statement because later in the analysis, like in Equation 32, we use "R". > **Weakness 6:** The formulation of (actor update) in Equation 15 seems wrong. $\eta_t$ should probably be replaced by $\alpha_t$. **Response to Weakness 6:** Yes, you are correct. Thank you for pointing this out. > **Weakness 7:** Furthermore, since I would suggest adding more comments in the sketch of the proof of Theorem 1, like for example introducing quantity $\alpha_T'$. It is also preferable to not refer directly to Equations appearing in the Appendix, as done for Equation (38). **Response to Weakness 7:** We appreciate this feedback. For clarity, $\alpha_T'$ comes from setting $t = T$ in the definition of $\alpha_t'$ introduced in Lemma 3. We can explicitly state this when introducing $\alpha_T'$. We will also update the manuscript so that in the main body, the equations referenced are also within the main body. "Equation 38" on line 330 can be replaced with "Equation 29". > **Question 1:** $K_t$ in line 251 is not defined. Is it a typo for $J_t$? **Response to Question 1:** Thank you for pointing this out. Yes, it is indeed a typo. After sampling $J_t$ from the geometric distribution, $2^J_t$ becomes the trajectory length. > **Question 2:** The formulation of (actor update) in Equation 15 seems wrong. Besides the wrong $\eta_t$, there is a further term that multiplies but already contains the TD term in its defintion. Is it correct or is there a redundant term? **Response to Question 2:** Indeed, the $\eta_t$ should be an $\alpha_t$, and the TD term in the $\theta$ update in Eq 15 is redundant. > **Question 3:** In Equation 53 in the Appendix, the term $T^\frac{1}{8}$ seems wrong, should it be $T^\frac{3}{4}$? **Response to Question 3:** Thank you for the comment. We obtain $T^\frac{1}{8}$ in the first term of the RHS of Equation 53 from Equation 52 through two operations. First, by taking the square root of both sides we go from $T^\frac{1}{2} \to T^\frac{1}{4}$. Next, by multiplying both sides by $\frac{1}{\sqrt{T}}$, we arrive at $T^\frac{1}{4} \to T^\frac{1}{8}$. We can split these two operations into their own steps in the analysis for clarity. > **Question 4:** On the experimental part, the MAC approach seems to have better performance but much higher variance than PPGAE. Can the author comment on this? **Response to Question 4:** Thank you for the comment. We kindly refer to Section C of this document <span style="color:blue">[here](https://drive.google.com/file/d/1yh5S5FHU_-AymOwlOCW0BLbUDhN4jWcz/view?usp=sharing)</span> where we show the variance of MAC was in part due to the number of trials. In the main body for the 5x5 sparse grid, we ran each algorithm for only 5 trials. Figure 5a of Section C shows less variance in MAC performance than 5b as 5a shows the learning curve of 100 trials while 5b only has 20 trials. We also note that the average learning curve of MAC achieves higher success in 5a with 100 trials than in 5b with 20 trials. In comparison to PPGAE and other baselines, we believe that the lower variance across trials suggests that for each baseline algorithm, it consistently converges to a policy that rarely ever reaches the goal. MAC however can consistently find policies that find the more goal more often. <!-- The main motivation of the quoting "OpenAI" in the abstract was to provide an additional sense of empirical validation of our results. Since, the OpenAI technical report is the primary and main source of information available to us about GPT-2, ChatGPT and GPT-4 as well. As these are some of the primary models that have been extremely impactful over these recent years, but most of the information is closed to OpenAI and only the reports were available. Consequently, when our generated results or theories align with the findings in the report, it serves as a valuable validation of our work. --> ------------------- <!-- ## Request for Urgent Assistance Regarding Replacing Reviewer RcT9 [Score 3, Confidence 4] <!-- Dear ACs and SACs, Thank you for your time and effort in organizing the reviews for our NeurIPS submission. We greatly appreciate your service and dedication. We are writing to respectfully raise our concerns regarding the feedback provided by Reviewer RcT9. We are concerned that the reviewer seems to be biased towards an existing unpublished arxiv work by Sadasivan et al. **and critisizing our work in a disrespectful manner**. Therefore, we would like to bring this potential conflict of interest to your attention and kindly request the replacement of this reviewer. Our concern stems from the fact that the questions raised by Reviewer RcT9 appear to be non-technical and more focused on explanatory matters rather than addressing the core technical aspects of our work. A detailed list of concerns is provided below. --> <!-- For instance, ***in the primay reasons for rejection***, reviewer mentioned that --> <!-- * the ***"contribution is small"*** by citing that first half of Section 3.1 in our work was already presented in previous **unpublished** work by Sadasivan et al. We would like to remark that this claim is wrong as Section 3.1 is just notations and preliminaries. We believe it is important to define terms in a mathematical rigorous manner (which is missing from Sadasivan et al.) because this is a new topic for the better accessibility. We have also mentioned clearly that AUROC bound in a single sample setting is also present in Sadasivan et al. (line 169 in our paper). Further, we remark that the bound of AUROC mentioned in Sadasivan et al. is also not new, it comes from the Lecam's lemma, which was never mentioned in Sadasivan et al., but we clearly mentioned it in our paper. (line 158). Therefore, evaluating our work based on this observation is not fair. Furthermore, we explicitly discussed in Appendix B.1 (line 587-598) regarding the relationship of our analysis to the analysis in Sadasivan et al. as well. --> <!-- * The reviewer has also mentioned that our claim regarding the possibility of AI-generated text detection in abstract is over-stated, and we need to add the condition "when there are sufficient number of samples." This also seems to be wrongly interpreted because we have mentioned "***we need more samples to detect it***" in the immediate next line in the abstract. We even talk about the precise sample complexity in the abstract, which the reviewer has completely ignored. This shows the biased nature of the reviewer. --> <!-- * Also, in the end, the reviewer has mentioned that our claim "***we can always gather more samples to spot AI generated text***" is incorrect. This is again misinterpreted because line 64 is just a high-level message (not a claim) of this work. We also motivated it with a specific application in lines 60-64. Further, in the applications mentioned by the reviewer, even in single essay, if the essay is long enough, we will have enough samples to detect (considering each sentense as samples) and our non-IID result would apply. This we explicitly show in several experiments in the paper (cf. Figure 2, 3, and Figure 6 in Appendix). We agree that one can construct counterexamples to our argument in this work because this is the first result on possibility, and there lies a huge scope for further improvements, but it is unfair to just consider that as the reasons for rejection without any technical argument. --> <!-- * Also, another comment in the weakness section mentioned by the reviewer is, "*Figure 2 appears in the text before it is described. It should be moved to the end of Section 4.3 to avoid reader confusion*. **This is not a weakness rather should be a suggestion**. --> <!-- * In another comment, the reviewer mentioned that "*when more samples are available, the harm being caused by the AI-generated content is also greater, since there is a greater chance people will see and be fooled by the content*". We note that this comment is not very meaningful and the reviewer's motivation behind this comment is unclear. Because, if more samples are already available, it will be easily detectable and we can already warn the users of that issue. --> <!-- We would like to mention that all the other reviewers have not raised any of the above-mentioned superfluous issues and have provided insightful reviews for our work. Hence in summary, after reading the review of reviewer RcT9 in detail, we feel that there is certain bias in the review. Most of the comments are quoting sentences/part of sentences from our draft and are not technical questions of our work which can help in improving the quality of our work (like the comments provided by the other reviewers). So for a fair review, we would like to request you to replace/remove the reviewer RcT9 for our manuscript. But to respect the time spent by the reviewer on our work, we would still be providing respones to the review before the rebuttal period ends. Thank you so much for your consideration. Sincerely, Authors Paper ID 6122 ===================================================== --> -->

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully