# Final author summary
Dear Area Chairs and Reviewers,
We are sincerely grateful for the thoughtful reviews and constructive discussion. The process materially improved our paper and sharpened the presentation of our main ideas.
What we heard:
Reviewers highlighted the practical relevance of the single-timescale setting, the improvement to the $\tilde{𝑂}(\epsilon^{-3})$ sample complexity over prior $𝑂(\epsilon^{-4})$, and the simplicity/generalizability of the Lyapunov/ODE-tracking perspective to multiple other use-cases.
Main concerns (writing/clarity):
The primary remaining issue is presentation and organization (tables, references, definitions, and clearer emphasis on the core technical innovation).
Our commitments (final version):
We will modify the introduction and technical presentation to foreground the core innovation (Lyapunov term and ODE-tracking), as per reviewer guidance.
Thorough editing pass: fix duplicated/mis-linked tables, add missing definitions/dimensions, and clean up references and appendix navigation.
We appreciate the reviewers’ engagement: ZGMW raised their score while urging stronger presentation; ZXNM moved to a weak accept contingent on clearer exposition; EEab and FxH5 confirmed satisfaction and kept their scores.
Thank you again for a rigorous and helpful process. We are confident the final revision will fully address the writing concerns and make the contribution easier to follow.
# Neurips AC Rebuttal
# Reviewer FxH5, rating 5, confidnece 4
We thank the reviewer for the encouranging feedback. We appreciate the time and effort spent on reviewing this paper. We are glad that the reviewer finds that the paper provides state-of-the-art complexity result to a challenging and more practical single-time scale algorithm.
We answer the questions below.
***Q1) The paper is restricted to tabular setups. This means not applicable to environments such as robotics.***
> The reviewer is right to observe that the current setting is not applicable to large-state problems such as robotics, as it is. The biggest roadblock is the 'square-root state dependence' of Gradient Dominant Lemma (GDL) 2.1. However, we agree that extending this lemma for continuous state (with embedding or approximate version) is an interesting future direction, that will also lead to non-tabular extention of the exact gradient convergence in Xiao et al. 2022, Mei et al 2022. Consequently, the non-tabular extension of our actor-critic complexity with minor (accounting for approximation errors of function approximation etc ) would follow. We thank the reviewer for this question, which we shall persue in future works.
***Q2) In certain situations, using multiple samples to estimate the critic can be beneficial such as DDPG. A discussion of these would be good.***
> We thank the reviwer for this insightful question. There are many works on double loop actor-critic (using many samples for critic per actor update) and our work is on the other extreme (using single sample for critic per actor update). It might be possible that the sweet spot lies somewhere in between these two extremes, in general or specific situations such as DDPG. We will add a discussion on this in the final version, and explore further in the future direction.
***Q3) The authors claim that the analysis may be extended to bilvel RL setup. More discussion on that would be good.***
> Our work suggests a very high-level approach for the bi-level problems such min-max optimization, robust MDPs, two(or multi)-agent RL:
>>Step 1) Formulate the inter-dependant recursions as in equation (3), for the involved quantities.
>> Step 2) Identify the Lyapunov function as done in Line 274, which may the most creative part of the solution.
>> Step 3) Solve the Lyapunov recursion as done in Lemma 4.4 to get the bound.
Our Lemma 4.4 can be extended to handle more general recursions, which we omitted for the clarity. Also, there are several insights we developed during the process such as the recursion usually tracks the associated ode. Also, the appropriate learning can be found using 'power-matching'. We shall gladly include a section (and more general ode tracking lemma) on this in the appendix in the final version.
***Q4) Is there any path to extend the analyses to infinite state/action spaces?***
> As we discussed in Q1, the key step in extending the analysis to the infinite state-action spaces, is to obtain a Gradient Domination Lemma (exact or approximated version). As only this has the the state dependence.
> Use of sufficient increase lemma for function approximation which is standard in literature. Obtain the recursion and solve it, with minor modification to the existing work.
> Loosely speaking, the $\sqrt{S}$ in GDL Lemma 2.1 corrosponds to the diameter of the policy space $\max_{\pi,\pi'\in\Pi}\lVert\pi -\pi'\lVert_2$. For function approximation case, it might be possible to be replaced with diameter of the parameter space $\max_{\theta,\theta'\in\Theta}\lVert\theta -\theta'\lVert_2$. However, this requires careful study which we leave for the future work.
***Q5) What challenges do you anticipate if any in extending your analysis to bilevel RL setup?***
> We briefly outlined the possible approach for bi-level problems in Q3. We believe the first step (obtaining the recursions) should be easy possibly laborious. Indentifying the Lyapunov might be trickiest of all. Then the recursion can solved similar to our ode tracking lemma.
> However, finding Lyapunov function might be not possible or easy, fortunately its possible to bypass the Lyanpunov step. By solving multiple recursions via extended ode tracking lemma, requiring the same methodology multiple times as we used to prove the Lyapunov recursion. This was our original approach before we stumble upon this elegant Lyapunov function. We will gladly include a section in the appendix on this in the final version.
# Reviewer EEab, rating 5, confidence 4
We thank the reviewer for the positive and thoughtful feedback.
We are glad that the reviewer finds our focus on the single-timescale actor-critic setting both practical and meaningful. We appreciate the recognition of our improved sample complexity and the Lyapunov-based analysis technique. We agree with the reviewer that this approach can inspire further work in similar settings.
***Q1) The 2nd and 3rd row of Table 1 looks the same.***
> Thanks for pointing it point. We will merge the two in the final version.
***Q2) In equation (3), the $z_k$ at the end of the first line seems redundant.***
> Thank you very much for noticing. This is indeed a typo which we will correct.
***Q3) The dimension of variables is obscure. Something like $A^k \in \mathbb{R}^{|S||A|}$ might make it clearer.***
> Thanks for the suggestion. We will add the dimesions to all the variables in the final version.
***Q4) Some references looks ill-formated. For example, it looks like that line 342-347 are referring to the same paper.***
> Thanks for pointing it out. We will fix this in the final version.
***Q5) How does the analysis techniques compared to the analysis in previous works and why previous analysis cannot achieve the rate obtained in this paper?***
> The weakness in the previous analysis from Chen and Zhao [2024] is that they treat the policy optimization objective as a general smooth non-convex function. Moreover, they take a standard approach of bounding the 'average of square norm of the gradient'. They do not use the gradient domination structure or one may plug in the inequality at the end yielding an inferior rate. The core innovation of our paper is to exploit this structure when bounding the iteration-wise drift of the actor.
> This additional structure required us to devolop very novel methodology to solve the interdependent recursions yielding the better result.
> In summary, our analysis is more tailor made for RL by efficiently exploiting the gradient domination structure compared to standared smooth optimization techniques used in the previous paper.
***Q6) Is there a sample-complexity lower bound for this kind of actor-critic algorithm?***
> Sample complexity lower bound of RL is $O(\epsilon^{-2})$ Auer et al. (2008). However, we are not aware of specific lower bound for single-time scale actor-critic algorithm. Nonetheless, people believe it to be the same.
# Reviewer ZGMW, rating 3, confidence 2
We thank the reviever for the encouraging reviews. We are glad that the reviewer finds key ideas and analysis to be novel, fairly simple, and insightful. We're also happy to hear that the presentation of the setting and positioning within prior work were clear.
***Q1) The sample-complexity does not match the lower-bound whose dependence in $O(\varepsilon^{-2}).$***
> The reviewer is right that our $O(\epsilon^{-3})$ complexity doesn't matches the lower bound of $O(\epsilon^{-2})$. However, our work improves upon the existing complexity of $O(\epsilon^{-4})$ , significantly bridging the gap and is state-of-the-art result to the best of our knowledge.
***Q2) The dependence on quantities other than $\varepsilon$ are ignored. In particular, the dependence on the effective horizon $1/(1-\gamma)$ and the size of the state-action space $SA$, which are hidden within constants, appears like it could be quite poor.***
> All the previous works reported only complexity only in $\epsilon$, hence we followed the same for clarity. Nonetheless, we can easily obtain the coefficients as outlined below.
> From Lemma C.1, we have
$$a_k \leq c_l^{-\frac{2}{3}}\Bigm(\max\{c_\eta, 2c_l^2u^3_0\}\Bigm)^{\frac{1}{3}}\Bigm(\frac{1}{\frac{1}{\alpha^3_0}+ 2k}\Bigm)^{\frac{1}{3}}
\leq \frac{\max\{2^{-\frac{1}{3}}c_l^{-\frac{2}{3}}c_\eta^{\frac{1}{3}}, u_0\}}{ k^{\frac{1}{3}}}.$$
Putting the values of $c_l$ and $c_\eta$ from Table 3, we get $c_l^{-\frac{2}{3}}c_\eta^{\frac{1}{3}} = O(\frac{S^{\frac{4}{3}}A^{\frac{2}{3}}C_{PL}^{\frac{4}{3}}}{c^{\frac{4}{3}}\lambda^{\frac{2}{3}}(1-\gamma)^{\frac{16}{3}}})$ and $u_0=O(\frac{SA}{(1-\gamma)^2})$, hence the complexity is
$$a_k \leq C\frac{S^{\frac{4}{3}}A^{\frac{2}{3}}C_{PL}^{\frac{4}{3}}}{\lambda^{\frac{2}{3}}(1-\gamma)^{\frac{16}{3}}k^{\frac{1}{3}}},$$
where $C$ is some numerical constant. For comparision, the iteration complexity for the exact gradient case is $O(\frac{SC^2_{PL}}{c^2(1-\gamma)^6k})$ as shown in Theorem of 4 of Mei et al. 2022, which can be improved to $O(\frac{SC^2_{PL}}{c^2(1-\gamma)^3k})$ using our ode tracking method (we will add this, in the final version).
>Notably, actor-critic dependence little better in mis-match coefficient $C_{PL}$ (yes, we double checked), and only slightly expensive in state space and horizon. We thank the reviewer for this question, we shall include this discussion in the final version.
***Q2) Presentation issues , typos, ... And Line 268-272 not clear.***
> We appreciate the reviewer's feedback on the presentation, we will correct these presentation issues and improve the clarity.
> Line 268-272 highlights the challenge in ensuring monotone decrease in the critic error in equation 3. That is, the gradient norm $y_k$ is lower bounded by $a_k$ (GDL). This lower bound of $y_k$ is helpful to upper bound $a_{k+1}$ in the 'Actor recursion of equation 3'. However, to upper bound the critic error $z^2_{k+1}$ in 'critic recursion of equation 3', we need upper bound on $y_k$ which we don't have. We will revise this paragraph for more clarity.
***Q3) In the discussion following Lemma 4.1, it is not clear to me why the last term “accounts for the variance in the stochastic update of the policy” since in the analysis this term arises from the smoothness of the value and not from the stochasticity of the update ?***
> Yes, the reviewer is correct to observe that the Lemma 4.1 arises from smoothness of the return. However, when looked closely at its proof in Lemma B.1, the last term is $$c_3\eta_k^2 = \frac{L}{2}E[\lVert \theta_{k+1}-\theta_k\rVert^2]$$
which is variance (second moment to be precise) of the updates. We thank the reviewer for noticing it, we will modify it to "accounts for the variance (second moment) of the updates " in the updated version.
***Q4) You mention that you believe that a bias analysis may lead to the optimal $O(\varepsilon^{-2})$ sample-complexity. Do you believe this is achievable without any modification to the algorithm ? Is there perhaps a slight adaptation of the algorithm that could improve your variance analysis ?***
> This is an intriging question. From our recursion analyis, we believe, the complexity that we got is difficult to improve (that is, the ODE tracking method solves the given recursion best). We also thought of having batches but it didn't help improve the complexity. Although, we believe if we could re-use samples, then the total number of new samples required would be less than $O(\epsilon^{-3})$. This is interesting from perspective of combining the offline and online RL, which we leave it to the future work.
# Reviewer ZXNM, rating 2, confidence 4
We thank the reviewer for the reviews and we are very happy that reviewer finds our result and analysis higly significant and interesting, worthy of being known in RL theory community.
***Q1) The weakness in the analysis from Chen and Zhao [2024] is that they treat the policy optimization objective as a general smooth non-convex function without using the gradient domination structure (or one may plug in the inequality at the end). In my understanding, the core innovation of the current paper is to exploit this structure when bounding the iteration-wise drift of the actor. A more thorough discussion of this aspect should be the actual focus of the presentation.***
> We thank the reviwer for this insightful observation. Indeed the reviewer is correct to note that we use the GDL (Gradient Domination Lemma) structure from the very start, and then derive and solve the recursion. In contrast to Chen and Zhao [2024] that uses standard smooth optimization techniques to upper bound the gradient norm using the telescoping sum. And when plugged GDL in the end, it yields best iterate rate in addition to the inferior rate. We will add this discussion in introduction, main text and the discussion sections.
***Q2) Can the authors clarify if I correctly understand the innovation of the paper in my comments above (in Strengths & Weaknesses)? Specifically, the limitation in prior works like Chen and Zhao [2024] is that they disregard the gradient domination structure and treat the policy optimization objective as a standard smooth non-convex function, but the current paper leverages the structure in the analysis of the actor. From this perspective, the key novelty of the paper lies in Lemma 4.1. To my knowledge, Lemma 4.2 is a standard result (namely, showing the approximate iteration-wise critic convergence up to gradient norm of the actor). The fact that Lemma 4.2 is standard does not undermine the contribution of the paper. For clarity, I strongly encourage the authors to be precise and transparent in communicating to the audience which parts of the analysis are novel and which rely on established results. If this is indeed the case, the authors should consider substantially revising the paper to place more emphasis on Lemma 4.1 and its analysis in the main paper.***
> Yes, the reviewer is correct in uderstanding the novelty and techniques of the paper. And indeed, Lemma 4.2 comes from the standard techniques, can be found in the literature. We welcome the suggetion on emphasing on Lemma 4.1 in the final version.
***Q3) The authors state "Additionally, when this ODE tracking method is applied to the recursion for the exact gradient case, it produces better results compared to existing bounds, such as those in Mei et al. (2022)." in line 52-54, but never establish anything along this line.***
> We apologize for this omission. By applying our techniques, we can significantly improve MDP parameters dependence as compared to Mei et al. (2022). A brief explaination follows:
>
>In exact gradient case, from sufficient increase Lemma, with step size $\eta_k =\frac{1}{L}$, where $L = \frac{8}{(1-\gamma)^3}$ is smoothness coefficient, we have
$$a_{k+1}-a_k \geq \frac{\lVert\nabla J^{\pi_{\theta_k}}\rVert^2_2}{2L}$$
and from GDL, we have
$$a_k\leq \frac{\sqrt{S}C_{PL}}{c}\lVert\nabla J^{\pi_{\theta_k}}\rVert_2,$$ combining both we get the recursion:
$$a_{k+1}-a_{k}\geq \frac{c^2}{2LSC^2_{PL}}a_k^2.$$
From the ode tracking, it can be shown that
$$a_k \leq \frac{1}{\frac{1}{a_0} + \frac{c^2}{2LSC^2_{PL}}k}=\frac{1}{\frac{1}{a_0} + \frac{c^2(1-\gamma)^3}{16SC^2_{PL}}k}, \qquad \forall k\geq 0.$$
Note that the complexity in Mei et al. 2022 is $a_k \leq \frac{16SC^2_{PL}}{c^2(1-\gamma)^6k}$. Our result is better than Mei et al. 2022 in two aspects:
>> 1) We have $a_k \leq \frac{16SC^2_{PL}}{c^2(1-\gamma)^3k}$ that improves upon the horizon by power of three.
>> 2) Note that $\frac{16SC^2_{PL}}{c^2(1-\gamma)^6}$ is a very large constant and sub-optimality $a_k$ can't be any larger than $\frac{2}{1-\gamma}$. Hence, the bound $\frac{16SC^2_{PL}}{c^2(1-\gamma)^6k}$ is greater than highest possible sub-optimality for large initial iterates, providing no meaningful result. On the other hand, our result is more insightful from the very first iterate.
We have updated this discussion with proof of the ode tracking lemma for this case, in the appendix in the final version.
***Q4) In line 119-121, the authors make the comment "Yuan et al. (2022) too establishes global convergence with sample complexity of $\mathcal{O}(\epsilon^{-3})$, however, it requires an additional structural assumption on the problem which is highly restrictive." The sentence is written in a way that makes the audience want to know the exact assumption. Please provide more detail here.***
> The Yuan et al (2022) have various results under various assumptions for stochastic policy gradient descent with batch $m$ horizon $H$. It has only actor parameter, no critic.
> ***Corollary 4.11.*** The simplest (only with ABC Assumption 3.3) global convergence sample complexity is $O(\epsilon^{-6})$ with horizon length $H = \frac{\log(\epsilon^{-1})}{1-\gamma}$.
>***Corollary 3.7.*** In addition to ABC Assumption 3.3 , this result requires Assumption 3.6 with slackness $\epsilon'$ for horizon $H$.
>> Limiting case: Note that $\epsilon'$ is zero for infinite horizon, and for this case, the stochastic policy gradient requires $O(\epsilon^{-3})$ iterations for $O(\epsilon)$ close global policy. Hence, the sample complexity is essential infinite as trajectories are rolled till infinite length for each iteration.
>> Normal case: For $\epsilon' \neq 0$, the policy is only guaranteed to converge to sub-optimality $O(\epsilon + \epsilon')$ in $O(\epsilon^{-1}(\epsilon')^{-2})$ iterations. That is, algorithm can't be guaranteed to converge to a arbitrary small sub-optimality.
> To summarize, the (stochastic policy gradient) algorithm in Yuan et al (2022) is substantially different than ours (single time-scale actor-critic). Furthermore, our sample complexity is either better and/or requires less assumptions. We shall make it more precise in the final version.
***Q5) In line 150-151, "However, the return ... leading to the following result." -- the gradient domination lemma is not a direct consequence of $J$ being smooth.***
> Thank you very much. This line is misplaced here. We will remove it in the updated version.
***Q6) I suggest that the authors re-organize the appendix and add proper linkage between lemmas/propositions and their proofs, to make it easier to the audience to locate the proofs.***
> Thank you very much for the suggestion. We will add an outline of the appendix and clearly mark the technical results against their main text counterparts in the updated version.
***Q7)Something nice about the result in this paper is that the convergence is established on the last iterate. If we follow the analysis in Chen and Zhao [2024] and plugs in the gradient domination condition at the end, the convergence in value function space will be on the best iterate (in addition to having an inferior rate). The authors should consider mentioning this additional advantage of their analysis.***
> Thanks this suggestion. We will add this point after our main result Theorem 3.3.