<!-- # General Response to All Reviewers
We can write it if time permits
$\newcommand{red}{\textcolor{red}{\mathrm{9hAu}}}$
$\newcommand{nfev}{\textcolor{blue}{\mathrm{NfEV}}}$
$\newcommand{zxw}{\textcolor{green}{\mathrm{z1xW}}}$
$\newcommand{jo}{\textcolor{orange}{\mathrm{J3o4}}}$
We thank all reviewers for their valuable feedback and insightful questions. We are particularly encouraged that they consider the proposed method novel and crucial for robust training in RL (z1xW, NfEV, J3o4), the empirical evaluation good and extensive (z1xW, J3o4), the performance achieving state-of-the-art (9hAu, NfEV) and our paper well-written (9hAu, z1xW, NfEV, J3o4).
We address individual questions of reviewers in separate responses. We have also uploaded a modified version of our paper based on reviewers' suggestions. The major changes are highlighted as red in the paper. In particular, we added more explanations of the convex relaxation (9hAu, J3o4), the description of computational cost in continuous action spaces (NfEV), and more comparison to prior work (9hAu, J3o4) in Section 4. We also added Appendix E which discusses the limitations and the social impacts of this work in detail (z1xW, J3o4).
We greatly appreciate all reviewers' time and effort in reviewing our paper. We hope that our paper updates and responses have addressed all reviewers' questions and concerns. Please let us know if there are further questions.
Authors
**WE NEED TO MODIFY THE ABOVE DISCUSSION**
-->
<!-- **EXAMPLE AC SUMMARY:**
We thank all reviewers for their valuable feedback and insightful questions. We are particularly encouraged that they consider the proposed method novel and crucial for robust training in RL (z1xW, NfEV, J3o4), the empirical evaluation good and extensive (z1xW, J3o4), the performance achieving state-of-the-art (9hAu, NfEV) and our paper well-written (9hAu, z1xW, NfEV, J3o4).
We address individual questions of reviewers in separate responses. We have also uploaded a modified version of our paper based on reviewers' suggestions. The major changes are highlighted as red in the paper. In particular, we added more explanations of the convex relaxation (9hAu, J3o4), the description of computational cost in continuous action spaces (NfEV), and more comparison to prior work (9hAu, J3o4) in Section 4. We also added Appendix E which discusses the limitations and the social impacts of this work in detail (z1xW, J3o4).
We greatly appreciate all reviewers' time and effort in reviewing our paper. We hope that our paper updates and responses have addressed all reviewers' questions and concerns. Please let us know if there are further questions.
Authors -->
**Summary of Rebuttal Discussions:**
**General comments:** We thank all reviewers for their valuable feedback and insightful questions. We are particularly encouraged by the following: they consider our addressing slow-mixing problems in RL to be well-motivated and a timely contribution to the literature (9QDQ, PeXZ, bQie, VQNT); they appreciate the novelty of our proposed MAC algorithm and its explicit dependence on mixing time in the analysis (9QDQ, PeXZ, bQie, VQNT); they find our paper well-written (9QDQ, VYuy, bQie, VQNT). We provided thorough responses to each reviewer's concerns during the rebuttal which has results in two (PeXZ, bQie) out of the three reviewers who engaged with us increased their scores (final scores 6 7 5 5 6).
**Review highlights:** Reviewer bQie *"really strongly agree[s] with the motivation of this paper"* and states post-rebuttal that *"this work is quite solid"* and *"would make a nice contribution to this conference"*. Reviewer PeXZ appreciates that we provide *"new results, explicitly characterizing the algorithm's dependence on the mixing time, which is never considered before for actor-critic algorithms."* Reviewer VQNT believes that the paper *"is overall well-written"* with a presentation in the main body that *"helps understand the high-level ideas,"* which are *"generally interesting for the theoretical community."* The remaining reviewers think the paper is *"clearly written"* and addresses *"an important open question"* (9QDQ), and note that our work improves over existing results (VYuy).
**Summary of core contributions:**
* Existing analyses of infinite-horizon, average-reward actor-critic algorithms assume fast-mixing properties or oracle access to mixing times, which may not hold in practice. To address this shortcoming, we propose a novel multi-level Monte Carlo actor-critic algorithm, MAC, that does not rely on oracle knowledge of mixing times or fast-mixing properties.
* Furthermore, our novel analysis of MAC explicitly characterizes its dependence on mixing times, which has not been previously addressed in the actor-critic literature.
* We also provide empirical results on gridworld problems illustrating that MAC enjoys benefits over vanilla methods.
* In summary, this work provides a timely contribution to an important unresolved problem in the literature that has both theoretical and practical ramifications, and is thus of interest to the theoretical RL and broader RL communities alike.
-----------------------------------------------------------------
## Response to Reviewer 9QDQ [Score 6, Confidence 4]
<!-- **General Response:** -->
We would like to express our gratitude to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript, which has allowed us to strengthen the manuscript. We deeply appreciate the insightful feedback provided, and we have thoroughly responded to reviewer's inquiries in the responses provided below.
> **Question 1:** One question I have is about the relationship between $T_{max}$ and $\tau_{mix}$. On page 6, it is assumed that $T_{max} \geq \tau_{mix}^t$. Does this condition implicitly require some knowledge of $\tau_{mix}$? Second, I am a bit confused about the remark on trade-off between $\tau_{mix}$ and $T_{max}$. Aren't they just proportional? Also, it seems one can simplify the results in theorems 4.6, 4.7 and 4.8 by simply taking $T_{max} = \max_t \tau_{mix}^{\theta_t}$.
**Response to Question 1:** For purposes of the analysis, we do indeed assume that $T_{max} \geq t_{mix}^{\theta_t}$, at each $t$. Unlike [Dorfman & Levy, 2022], which assumes $T_{max} = T$ grows as $T$ over time, fixing a finite value of $T_{max}$ allows us to explicitly quantify the effect of using a specific values of $T_{max}$ in Theorems 4.6, 4.7, and 4.8.
In our experiments, we simply choose a $T_{max}$ that seems to be "large enough", and that we hope is larger than $\tau_{mix} := \max_t \tau_{mix}^{\theta_t}$. If we let $T_{max} = T$ as in [Dorfman & Levy, 2022], however, then we recover the rate $\widetilde{O}(\tau_{mix}^2 \epsilon^{-2})$ while remaining completely oblivious to $\tau_{mix}$. The reason we keep the explicit dependence on $T_{max}$ in our results is to provide insight into how the error due to choice of $T_{max}$ and worst-case mixing time $\tau_{mix}$ dies off as $T_{max} > \tau_{mix}$ grows large. This also clarifies the "trade-off" between $\tau_{mix}$ and our choice of $T_{max}$: for large $\tau_{mix}$, we should choose $T_{max}$ so that $O\left( \frac{\tau_{mix}\log T_{max}}{T_{max}} \right)$ appearing in equation (23) is small.
> **Question 2:** A related question I have is that if $T_{max} \geq \tau_{mix}^t$, can the algorithm be simplified by just waiting for $\tau_{mix}$ time steps between collecting data? That is, collect data at $t$ and $t + \tau$, where $\tau > \tau_{mix}$. Then, $T_{mix}$ seems to be not required. I might be missing something here.
**Response to Question 2:** This is a good point, but we note that $T_{max}$ is an upper bound on the number of samples used in the MLMC estimator (notice $T_{max}$ in equation (13)). Since we are using $2^{J_t}$ samples with $J_t \sim \text{Geom}(1/2)$ in the MLMC estimator from equation (13) in the paper, we actually only require around $\log T_{max}$ samples in expectation. Waiting $\tau > \tau_{mix}$ steps between collecting data would thus waste much more data than following our MLMC sampling procedure.
> **Question 3:** There seems to be some typos in the actor update in (11). Step size for the actor update are denoted by $\alpha_t$ throughout the paper? Also, $h_t$ includes the term $\delta^{\pi_{\theta_t}}$? Same for (13).
**Response to Question 3:** Thanks to the reviewer for pointing this out. The actor update should instead be $\theta_{t+1} = \theta_t + \alpha_t h_t$, where $\alpha_t$ is the stepsize and $h_t$ is the policy gradient estimate.
> **Question 4:** For theorem 4.7, is there a specific step size selection rule of $\beta_t$? It does not seem to be spelled out. Also, in the statement of theorem 4.6, $\eta_t^*$ seems undefined.
**Response to Question 4:** Thank you for raising this issue for clarification. In Theorem D.5 (detailed formal version of Theorem 4.7 in the appendix, page 24) , we take $\beta_t = \gamma_t = (1 + t)^{-\nu}$. We will add this to the statement of Theorem 4.7 in the main body. Regarding $\eta_t^*$, it is defined to be $\eta_t^* = J(\theta_t)$ on line 752 in the appendix. We will add this to Theorem 4.6. Thanks for catching these gaps -- we will clarify them in the final version.
## Response to Reviewer VYuy [Score 5, confidence 2]
<!-- **General Response:** -->
We are thankful to the reviewer for dedicating their valuable time and effort towards evaluating our manuscript, which has allowed us to strengthen the manuscript. We have thoroughly responded to reviewer’s inquiries in the responses provided below.
> **Reviewer comments:** This paper uses vanilla AC, which is known to be very hard to train. And indeed, in figure 1), the variance is very high, just from the figure itself, I don't seem to get convincing comparison. I was wondering if this could be done is a proper way by using modern AC algorithms like TD3 or SAC as authors have already mentioned.
Also, it'd be interesting to run the proposed algorithm on Mujoco benchmark tasks.
**Response to Reviewer comments:** We agree that a thorough comparison of a neural network version of MAC with state-of-the-art deep RL methods like TD3, SAC, or PPO on benchmark problems like MuJoCo tasks is highly desirable. In order to achieve a fair comparison, however, we would first need to develop a "MAC-ified" version of a SOTA method like PPO, where the standard gradient estimators are replaced with MAC-style estimators, then compare this version of MAC to standard PPO, TD3, SAC, etc., on a wide range of benchmark tasks. This is a very interesting and potentially impactful future direction, yet we believe it is beyond the scope of the present work.
***For the present paper***, ***we have investigated the theoretical underpinnings of MAC***, and our experiments are provided as a proof of concept to support our theoretical findings. Nonetheless, we have begun experimentally investigating MAC on the Frozen Lake and Taxi OpenAI gym environments and will include what additional results we are able in the final version.
<!-- **(SHOULD WE PROMISE TO RUN EXPERIMENTS ON MUJOCO HERE? THIS MIGHT BE PROMISING TOO MUCH.)** -->
## Response to Reviewer PeXZ [Score 4, confidence 5]
We express our gratitude to the reviewer for taking the time to review our manuscript and recognizing the importance of our contribution in characterizing the actor-critic convergence rates in terms of mixing time. We sincerely appreciate the feedback and provide detailed responses to all questions below.
> **Question 1.1:** It seems the optimal rate requires $T_{max} = T^{1/2}$ and also the results implicitly assumes $T_{max} > \tau_{mix}$. If so, the theorems should explicitly state the assumption that $T_{max} > \tau_{mix}$.
>
> **Question 1.2:** Note that [Xu et al. 2020] (which achieves $\epsilon^2$) also has a similar minibatch size $B + T^{1/2}$ to estimate the updates. It is very likely that if [Xu et al. 2020]'s standard estimator is replaced with the MLMC estimator, then the same sample complexity $\tau^2 / \epsilon^2$ can be recovered. Do you agree with this comment?
**Response to Question 1.1:** For purposes of the analysis, we assume that $T_{max} \geq t_{mix}^{\theta_t}$, at each $t$. This assumption is made at line 319 in the left-hand column, as well as in the statements of Lemmas B.3, C.2, and D.4 in the appendix. Unlike [Dorfman & Levy, 2022], which assumes $T_{max} = T$ grows as $T$ over time, we fix a finite value of $T_{max}$ both in practice and in our analysis. This allows us to explicitly quantify the effect of using a specific values of $T_{max}$ in Theorem 4.6 (cf. Theorem C.4 in appendix), 4.7 (cf. Theorems D.1 and D.5 in the appendix), and Theorem 4.8 (cf. Section E). If we let $T_{max} = T$ as in [Dorfman & Levy, 2022], then we recover $\widetilde{O}(\tau_{mix}^2 \epsilon^{-2})$.
**Response to Question 1.2:** We agree that modifying the algorithms and analysis of [Xu et al., 2020] to use MAC-style MLMC estimators may be possible, but it would be challenging and would constitute a significant contribution instead of a straightforward extension, for the following reasons:
* ***Batch size-dependent bias analyses.*** One of the primary contributions in [Xu et al., 2020] is the use of mini-batch TD learning and mini-batch AC to obtain improved sample complexity. A significant novelty in their analysis is a new method for characterizing the bias involved in using mini-batches for both the critic and the actor. This analysis depends in a crucial way on the batch sizes, which for the critic is $M = \Theta(\epsilon^{-1})$ and for the actor is $B = \Omega(\epsilon^{-1})$ (see Theorems 1 and 2 in [Xu et al., 2020]). The MLMC estimators used in MAC, on the other hand, use stochastic rollout lengths (i.e., batch sizes) of $M = B = 2^{\min(j_{max}, J_t)}$ samples, where $J_t \sim \text{Geom}(1/2)$ and $j_{max} = \lfloor \log T_{max} \rfloor$ (see, e.g., equation (13)). Since $\mathbb{E}[M] = 1 + \sum_{j=1}^{j_{max}} P(J_t = j) 2^j = O(\log T_{max})$, this means batch sizes for both the actor and critic are only $O(\log T_{max})$, in expectation. When we let $T_{max} = T$ as in [Dorfman & Levy, 2022], this means $M = B = O(\log (1 / \epsilon))$. In order to combine MLMC estimators with [Xu et al., 2020], this significant difference between batch sizes would need to be reconciled with their batch size-dependent bias analyses.
* ***Dependence on transition kernel Lipschitzness and uniform ergodicity.*** A crucial difference between the MAC analysis and previous analyses lies in how MAC handles Markovian sampling. Previous works, including [Xu et al., 2020], follow [Zou et al., 2019] in combining Lipschitz properties of the induced transition kernels and uniform ergodicity of the policy-induced Markov chains to handle bias and variance arising from the Markovian sampling. In [Xu et al., 2020], for example, the results that depend on Assumption 2 and Lemma 3 to eventually tease out the exponential decay term $\kappa \rho^{i-j}$ around equation (25) rely on this style of analysis. Our novel MAC analysis, on the other hand, replaces this dependence with the easier-to-use MLMC first and second moment properties described in Lemmas B.3, C.2, and D.4 of our appendix. This innovation has the added bonus of explicitly revealing mixing time dependence. In order to incorporate MAC-type estimators in [Xu et al., 2020]'s analysis, similar MLMC-style results would need to be proven, and all the analysis stemming from their Assumption 2 and Lemma 3 would need to be reworked accordingly. This would be a nontrivial extension.
> **Question 2:** If $j \sim \text{Geom}(1/2)$, then $\mathbb{E}[2^j] = + \infty$. Does this mean the algorithm will sample infinite times in expectation at each round? If $T_{max}$ is used to limit this situation, then it is at best not properly described in Algorithm 1. Currently, Algorithm 1 does not even make sense due to its sample complexity going to infinity. My understanding is that is used to limit the expected rollout steps to $\log(T_{max})$.
**Response to Question 2:** We apologize for the confusion. Please note the presence of the constraint $2^{J_t} \leq T_{max}$ in equation (13) (referred to in step 9 of Algorithm 1). This means that the maximum rollout length is capped at $T_{max}$. In addition, the maximum possible value that the exponent $j$ in the MLMC estimator (13) can take is $j_{max} = \lfloor \log T_{max} \rfloor$. As outlined in our response to ***Question 1***, this means that only $O(\log T_{max})$ samples are actually needed, in expectation. We will make this fact clearer in Algorithm 1 in the revised version of our paper.
> **Question 3:** Is Algorithm 1 designed to use one whole trajectory $s_t^{i+1} = s_{t+1}^0$, or restarting at each step $s_t^0 \sim \mu_0$? It seems this algorithm requires resampling at each round $T$.
**Response to Question 3:** Algorithm 1 uses a single trajectory, so $s_{t+1}^0 = s_t^{i+1}$ for the final $i$ at each step $t$. We will add a corresponding line and explicitly clarify this in Algorithm 1.
> **Question 4:** In terms of the convergence rate, previous works like [Wu et al. 2020][Xu et al. 2020] consider the per-sample convergence rate. This paper instead allows sample $T_{max}$ times and counts that as one round. What if they are compared in the same per-sample criterion? Assuming $T'$ is the number of total samples, it looks like the actor's convergence rate becomes roughly $\frac{\sqrt{\log T_{max}}}{\sqrt{T'}} + \tau \frac{\log T_{max}}{T_{max}}$. I think this kind of result should instead be presented to make a fair comparison.
**Response to Question 4:** This is a good point, which is explicitly addressed by Theorems C.4, D.1, D.5, and Section E in the appendix. In particular, the proof of Theorem 4.8 given in Section E provides the following actor bound:
\begin{align}
\frac{1}{T} \sum_{t=1}^T \mathbb{E} \left[ || \nabla J(\theta_t) ||^2 \right] \leq \widetilde{O} \left( \max_{t \in [T]} \tau_{mix}^{\theta_t} \log T_{\max} \right) \cdot O\left( \frac{1}{\sqrt{T}} \right) + \widetilde{O} \left( \sqrt{\max_{t \in [T]} \tau_{mix}^{\theta_t} \frac{\log T_{\max}}{T_{\max}}} \right) + O\left( \mathcal{E}_{app} \right).
\end{align}
We believe that this bound is what the reviewer is asking for. The statements of Theorems 4.6, 4.7, and 4.8 in the main body are simplified versions of the results given in Theorems C.4, D.1, D.5, and Section E. We will make this connection clear in the revised manuscript.
> **Question 5:** What if $T$ is not known ahead of time? Some of the previous works don't require the knowledge of the time horizon and can run for arbitrary time.
**Response to Question 5:** We respectfully would like to comment that our algorithm and analysis holds for arbitrary $T$.
<!-- **OTHERWISE, WE CAN ALSO CITE DOUBLE SAMPLING THEOREM WHICH ALLOWS US TO MAKE THINGS INDEPENDENT OF T** -->
> **Question 6:** For the Adagrad-like learning rate, if you solely apply this to [Wu et al. 2020], can their result be further improved to $O(\epsilon^{-2})$?
**Response to Question 6:** This may be possible, but the resulting analysis would remain tied to the exponentially fast mixing assumption (Assumption 4.2 in [Wu et al. 2020]). Avoiding the need for this assumption is our primary motivation for using MLMC.
<!-- **Amrit Response** The utilization of Adagrad-like learning rate is tightly bound with the aspect of making our learning rate oblivious to the mixing time. [can this statement help somehow?, we need to discuss] -->
> **Minor questions or typos:**
> * At line 92, "which is alleviates"
> * At Line 215, it is still not clear to me what is the exact form of $f_t^{MLMC}$ and $g_t^{MLMC}$. Can you write them down explicitly?
**Response to minor questions or typos:** Thanks for spotting the typo on line 92, we will fix it. We can include explicit definitions of $f_t^{MLMC}, g_t^{MLMC}$ in the final version. For quick reference, here are the explicit expressions for them:
\begin{equation}
f_t^{MLMC} = f_t^0 +
\begin{cases}
2^{J_t}(f_t^{J_t} - f_t^{J_t - 1}),& \text{if } 2^{J_t}\leq T_{\max}\\
0, & \text{otherwise}
\end{cases}
\end{equation}
with $f_t^j = \frac{1}{2^j}\sum_{i=1}^{2^j} f(\eta_t; s_t^i, a_t^i) = \frac{1}{2^j}\sum_{i=1}^{2^j} \left( r(s_t^i, a_t^i) - \eta_t \right)$.
\begin{equation}
g_t^{MLMC} = g_t^0 +
\begin{cases}
2^{J_t}(g_t^{J_t} - g_t^{J_t - 1}),& \text{if } 2^{J_t}\leq T_{\max}\\
0, & \text{otherwise}
\end{cases}
\end{equation}
with $g_t^j = \frac{1}{2^j}\sum_{i=1}^{2^j} g(\eta_t; s_t^i, a_t^i) = \frac{1}{2^j}\sum_{i=1}^{2^j} \left( r(s_t^i,a_t^i)-\eta_t+ \langle\phi(s_{t+1}^i)-\phi(s_{t}^i),\omega_t\rangle \right) \phi(s_t^i)$, where $\phi$ is the feature map described at line 191 and $\omega_t$ is the critic parameter vector.
## Response to Reviewer bQie [Score 5, confidence 4]
We would like to express our gratitude to the reviewer for dedicating their valuable time and effort towards reviewing our manuscript and helping us to strengthen the manuscript. We deeply appreciate the feedback provided, and we have provided thorough responses to all the reviewer's questions below.
> **Question 1:** Given that the analysis in section 4.4 breaks free of the dependence on T_max in establishing the main theoretical results of this paper, why does the approach and experiments take T_max as a hyperparameter than must be set by the algorithm designer? How does setting T_max relate to setting an upper bound on the mixing time as discussed in "Unknown mixing times in apprenticeship and reinforcement learning" by Zahavy et al.? Is setting T_max really an improvement over providing knowledge of the mixing time to the algorithm given how related this quantity is?
**Response to Question 1:** This is an excellent point, and we thank the reviewer for raising it. The ***short answer*** is: as in [Dorfman & Levy, 2022], we can let $T_{max} = T$ grow as $T$. We briefly touched on this in the Remark following Proposition 4.5 around line 290. It is important to note that Theorems 4.6, 4.7, and 4.8 (and their detailed analogues in the appendix, which retain additional information on the effects of $T_{max}$ and problem-dependent constants) all still hold with $T_{max}$ replaced by $T$. When $T_{max} = T$, the assumption $T_{max} \geq \tau_{mix}^{\theta_t}$ is no longer required at each $t$, making our results oblivious to mixing time. We will make this crystal clear in the revision.
***Longer answer:*** We note that setting $T_{max} = T$ increases sample complexity by a factor of $O(\log T)$ (this is already reflected in the result bounds) due to the geometric sampling $J_t \sim \text{Geom}(1/2)$ in equation (13), but allows us to be completely oblivious to mixing time. In practice, of course, we cannot let $T_{max}$ grow indefinitely. This is our motivation for considering fixed $T_{max}$. Our results (especially Theorems C.4, D.1, D.5, and Section E in the appendix) explicitly quantify the complexity implications of using specific values of $T_{max}$. In our experiments we simply chose $T_{max}$ through trial and error, but an interesting idea is to choose $T_{max}$ adaptively by estimating an upper bound on the mixing times encountered during training. The introduction of [Zahavy et al., 2020] points to useful techniques for estimating such upper bounds -- we thank the reviewer for bringing this work to our attention, and will mention this valid scope of future work in the revision.
> **Question 2:** In the caption of table 1 the authors write "To our knowledge, this is the first AC algorithm with an explicit optimal dependence on the underlying mixing time" and then in the conclusion they write "As a limitation, our current dependence on mixing time is not the sharpest possible". These comments seem to be contradictory. Can you further clarify the optimality of your dependence on the mixing time? I was looking at the paper "Towards Tight Bounds on the Sample Complexity of Average-reward MDPs" by Jin and Sidford (2021), which seems to imply that their is a lower bound on sample complexity that is linear in the mixing time, but I suppose it is not known whether or not this lower bound is tight or whether the squared dependence is unavoidable?
**Response to Question 2:** We thank the reviewer for pointing this out and concede that our statement "this is the first AC algorithm with an explicit optimal dependence on the underlying mixing time" is imprecise. We do not mean to claim that our dependence on mixing time is the best possible. We believe that whether the squared dependence is improvable remains an open question. Instead, our dependence merely improves upon prior results which are most comparable.
***Our claim*:** What we do claim is that, to the best of our knowledge, our work provides the first AC algorithm that simultaneously exhibits explicit dependence on mixing time and does not require the exponentially fast mixing assumption. We will update the Table 1 caption accordingly in the revised version of our manuscript.
> **Question 3:** I know that the MLMC approach is previously considered by Dorfman and Levy, but could you explain a bit about the motivation for what happens i.e. the sampling from the geometric distribution and subsequent aggregation? I think readers would be interested to know how important each step is for the theory to work out and what pieces could potentially be done differently.
**Response to Question 3:** An initial motivation for using geometric sampling in the MLMC estimator (i.e., $J_t \sim \text{Geom}(1/2)$ in equation (18)) is that it allows us to obtain an unbiased gradient estimate averaged over $O(T_{max})$ samples using only $O(\log T_{max})$ samples, in expectation. This is reflected in equation (19) from Proposition 4.5 in our paper, which says $\mathbb{E}_{t-1} \left[ l_t^{MLMC} \right] = \mathbb{E}_{t-1} \left[ l_t^{j_{\max}} \right]$, for general MLMC estimators $l^{MLMC}_t$, and also in (30) in the appendix. The reason that only an expected $O(\log T_{max})$ samples are required is that, though $J_t \sim \text{Geom}(1/2)$, the maximum value $j$ is allowed to take in the MLMC estimator (13) is $j_{max} = \lfloor \log T_{max} \rfloor$, implying that the expected number of samples used is actually $\mathcal{O}\Big(\sum_{j=1}^{j_{max}} P(J_t = j) 2^j\Big) = \mathcal{O}(\log T_{max})$.
Importantly for our case, the structure of the MLMC estimator also allows us to quantify the effect of mixing time on its squared norm, as in equation (20), which gives the bound $\mathbb{E} \left[ || l_t^{MLMC} ||^2 \right] \leq \widetilde{O}( G_L^2 \tau_{mix}^{\theta_t} \log T_{\max} )$ (also see Lemma B.3 in the appendix). Using these features of our MLMC estimators, we can accommodate Markovian sampling in our actor-critic error analysis (cf. Lemmas C.2 and D.4 in the appendix) without resorting to the exponentially fast mixing assumption. This is a critical innovation in our analysis, as mentioned in the last paragraph of section 4.1.
> **Question 4:** The authors write "As previously mentioned, the stability of vanilla average reward actor-critic can only be ensured under the exponentially fast mixing condition, which can preclude cases with reward sparsity or large state spaces." Can you explain why this is? The connection with reward sparsity and large state spaces and mixing times seems a bit indirect. Why not consider settings with high lower bounds on the mixing time as in Riemer et al.? Or maybe in your empirical settings you can at least measure the mixing times to provide more transparency about where gains are coming from?
**Response to Question 4:** Regarding the connection between large state spaces and mixing times, the intuition is that mixing time should grow as the size of the state space increases for the simple reason that there is more state space to explore. Section 7.1 in [Levin & Peres, 2017] gives results and examples showing that mixing time can be lower bounded by increasing functions of the state space size, while [Riemer et al., 2021] show that mixing times grow polynomially in the state space size for certain classes of problems.
We agree it would be better to include mixing time information for the environments we consider, and we have already begun investigating this for our gridworld environments, as well as the Frozen Lake and Taxi OpenAI gym environments. We can include additional experiments investigating MAC's gains over vanilla in the final version.
> **Question 5:** Additionally during your experiments for the baseline actor-critic wouldn't it be more fair to compare to adagrad as well given how it is apparently key to your analysis? Also are the experiments in linear function approximation as in the theory? Can you comment on how these insights would hold up in the non-linear neural network setting?
**Response to Question 5:** We agree that adding comparisons of MAC with vanilla AC equipped with AdaGrad is a good idea. We can include this in the final version. Since the primary purpose of our experiments was to provide a proof of concept and support our theoretical results, we used linear function approximation for the critic and a softmax policy parameterization for the actor.
***Non-linear neural network setting:*** We remark that our work does not explicitly address the well-known gap in the actor-critic literature between the linear critic setting (which is theoretically tractable) and the highly nonlinear neural network approximations used in practice. Nonetheless, as long as the critic that we use provides good value function approximations -- i.e., the nonlinear analogue of the worst-case approximation error $\mathcal{E}_{app}$ in equation (23) is small -- and enjoys basic continuity properties, the benefits that MAC enjoys in slow-mixing settings will be independent of our choice of approximation architecture. This is because the core of MLMC is the sampling procedure and estimator, not the specific properties of the function estimated.
## Response to Reviewer VQNT [Score 6, confidence 2]
We appreciate the reviewer for their valuable time to assessing our manuscript and recognizing the novel contributions. We are grateful for the valuable input offered, and we have addressed reviewer's queries in the responses provided below.
> **Question 1:** Could the authors provide some further discussions about the assumption on linear function approximation? One point the authors make is that many RL environments can exhibit a slower than exponential mixing rate due to high dimensionality for example. If there doesn't exist a good linear approximation, the final errors could be large anyway. On the other hand, suppose there is a good linear approximation, could this make the problem "effectively low-dimensional" (and perhaps also help with mixing?) I guess the assumption here is typical for a theoretical analysis. A detailed analysis is not needed but I would appreciate some general comments.
**Response to Question 1:** We assume linear function approximation for the critic. This helps to reduce the dimensionality of the critic parameter.
Regarding slower mixing due to high dimensionality comment, we meant to say that for larger state and action spaces, the mixing time is slower as detailed in [Riemer et al, 2023]. Even if we have a good linear approximation, this would help us in terms of estimating the critic parameter, but the mixing time would still be defined in terms of the original state space. In other words, the linear function approximation example is for the critic step, and it does not affect the mixing time definition provided in Definition 2.1 on page 3 in our paper.
> **Additional remarks:** Few typos: line 93, "which is alleviates prior" line203, update in (10)?
**Response to additional remarks:** We thank the reviewer for these comments and will rectify these typos in the final version of our manuscript.