# Reviewer uT14
**Reply to comment on *Originality***:
We agree that the idea of “how to obtain” zero constraint violation is not new, this technique is well-known in optimization literature [Mahdavi et al., 2012; Akhtar et al., 2021], and hence also not proposed but also utilized in (Liu et al., 2021), same as ours. But the key contribution in this paper is in terms of algorithm development and convergence analysis for the specific setting of parametrized constrained reinforcement learning. The proof for parameterized consrained setup is novel and improves the sample complexity result (from $\mathcal{O}(1/\epsilon^6)$ to $\mathcal{O}(1/\epsilon^4)$). So, even without the zero constraint violation result, our sample complexity result of CMDP significantly **improves the state of the art**, which we believe is an important contribution. We further contrast our contributions to other mentioned works as follows.
(1) Our paper considers an infinite horizon discounted reward setup and we propose a model-free policy-based algorithm. On the other hand, Liu et al proposed a model-based algorithm for the episodic setting. Thus, **the results/approach/proof won’t holds for our setting**. Furthermore, the zero constraint part in both the works utilize the idea from [Mahdavi et al., 2012], and thus do not follow each other.
(2) To bound the gap between optimal value function for the original and conservative problem, we convert the problem into a dual domain and then bound the gap, while Liu et al. used a mixed policy to bound the gap in the primal domain only. Two benefit of the dual domain are: (i) value function is linear w.r.t occupancy measure (ii) any linear combination of occupancy meausre is still valid occupany measure. Thus, the **proof technique is significantly different** in terms of utilizing the conservative idea to achieve zero constraint violation. Further, we remark that the proof technique of [Liu et al., 2021] cannot be applied to our parameterized setting because the mixed policy of $\pi^*$ and $\pi^0$ might not correspond to certain parametrization which fails Lemma C.1 in Liu’s paper. Thus, the details in the proof are widely different from the prior works.
(3) We note that [Liu et al., 2021] shows O(1) violation result, which is not the zero violation result. Our results precisely provide the proof for zero violation which is better.
(4) Due to the value function being linear w.r.t. occupation measure, we handle the problem in the dual domain. However, due to the existence of parametrization, the strong duality does not hold. The reason is that Eq. 3.6 in [Altman, 1999] may lead to a policy that cannot be parameterized. Thus, Assumption 6 is essential to prove that the primal value and dual value are close (Lemma 5). The proof of this part (Lemma 5) and how to use this part to bound the gap between value function of original and conservative problem (Lemma 6) are novel in our work.
Mahdavi, M., Jin, R., & Yang, T. (2012). Trading regret for efficiency: online convex optimization with long term constraints. The Journal of Machine Learning Research, 13(1), 2503-2528.
Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
Zeeshan Akhtar, Amrit Singh Bedi, and Ketan Rajawat. Conservative stochastic optimization with expectation constraints. IEEE Transactions on Signal Processing, 69:3190–3205, 2021.
**Reply to comment on *Quality*:**
(1) The final bound in our work does not depend on the space state and have a dependence of $\mathcal{O}(\log|\mathcal{A}|)$ on action space, which is the same as in [Ding et al., 2021]. However, due to the existence of zero violation result, our algorithm has a dependence of $\mathcal{O}(1/\varphi^4)$ in the sample complexity, which we will add in the revision.
(2) We used the same parameters as in [Ding et al., 2021] for the algorithms, and thus it is a fair comparison.
(3) The key focus of our work is on the theoretical proof of the NPG based algorithm for general parameterization with zero constraint violation; and the experiments are only intended to validate this point. Thus, we remark that the comparison with multiple algorithms will not change the key point of the experiments, which is to show that the proposed approach converges. We used the exact same setup for experiments as in [Ding et al., 2021] which is also in the parameterized setup. With the available code, the experiments can be directly performed with large space, but will not change the final results.
**Reply to comment on *Significance*:**
(1) As we mentioned above, the main contribution of this paper is how to achieve zero violation in general parameterization with NPG. It required significant changes in the proofs. Further, Lemma 5 and Lemma 6 are novel contributions of this work.
(2) We agree that we adopt a part of the analysis from [Liu et al., 2020] to improve the dependence from $\mathcal{O}(1/\epsilon^6)$ to $\mathcal{O}(1/\epsilon^4)$ (see contribution point 2). However, there are still differences as compared to [Liu et al., 2020] due to the existence of constraints; some of the differences are in Appendix C for first-order stationary proof. Finally, if there is no constraint, then we could get the same result of $\mathcal{O}(1/\epsilon^3)$ as in [Liu et al., 2020] as well.
(3) & (4) We would like to emphasize that this is still the first result which gives $\mathcal{O}(1/\epsilon^4)$ result for NPG with general parametrization. This result improves the state of the art for parametrized CMDP, and thus we believe that this is an important contribution in the literature. The problem has been studied before, but it needs a careful analysis to use the existing approaches towards a complete state of the art result.
**Reply to comment on *Clarity*:**
(1) This is a typo and the output should be $\theta^k$, $k\in[0,K-1]$.
(2) There is a typo in assumption 4, $\sigma_\lambda$ should be $\Lambda$.
(3) To better introduce the two items, we will move the statement of Lemma 10 to the main part in the revision.
(4) Thanks for the suggestion, we will make the change accordingly in the revision.
Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance reduced) policy gradient and natural policy gradient methods. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7624–7636. Curran Associates, Inc., 2020.
# Reviewer PGZh
**Reply to *Weaknesses*:**
(1) We agree that the Slater condition and uniformly bounded dual variable may not hold in practice. However, such assumptions are necessary and common in theoretical papers on this topic. Please refer to Remark 2 for more intuitions on the bounded Lagrange multiplier.
(2) $\epsilon$ is the accuracy we would like the algorithm to achieve. In practice, we can choose it to be $0.1$ or $0.01$, etc. After $\epsilon$ is fixed, the selection of $K$ and $\kappa$ are given in Theorem 1. Because $\kappa$ decreases as $K$ increases, we can increase $K$ to achieve better accuracy.
(3) We can show that $K$ needs to have a dependence of $\mathcal{O}(1/\varphi^4)$. Such high dependence is due to dependence of $K$ on sample complexity which results from general parametrization. We will explicitly mention this in the final version.
**Reply to *Questions*:**
(1) & (2) Sorry for the confusion. Our algorithm does not need a generative model. The table and the statement of Algorithm 2 has an error, which we will rectify in the revision. Note that in the algorithm, there is no call to any generative model. We only need to sample multiple trajectories from CMDP, which do not need a generative model, and thus it is a fair comparison to CRPO. We will show dependence of $1-\gamma$ in the revision and will add more discussion about the assumptions in the footnote.
# Reviewer vwUt
**Reply to Weakness 1 and Question 1:** We agree that the conservative idea for zero constraint violation is well-known in optimization [Mahdavi et al., 2012]. Thus, the idea is not new in (Liu et al., 2021), while has been in the literature and is used in many recent works. Despite the idea of zero constraint existing in the literature, the key contribution in this paper is in the algorithm development and the proof for the specific setting of parametrized constrained reinforcement learning setting. The proof variation due to parameterized setup brings some novelty, and this improves the sample complexity result to be the new state of the art for CMDP. Even without the zero constraint, the sample complexity of CMDP significantly improves the state of the art results, and we believe that this is a significant contribution to the literature of CMDP in parameterized setup. More detailed comparisons with state of the art, please see our comments on the originality for Reviewer 3. Ensuring primal-dual gap (Lemma 5) and how to use this to bound the gap between value function of original and conservative problem (Lemma 6) are novel in our work.
**Reply to Weakness 2 and Question 2:** Assumption 6 is needed only for the analysis of zero violation. Thus, if we remove this assumption, we still have $\mathcal{O}(1/\epsilon^4)$ sample complexity and $\mathcal{O}(\epsilon)$ violation. Thus, the results will still be better than the current state of the art results. We will mention this in the revision. Based on this, we note that Assumption 6 should not influence the fair comparison.
**Regarding Assumption 4**, this is needed to bound the two terms in Lemma 10, Equation (69). Despite [Ding et al.] do not use this assumption, they made another assumption, Assumption 4, to alleviate the problem of bounding $||\omega^t||$. For CRPO paper, because they are using primal method, and introduce a two layer NN, it is hard to compare with the same assumption setting. Moreover, CRPO also made Assumptions 1,2,3,4, which is not used by our paper.