# Risk-POMDP rebuttal(draft)
# Second Round of Rebuttal
### Reviewer's Opinion ###
$\textbf{Rating: 4/10, Borderline reject. Confidence: 3/5}$
$\textbf{Soundness: 3/4, good}$
$\textbf{Presentation: 3/4, good}$
$\textbf{Contribution: 3/4, good}$
Comment:
Thank you for your comprehensive reply. I will raise my score accordingly after the following concerns (related to your rebuttal) are addressed.
W1: The entropy risk measure definitely makes sense to me, but I believe the readability can be improved significantly if you provide a more detailed explanation around its definition (e.g. the explanation and the connection mentioned in the rebuttal).
W2: To clarify, my concern here is that hindsight observability is a comparably stronger assumption than the weakly revealing condition. For example, a POMDP model with observability can be regarded into a weakly revealing POMDP model (see [6, Section 5.3], which defines revealing POMDPs with a few known core action sequences), by including the hindsight observation into the last-step observation. Of course, such a conversion can result in suboptimal regret, but the point is that hindsight can be a strong assumption theoretically; it can also be fairly strong practically, if the latent state corresponds to sensitive personal information.
Furthermore, I believe the remark "their regret bounds become vacuous when the assumptions are not satisfied. To make our formulation more pragmatic, we do not follow this direction" is very unfair. Basically, any regret bound can be vacuous when the structural condition of the algorithm is violated (e.g. an algorithm for MDP is deemed to fail when the environment is non-Markov).
W4: The reason why I mention the adaptability of existing algorithm framework here is that, it is strongly related to the theoretical significance of the results of this paper. If the previous results can be immediately applied, then it is probably harder for the bounds in this paper to be interesting. For example, it seems to me that for risk-sensitive revealing POMDP, the analysis in [2, Appendix F, G] [5, Appendix E] only need to be slightly modified (using your linearization lemma, Lemma F.14). Therefore, I guess it is more suitable to briefly discuss whether these general frameworks can be applied, or whether they may provide unsatisfactory results.
[2] Liu, Q., Chung, A., Szepesvari, C., and Jin, C. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175–5220. PMLR, 2022.
[5] Fan Chen, Yu Bai and Song Mei. Partially observable RL with B-stability: Unified structural condition and sharp sample efficient algorithms. arXiv preprint arXiv:2209.14990, 2022.
[6] Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 363–376, 2023.
### Rebuttal ###
Thank you for your prompt reply! We will address your remaining concerns in the following discussion.
> W1: The entropy risk measure definitely makes sense to me, but I believe the readability can be improved significantly if you provide a more detailed explanation around its definition (e.g. the explanation and the connection mentioned in the rebuttal).
We appreciate your suggestion to improve the readability. During our revision, we will follow your advice and supplement the main text with the detailed explanation mentioned in our rebuttal.
> W2: To clarify, my concern here is that hindsight observability is a comparably stronger assumption than the weakly revealing condition. For example, a POMDP model with observability can be regarded into a weakly revealing POMDP model (see [6, Section 5.3], which defines revealing POMDPs with a few known core action sequences), by including the hindsight observation into the last-step observation. Of course, such a conversion can result in suboptimal regret, but the point is that hindsight can be a strong assumption theoretically; it can also be fairly strong practically, if the latent state corresponds to sensitive personal information.
> Furthermore, I believe the remark "their regret bounds become vacuous when the assumptions are not satisfied. To make our formulation more pragmatic, we do not follow this direction" is very unfair. Basically, any regret bound can be vacuous when the structural condition of the algorithm is violated (e.g. an algorithm for MDP is deemed to fail when the environment is non-Markov).
<!--
Thank you for the comments regarding hindsight observability. As outlined in the third paragraph of Page 3 and in Section E (Page 31), and as we pointed out in our prior response, we have chosen to concentrate on this setup owing to its direct applicability across numerous real-world scenarios.
-->
Thank you for the insights regarding hindsight observability. We have focused on this setting due to its relevance in various real-world scenarios, as mentioned in Page 3, Section E and our prior response.
<!--
We concur that the weakly revealing condition you highlighted is less restrictive than hindsight observability. However, Designing a sample-efficient, risk-sensitive RL algorithm for weakly revealing POMDPs remains an open challenge, as current algorithms tailored to such POMDPs do not straightforwardly lend themselves to risk-sensitive settings, on which we will elaborate in the following discussion. Our proposed algorithm, BVVI, represents an attempt to bridge this gap by offering a solution that is both provably sample-efficient and applicable to hindsight observable POMDPs. In this way, we believe our work constitutes a meaningful advancement in the field of risk-sensitive POMDPs. We will expand on these points and clarify how BVVI contributes to the state-of-the-art in our revision.
-->
We understand your concerns regarding the robustness of hindsight observability. In our revision, we plan to provide a thorough comparison with other assumptions. However, adapting current algorithms for revealing POMDPs to risk-sensitive environments presents significant challenges, as we will discuss below. We believe that our BVVI algorithm, tailored to the novel setup, represents a solid progress in extending existing POMDP solutions to the risk-sensitive setting.
<!-- We have also taken your comment on our previous remarks on regret bounds to heart. In our revision, we will provide a more precise discussion on the nature of regret bounds under varying structural conditions. We are grateful for your valuable feedback, which has undeniably helped to strengthen our work.
-->
We have taken your feedback on the regret bounds to heart. We will offer a more precise description in our revision. We are grateful for your valuable response, which significantly helps to improve our presentation.
<!-- Thank you for providing clarification of your suggestion. Our choice to focus on POMDP with hindsight observation stems from its direct relevance to numerous real-world applications. We think it is an important problem to investigate the relationship between existing POMDP assumptions, and we value your insights on this matter. Taking your advice into account, we plan to update our submission by substituting the comment regarding the $\alpha$-revealing condition with a detailed comparison between our framework and other setups. -->
> W4: The reason why I mention the adaptability of existing algorithm framework here is that, it is strongly related to the theoretical significance of the results of this paper. If the previous results can be immediately applied, then it is probably harder for the bounds in this paper to be interesting. For example, it seems to me that for risk-sensitive revealing POMDP, the analysis in [2, Appendix F, G] [5, Appendix E] only need to be slightly modified (using your linearization lemma, Lemma F.14). Therefore, I guess it is more suitable to briefly discuss whether these general frameworks can be applied, or whether they may provide unsatisfactory results.
<!--
We appreciate your inquiry into the adaptability of existing algorithm frameworks and their relevance to the theoretical contributions of our paper. While the algorithms in [2] and [5] are indeed pioneering for risk-neutral POMDPs, our work extends the frontier to risk-sensitive environments ---- an area where immediate applications of these frameworks are nontrivial. Below, we outline the main obstacles to such adaptation and underscore the novelties of our approach.
-->
We appreciate your inquiry into the adaptability of existing algorithm frameworks. While the algorithms in [2] and [5] are indeed pioneering works for risk-neutral POMDPs, it is not straightforward to directly apply these algorithms to the risk-sensitive setting. Here, we highlight the challenges in a direct adaptation and explain what sets our approach apart.
1. **Formulation**: The OMLE-based analyses in [2] and [5] predicate on the assumption that emission matrices have a minimum positive singular value denoted by $\alpha$. Our work diverges significantly by not imposing such structural constraints on the probability kernels. This foundational difference in formulation allows our regret bound to remain independent of $\alpha$, which enlarges the scope of applicability to instances where emission kernels lack full column rank.
<!--
2. **Planning**: The OMLE framework's optimization planning oracle, $\pi^k = \arg\max_{\theta,\pi}V_{\theta}^{\pi}$, is tailored for risk-neutral settings. Precedent research [1][3][4] suggests that risk-neutral planners do not seamlessly transition to risk-sensitive contexts. Furthermore, this oracle is computationally intractable to implementation. Our Algorithm 1 BVVI, detailed on Page 4, integrates a computationally tractable risk-sensitive POMDP planner (lines 5-17 in Algorithm 1) that does not rely on an impractical oracle.
-->
2. **Planning**: The OMLE framework utilizes an optimization oracle $\pi^k = \arg\max_{\theta,\pi}V_{\theta}^{\pi}$ to address the planning task tailored for risk-neutral scenarios. However, previous studies [1][3][4] suggest that risk-neutral planners cannot be directly applied to risk-sensitive contexts. Our algorithm BVVI, detailed on Page 4, explicitly implements a practical and risk-sensitive planner (refer to Lines 5-17 in Algorithm 1) as an alternative to the optimization oracle.
<!--
to address these challenges.
--->
<!--
3. **Learning**: The learning mechanism in OMLE involves executing MLE within a large confidence set, which can be computationally prohibitive and hard to implement (also discussed in [7]). In contrast, our approach utilizes a tractable sample mean estimator to learn the environment dynamics, culminating in the first practical algorithm tailored to risk-sensitive POMDPs. This shift in the learning paradigm also necessitates a distinct suite of proving techniques, which we elaborate on in Section D (page 23).
-->
3. **Learning**: In OMLE, learning involves executing Maximum Likelihood Estimation within a large confidence set, which can be computationally prohibitive and hard to implement (as discussed in [7][8]). In contrast, our approach employs a tractable sample mean estimator to learn the environment dynamics, resulting in the first practical algorithm specifically designed for risk-sensitive POMDPs. This shift in learning methodology also necessitates a different and unique suite of proof techniques, which we elaborate on in Section D (Page 23).
<!--
We acknowledge the importance of a rigorous comparative analysis, and in our revised manuscript, we will enhance our discussion on the limitations of extending the existing general frameworks to risk-sensitive domain. We thank you for your thoughtful comments, which have been instrumental in refining our work and will continue to guide our revisions.
-->
We acknowledge the importance of a thorough comparative analysis on various POMDP approaches. In our revised manuscript, we will elaborate on the limitations of extending existing general frameworks to the risk-sensitive domain. We appreciate your thoughtful comments, which have played a crucial role in refining our work and will continue to guide our revisions.
We hope our response addresses your concerns, and if that is the case, we wonder if you could kindly raise your score rating? Should you have any further questions, we are very willing to provide further assistance. Thank you very much for the support!
<!--
6. **Technique**: It is possible that we can integrate the linearization lemma into the OMLE framework. However, simple adaptation could lead to largely suboptimal results. For example, derivations in the risk-neutral setting reliant on the crude bounds(e.g. Eqs.(27) in [5]), needs serious revision to optimize the dependency on $e^{\gamma H}$. Without careful consideration, simple adaptation could easily lead to exponentially looser bounds[8]. In our work, apart from the linearization lemmas, various other techniques(e.g. Lemma F.5, Lemma F.10) worked in concerted effort in our non-linear analysis in Section D.21. These technical considerations secure our regret is optimal w.r.t. the factor $\gamma H$, ensuring that the upper bound not only accomodates to all $\gamma\in \mathbb{R}\setminus \{0\}$, but also improves precedent results when degenerated to risk-neutral setting as $\gamma \to 0$.
-->
<!-- 1. Representation methods for risk-neutral POMDP may NOT be suitable in the risk-sensitive setting. For example, it has been settled that belief state representation of POMDPs cannot faithfully and sufficiently capture the risk-awareness of the agent(see discussion on page 28 and [1][2][3]), so we adopted risk-sensitive beliefs to tackle this gap. Similarily, OMLE is established upon a closely related representation method: the linear PSR. It remains an open problem whether it also fails like the beliefs in the face of a non-linear learning goal.
2. Fundamental results in POMDP studies needs heavy revision in risk-sensitive setup. The value functions and Bellman equations are defined in a compleletely different way. We also need various technique other than the linearization lemma to handle non-linearity. For example, we cannot directly convert the regret to probability erro like Eqs.(27) in [5]. Naively adopting our linearization lemma with a crude bound like [5] will only result in largely suboptimal regret. Moreover, such childish adaptation cannot effectively and quantitatively reveal how the risk-level affects the regret, like we did. We discussed these issues in Section 2(page 2), Appendix B(page 12) and Remark B.28(page 21).
3. The proofs in the OMLE paper fails to reflect the distribution twist characteristic of risk-sensitive rewards. [!!!!!!!!!!! NOT FINISHED]
4. in the risk-neutral setting. However, since even the dynamic programming equations are very different in risk-POMDP setup (see [1][2][3]), we seriously doublt whether this algorithm is still applicable. Our contribution is that we analyzed this issue in detail and we discovered the correct planner through line 4-30 on page 25. Moreover, OMLE does not discuss the existence and the property of the optimal policy in the risk-sensitive setup, while we provided a theoretical investigation in Theore B.23(page 22). More importantly, our plannner plans in a single model, without the step of $\max_{\widehat{\theta}}$, and thus our proof does not rely on the realizability assumption of OMLE.
-->
[1] Baras, J. S. and James, M. R. Robust and risk-sensitive output feedback control for finite state machines and hidden markov models. J. Math. Systems, Estimation & Control, 7:371–374, 1997. URL http://spigot.anu.edu.au/people/mat/home.html.
[2] Liu, Q., Chung, A., Szepesvari, C., and Jin, C. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175–5220. PMLR, 2022.
[2] Bauerle, N. and Rieder, U. More risk-sensitive markov decision processes. Mathematics of Operations Research,39(1):105–120, 2014.
[3] Cavazos-Cadena, R. and Hernandez-Hern´andez, D. Successive approximations in partially observable controlled markov chains with risk-sensitive average criterion. Stochastics: An International Journal of Probability and Stochastics Processes, 77(6):537–568, 2005
[4] James, M. R., Baras, J. S., and Elliott, R. J. Risksensitive control and dynamic games for partially observed discrete-time nonlinear systems. IEEE Transactions on Automatic Control, 39:780–792, 1994. ISSN 15582523. doi: 10.1109/9.286253.
[5] Fan Chen, Yu Bai and Song Mei. Partially observable RL with B-stability: Unified structural condition and sharp sample efficient algorithms. arXiv preprint arXiv:2209.14990, 2022.
[6] Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 363–376, 2023.
[7] Golowich, N., Moitra, A., and Rohatgi, D. Learning in observable pomdps, without computationally intractable oracles. Advances in Neural Information Processing Systems, 6 2022b. URL http://arxiv.org/abs/2206.03446.
[8] Zhang, H., Ren, T., Xiao, C., Schuurmans, D., & Dai, B. (2024, February 11). Efficient Reinforcement Learning from partial observability. arXiv.org. https://arxiv.org/abs/2311.12244
<!--
[7] Zhang, H., Ren, T., Xiao, C., Schuurmans, D., & Dai, B. (2023). Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning. arXiv preprint arXiv:2311.12244.
[8] Fei, Y., Yang, Z., Chen, Y., and Wang, Z. Exponential Bellman equation and improved regret bounds for risksensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436–20446, 2021a -->
# First Round of Rebuttal
## Reviewr o2PN
$\textbf{Rating: 7/10 Accept. Confidence: 3/5}$
$\textbf{Soundness: 3/4}$
$\textbf{Presentation: 3/4}$
$\textbf{Contribution: 3/4}$
### Review opinion
**Strengths And Weaknesses:**
The paper is clear and well-written. The authors clearly explain how their work extends the risk-neutral setting and show how it is possible to resort to it starting from the risk-sensitive scenario. Furthermore, they explain the need for beta vector in the risk-sensitive scenario, which shares similar ideas with the more common alpha vector that is used in the risk-neutral POMDP settings and highlight their differences with it.
I do not see any specific weakness in the presented work.
**Questions**
The authors show that the result improves over the one of Lee et al.2023 that concerns the risk-neutral setting. Since their regret approaches the lower bound but does not reach it, I was wondering whether the authors have any intuition about this aspect. In particular, do the authors think that the result can be improved? If the answer is yes, do they have any idea on how to tackle it? I acknowledge that this is a minor point but it would be interesting to discuss this aspect.
**Limitations:**
The authors addressed the limitations of the work.
### **Rebuttal**
Thank you for your thorough review and encouraging feedback!
> Question: The authors show that the result improves over the one of Lee et al.2023 that concerns the risk-neutral setting. Since their regret approaches the lower bound but does not reach it, I was wondering whether the authors have any intuition about this aspect. In particular, do the authors think that the result can be improved? If the answer is yes, do they have any idea on how to tackle it? I acknowledge that this is a minor point but it would be interesting to discuss this aspect.
The main advancement lies in the design of a more refined bonus term, as detailed in Eqs. (4) and (5) in our submission. By leveraging the empirical models of transitions and emissions, Eqs. (60) and (61) establish a tighter concentration bound using Hoeffding inequalities compared to [1]. Thus, we believe that there is a potential for refinement if one can tighten Eqs. (37) and (39) in [1] by utilizing the empirical models.
References:
[1] Lee, J., Agarwal, A., Dann, C., \& Zhang, T. (2023, July). Learning in pomdps is sample-efficient with hindsight observability. In *International Conference on Machine Learning* (pp. 18733-18773). PMLR.
## Reviewer wTo3 ##
### Reviewer's Opinion
$\textbf{Rating: 5/10 Borderline accept. Confidence: 2/4}$
$\textbf{Soundness: 3/4, good}$
$\textbf{Presentation:3/4, good}$
$\textbf{Contribution: 2/4, fair}$
**Strengths:**
The paper integrates hindsight observations under the POMDP framework and risk-sensitive reinforcement learning, introducing beta vectors as a novel analytical tool and the unique bonus function for risk-sensitive exploration.
Despite the complexity of the subject matter, the paper is well-structured and articulately written.
The paper presents the first regret analysis in this context, achieving results that reflect the agent's risk awareness and history-dependency.
The theoretical result matches existing upper bounds in degenerate cases.
The introduction of the beta vector looks novel to me.
**Weaknesses:**
The absence of numerical experiments in the paper is a notable weakness. Theoretical proofs of efficiency and regret bounds are important, but empirical evidence demonstrating how the algorithm performs with actual data is equally critical. Without numerical experiments, it's challenging to compare the proposed algorithm's performance against existing methods under real-world or simulated conditions.
The paper's notation, particularly the use of the function "o" in various fonts to represent different objects, is not the most user-friendly. Readers may struggle to keep track of the distinctions between similar-looking symbols, potentially leading to misunderstandings.
**Questions:**
The paper does not provide explicit details on the computational complexity of the proposed algorithm. It might be valuable to briefly discuss this aspect in the paper?
**Limitations:**
The authors adequately addressed the limitations from my point of view.
### Rebuttal ###
We would like to thank the reviewer for positively appreciating the technical soundness and novelty of the problem we study, and providing detailed comments on the paper presentation.
**Experiments:** We appreciate your valuable suggestion to include numerical experiments. Consistent with precedent theoretical RL studies like [1], [2], and [3], our focus is on the mathematical foundations of the new problem setup. We plan to incorporate experiments in our revision.
**Notation:** The notations in our article follow the convention of a line of works on POMDP, such as [4],[5] and [6]. We will be sure to further improve the paper presentation in our revision and we thank you for your feedback.
**Computation complexity:** The computational issues are detailed in Remark C.1, Section C (Page 23), and Section E (Page 31) of our Appendix. As noted in Remark C.1, the POMDP's history-dependency impels the agent to traverse the entire history space during policy-evaluation, leading to high computational complexity.
To the best of our knowledge, this computation bottleneck is inevitable for general POMDP solvers [7]. It is an exciting avenue for future research to optimize both the statistical and computational complexities for general POMDP algorithms.
We are grateful for your constructive suggestions which significantly guide our improvements. We hope our response addresses your concerns. If so, we wonder if you could kindly consider raising your score rating? We will also be happy to answer any further questions you may have. Thank you very much!
References:
[1] Jin, C., Allen-Zhu, Z., Bubeck, S., & Jordan, M. I. (2018). Is Q-learning provably efficient?. Advances in neural information processing systems, 31.
[2] Jin, C., Yang, Z., Wang, Z., & Jordan, M. I. (2020, July). Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory (pp. 2137-2143). PMLR.
[3] Shani, L., Efroni, Y., & Mannor, S. (2020, April). Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5668-5675).
[4] Liu, Q., Chung, A., Szepesvári, C., \& Jin, C. (2022, June). When is partially observable reinforcement learning not scary?. In Conference on Learning Theory (pp. 5175-5220). PMLR.
[5] Cai, Q., Yang, Z., \& Wang, Z. (2022, June). Reinforcement learning from partial
observation: Linear function approximation with provable sample efficiency. In International Conference on Machine Learning (pp. 2485-2522). PMLR.
[6] Zhan, W., Uehara, M., Sun, W., \& Lee, J. D. (2022). Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738.
[7] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
## Reviewer oVgC ##
### Reviewer's Opinion ###
$\textbf{Rating: 7/10: Accept. Confidence: 3/5}$
$\textbf{Soundness: 3/4, good}$
$\textbf{Presentation: 2/4, fair}$
$\textbf{Contribution: 3/4, good}$
**Strengths And Weaknesses:**
Originality and Significance:
the theoretical problem setup of risk-sensitive RL in POMDP with hindsight observability is new and interesting given the literature of (i) risk-neutral RL in POMDP with hindsight observability, (ii) risk-sensitive RL with full observations, and (iii) risk-sensitive control with partial observations (known transition/emission).
The authors utilizes the concepts of risk beliefs and conjugate beliefs in risk-sensitive control with partial observations to derive a new set of value functions/Bellman equations for risk-sensitive POMDPs. To simplify the representation of value functions and theoretical analysis, they further propose the beta vector which generalizes the alpha-vector in risk-neutral POMDP literature. On top of that, a UCB-style algorithm is developed with a polynomial regret upper bound showing its sample efficiency. The regret upper bound, when reduced to setups of (i) or (ii), either matches or improves the existing bounds therein.
Quality and Clarity:
The algorithm is well developed and the regret analysis is sound and solid. The comparison with related literature (i) and (ii) (see originality) is also clear. The handling of the risk-sensitive beliefs for the paper's setup is complete and self-content.
Still there are some minor suggestions that I think could help improve the presentation and the readability, see my questions below.
**Questions:**
Even though the authors discuss the regret/sample complexity lower bounds in reduced setups (i) risk-neutral RL in POMDP with hindsight observability and (ii) risk-sensitive RL with full observations, it would be better if a direct lower bound for the new problem setup is derived and presented, which can help make the work on this new setting more self-content. Would it be possible to derive and discuss a regret/sample complexity lower bound for this new setting?
Following Question 1. Since the algorithm uses a simple Hoeffding-style UCB bonus design, the dependence on and seems not tight (just compare with the lower bound for risk-neutral tabular MDPs). Would it be possible to derive tighter bounds by Bernstein-style bonus design? But definitely this should serve as a future work.
The paper introduces the notions of beliefs and the novel beta vectors in Section 7, after the algorithm section which heavily depends on these concepts. Would it be possible to first introduce these notations and then propose the algorithm? I think this makes the paper more readable for those who are not familiar with these concepts and also help to better convey the idea behind your algorithm design.
**Limitations**
See the above Question section.
### Rebuttal ###
Thank you for your thorough review and encouraging feedback!
**Lower Bound:** Due to the novelty and complexity of our problem, constructing a lower bound remains an open problem. We toally agree with the reviewer on the significance of this isssue, and we will try to tackle this problem in our future study.
**Bernstein-type bonus:** Thank you for the interesting suggestion! We believe the adoption of Bernstein-type bonus could likely improve our results. We plan to investigate it in our future research.
**Notation:** Thank you for your kind suggestions on the narration order. We will be sure to improve the presentation following your advice on our revision.
## Reviewer zUwP ##
### Reviewer's Opinion ###
$\textbf{Rating: 4/10, Borderline reject. Confidence: 3/5}$
$\textbf{Soundness: 3/4, good}$
$\textbf{Presentation: 3/4, good}$
$\textbf{Contribution: 3/4, good}$
**Strength:**
The proposed algorithm adopts several novel mechanisms to tackle partial observability, which may worth future investigations.
**Weakness:**
The motivation of investigating risk-sensitive RL is evident, but the formulation of risk-sensitive objective (1) is not well-motivated, e.g. why this formulation captures the essence of risk-sensitivity, and whether there are other formulations.
The structural assumption of the hindsight observation is also not well-motivated, given that there are various structural conditions for sample-efficient learning in POMDP (e.g. revealing condition, decodability, stable PSR, etc.). In particular, the assumption of hindsight observation is much stronger than the revealing condition
Due to the partial observability, both the description and the analysis of Algorithm BVVI are very sophisticated. The paper mentions that for fully-observable RL, Algorithm BVVI degenerates into a simpler algorithm with better guarantees. Perhaps it is better to sketch the obtained algorithm, so that the reader can better understand Algorithm BVVI itself and how it tackles the challenge of partial observability.\red{We haave tackled the challenge of paartial in somewhiere in the paper. We will consider add that in the appendix. We gonna sketch that in the revision.}
It is not that obvious such a sophisticated approach is necessary for the risk-sensitive RL. A natural question is that whether the (conceptually) simpler general-purpose algorithms can also achieve similar guarantees, e.g. E2D [1], OMLE [2, 4, 5, 6], or MOPS [3, 4]. Using the linearization lemma (Lemma F.14) and the analysis in [4, 5, 6], it seems that if we adopt the risk-sensitive objective as the value function in OMLE / MOPS / E2D, then they can achieve similar risk-sensitive regret guarantees for a broader class of decision-making problems (not restricted to POMDP with hindsight observation). Of course, the obtained regret bounds possibly are worse by some polynomial factors on (S, A, O, H), but (1) the analysis can be easier to understand, (2) potentially by doing so we can also obtain similar guarantees without assuming hindsight observation (e.g. revealing POMDP), and (3) the current upper bounds' dependency on (S, A, O, H) may also be loose.
**Questions:**
Q: The current regret upper bound scales with $e^{\|\gamma H\|}$ In my understanding, the H factor comes from the fact that can vary from 0 to H. Is it possible to replace H by some parameter that characterizes the sub-Gaussian norm of the cumulative rewards? Such a parameter can be regarded as a variance proxy, which is possibly more interpretable in the context of risk-sensitive RL and can be much smaller than H in certain problem instances.
[1] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
[2] Liu, Q., Chung, A., Szepesvari, C., and Jin, C. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175–5220. PMLR, 2022.
[3] Alekh Agarwal and Tong Zhang. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity. arXiv preprint arXiv:2206.07659, 2022.
[4] Fan Chen, Song Mei, and Yu Bai. Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning. arXiv preprint arXiv:2209.11745, 2022.
[5] Fan Chen, Yu Bai and Song Mei. Partially observable RL with B-stability: Unified structural condition and sharp sample efficient algorithms. arXiv preprint arXiv:2209.14990, 2022.
[6] Liu, Q., Netrapalli, P., Szepesvari, C., and Jin, C. Optimistic MLE: A generic model-based algorithm for partially observable sequential decision making. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pp. 363–376, 2023.
### Rebuttal ###
Thank you for your thoughtful review. We would like to address your concerns in what follows.
### Weakness
> W1: The motivation of investigating risk-sensitive RL is evident, but the formulation of risk-sensitive objective (1) is not well-motivated, e.g. why this formulation captures the essence of risk-sensitivity, and whether there are other formulations.
The formulation we have chosen, the entropic risk measure, is a widely recognized and general approach in previous literature [1][2][3][4][5], as mentioned in Remark 3.1 on Page 3.
The connection between exponential function and the risk-awareness can be illustrated by the following expansion:
$\frac{1}{\gamma}\ln \mathbb{E}e^{\gamma R}=\mathbb{E}R+\frac{\gamma}{2}\cdot \operatorname{Var}R+o(\operatorname{Var} R)$
where $R=\sum_{h=1}^H r_h$ is the accumulated reward. This relation intuitively reveals the connection between the entropic risk measure and the uncertainty in the rewards.
There exists various other risk-measures in the studies of risk-sensitive RL, such as Conditional Value-at-Risk (CVaR) [13], convex risks[14], and Lipschitz risks[15]. It is an exciting future topic to investigate these measures in the context of risk-sensitive POMDPs.
> W2: The structural assumption of the hindsight observation is also not well-motivated, given that there are various structural conditions for sample-efficient learning in POMDP (e.g. revealing condition, decodability, stable PSR, etc.). In particular, the assumption of hindsight observation is much stronger than the revealing condition
The choice to consider hindsight observation is grounded in its practicality and the support from a wide range of empiricl and theoretical studies, as is detailed in the third paragraph on Page 3.
Additionally, in Section E(Page 31), we directly compare hindsight observability with other structural assumptions. In the second paragraph of Section 2 (Page 2), we explicitly explain our decision not to adopt these assumptions.
Moreover, RL with hindsight observability, also known as asymmetric RL, has been extensively studied in empirical research, including references [6], [7], [8], [9]. We also mention that recent theoretical works such as [10], [11], [12] have also started to investigate this practical and general setting.
> W3: Due to the partial observability, both the description and the analysis of Algorithm BVVI are very sophisticated. The paper mentions that for fully-observable RL, Algorithm BVVI degenerates into a simpler algorithm with better guarantees. Perhaps it is better to sketch the obtained algorithm, so that the reader can better understand Algorithm BVVI itself and how it tackles the challenge of partial observability.
We dedicate Section B (Pages 12 to 21) to describe the development of the mathematical method for addressing partial observability under the risk measure. We introduce a new set of value functions, distinct from the MDP framework, as outlined in Sections 4 and 5 (Pages 3-4). This novel representation method lays the foundation for the design of Algorithm BVVI.
We appreciate your suggestion to include a sketch of other algorithms in the simpler fully-observable setting. We will take your advice into consideration during our revision and we plan to enhance the clarity of our presentation.
> W4: It is not that obvious such a sophisticated approach is necessary for the risk-sensitive RL.
Our work provides the first rigorous analysis on the sample complexity of learning risk-sensitive POMDPs. The theoretical analysis and interpretation of our results can be found in Section D.4 on Page 29, where we thoroughly analyze the implications of our findings.
We appreciate your suggestion to explore the adaptability of existing algorithm frameworks from risk-neutral to risk-sensitive POMDPs. This indeed presents an interesting avenue for future research. It would be valuable to assess whether these frameworks can be effectively tailored to accommodate risk-sensitive objectives and offer closed-form regret bounds akin to those in our study, thereby elucidating the complexity analysis in relation to the risk parameter $\gamma$.
### Question
>The current regret upper bound scales with $e^{|\gamma| H}$. In my understanding, the H factor comes from the fact that can vary from 0 to H. Is it possible to replace H by some parameter that characterizes the sub-Gaussian norm of the cumulative rewards?
Previous study (Section F of [1]) has proved that in the fully observable setting, risk-sensitive RL under the entropic risk measure suffers the regret lower bound of $\Omega((e^{\gamma\frac{H}{2}}-1)\sqrt{T}/\gamma)$. Consequently, we think that the term $e^{|\gamma| H}$ is generally unavoidable for risk-sensitive RL problems. We have discussed this issues in Section 8.3 of our submission (Page 8).
We agree with the reviewer that exploring the use of a parameter characterizing the sub-Gaussian norm of the cumulative rewards is very interesting. We will consider this issue in further investigation.
We appreciate your constructive suggestions, which have played a significant role in guiding our improvements. We hope our response addresses your concerns. If that is the case, we wonder if you could kindly consider raising your score rating? Should you have any additional questions, we are more than happy to provide further assistance. Thank you very much for the support!
**References:**
[1] Fei, Y., Yang, Z., Chen, Y., Wang, Z., \& Xie, Q. (2020). Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33, 22384-22395.
[2] Liang, H., \& Luo, Z. (2023). Regret Bounds for Risk-sensitive Reinforcement Learning with Lipschitz Dynamic Risk Measures. ArXiv, abs/2306.02399.
[3] Lin Hau, J., Petrik, M. \& Ghavamzadeh, M.. (2023). Entropic Risk Optimization in Discounted MDPs. <i>Proceedings of The 26th International Conference on Artificial Intelligence and Statistics</i>, in <i>Proceedings of Machine Learning Research</i> 206:47-76 Available from https://proceedings.mlr.press/v206/lin-hau23a.html.
[4] Fei, Y., Yang, Z. \& Wang, Z.. (2021). Risk-Sensitive Reinforcement Learning with Function Approximation: A Debiasing Approach. <i>Proceedings of the 38th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 139:3198-3207 Available from https://proceedings.mlr.press/v139/fei21a.html.
[5] Bäuerle, N., \& Rieder, U. (2014). More risk-sensitive Markov decision processes. Mathematics of Operations Research, 39(1), 105-120.
[6] Baisero, A., Daley, B., \& Amato, C. (2022, August). Asymmetric DQN for partially observable reinforcement learning. In Uncertainty in Artificial Intelligence (pp. 107-117). PMLR.
[7] Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W., & Abbeel, P. (2017). Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542.
[8] Chen, D., Zhou, B., Koltun, V., & Krähenbühl, P. (2020, May). Learning by cheating. In Conference on Robot Learning (pp. 66-75). PMLR.
[9] Ross, S., Gordon, G., & Bagnell, D. (2011, June). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 627-635). JMLR Workshop and Conference Proceedings.
[10] Lee, J., Agarwal, A., Dann, C., & Zhang, T. (2023, July). Learning in pomdps is sample-efficient with hindsight observability. In International Conference on Machine Learning (pp. 18733-18773). PMLR.
[11] Sinclair, S. R., Frujeri, F. V., Cheng, C. A., Marshall, L., Barbalho, H. D. O., Li, J., ... & Swaminathan, A. (2023, July). Hindsight learning for mdps with exogenous inputs. In International Conference on Machine Learning (pp. 31877-31914). PMLR.
[12] Shi, M., Liang, Y., & Shroff, N. (2023). Theoretical Hardness and Tractability of POMDPs in RL with Partial Hindsight State Information. arXiv preprint arXiv:2306.08762.
[13] Du, Y., Wang, S., & Huang, L. (2022). Provably efficient risk-sensitive reinforcement learning: Iterated cvar and worst path. arXiv preprint arXiv:2206.02678.
[14] Coache, A., & Jaimungal, S. (2024). Reinforcement learning with dynamic convex risk measures. Mathematical Finance, 34(2), 557-587.
[15] Liang, H., & Luo, Z. Q. (2023). Regret bounds for risk-sensitive reinforcement learning with lipschitz dynamic risk measures. arXiv preprint arXiv:2306.02399.