Rebuttal for Kernel graph MDP

# Rebuttal for Kernel graph MDP ## Common Responses We thank all the reviewers for their insightful feedback. In the following, we first provide responses to several common comments raised by the reviewers. Then, we provide a point-by-point response to each individual comment made by the reviewers right after their respective reviews. #### Q1: The comparison between our model and factored MDPs. * **Different structural formulation.** We want to clarify that our model and the factored MDP model are fundamentally different. Specifically, in factored MDPs, it was assumed that the transition kernel and the reward function are both factorizable, with a matching factored set [1]. Under this assumption, a factored MDP can be perfectly decomposed into multiple independent smaller MDPs. In contrast, we only assume a more general bipartite graph-based transition kernel, which is not necessarily factorizable. Even for a fully decomposable kernel graph, our model is still different because we do not make assumptions about the reward function. * **New technical challenges and our contributions.** Since the factored MDP can be perfectly decomposed into multiple independent smaller MDPs, applying vanilla single-agent RL algorithms to optimize each small MDP is enough for optimizing the overall factored MDP. In our case, since it is, in general, impossible to perfectly decompose the overall MDP into smaller ones, our algorithm focuses on strategically decomposing and aggregating the kernel graph and sampling the sub-kernel level transitions, which is then the main algorithmic contribution of this work. Technically, to estimate and control the errors that arise due to kernel decomposition and aggregation, we need to develop a more sophisticated variance analysis and concentration techniques for sub-kernels. >[1] Osband, I. and Van Roy, B. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014 #### Q2: How to get the graph decomposition? The choice of the decomposition scheme determines the trade-off between the asymptotic bias and the sample complexity, as demonstrated by our convergence bounds and our numerical simulations. Since we are the first to consider such a bipartite graph structure and to develop a customized algorithm, how to choose such a decomposition scheme in an optimal manner was not the focus of this work. That being said, next, we briefly elaborate on several promising future directions on how to achieve this goal theoretically and also empirically. 1. **Theoretically**, to characterize the trade-off and provide an optimal decomposition scheme, we need to first establish a lower bound of the iterates that match the upper bound. Based on that, we can optimize the two terms on the right-hand side of the bound to obtain the optimal decomposition scheme. 2. **Empirically,** suppose that we have some prior knowledge of the problem, that is, we are able to identify how strong the transition dependence is for different graph edges from the physical model of the problem (which is often possible for practical applications). Then, we can decompose the graph based on such prior knowledge of the transition dependence. 3. **Suppose that we don't have any prior knowledge of the problem**, then a possible direction is to design an adaptive decomposition algorithm to dynamically choose the best decomposition. Specifically, we start with a full decomposition, i.e., the number of decomposed components $K$ is equal to the number of state dimensions. This is the case where the sample complexity is the lowest, but the error due to decomposition is the largest. When the estimation converges but the required accuracy is not achieved, we can aggregate some components and get a new decomposition with fewer components, which would lead to a higher sample complexity and a lower bias. We iteratively conduct this process until the required accuracy is reached. #### Q3: Could you provide a concrete example that is not a factored MDP but has a bipartite graph-based transition kernel? I don't understand the reward function part. In general, your state space is combinatory, how could you design any kind of efficient algorithm without making assumptions on the reward function? In line 13 algorithm 1, you state "sample each state-action pair once to obtain reward function r(·, ·)". This step requires iterating the exponentially large state and action space. I thought the goal was to avoid this. We thank the reviewer for the feedback. 1. **An example of bipartite graph-based transition kernel without factored structure**: Consider an MDP with a two-dimensional state $s = (s_1, s_2)$ and a one-dimensional action $a$. For the transition dependence, the substate $s_1$'s transition depends on $s_1$ and $s_2$, and the substate $s_2$'s transition depends on $s_2$ and $a$. In the factored MDP setting, substates $s_1$ and $s_2$ have independent transitions, and the overall transition probability $P(s'|s,a)$ satisfies: \begin{align*} P(s'|s,a) &= P(s'_1,s'_2|s_1,s_2,a) \\ &= P(s'_1|s_1,s_2,a)P(s'_2|s_1,s_2,a)\\ &= P(s'_1|s_1,s_2)P(s'_2|s_2,a). \end{align*} The second equality follows from the factored probability of factored MDP [1], and the third equality follows from the partial dependence. However, in our model, since we do not make the factored probability assumption, the second equality does not hold because the transitions of $s_1$ and $s_2$ are not independent. Such independence is due to the fact that $s_1$ and $s_2$'s transitions are jointly influenced by the common substate $s_2$. Therefore, our kernel graph-based transition model has weaker assumptions and thus has a wider range of applicability. 2. **Influence of the reward function structure**: * **When the reward function has certain structures,** we can avoid sampling all state-action pairs and the exponentially large state-action space dependence. For example, consider the MDP mentioned in point 1. Suppose that the reward function has some low-dimensional structures: \begin{align*} r(s,a) &= r(s_1,s_2,a) = r_1(s_1,s_2)+r_2(s_1,a)+r_3(s_2,a), \end{align*} where $r_1$, $r_2$ and $r_3$ are low-dimensional sub-reward function components. In this case, it is clear that we only need to sample all state-action pairs of components $r_1$, $r_2$ and $r_3$ to recover the overall reward function $r$. The induced sample complexity is $|S_1||S_2|+|S_1||A|+|S_2||A|$, which is much smaller than $|S||A|=|S_1||S_2||A|$ due to the reward function structure. Note that this is only an example of the reward function structure; the number of sub-reward function components and their combination form (e.g., here $r = r_1 + r_2 + r_3$, but we allow any combination function $f$ with $r = f(r_1, r_2, r_3)$) can be much more general. Specifically, the induced sample complexity $D_r$ in the general case satisfies \begin{align*} D_{r} = \sum_{i=1}^H D_r^i, \end{align*} where $H$ denotes the number of sub-reward function components, and $D_r^i$ denotes the size of state-action space for component $r_i$. Since the sub-reward function components have partial dependence on state (action) dimensions, the sample complexity $D_r$ can be exponentially smaller than $|S||A|$. A detailed description has been provided in Appendix F in our paper. * **Even when the reward function has no structures,** the induced sample complexity by estimating the unknown reward function is often minor compared with that induced by estimating the unknown transition kernel. As the reviewer mentioned, we need to sample all state-action pairs in this case. However, if the reward function is deterministic, we only need to sample each state-action pair once with sample complexity $|S||A|$, which is much smaller than the minimax sample complexity $\frac{|S||A|}{(1-\gamma)^3\epsilon^2}$ because there are no dependence on the effective horizon $1/(1-\gamma)$ and, more importantly, the accuracy $\epsilon$. Even when the reward function is stochastic, the induced sample complexity is only $\frac{|S||A|}{(1-\gamma)^2\epsilon^2}$ due to the law of large numbers, which is still smaller than the minimax sample complexity. Essentially, the sample complexity induced by estimating the reward function is dominated by that induced by estimating the transition probabilities, which is also the reason why most of the existing literature do not consider the stochasticity of the reward function and focus on the transition probabilities [2][3]. >[1] Osband, I. and Van Roy, B. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014 >[2] Agarwal, Alekh, Sham Kakade, and Lin F. Yang. "Model-based reinforcement learning with a generative model is minimax optimal." Conference on Learning Theory. PMLR, 2020. >[3] Gheshlaghi Azar M, Munos R, Kappen H J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model[J]. Machine learning, 2013, 91: 325-349. #### Q4: Regarding the discussion on the reward function, all of your arguments apply to FMDP as well. So they are not arguments that help clarify your contribution w.r.t. FMDP. Regarding the first example, are you suggesting that next $s_1$ and $s_2$ are dependent given the current $s_1, s_2, a$, so the second equality you mentioned does not hold? And I think your assumption is only made on the marginal distribution of $s_1$. Note that I am talking about the condition dependence here. So everything is conditioned on a specific current $s_1, s_2, a$. In this case, $s_1$ and $s_2$ are dependent only when there is a confounder, for example $U$, and both $s_1$ and $s_2$ depends on $U$. If this is the case, this has to be shown on your Figure 1. Figure 1 clearly indicates that $s_1$ and $s_2$ are conditionally independent. Could you make this more concrete? For example, consider binary $s_1$, $s_2$ and $a$ and write out the full conditional probability: $P(s_1^{t+1}, s^{t+1}_2 \mid s_1^{t}, s_2^{t}, a^{t})$. Then tell me how this probability does not satisfy FMDP assumption while satisfying yours. We thank the reviewer for the follow-up question and engaging in discussion with us. * **Regarding the reward function.** We agree with the reviewer that the reward function assumption is not the key difference between our model and the FMDP and we apologize for the misunderstanding. In contrast, the key difference of our modeling and FMDP lies in the transition kernel structure, which is also the unknown component of MDP that usually dominates the sample complexity [1][2]. Please refer to the following concrete example. So the difference between our modeling and FMDP on transition kernel makes us need to develop new techniques, detailed in the second bullet point in 'Common responses Q1'. * **A concrete example that is not a FMDP but is a bipartite graph-based transition.** As suggested by the reviewer, here, we provide a concrete example with binary state values to illustrate that FMDP assumption is not satisfied in our setting. For any $t\geq 0$, consider binary substates $s_1^t$, $s_2^t$, and action $a^t$, the next substates $s^{t+1}_1$ and $s^{t+1}_2$ depends only on the current substate $s^t_2$, i.e., \begin{align*} P(s_1^{t+1},s_2^{t+1}|s_1^{t},s_2^{t},a^t) = P(s_1^{t+1},s_2^{t+1}|s_2^{t}). \end{align*} Next, we provide the full transition probabilities. For any $s_1^t$ and $a^t$, let $$ \ \ \ \ P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=0,s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 0, $$ $$ P(s_1^{t+1}=1,s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 0, $$ $$ \ \ \ \ P(s_1^{t+1}=1,s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 0, $$ $$ \ \ \ \ P(s_1^{t+1}=0,s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 2/3, $$ $$ \ \ \ \ P(s_1^{t+1}=1,s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=1,s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 0. $$ Based on the above equations, we calculate the marginal transition probabilities as follows: for any $s_1^t$ and $a^t$, $$ P(s_1^{t+1}=0|s_2^{t}=0) =P(s_1^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=1|s_2^{t}=0) =P(s_1^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=0|s_2^{t}=1) =P(s_1^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=1|s_2^{t}=1) =P(s_1^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=0|s_2^{t}=0) =P(s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=1|s_2^{t}=0) =P(s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_2^{t+1}=0|s_2^{t}=1) =P(s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=1|s_2^{t}=1) =P(s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 2/3. $$ Note that in this example, we have $P(s_1^{t+1}|s_2^{t},s_1^{t},a^t) = P(s_1^{t+1}|s_2^{t})$ and $P(s_2^{t+1}|s_2^{t},s_1^{t},a^t) = P(s_2^{t+1}|s_2^{t})$ for any $s_1^{t},a^t$, thus satisfying our model assumption. However, we see that the factored probability assumption does not hold because \begin{align*} P(s_1^{t+1}=0|s_2^{t}=0)P(s_2^{t+1}=0|s_2^{t}=0) = 1/9 \neq P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=0). \end{align*} In essence, the FMDP simultaneously assumes (1) the partial transition dependence and (2) the factored probability, in particular, the joint probability is equal to the product of marginal probabilities. On the other hand, we only assume (1) the partial transition dependence, which is the key modeling difference. >[1] Agarwal, Alekh, Sham Kakade, and Lin F. Yang. "Model-based reinforcement learning with a generative model is minimax optimal." Conference on Learning Theory. PMLR, 2020. >[2] Gheshlaghi Azar M, Munos R, Kappen H J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model[J]. Machine learning, 2013, 91: 325-349. Consider an MDP with $s=(s_1,s_2)$, partial dependence $P(s^{t+1}_1|s^t_1,s^t_2) = P(s^{t+1}_1|s^t_1)$ and $P(s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_2|s^t_2)$. We hope to have \begin{align} P(s^{t+1}_1,s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_1|s^{t}_1)P(s^{t+1}_2|s^t_1,s^{t}_2) \end{align} \begin{align} P(s^{t+1}_1,s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_1|s^{t}_1,s^t_2)P(s^{t+1}_2|s^t_1,s^{t}_2,s^{t+1}_1) \end{align} #### Q5: Thanks for the clarification. Now I understand that you do not assume that $s_1$ and $s_2$ are conditionally independent. There is confounder. Let us call it $U$. You will have to make this point very clear in Figure 1. Figure 1 as a DAG clearly indicates that $s_1$ and $s_2$ are conditionally independent, because conditioning on the current state and action, the causal path between $s_1$ and $s_2$ is blocked. I am still confused by how you can estimate the overall transition probability based on the marginal ones when you don't assume conditional independence. It seems that in equation (8) you compute the overall transition based on the marginal ones with Kronecker product operator. Doesn't this assume conditional independence? Could you use the same concrete example and explain to me how you can reconstruct overall transition from the marginal ones? Let us assume you have a perfect estimation of the marginal transition. **Regarding the Kernel Graph Representation**: We thank the reviewer for proposing an alternative viewpoint on our model. Indeed, using the confounding effect to explain the dependence between the substates is a good perspective to understand the difference between our model and FMDP. **For Using Marginal Probabilities to Recover Full Probabilities**: When there is a confounding variable that impacts two substates (as in the illustrative example provided in our last response), our decomposition scheme will NOT decompose those two substates (which is in constrast FMDP). Specifically, as illustrated in Section 4.1.1. in our submission, our decomposition scheme is based on the division of **disconnected** substates. Next, we use the MDP model provided in our last response to further illustrate this point. For any $t\geq 0$, consider binary substates $s_1^t$, $s_2^t$, and action $a^t$, the next substates $s^{t+1}_1$ and $s^{t+1}_2$ depends only on the current substate $s^t_2$. We can see that, $s^t_2$ influences both $s^{t+1}_1$ and $s^{t+1}_2$. However, since the product of the marginal transition probabilities does not equal to the overall transition probability (see our previous response), there exists a confounding variable $U^t_2$ whose realization depends on $s^t_2$ and influences both $s^{t+1}_1$ and $s^{t+1}_2$. Please refer to the following anonymous link for a pictorial illustration: https://docs.google.com/document/d/e/2PACX-1vSfTQ_hG0VG5eBHXGAQ5P81h8coOcSyS5hn3ruP2wCqp0eBEQbhmqJ_vB8gVnWQzgjgPXstw8bS__Ss/pub If all state/action variables $s^t_1$, $s^t_2$, $a^t$ and the confounder $U^t_2$ are known, then $s_1^{t+1}$ and $s_2^{t+1}$ should be conditional independent by default [1], i.e., $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2) = P(s_1^{t+1}|s_1^t,s_2^t,a^t,U^t_2)P(s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2). $$ However, since $U_2^t$ is unknown, the conditional independence does not hold due to the confounding effec: $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t) \neq P(s_1^{t+1}|s_1^t,s_2^t,a^t)P(s_2^{t+1}|s_1^t,s_2^t,a^t) = P(s_1^{t+1}|s_2^{t})P(s_2^{t+1}|s_2^{t}). $$ Therefore, for this example, FMDP decomposes $s_1^{t+1}$ and $s_2^{t+1}$ and gets inexact transition kernel. For our connected component decomposition, since it is based on the division of the disconnected substates (cf. Section 4.2 in our paper), it will not decompose these 2 substates because but instead directly estimates the joint distribution $P(s_1^{t+1},s_2^{t+1}|s_2^t)$. Since the confounding effect exists only within substates that are connected, our connected component decomposition avoids all confounders, thus the substates in different components should be conditional independent by default [1]. Therefore, the issue of estimating the overall transition probability based on marginal transition probabilities does not exist. Compared with the FMDP, since we take into account the confounding effect, we could have a more accurate and comprehensive description of the model structure. **Highlight our Contributions --- our proposed decomposition scheme is highly flexible**: A notable feature of our decomposition and analysis framework is that it allows from decomposition errors, which makes our framework more flexible. To see this, consider the following example. Consider an MDP that has a $100$-dimensional statespace $s=(s_1,s_2,...,s_{100})$. Suppose that the substates can be divided into $10$ groups: $(s_1,\cdots,s_{10})$, $(s_{11},\cdots,s_{20})$, $\cdots$, $(s_{91},\cdots,s_{100})$ such that within each group, the interconnections are strong while between different groups, there are weak transition dependences.  In this case, the graph is not fully decomposible (and also not factorizable). However, since our analysis allows for decomposition errors, it might still be better to artificially decompose the MDP into $10$ components so that the sample complexity is exponentially reduced while the asymptotic error is relatively small due to the weak dependence between the components. We have presented numerical simulations (cf. Figure 5) to demonstrate this effect. >[1] Pearl J. Models, reasoning and inference[J]. Cambridge, UK: Cambridge University Press, 2000, 19(2): 3.    ***      ## Reviewer fzvc We would like to express our gratitude to the reviewer for their insightful feedback. We are glad to know that the reviewer recognizes the novelty of our contributions, the theoretical and empirical results that show the advantages of our approach, as well as the clarity of the whole paper. We provide our response to the questions below. #### Q1: State-action space is rather small in the experiments. For the three cases in the experiments, the sizes of their state-action spaces are $625$, $320$, and $810$, which we believe are not small and are sufficient to numerically verify the theoretical findings (for a theoretical work). That being said, since the advantage of our method will become clearer as the size of the state-action space gets larger, we have included additional experiments with larger state-action spaces as follows: >https://docs.google.com/document/d/e/2PACX-1vRKjDMB4WecEpznmJfVZoDalryT8eLS8gJlQPBJB-VbOwrQBA0JUWG2STEaNLOFgg9lIsFpe6UsT_4F/pub #### Q2: How to get the graph decomposition? Please see our response to the second common comment at the beginning of this page for more details. ## Reviewer 9p8M We gratefully thank the reviewer for recognizing our contributions to the modeling, strong theoretical guarantees and clarity of presentation. We provide our response below: #### Q1: In terms of weakness, I think the setup itself -- a generative-model is available for sampling from any state-action pairs -- has limited applications in practice. The generative-model is a fundamental and frequent-adopted sampling model [1][2][3] for theoretical analysis in RL. Since this is the first work that considers such a bipartite graph structure, we use the generative-model setup to theoretically examine the merit of our proposed algorithm. That being said, our result can potentially be extended to the case where we have a single trajectory of Markovian samples, as long as the Markov chain is uniformly ergodic. The key challenge in this case is to extend the analysis for i.i.d. sampling to Markovian sampling. Given that there are several existing results [4] that provide the tools to handle this challenge (based on a mixing-time argument), we believe the generative model assumption can potentially be relaxed. We will include a discussion about our generative model assumption in the next version of this work. >[1] Kearns, M., Mansour, Y., and Ng, A. Y. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49:193–208, 2002. >[2] Kakade, Sham Machandranath. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003. >[3] Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert Kappen. "On the Sample Complexity of Reinforcement Learning with a Generative Model." In International Conference on Machine Learning. 2012. >[4] Li, G., Wei, Y., Chi, Y., Gu, Y., \& Chen, Y. (2020). Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 33, 7031-7043. #### Q2: Identifying a meaningful decomposition is challenging. Please see our response to the second common comment at the beginning of this page for more details. #### Q3: In the fully decomposable case, is the setup equivalent to a factored MDP? We want to clarify that even for a fully decomposable kernel graph, our model is different from the factored MDP because we have no assumptions on the reward function and the matching factored set. We have provided a more detailed discussion regarding the comparison between our model and the factored MDP in our response to the first common comment at the beginning of this page. #### Q4: Questions on the warm-up model --- seems not realistic and won't benefit a lot from decomposition. There are two reasons that motivate us to present the warm-up model. 1. Although very simple, the substate modeling and decomposable structures can be found in a wide variety of real-world decision making problems, such as electric vehicle charging, building thermal control, and electricity market trading [1]. 2. The 3-state model (which consists of independent substates, Markovian substates and dependent substates) can be viewed as an illustrative example and provides important first-glance intuition: the independent substate does not influence the sample complexity; the Markovian substate introduces a sample complexity that is linear to the size of the Markovian substate space; the dependent substate introduces a sample complexity that is linear to the space size of dependent substate-action pairs. To address the reviewer's comment, we will include another example that does not have such a fully decomposable structure to the warm-up section in the next version of our paper. >[1] Yeh, C., Li, V., Datta, R., Arroyo, J., Christianson, N., Zhang, C., ... \& Wierman, A. (2024). SustainGym: Reinforcement Learning Environments for Sustainable Energy Systems. Advances in Neural Information Processing Systems, 36. ## 3. Reviewer rr4w  #### Q1: What's the difference between this work and factored MDPs, which also have a decomposition process? What's the relation between the structure of this work and the more general low bellman rank assumption? --- The modeling and assumptions between our model and factored-MDP work are, in fact, fundamentally different. Please see our response to the first common comment at the beginning of this page for more details. Regarding the works making low Bellman-rank assumptions, it is not directly comparable to our results. Specifically, the Bellman rank was introduced to characterize the approximation limitation in RL with function approximation [1]. There is another set of works studying low-rank MDPs [2][3][4]. Low-rank MDP is also a framework for studying the limitation of function approximation. Specifically, in a low-rank MDP, it was assumed that the transition probability matrix can be factored into the product of two low-rank matrices, similar to using linear function approximation in value function approximation. In contrast, our model focuses on a possibly decomposable transition kernel based on Kronecker product, which targets on quite different things.  In the next version of our paper, we will properly cite those works that study MDPs with low Bellman ranks and discuss the relation. >[1] Jiang, Nan, et al. "Contextual decision processes with low bellman rank are pac-learnable." International Conference on Machine Learning. PMLR, 2017. >[2] Huang, A., Chen, J., & Jiang, N. (2023, July). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning (pp. 13710-13752). PMLR. >[3] Agarwal, Alekh, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. "Flambe: Structural complexity and representation learning of low rank mdps." Advances in neural information processing systems 33 (2020): 20095-20107. >[4] Mhammedi, Zak, Adam Block, Dylan J. Foster, and Alexander Rakhlin. "Efficient model-free exploration in low-rank mdps." Advances in Neural Information Processing Systems 36 (2024). #### Q2: Is the main contribution caused by the generative model assumption?  We want to clarify that the improvement of our sample complexity is not caused by the generative-model assumption. In fact, the generative-model assumption is a standard assumption in the theoretical analysis of RL, and was imposed in [1] that develops the minimax lower bound. Our key contribution is to identify bipartite graph structures in MDPs and, in turn, develop a customized algorithm with a novel sampling scheme that achieves an improved sample complexity compared with the minimax lower bound.  #### Q3: What is the advantage of the proposed methods compared to the value function approximation methods? The function approximation (FA) and our method are not mutually exclusive.  Specifically, after decomposing the transition kernel into multiple small-scale sub-kernels, one can use FA to approximate the optimal value function (or Q-function) for the small-scale transition kernels. It is an interesting future direction to theoretically (and also empirically) verify the performance of our algorithm combined with FA. #### Q4: What is the computational complexity of Algorithm 1? In our algorithm, after obtaining an estimated model (including transition kernel and reward) using our sampling scheme specifically designed based on the bipartite graph structure, we use standard value iteration to find an optimal policy. Since the value iteration converges geometrically fast [2], it is often not the performance bottleneck. In addition, one can incorporate function approximation in such value iteration method as suggested by the reviewer to further overcome the curse of dimensionality. We will add the above discussion regarding the computational complexity to the next version of this work. >[1] Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert Kappen. "On the Sample Complexity of Reinforcement Learning with a Generative Model." In International Conference on Machine Learning. 2012. >[2] Sutton, R. S., \& Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. #### Q5: The simulation should compare with these algorithms designed for factored MDP as well if the authors argue that this is a different setting. As illustrated in our response to the first common comment in the beginning of this page, the model assumptions between the factored MDP and ours are fundamentally different, even for the decomposable kernel graph case.  We did not numerically compare our algorithm to theirs because we feel that it would be an unfair comparison as the algorithms require different model assumptions to work. Specifically, the factored MDP requires a much stronger assumption on the matching factored transition kernel and reward function. If we directly apply their customized algorithm to our non-factored model, forcing a factorization will introduce additional errors, making the comparison unfair. That being said, to address the reviewer's concern, we will add corresponding numerical comparisons in the next version of our paper. #### Q6: The paper should have a more detailed comparison between their proposed setting and the other structural assumptions in the literature. No apparent societal impact. In the next version of this work, we will provide a more detailed comparison with works that study factored MDPs, weakly coupled MDPs [1], pseudo-stochastic state MDPs [2], networked MDPs [3][4], confounded MDPs [5], low Bellman rank MDPs [6][7]. >[1] Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L. P., Dean, T. L., and Boutilier, C. Solving very large weakly coupled Markov decision processes. In AAAI/IAAI, pp. 165–172, 1998. >[2] Wei, H., Liu, X., Wang, W., and Ying, L. Sample efficient reinforcement learning in mixed systems through augmented samples and its applications to queueing networks. Preprint arXiv:2305.16483, 2023 >[3] Zhang, Y., Qu, G., Xu, P., Lin, Y., Chen, Z., and Wierman, A. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. >[4] Qu, G., Lin, Y., Wierman, A., and Li, N. Scalable multi-agent reinforcement learning for networked systems withaverage reward. Advances in Neural Information Processing Systems, 33:2074–2086, 2020a. >[5] Zhang, J. and Bareinboim, E. Markov decision processes with unobserved confounders: A causal approach. Technical report, Technical report, Technical Report R-23, Purdue AI Lab, 2016 >[6] Jiang, Nan, et al. "Contextual decision processes with low bellman rank are pac-learnable." International Conference on Machine Learning. PMLR, 2017. >[7] Huang, A., Chen, J., & Jiang, N. (2023, July). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning (pp. 13710-13752). PMLR. ## Reviewer 6KFY #### Q1: The knowledge of the dependency graph is in general hard to obtain in the high-dimensional real world applications with e.g. the visual/sensor input. How can we obtain such dependency graph in practice? If the dependency graph is not available, what can we do? We have provided a detailed discussion regarding how to obtain a good decomposition scheme in response to the second common comment raised by the reviewers at the beginning of this page. In summary, for many real-world applications (especially those that have physical models), one way to obtain the transition dependence structure is to use the prior knowledge of the problem. For example, in power system control, the grid demand is independent of the power generator dispatch actions [1]; in robotic control problem, the environmental factors (like weather, wind speed, and ground conditions) will not be influenced by the robotic's state and action, and the control signal to a robotic will not influence the other robotics [2]. When the prior knowledge of the problem is inaccessible, we have discussed a potential adaptive decomposition scheme (see the beginning of this paper). In addition, it is possible to use the existing approaches to effectively detect the transition dependence (e.g., Causal Discovery [3], Variable Influence Structure Analysis [4] in RL), but this is out of the scope of the current paper. It is worth noting that our results allow for errors in the decomposition scheme. Such an error term will eventually appear as the asymptotically non-vanishing term on the right-hand side of our convergence bound. Therefore, we can start from a fully connected bipartite graph and eliminate edges as much as we can (by learning from the data), and obtain an approximation of the dependence graph.  We agree with the reviewer that developing generic practical methods to estimate the dependence graph is important. That being said, since this is the first work that considered the bipartite graph structure of developed customized algorithms with improved sample complexity guarantees, developing practically efficient estimation methods can be a future direction, but is out of the scope of this work.    >[1] Chowdhury, Badrul H., and Saifur Rahman. "A review of recent advances in economic dispatch." IEEE Transactions on Power Systems 5, no. 4 (1990): 1248-1259. >[2] Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238-1274. >[3] Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... \& Chen, Y. (2022). Causality-driven hierarchical structure discovery for reinforcement learning. Advances in Neural Information Processing Systems, 35, 20064-20076. >[4] Jonsson, A., \& Barto, A. (2006). Causal Graph Based Decomposition of Factored MDPs. Journal of Machine Learning Research, 7(11). #### Q2: How to compare the proposed model with the factored MDP: Our setting is fundamentally different from the factored MDP. We have provided a detailed discussion in response to the first common comment raised by the reviewers at the beginning of this page.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.