laixis
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Rebuttal for Kernel graph MDP ## Common Responses We thank all the reviewers for their insightful feedback. In the following, we first provide responses to several common comments raised by the reviewers. Then, we provide a point-by-point response to each individual comment made by the reviewers right after their respective reviews. #### Q1: The comparison between our model and factored MDPs. * **Different structural formulation.** We want to clarify that our model and the factored MDP model are fundamentally different. Specifically, in factored MDPs, it was assumed that the transition kernel and the reward function are both factorizable, with a matching factored set [1]. Under this assumption, a factored MDP can be perfectly decomposed into multiple independent smaller MDPs. In contrast, we only assume a more general bipartite graph-based transition kernel, which is not necessarily factorizable. Even for a fully decomposable kernel graph, our model is still different because we do not make assumptions about the reward function. * **New technical challenges and our contributions.** Since the factored MDP can be perfectly decomposed into multiple independent smaller MDPs, applying vanilla single-agent RL algorithms to optimize each small MDP is enough for optimizing the overall factored MDP. In our case, since it is, in general, impossible to perfectly decompose the overall MDP into smaller ones, our algorithm focuses on strategically decomposing and aggregating the kernel graph and sampling the sub-kernel level transitions, which is then the main algorithmic contribution of this work. Technically, to estimate and control the errors that arise due to kernel decomposition and aggregation, we need to develop a more sophisticated variance analysis and concentration techniques for sub-kernels. >[1] Osband, I. and Van Roy, B. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014 #### Q2: How to get the graph decomposition? The choice of the decomposition scheme determines the trade-off between the asymptotic bias and the sample complexity, as demonstrated by our convergence bounds and our numerical simulations. Since we are the first to consider such a bipartite graph structure and to develop a customized algorithm, how to choose such a decomposition scheme in an optimal manner was not the focus of this work. That being said, next, we briefly elaborate on several promising future directions on how to achieve this goal theoretically and also empirically. 1. **Theoretically**, to characterize the trade-off and provide an optimal decomposition scheme, we need to first establish a lower bound of the iterates that match the upper bound. Based on that, we can optimize the two terms on the right-hand side of the bound to obtain the optimal decomposition scheme. 2. **Empirically,** suppose that we have some prior knowledge of the problem, that is, we are able to identify how strong the transition dependence is for different graph edges from the physical model of the problem (which is often possible for practical applications). Then, we can decompose the graph based on such prior knowledge of the transition dependence. 3. **Suppose that we don't have any prior knowledge of the problem**, then a possible direction is to design an adaptive decomposition algorithm to dynamically choose the best decomposition. Specifically, we start with a full decomposition, i.e., the number of decomposed components $K$ is equal to the number of state dimensions. This is the case where the sample complexity is the lowest, but the error due to decomposition is the largest. When the estimation converges but the required accuracy is not achieved, we can aggregate some components and get a new decomposition with fewer components, which would lead to a higher sample complexity and a lower bias. We iteratively conduct this process until the required accuracy is reached. #### Q3: Could you provide a concrete example that is not a factored MDP but has a bipartite graph-based transition kernel? I don't understand the reward function part. In general, your state space is combinatory, how could you design any kind of efficient algorithm without making assumptions on the reward function? In line 13 algorithm 1, you state "sample each state-action pair once to obtain reward function r(·, ·)". This step requires iterating the exponentially large state and action space. I thought the goal was to avoid this. We thank the reviewer for the feedback. 1. **An example of bipartite graph-based transition kernel without factored structure**: Consider an MDP with a two-dimensional state $s = (s_1, s_2)$ and a one-dimensional action $a$. For the transition dependence, the substate $s_1$'s transition depends on $s_1$ and $s_2$, and the substate $s_2$'s transition depends on $s_2$ and $a$. In the factored MDP setting, substates $s_1$ and $s_2$ have independent transitions, and the overall transition probability $P(s'|s,a)$ satisfies: \begin{align*} P(s'|s,a) &= P(s'_1,s'_2|s_1,s_2,a) \\ &= P(s'_1|s_1,s_2,a)P(s'_2|s_1,s_2,a)\\ &= P(s'_1|s_1,s_2)P(s'_2|s_2,a). \end{align*} The second equality follows from the factored probability of factored MDP [1], and the third equality follows from the partial dependence. However, in our model, since we do not make the factored probability assumption, the second equality does not hold because the transitions of $s_1$ and $s_2$ are not independent. Such independence is due to the fact that $s_1$ and $s_2$'s transitions are jointly influenced by the common substate $s_2$. Therefore, our kernel graph-based transition model has weaker assumptions and thus has a wider range of applicability. 2. **Influence of the reward function structure**: * **When the reward function has certain structures,** we can avoid sampling all state-action pairs and the exponentially large state-action space dependence. For example, consider the MDP mentioned in point 1. Suppose that the reward function has some low-dimensional structures: \begin{align*} r(s,a) &= r(s_1,s_2,a) = r_1(s_1,s_2)+r_2(s_1,a)+r_3(s_2,a), \end{align*} where $r_1$, $r_2$ and $r_3$ are low-dimensional sub-reward function components. In this case, it is clear that we only need to sample all state-action pairs of components $r_1$, $r_2$ and $r_3$ to recover the overall reward function $r$. The induced sample complexity is $|S_1||S_2|+|S_1||A|+|S_2||A|$, which is much smaller than $|S||A|=|S_1||S_2||A|$ due to the reward function structure. Note that this is only an example of the reward function structure; the number of sub-reward function components and their combination form (e.g., here $r = r_1 + r_2 + r_3$, but we allow any combination function $f$ with $r = f(r_1, r_2, r_3)$) can be much more general. Specifically, the induced sample complexity $D_r$ in the general case satisfies \begin{align*} D_{r} = \sum_{i=1}^H D_r^i, \end{align*} where $H$ denotes the number of sub-reward function components, and $D_r^i$ denotes the size of state-action space for component $r_i$. Since the sub-reward function components have partial dependence on state (action) dimensions, the sample complexity $D_r$ can be exponentially smaller than $|S||A|$. A detailed description has been provided in Appendix F in our paper. * **Even when the reward function has no structures,** the induced sample complexity by estimating the unknown reward function is often minor compared with that induced by estimating the unknown transition kernel. As the reviewer mentioned, we need to sample all state-action pairs in this case. However, if the reward function is deterministic, we only need to sample each state-action pair once with sample complexity $|S||A|$, which is much smaller than the minimax sample complexity $\frac{|S||A|}{(1-\gamma)^3\epsilon^2}$ because there are no dependence on the effective horizon $1/(1-\gamma)$ and, more importantly, the accuracy $\epsilon$. Even when the reward function is stochastic, the induced sample complexity is only $\frac{|S||A|}{(1-\gamma)^2\epsilon^2}$ due to the law of large numbers, which is still smaller than the minimax sample complexity. Essentially, the sample complexity induced by estimating the reward function is dominated by that induced by estimating the transition probabilities, which is also the reason why most of the existing literature do not consider the stochasticity of the reward function and focus on the transition probabilities [2][3]. >[1] Osband, I. and Van Roy, B. Near-optimal reinforcement learning in factored MDPs. Advances in Neural Information Processing Systems, 27, 2014 >[2] Agarwal, Alekh, Sham Kakade, and Lin F. Yang. "Model-based reinforcement learning with a generative model is minimax optimal." Conference on Learning Theory. PMLR, 2020. >[3] Gheshlaghi Azar M, Munos R, Kappen H J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model[J]. Machine learning, 2013, 91: 325-349. #### Q4: Regarding the discussion on the reward function, all of your arguments apply to FMDP as well. So they are not arguments that help clarify your contribution w.r.t. FMDP. Regarding the first example, are you suggesting that next $s_1$ and $s_2$ are dependent given the current $s_1, s_2, a$, so the second equality you mentioned does not hold? And I think your assumption is only made on the marginal distribution of $s_1$. Note that I am talking about the condition dependence here. So everything is conditioned on a specific current $s_1, s_2, a$. In this case, $s_1$ and $s_2$ are dependent only when there is a confounder, for example $U$, and both $s_1$ and $s_2$ depends on $U$. If this is the case, this has to be shown on your Figure 1. Figure 1 clearly indicates that $s_1$ and $s_2$ are conditionally independent. Could you make this more concrete? For example, consider binary $s_1$, $s_2$ and $a$ and write out the full conditional probability: $P(s_1^{t+1}, s^{t+1}_2 \mid s_1^{t}, s_2^{t}, a^{t})$. Then tell me how this probability does not satisfy FMDP assumption while satisfying yours. We thank the reviewer for the follow-up question and engaging in discussion with us. * **Regarding the reward function.** We agree with the reviewer that the reward function assumption is not the key difference between our model and the FMDP and we apologize for the misunderstanding. In contrast, the key difference of our modeling and FMDP lies in the transition kernel structure, which is also the unknown component of MDP that usually dominates the sample complexity [1][2]. Please refer to the following concrete example. So the difference between our modeling and FMDP on transition kernel makes us need to develop new techniques, detailed in the second bullet point in 'Common responses Q1'. * **A concrete example that is not a FMDP but is a bipartite graph-based transition.** As suggested by the reviewer, here, we provide a concrete example with binary state values to illustrate that FMDP assumption is not satisfied in our setting. For any $t\geq 0$, consider binary substates $s_1^t$, $s_2^t$, and action $a^t$, the next substates $s^{t+1}_1$ and $s^{t+1}_2$ depends only on the current substate $s^t_2$, i.e., \begin{align*} P(s_1^{t+1},s_2^{t+1}|s_1^{t},s_2^{t},a^t) = P(s_1^{t+1},s_2^{t+1}|s_2^{t}). \end{align*} Next, we provide the full transition probabilities. For any $s_1^t$ and $a^t$, let $$ \ \ \ \ P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=0,s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 0, $$ $$ P(s_1^{t+1}=1,s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 0, $$ $$ \ \ \ \ P(s_1^{t+1}=1,s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 0, $$ $$ \ \ \ \ P(s_1^{t+1}=0,s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 2/3, $$ $$ \ \ \ \ P(s_1^{t+1}=1,s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=1,s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 0. $$ Based on the above equations, we calculate the marginal transition probabilities as follows: for any $s_1^t$ and $a^t$, $$ P(s_1^{t+1}=0|s_2^{t}=0) =P(s_1^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_1^{t+1}=1|s_2^{t}=0) =P(s_1^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=0|s_2^{t}=1) =P(s_1^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 2/3, $$ $$ P(s_1^{t+1}=1|s_2^{t}=1) =P(s_1^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=0|s_2^{t}=0) =P(s_2^{t+1}=0|s_2^{t}=0,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=1|s_2^{t}=0) =P(s_2^{t+1}=1|s_2^{t}=0,s_1^{t},a^t) = 2/3, $$ $$ P(s_2^{t+1}=0|s_2^{t}=1) =P(s_2^{t+1}=0|s_2^{t}=1,s_1^{t},a^t) = 1/3, $$ $$ P(s_2^{t+1}=1|s_2^{t}=1) =P(s_2^{t+1}=1|s_2^{t}=1,s_1^{t},a^t) = 2/3. $$ Note that in this example, we have $P(s_1^{t+1}|s_2^{t},s_1^{t},a^t) = P(s_1^{t+1}|s_2^{t})$ and $P(s_2^{t+1}|s_2^{t},s_1^{t},a^t) = P(s_2^{t+1}|s_2^{t})$ for any $s_1^{t},a^t$, thus satisfying our model assumption. However, we see that the factored probability assumption does not hold because \begin{align*} P(s_1^{t+1}=0|s_2^{t}=0)P(s_2^{t+1}=0|s_2^{t}=0) = 1/9 \neq P(s_1^{t+1}=0,s_2^{t+1}=0|s_2^{t}=0). \end{align*} In essence, the FMDP simultaneously assumes (1) the partial transition dependence and (2) the factored probability, in particular, the joint probability is equal to the product of marginal probabilities. On the other hand, we only assume (1) the partial transition dependence, which is the key modeling difference. >[1] Agarwal, Alekh, Sham Kakade, and Lin F. Yang. "Model-based reinforcement learning with a generative model is minimax optimal." Conference on Learning Theory. PMLR, 2020. >[2] Gheshlaghi Azar M, Munos R, Kappen H J. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model[J]. Machine learning, 2013, 91: 325-349. Consider an MDP with $s=(s_1,s_2)$, partial dependence $P(s^{t+1}_1|s^t_1,s^t_2) = P(s^{t+1}_1|s^t_1)$ and $P(s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_2|s^t_2)$. We hope to have \begin{align} P(s^{t+1}_1,s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_1|s^{t}_1)P(s^{t+1}_2|s^t_1,s^{t}_2) \end{align} \begin{align} P(s^{t+1}_1,s^{t+1}_2|s^t_1,s^t_2) = P(s^{t+1}_1|s^{t}_1,s^t_2)P(s^{t+1}_2|s^t_1,s^{t}_2,s^{t+1}_1) \end{align} #### Q5: Thanks for the clarification. Now I understand that you do not assume that $s_1$ and $s_2$ are conditionally independent. There is confounder. Let us call it $U$. You will have to make this point very clear in Figure 1. Figure 1 as a DAG clearly indicates that $s_1$ and $s_2$ are conditionally independent, because conditioning on the current state and action, the causal path between $s_1$ and $s_2$ is blocked. I am still confused by how you can estimate the overall transition probability based on the marginal ones when you don't assume conditional independence. It seems that in equation (8) you compute the overall transition based on the marginal ones with Kronecker product operator. Doesn't this assume conditional independence? Could you use the same concrete example and explain to me how you can reconstruct overall transition from the marginal ones? Let us assume you have a perfect estimation of the marginal transition. **Regarding the Kernel Graph Representation**: We thank the reviewer for proposing an alternative viewpoint on our model. Indeed, using the confounding effect to explain the dependence between the substates is a good perspective to understand the difference between our model and FMDP. **For Using Marginal Probabilities to Recover Full Probabilities**: When there is a confounding variable that impacts two substates (as in the illustrative example provided in our last response), our decomposition scheme will NOT decompose those two substates (which is in constrast FMDP). Specifically, as illustrated in Section 4.1.1. in our submission, our decomposition scheme is based on the division of **disconnected** substates. Next, we use the MDP model provided in our last response to further illustrate this point. For any $t\geq 0$, consider binary substates $s_1^t$, $s_2^t$, and action $a^t$, the next substates $s^{t+1}_1$ and $s^{t+1}_2$ depends only on the current substate $s^t_2$. We can see that, $s^t_2$ influences both $s^{t+1}_1$ and $s^{t+1}_2$. However, since the product of the marginal transition probabilities does not equal to the overall transition probability (see our previous response), there exists a confounding variable $U^t_2$ whose realization depends on $s^t_2$ and influences both $s^{t+1}_1$ and $s^{t+1}_2$. Please refer to the following anonymous link for a pictorial illustration: https://docs.google.com/document/d/e/2PACX-1vSfTQ_hG0VG5eBHXGAQ5P81h8coOcSyS5hn3ruP2wCqp0eBEQbhmqJ_vB8gVnWQzgjgPXstw8bS__Ss/pub If all state/action variables $s^t_1$, $s^t_2$, $a^t$ and the confounder $U^t_2$ are known, then $s_1^{t+1}$ and $s_2^{t+1}$ should be conditional independent by default [1], i.e., $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2) = P(s_1^{t+1}|s_1^t,s_2^t,a^t,U^t_2)P(s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2). $$ However, since $U_2^t$ is unknown, the conditional independence does not hold due to the confounding effec: $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t) \neq P(s_1^{t+1}|s_1^t,s_2^t,a^t)P(s_2^{t+1}|s_1^t,s_2^t,a^t) = P(s_1^{t+1}|s_2^{t})P(s_2^{t+1}|s_2^{t}). $$ Therefore, for this example, FMDP decomposes $s_1^{t+1}$ and $s_2^{t+1}$ and gets inexact transition kernel. For our connected component decomposition, since it is based on the division of the disconnected substates (cf. Section 4.2 in our paper), it will not decompose these 2 substates because but instead directly estimates the joint distribution $P(s_1^{t+1},s_2^{t+1}|s_2^t)$. Since the confounding effect exists only within substates that are connected, our connected component decomposition avoids all confounders, thus the substates in different components should be conditional independent by default [1]. Therefore, the issue of estimating the overall transition probability based on marginal transition probabilities does not exist. Compared with the FMDP, since we take into account the confounding effect, we could have a more accurate and comprehensive description of the model structure. **Highlight our Contributions --- our proposed decomposition scheme is highly flexible**: A notable feature of our decomposition and analysis framework is that it allows from decomposition errors, which makes our framework more flexible. To see this, consider the following example. Consider an MDP that has a $100$-dimensional statespace $s=(s_1,s_2,...,s_{100})$. Suppose that the substates can be divided into $10$ groups: $(s_1,\cdots,s_{10})$, $(s_{11},\cdots,s_{20})$, $\cdots$, $(s_{91},\cdots,s_{100})$ such that within each group, the interconnections are strong while between different groups, there are weak transition dependences. <!-- a component having intense and complex dependence within themselves. In addition, there exist weak transition dependences between $s_{10}$ and $s_{11}$; $s_{20}$ and $s_{21}$; ..., $s_{90}$ and $s_{91}$. Therefore, the graph is connected and not fully decomposable. --> In this case, the graph is not fully decomposible (and also not factorizable). However, since our analysis allows for decomposition errors, it might still be better to artificially decompose the MDP into $10$ components so that the sample complexity is exponentially reduced while the asymptotic error is relatively small due to the weak dependence between the components. We have presented numerical simulations (cf. Figure 5) to demonstrate this effect. >[1] Pearl J. Models, reasoning and inference[J]. Cambridge, UK: Cambridge University Press, 2000, 19(2): 3. <!-- >[2] Strehl, Alexander L. "Model-based reinforcement learning in factored-state MDPs." 2007 IEEE international symposium on approximate dynamic programming and reinforcement learning. IEEE, 2007. >[3] Chen, Xiaoyu, Jiachen Hu, Lihong Li, and Liwei Wang. "Efficient Reinforcement Learning in Factored MDPs with Application to Constrained RL." In International Conference on Learning Representations. 2021. >[4] Tian, Yi, Jian Qian, and Suvrit Sra. "Towards minimax optimal reinforcement learning in factored markov decision processes." Advances in Neural Information Processing Systems 33 (2020): 19896-19907. --> <!-- (1) **Our sample complexity result is new**: While there are situations where both our model and FMDP applies, to the best of our knowledge, we are the first to show the $\tilde{O}(\frac{\sum_{i=1}^mX_i}{(1-\gamma)^3\epsilon^2})$ sample complexity with a generative model, improving the SOTA sample complexity from the factored MDP literature by a factor of $\frac{m}{1-\gamma}$ [2], where $m$ denotes the number of decomposed components, $\epsilon$ denotes the accuracy, and $X_i$ denotes the state-action space of the decomposed component $i$ (equals the size of the scope of the $i$th state component in FMDP). This is achieved with our refined subkernel variance analysis techniques. In addition, our finite-time sample complexity result can directly induce a $\tilde{O}(\sqrt{H(\sum_{i=1}^m X_i)T})$ regret **ZAIWEI: I do not think sample complexity implies regret bound, probably remove this sentence**, where $H$ denotes the (effective) horizon, which matches the SOTA regret in FMDP [3][4] (though our sampling model is different). --> <!-- Specifically, as the reviewer mentioned, our model can be equivalently understood as an MDP with several confounders. The confounders can be one-on-one matched to the state/action dimensions influencing the transitions of multiple substates. We introduce two detailed examples to illustrate when it brings conditional independence and when it does not. --> *** <!-- **ZAIWEI Outline: Keep the paragraph above** (1) Answer the question, say that when there is a confounding effect, FMDP will decompose but ours will not. The reason is that confounding effect exists only within substates that are connected but our decomposition scheme is based on the substates that are NOT connected (also refer to the section of the paper). It would be better to provide a graph to illustrate this. Therefore, the issue of estimating the overall transition probability based on marginal transition probabilities does not exist. In addition, since we take into account of the confounding effect, we have a more accurate estimate of the model structure. * laixi: first, the reviewer is correct that there may be a confounder u. So we won't decompose s1, s2. How we decompose --- since confounding effect exists only within substates that are connected but our decomposition scheme is based on the substates that are NOT connected (also refer to the section of the paper). It would be better to provide a graph to illustrate this. Therefore, the issue of estimating the overall transition probability based on marginal transition probabilities does not exist. We estimate joint distribution. While FMDP will decompose s1, s2, different from us. We take into account of the confounding effect, we have a more accurate estimate of the model structure. (2) Look into the literature that establish sample complexity bounds of FMDP. If it is as good as ours then go to (3). It it is not as good as ours then sell this point. (3) Our decomposition is more flexible. Provide an example to illustrate that in certain cases, even if the MDP is not decomposible, it might be better to artificially decompose the MDP so that the overall sample complexity + the asymptitic error is minimized. Can also refer the experiment section to show if we have something there. *** <!-- * using the confounding effect. Though our kernel graph is an undirected bipartite graph and different from the DAG in the causal diagram, we agree with the reviewer that it does help to understand our model using the confounding effect more clearly. --> <!-- * **Clarifications on Conditional Independence and Examples**: Our model doesn't assume conditional independence due to confounding effects. Specifically, as the reviewer mentioned, our model can be equivalently understood as an MDP with several confounders. The confounders can be one-on-one matched to the state/action dimensions influencing the transitions of multiple substates. We introduce two detailed examples to illustrate when it brings conditional independence and when it does not. The illustrative figures are provided in the following anonymous link: <!-- https://docs.google.com/document/d/e/2PACX-1vSfTQ_hG0VG5eBHXGAQ5P81h8coOcSyS5hn3ruP2wCqp0eBEQbhmqJ_vB8gVnWQzgjgPXstw8bS__Ss/pub --> <!--* (1): Again, consider an MDP with state $s = (s_1, s_2)$ and action $a$. For any $t\geq 0$, the next substate $s_1^{t+1}$ depends only on current substate $s_1^t$ while $s_2^{t+1}$ depends only on $s_2^t$. We can see that in this case, there are no state/action dimensions that simultaneously influence both $s_1^{t+1}$ and $s_2^{t+1}$. Thus, there are no confunders, and $s_1^{t+1}$ and $s_2^{t+1}$ should be conditional independent by default [1]. That is, --> <!-- $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t) = P(s_1^{t+1}|s_1^t,s_2^t,a^t)P(s_2^{t+1}|s_1^t,s_2^t,a^t) = P(s_1^{t+1}|s_1^{t})P(s_2^{t+1}|s_2^{t}). $$ * (2): In contrast, consider the model proposed in the last response, for any $t\geq 0$, consider binary substates $s_1^t$, $s_2^t$, and action $a^t$, the next substates $s^{t+1}_1$ and $s^{t+1}_2$ depends only on the current substate $s^t_2$. We can see that, $s^t_2$ influences both $s^{t+1}_1$ and $s^{t+1}_2$. Thus, there exists a confounder $U^t_2$ whose realization depends on $s^t_2$ and influences both $s^{t+1}_1$ and $s^{t+1}_2$ (the same to $s_2^t$). If all variables $U^t_2$, $s^t_1$, $s^t_2$ and $a^t$ are known, then $s_1^{t+1}$ and $s_2^{t+1}$ should be conditional independent by default, i.e., $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2) = P(s_1^{t+1}|s_1^t,s_2^t,a^t,U^t_2)P(s_2^{t+1}|s_1^t,s_2^t,a^t,U^t_2). $$ However, since $U_2^t$ is unknown, the confounding effect makes the conditional independence no longer true, i.e., $$ P(s_1^{t+1},s_2^{t+1}|s_1^t,s_2^t,a^t) \neq P(s_1^{t+1}|s_1^t,s_2^t,a^t)P(s_2^{t+1}|s_1^t,s_2^t,a^t) = P(s_1^{t+1}|s_2^{t})P(s_2^{t+1}|s_2^{t}). $$ --> <!--* **In summary**, we don't assume conditional independence essentially due to the confounding effect. If there are no confounders or all confounders are known, conditional independence is held by default. We thank the reviewer for helping us polish our work, and we will include a more detailed description of this model in the next version. --> ## Reviewer fzvc We would like to express our gratitude to the reviewer for their insightful feedback. We are glad to know that the reviewer recognizes the novelty of our contributions, the theoretical and empirical results that show the advantages of our approach, as well as the clarity of the whole paper. We provide our response to the questions below. #### Q1: State-action space is rather small in the experiments. For the three cases in the experiments, the sizes of their state-action spaces are $625$, $320$, and $810$, which we believe are not small and are sufficient to numerically verify the theoretical findings (for a theoretical work). That being said, since the advantage of our method will become clearer as the size of the state-action space gets larger, we have included additional experiments with larger state-action spaces as follows: >https://docs.google.com/document/d/e/2PACX-1vRKjDMB4WecEpznmJfVZoDalryT8eLS8gJlQPBJB-VbOwrQBA0JUWG2STEaNLOFgg9lIsFpe6UsT_4F/pub #### Q2: How to get the graph decomposition? Please see our response to the second common comment at the beginning of this page for more details. ## Reviewer 9p8M We gratefully thank the reviewer for recognizing our contributions to the modeling, strong theoretical guarantees and clarity of presentation. We provide our response below: #### Q1: In terms of weakness, I think the setup itself -- a generative-model is available for sampling from any state-action pairs -- has limited applications in practice. The generative-model is a fundamental and frequent-adopted sampling model [1][2][3] for theoretical analysis in RL. Since this is the first work that considers such a bipartite graph structure, we use the generative-model setup to theoretically examine the merit of our proposed algorithm. That being said, our result can potentially be extended to the case where we have a single trajectory of Markovian samples, as long as the Markov chain is uniformly ergodic. The key challenge in this case is to extend the analysis for i.i.d. sampling to Markovian sampling. Given that there are several existing results [4] that provide the tools to handle this challenge (based on a mixing-time argument), we believe the generative model assumption can potentially be relaxed. We will include a discussion about our generative model assumption in the next version of this work. >[1] Kearns, M., Mansour, Y., and Ng, A. Y. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine learning, 49:193–208, 2002. >[2] Kakade, Sham Machandranath. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003. >[3] Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert Kappen. "On the Sample Complexity of Reinforcement Learning with a Generative Model." In International Conference on Machine Learning. 2012. >[4] Li, G., Wei, Y., Chi, Y., Gu, Y., \& Chen, Y. (2020). Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 33, 7031-7043. #### Q2: Identifying a meaningful decomposition is challenging. Please see our response to the second common comment at the beginning of this page for more details. #### Q3: In the fully decomposable case, is the setup equivalent to a factored MDP? We want to clarify that even for a fully decomposable kernel graph, our model is different from the factored MDP because we have no assumptions on the reward function and the matching factored set. We have provided a more detailed discussion regarding the comparison between our model and the factored MDP in our response to the first common comment at the beginning of this page. #### Q4: Questions on the warm-up model --- seems not realistic and won't benefit a lot from decomposition. There are two reasons that motivate us to present the warm-up model. 1. Although very simple, the substate modeling and decomposable structures can be found in a wide variety of real-world decision making problems, such as electric vehicle charging, building thermal control, and electricity market trading [1]. 2. The 3-state model (which consists of independent substates, Markovian substates and dependent substates) can be viewed as an illustrative example and provides important first-glance intuition: the independent substate does not influence the sample complexity; the Markovian substate introduces a sample complexity that is linear to the size of the Markovian substate space; the dependent substate introduces a sample complexity that is linear to the space size of dependent substate-action pairs. To address the reviewer's comment, we will include another example that does not have such a fully decomposable structure to the warm-up section in the next version of our paper. >[1] Yeh, C., Li, V., Datta, R., Arroyo, J., Christianson, N., Zhang, C., ... \& Wierman, A. (2024). SustainGym: Reinforcement Learning Environments for Sustainable Energy Systems. Advances in Neural Information Processing Systems, 36. ## 3. Reviewer rr4w <!-- #### Q1: My main concern is on how this structural assumption through a decomposition on the bipartition graph is different from factored MDPs in the literature. Factored MDPs assume that the state is a dimensional vector and the transition probability of each dimension depends on a subset of entries of the current state-action pair. This is essentially also decomposing the bipartition graph. Factored MDP problem has been extensively studied in the literature, as the authors also pointed out. Moreover, factored MDP is embedded with more general structural assumptions on the transition model--low bellman rank. How is this structural assumption related to low bellman rank? --- Laixi: help the reviewer to summarize their questions and put a concise one here. Such as --> #### Q1: What's the difference between this work and factored MDPs, which also have a decomposition process? What's the relation between the structure of this work and the more general low bellman rank assumption? --- The modeling and assumptions between our model and factored-MDP work are, in fact, fundamentally different. Please see our response to the first common comment at the beginning of this page for more details. Regarding the works making low Bellman-rank assumptions, it is not directly comparable to our results. Specifically, the Bellman rank was introduced to characterize the approximation limitation in RL with function approximation [1]. There is another set of works studying low-rank MDPs [2][3][4]. Low-rank MDP is also a framework for studying the limitation of function approximation. Specifically, in a low-rank MDP, it was assumed that the transition probability matrix can be factored into the product of two low-rank matrices, similar to using linear function approximation in value function approximation. In contrast, our model focuses on a possibly decomposable transition kernel based on Kronecker product, which targets on quite different things. <!-- in which it was assumed that the transition kernel admitscan be decomposed into the product of two matrices, satisfying the low Bellman-rank assumption. Compared with those works, our work has 3 key differences: 1. We don't assume the transition kernel is perfectly decomposable. 2. We consider the decomposition using the Kronecker product, instead of the matrix product. **ZAIWEI: Is this fundamental, or just notation difference?** 3. We do not impose the restriction that the transition kernel can only be decomposed into two matrices. **ZAIWEI: I feel this is a minor thing.** **ZAIWEI: Overall,I feel the 3 differences are not convincing.** --> In the next version of our paper, we will properly cite those works that study MDPs with low Bellman ranks and discuss the relation. >[1] Jiang, Nan, et al. "Contextual decision processes with low bellman rank are pac-learnable." International Conference on Machine Learning. PMLR, 2017. >[2] Huang, A., Chen, J., & Jiang, N. (2023, July). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning (pp. 13710-13752). PMLR. >[3] Agarwal, Alekh, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. "Flambe: Structural complexity and representation learning of low rank mdps." Advances in neural information processing systems 33 (2020): 20095-20107. >[4] Mhammedi, Zak, Adam Block, Dylan J. Foster, and Alexander Rakhlin. "Efficient model-free exploration in low-rank mdps." Advances in Neural Information Processing Systems 36 (2024). #### Q2: Is the main contribution caused by the generative model assumption? <!-- If the main contribution of the paper is to propose an algorithm with the access to a generative model, then this has to be made clear and the paper should also discuss whether this bound is better than these without a generative model. --> We want to clarify that the improvement of our sample complexity is not caused by the generative-model assumption. In fact, the generative-model assumption is a standard assumption in the theoretical analysis of RL, and was imposed in [1] that develops the minimax lower bound. Our key contribution is to identify bipartite graph structures in MDPs and, in turn, develop a customized algorithm with a novel sampling scheme that achieves an improved sample complexity compared with the minimax lower bound. <!-- Technically, since our algorithm focuses on strategically decomposing and aggregating the kernel graph and sample the sub-kernel level transitions, to handle the challenge in estimating and controlling the errors during kernel decomposition and aggregation, we develop a more elaborate variance analysis and a concentration technique to sub-kernels. --> #### Q3: What is the advantage of the proposed methods compared to the value function approximation methods? The function approximation (FA) and our method are not mutually exclusive. <!-- Conventionally, the FA uses a low-dimensional model to output the transition probability with the transition event $(s,a,s')$ inputs. Our model can directly be incorporated with the FA method simultaneously. --> Specifically, after decomposing the transition kernel into multiple small-scale sub-kernels, one can use FA to approximate the optimal value function (or Q-function) for the small-scale transition kernels. It is an interesting future direction to theoretically (and also empirically) verify the performance of our algorithm combined with FA. #### Q4: What is the computational complexity of Algorithm 1? In our algorithm, after obtaining an estimated model (including transition kernel and reward) using our sampling scheme specifically designed based on the bipartite graph structure, we use standard value iteration to find an optimal policy. Since the value iteration converges geometrically fast [2], it is often not the performance bottleneck. In addition, one can incorporate function approximation in such value iteration method as suggested by the reviewer to further overcome the curse of dimensionality. We will add the above discussion regarding the computational complexity to the next version of this work. >[1] Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert Kappen. "On the Sample Complexity of Reinforcement Learning with a Generative Model." In International Conference on Machine Learning. 2012. >[2] Sutton, R. S., \& Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. #### Q5: The simulation should compare with these algorithms designed for factored MDP as well if the authors argue that this is a different setting. As illustrated in our response to the first common comment in the beginning of this page, the model assumptions between the factored MDP and ours are fundamentally different, even for the decomposable kernel graph case. <!-- Therefore, the results in the existing literature on factored MDP do not apply to our model (both algorithms and theoretical guarantees). --> We did not numerically compare our algorithm to theirs because we feel that it would be an unfair comparison as the algorithms require different model assumptions to work. Specifically, the factored MDP requires a much stronger assumption on the matching factored transition kernel and reward function. If we directly apply their customized algorithm to our non-factored model, forcing a factorization will introduce additional errors, making the comparison unfair. That being said, to address the reviewer's concern, we will add corresponding numerical comparisons in the next version of our paper. #### Q6: The paper should have a more detailed comparison between their proposed setting and the other structural assumptions in the literature. No apparent societal impact. In the next version of this work, we will provide a more detailed comparison with works that study factored MDPs, weakly coupled MDPs [1], pseudo-stochastic state MDPs [2], networked MDPs [3][4], confounded MDPs [5], low Bellman rank MDPs [6][7]. >[1] Meuleau, N., Hauskrecht, M., Kim, K.-E., Peshkin, L., Kaelbling, L. P., Dean, T. L., and Boutilier, C. Solving very large weakly coupled Markov decision processes. In AAAI/IAAI, pp. 165–172, 1998. >[2] Wei, H., Liu, X., Wang, W., and Ying, L. Sample efficient reinforcement learning in mixed systems through augmented samples and its applications to queueing networks. Preprint arXiv:2305.16483, 2023 >[3] Zhang, Y., Qu, G., Xu, P., Lin, Y., Chen, Z., and Wierman, A. Global convergence of localized policy iteration in networked multi-agent reinforcement learning. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023. >[4] Qu, G., Lin, Y., Wierman, A., and Li, N. Scalable multi-agent reinforcement learning for networked systems withaverage reward. Advances in Neural Information Processing Systems, 33:2074–2086, 2020a. >[5] Zhang, J. and Bareinboim, E. Markov decision processes with unobserved confounders: A causal approach. Technical report, Technical report, Technical Report R-23, Purdue AI Lab, 2016 >[6] Jiang, Nan, et al. "Contextual decision processes with low bellman rank are pac-learnable." International Conference on Machine Learning. PMLR, 2017. >[7] Huang, A., Chen, J., & Jiang, N. (2023, July). Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning (pp. 13710-13752). PMLR. ## Reviewer 6KFY #### Q1: The knowledge of the dependency graph is in general hard to obtain in the high-dimensional real world applications with e.g. the visual/sensor input. How can we obtain such dependency graph in practice? If the dependency graph is not available, what can we do? We have provided a detailed discussion regarding how to obtain a good decomposition scheme in response to the second common comment raised by the reviewers at the beginning of this page. In summary, for many real-world applications (especially those that have physical models), one way to obtain the transition dependence structure is to use the prior knowledge of the problem. For example, in power system control, the grid demand is independent of the power generator dispatch actions [1]; in robotic control problem, the environmental factors (like weather, wind speed, and ground conditions) will not be influenced by the robotic's state and action, and the control signal to a robotic will not influence the other robotics [2]. When the prior knowledge of the problem is inaccessible, we have discussed a potential adaptive decomposition scheme (see the beginning of this paper). In addition, it is possible to use the existing approaches to effectively detect the transition dependence (e.g., Causal Discovery [3], Variable Influence Structure Analysis [4] in RL), but this is out of the scope of the current paper. It is worth noting that our results allow for errors in the decomposition scheme. Such an error term will eventually appear as the asymptotically non-vanishing term on the right-hand side of our convergence bound. Therefore, we can start from a fully connected bipartite graph and eliminate edges as much as we can (by learning from the data), and obtain an approximation of the dependence graph. <!-- that is possibly denser compared to the groud truth graph. Fortunately, applying our algorithm to the approximate graph still give reasonable results with sample complexity guarantee $D$ and graph approximation error $\mathcal{E}$. 2. **The algorithm still work reasonably when the graph knowledge is not perfect.** Without the access to the graph knowledge, we can start from a fully-connected bipartite graph and elimiate edges as much as we can --- learning from the data. Then we obtain an approximate graph that is possibly denser compared to the groud truth graph. Fortunately, applying our algorithm to the approximate graph still give reasonable results with sample complexity guarantee $D$ and graph approximation error $\mathcal{E}$. (<span style="color:red">$D, \mathcal{E}$ are not defined, specify their intuitive meaning and refer to which section does these terms or the results appear in our main paper </span>) **ZAIWEI: I also prefer to remove this paragraph. This discussion is very handwaving and may end up hurting us.** --> We agree with the reviewer that developing generic practical methods to estimate the dependence graph is important. That being said, since this is the first work that considered the bipartite graph structure of developed customized algorithms with improved sample complexity guarantees, developing practically efficient estimation methods can be a future direction, but is out of the scope of this work. <!-- * **Many applications enjoys prior knowledge of the dependency graph.** In many real-world applications, the transition dependence structure is available as prior knowledge of inherent physical relationship and dependence. For example, in power system control, the grid demand is independent to power generator dispatch actions [1]; in robotic control problem, the environmental factors (like weather, wind speed, and ground conditions) will not be influenced by the robotic's state and action, and the control signal to an robotic will not influence the other robotics [2]. --> <!-- For these problems, we can directly check the physical dependence between different dimensions. --> <!-- Technically, getting the edge information requires to delete all unnecessary edges (means finding independence) from an fully connected bipartite graph. Although deleting all unnecessary edges and obtaining the exact dependence information is hard sometimes, however, deleting a subset of them and get an approximate bipartite graph is often easy. By applying our algorithm into the approximate graph, our theoretical results still apply by providing the upper bounds of sample complexity $D$ and approximation error $\mathcal{E}$. [laixi] --> >[1] Chowdhury, Badrul H., and Saifur Rahman. "A review of recent advances in economic dispatch." IEEE Transactions on Power Systems 5, no. 4 (1990): 1248-1259. >[2] Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238-1274. >[3] Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... \& Chen, Y. (2022). Causality-driven hierarchical structure discovery for reinforcement learning. Advances in Neural Information Processing Systems, 35, 20064-20076. >[4] Jonsson, A., \& Barto, A. (2006). Causal Graph Based Decomposition of Factored MDPs. Journal of Machine Learning Research, 7(11). #### Q2: How to compare the proposed model with the factored MDP: Our setting is fundamentally different from the factored MDP. We have provided a detailed discussion in response to the first common comment raised by the reviewers at the beginning of this page.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully