HackMD
  • Prime
    Prime  Full-text search on all paid plans
    Search anywhere and reach everything in a Workspace with Prime plan.
    Got it
      • Create new note
      • Create a note from template
    • Prime  Full-text search on all paid plans
      Prime  Full-text search on all paid plans
      Search anywhere and reach everything in a Workspace with Prime plan.
      Got it
      • Sharing Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • More (Comment, Invitee)
      • Publishing
        Everyone on the web can find and read all notes of this public team.
        After the note is published, everyone on the web can find and read this note.
        See all published notes on profile page.
      • Commenting Enable
        Disabled Forbidden Owners Signed-in users Everyone
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Invitee
      • No invitee
      • Options
      • Versions and GitHub Sync
      • Transfer ownership
      • Delete this note
      • Template
      • Save as template
      • Insert from template
      • Export
      • Dropbox
      • Google Drive
      • Gist
      • Import
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
      • Download
      • Markdown
      • HTML
      • Raw HTML
    Menu Sharing Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Versions and GitHub Sync Transfer ownership Delete this note
    Export
    Dropbox Google Drive Gist
    Import
    Dropbox Google Drive Gist Clipboard
    Download
    Markdown HTML Raw HTML
    Back
    Sharing
    Sharing Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    More (Comment, Invitee)
    Publishing
    Everyone on the web can find and read all notes of this public team.
    After the note is published, everyone on the web can find and read this note.
    See all published notes on profile page.
    More (Comment, Invitee)
    Commenting Enable
    Disabled Forbidden Owners Signed-in users Everyone
    Permission
    Owners
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Invitee
    No invitee
       owned this note    owned this note      
    Published Linked with GitHub
    Like2 BookmarkBookmarked
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Multi-Agent Reinforcement Learning (2/3): Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning This blog is based on the paper *"Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning"*, by Kuba et al., available at https://arxiv.org/pdf/2109.11251.pdf. It is the second blog in the series "Multi-Agent Reinforcement Learning". You can read the prequel of it at https://hackmd.io/rkNojzNzQzWXlU0HoaPOrg. ## The Limitations of Multi-Agent Policy Gradients We are already familiar with **MARL**, in which independent agents want to optimize their parameterized policies $\pi^i$ to maximize $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\pi}) = \mathbb{E}_{s\sim \rho^{0:\infty}_{\boldsymbol{\pi}}, \boldsymbol{a_{0:\infty}}\sim \boldsymbol{\pi} }\big[ \sum_{t=0}^{\infty}\gamma^t r(s_t, \boldsymbol{a}_t)\big]$. Following traditional deep RL approach, we would always model every agent's policy with a neural network $\theta^i$ as $\pi^i_{\theta^i}$, and learn it with stochastic gradient ascent $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^i \gets \theta^i + \alpha \nabla_{\theta^i}\mathcal{J}(\boldsymbol{\pi}_{\boldsymbol{\theta}})$. Such a policy-gradient (PG) based approach is known to be problematic already in single-agent RL (and Machine Learning in general), because the step taken by the gradient update can be too large and harm performance. When it comes to mutli-agent PG (MAPG), this problem is even more severe because the update directions pointed by the gradients of the agents can be conflicting... What?! The gradient should always lead up the reward, right?! Not exactly. Bear in mind that in MARL, every agent follows its OWN gradient, which tells what the agent can do to improve the joint performance. Meanwhile, when all agents try to do the same thing, the joint update may be a disaster. ![](https://i.imgur.com/SNckxAw.jpg) It's like if you were driving a car with your friend, and at some point the road would diverge, with a tree in the middle. Assuming you do nothing, your friend pulls the wheel right, to avoid a crash. However, being mindfull, you want to avoid the tree by turning left. You can't however, because the force applied by your friend is stopping you. The wheel remains still... ![](https://i.imgur.com/ZIXFRg8.jpg) Therefore, in order to make decisions that are beneficial fo the whole team, the agents must always ***collaborate***. Unfortunately, name a method, be it MADDPG, IPPO, MAPPO, all of them make the agents to mind only themselves and follow their own gradients. Hence, we still have no clue how to assure the performance improvement in MARL... Until now 😁 ## Multi-Agent Trust Region Learning In single-agent RL, trust region learning enables stability of updates and policy improvement; at every iteration $k$, the new policy $\pi_{k+1}$ increases the return: $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\pi_{k+1}) \geq \mathcal{J}(\pi_k)$. Because of the reasons described above, simply apllying trust region learning to MARL fails: even if a trust-region update would guarantee improvement of one agent, all agents' updates can be damaging for the whole team. Today, however, you will see the new *multi-agent trust region learning*, which implements cooperation, and leads to the joint policy improvement 🎉. The key ingredients of it are the novel multi-agent functions, which describe the contribution of subsets of agents to the joint return. ### Multi-Agent Advantage Decomposition First, the *multi-agent state-action value function* for an arbitrary ordered agent subset $i_{1:m} = \{i_1, \dots, i_m\}$ is defined as $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad Q_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}) \triangleq \mathbb{E}_{\boldsymbol{a}^{-i_{1:m}}\sim\boldsymbol{\pi}^{-i_{1:m}}}\big[Q_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}, \boldsymbol{a}^{-i_{1:m}} ) \big]$. Simply speaking, this function says what is the average return if agents $i_{1:m}$ take a joint action $a^{i_{1:m}}$ at state $s$. On top of it we can define *the multi-agent advantage function* $\quad \quad \quad \quad \quad \quad \quad \quad A_{\boldsymbol{\pi}}^{i_{1:m}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}}) = Q^{j_{1:k}, i_{1:m}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{j_{1:k}}, \boldsymbol{a}^{i_{1:m}}) - Q^{j_{1:k}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{j_{1:k}})$. This function compares the quality of joint action $\boldsymbol{a}^{i_{1:m}}$ of agents $i_{1:m}$ against the average one for joint action $\boldsymbol{a}^{j_{1:k}}$ of agents $j_{1:k}$. Just think about how useful would it be if $i_{1:m}$ could know $\boldsymbol{a}^{j_{1:k}}$ and the multi-agent advantage. They could then "react" cleverly by choosing a joint action $\boldsymbol{a}^{i_{1:m}}$ with large multi-agent advantage... Actually, there is a lemma which describes the awesome consequence of such a scenario, known as the ***Multi-Agent Advantage Decomposition Lemma***: for any ordered subset $i_{1:m}$ of agents $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad A^{i_{1:m}}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m}}) = \sum\limits_{j=1}^{m}A^{i_j}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:j-1}}, a^{i_j})$. Isn't it amazing 😀?! If every agent $i_j$ knew what agents $i_{1:j-1}$ do, then it could react with $a^{i_j}_*$ to maximize its own multi-agent advantage (whose max is always positive). ![](https://i.imgur.com/Ue5arsO.png) Then, by setting $m=n$, the lemma assures that the joint advantage will be positive! And if the number of agents is large, it should actually be "very" positive! Let's go and see how to contrive a learning algorithm with this idea. ### Monotonic Improvement To learn joint policies which perform well throughout the whole game, we must look at the bigger picture than only one state and action---we consider their marginal distribution. Let's suppose that our agents follow a joint policy $\boldsymbol{\pi}=(\pi^1, \dots, \pi^n)$. They decide to learn according to some order $i_{1:n}$. Suppose that $i_1, \dots, i_{m-1}$ have already made their updates to new policies $(\bar{\pi}^{i_1}, \dots, \bar{\pi}^{i_{m-1}}) = \boldsymbol{\bar{\pi}}^{i_{1:m-1}}$. Then, for any candidate policy $\hat{\pi}^{i_m}$ we define the surrogate return $\quad \quad \quad \quad \quad L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) = \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}^{i_{1:m-1}}\sim \boldsymbol{\bar{\pi}}^{i_{1:m-1}}, a^{i_m}\sim \hat{\pi}^{i_m}}\big[ A_{\boldsymbol{\pi}}^{i_m}(s, \boldsymbol{a}^{i_{1:m-1}}, a^{i_m}) \big]$. This definition is just a slight step forward from the definition of multi-agent advantage: here, the agent $i_m$ wouldn't react to others with a specific action, but rather with a specific policy $\hat{\pi}^{i_m}$. Fortunately, in this "bigger" setting, another decomposition lemma holds: define $C = \frac{4\gamma \max_{s, \boldsymbol{a}} |A_{\boldsymbol{\pi}}(s, \boldsymbol{a})|}{(1-\gamma)^2}$. Then $\quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\bar{\pi}}) \geq \mathcal{J}(\boldsymbol{\pi}) + \sum\limits_{m=1}^{n}\big[ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \bar{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m})\big]$. This lemma provides a lower bound on the performance of the new joint policy of agents; a lower bound that is decemposed among agents. Such a decomposition allows the agents to, one by one, improve the guarantee on the performance of the next joint policy. It is just tailored for the *sequential update scheme*: the agents update their policies to solve $\quad \quad \quad \quad \quad \quad \quad \bar{\pi}^{i_m} = \text{argmax}_{\hat{\pi}^{i_m}} \ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \hat{\pi}^{i_m})$. As the maximization update is at least as good as no update (for which the above objective is zero), every agent guarantees that $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \bar{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m}) \geq 0$. Together with the above decomposition inequality, the agents following this protocol achieve the **monotonic improvement property**: $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathcal{J}(\boldsymbol{\bar{\pi}}) \geq \mathcal{J}(\boldsymbol{\pi})$. Hurrah!!! 🧨🎆 We've done it! We figured out how to make the agents improve the joint return; all with a small update guarantee due to the $D_{\text{KL}}^{\text{max}}(\pi^{i_m}, \bar{\pi}^{i_m})$ penalty. You may wonder *if any order of updates in the sequential update scheme guarantees monotonic improvement, which order should we use?* Indeed, it's a good point. Let's ask ourselves a question, *what the behavior of agents would be like at convergence to the optimal joint policy?* Unfortunately, just like in the case of trust region learning in RL, we cannot implement this protocol in games with large state spaces. But don't worry!, we can easily approximate it and boost with neural networks, as we describe below. ## Deep Algorithms Now you will learn about and get excited by the new state-of-the-art deep MARL algorithms: *Heterogeneous-Agent Trust Region Policy Optimization* (**HATRPO**) and *Heterogeneous-Agent Proximal Policy Optimization* (**HAPPO**). ### HATRPO So what we'd like to do is to make every agent $i_m$ solve $\quad \quad \quad \quad \quad \quad \quad \quad \bar{\pi}^{i_m} = \text{argmax}_{\hat{\pi}^{i_m}} \ L_{\boldsymbol{\pi}}^{i_{1:m}}(\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, \hat{\pi}^{i_m}) - CD_{\text{KL}}^{\text{max}}(\pi^{i_m}, \hat{\pi}^{i_m})$ one after another. This is very similar to the objective of single-agent trust region learning: $L_{\pi}(\hat{\pi}) - CD_{\text{KL}}^{\text{max}}(\pi, \hat{\pi})$. This, in case of neural network policies, is solved approximately by constrained objective of TRPO $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \bar{\pi} = \text{argmax}_{\hat{\pi}} \ L_{\pi}(\hat{\pi}), \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi, \hat{\pi}) \leq \delta\\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad =\text{argmax}_{\hat{\pi}} \ \mathbb{E}_{s\sim\rho_{\pi}, a\sim\pi}\Big[ \frac{\hat{\pi}(a|s)}{\pi(a|s)} A_{\pi}(s, a)\Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi, \hat{\pi}) \leq \delta,$ where $\delta$ is a small constraint parameter controlling the size of update. The key to making this objective solvable is that the distribution over which the expectation is taken entirely from the "old" policy $\pi$, allowing us to automatically differentiate it. Unfortunately, our approximate multi-agent trust-region objective (written out) $\quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}^{i_{1:m-1}}\sim\boldsymbol{\bar{\pi}}^{i_{1:m-1}}, a^{i_m}\sim\hat{\pi}^{i_m}}\big[ A^{i_m}_{\boldsymbol{\pi}}(s, \boldsymbol{a}^{i_{1:m-1}}, a^{i_m}) \big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta$ not only involves the old and the candidate policies $\pi^{i_m}$ and $\hat{\pi}^{i_m}$, but also the joint just-updated policy of agents $i_{1:m-1}$, 😭 so hard!!! Not really! We can boost importance sampling to make the esimtation of this objective feasible. Recall that the agents $i_{1:m-1}$ have already made their update. Then, $i_m$ is free to compute the ratio $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} \cdot \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }$. So let's the agent compute it, and plug it in what it has, which is the data coming from the old joint policy $\boldsymbol{\pi}$. It turns out that the multi-agent trust-region objective can be equivalently written as $\quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} \cdot \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) } A_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta.$ Yes, $A_{\boldsymbol{\pi}}(s, \boldsymbol{a})$ is the joint advantage function. The agents don't have to train special critics to compute the multi-agent advantage; all they have to do is to maintain a joint advantage estimator, like GAE. Furthermore, for simplicity, we can write $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) = \frac{ \boldsymbol{\bar{\pi}}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) }{ \boldsymbol{\pi}^{i_{1:m-1}}(\boldsymbol{a}^{i_{1:m-1}}|s) } A_{\boldsymbol{\pi}}(s, \boldsymbol{a})$, which transforms our problem to the well-known TRPO objective! 🎆🎇🎈 $\quad \quad \quad \quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big], \ \text{s.t.} \ \overline{D}_{\text{KL}}(\pi^{i_m}, \hat{\pi}^{i_m}) \leq \delta$. Having turned the multi-agent trust-region problem to the above objective, the agent $i_m$ (with a neural network policy $\theta^{i_m}$) can maximize it by performing a step of TRPO $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^{i_m}_{\text{new}} = \theta^{i_m}_{\text{old}} + \alpha^j \sqrt{ \frac{2\delta}{\boldsymbol{g}^{i_m}(\boldsymbol{H}^{i_m})^{-1} \boldsymbol{g}^{i_m}}} (\boldsymbol{H}^{i_m})^{-1} \boldsymbol{g}^{i_m}$. Here, just like in TRPO, $\boldsymbol{g}^{i_m}$ is the gradient of ${i_m}$'s objective, and the matrix $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \boldsymbol{H}^{i_m} = \nabla^{2}_{\theta^{i_m}}\overline{D}_{\text{KL}}\big( \pi^{i_m}_{\theta^{i_m}_{\text{old}}}, \pi^{i_m}_{\theta^{i_m}}\big)$ is the Hessian of the average KL-divergence from the old policy. As you may know from TRPO, $\alpha^j$ is a step size at the power $j$. We choose $j$ to be the smallest such $j\in\mathbb{N}$ which makes the update improve the objective estimated from the data. In this way, we can enforce the update to be as good as we can get from available data 🤩. ### HAPPO You may not like the second-order differentiation that HATRPO uses, ok. It is harder to code up, and also is more computationally expensive. Sometimes we want to quickly implement and execute an algorithm. Because of such concerns, we also developed an implementation of multi-agent trust-region learning through proximal policy optimization (PPO). As the constrained HATRPO objective has the same algebraic form as TRPO, it can be implemented with the *clip objective*. $\quad \quad \quad \quad \mathbb{E}_{s\sim\rho_{\boldsymbol{\pi}}, \boldsymbol{a}\sim\boldsymbol{\pi} }\Big[ \text{min}\Big( \frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)} M_{\boldsymbol{\pi}}(s, \boldsymbol{a}), \text{clip}\big(\frac{ \hat{\pi}^{i_m}(a^{i_m}|s)}{ \pi^{i_m}(a^{i_m}|s)}, 1\pm \epsilon \big) M_{\boldsymbol{\pi}}(s, \boldsymbol{a}) \Big)\Big]$ The clip operator replaces the policy ratio with $1-\epsilon$ or $1+\epsilon$ depending on whether it exceeds the thresholds interval $1\pm\epsilon$ from below or above. If this is not the case, the ration remains unchanged. So for example, $\text{clip}(1.2, 1\pm 0.1) = 1.1$, and $\text{clip}(0.8, 1\pm 0.1) = 0.9$. This makes sure that large policy updates are discouraged. The clip objective is differentiable with respect to the policy parameters, so all we have to do is to initialize $\theta^{i_m} = \theta^{i_m}_{\text{old}}$, and a few times to this $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \boldsymbol{g}^{i_m}_{\text{HAPPO}} \gets \nabla_{\theta^{i_m}}L^{\text{HAPPO}}(\theta^{i_m})\\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \theta^{i_m} \gets \theta^{i_m} + \alpha \boldsymbol{g}^{i_m}_{\text{HAPPO}}.$ So quick and convenient! 😍 ### Domination (verified empirically) And how do these algorithms perform? Well, HAPPO itself outperforms SOTA methods like MAPPO, IPPO, and MADDPG. HATRPO, however, completely dominates all of them (including HAPPO) and establishes the new SOTA! 💥 Here you have some plots from Multi-Agent MuJoCo---the hardest MARL benchmark. ![](https://i.imgur.com/Uo32wJN.png) So yeah, the key take-away of multi-agent trust-region learning is that the large number of agents does not have to imply conflicts in learning. Quite the opposite, a large team of learners willing to cooperate can get very far 🗻! Are you wondering how can they do it safely? Read our next article! Thanks for reading this article; myself (Kuba) I am really happy for your interest in MARL, and so are my co-authors: Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang.

    Import from clipboard

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lost their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template is not available.


    Upgrade

    All
    • All
    • Team
    No template found.

    Create custom template


    Upgrade

    Delete template

    Do you really want to delete this template?

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Tutorials

    Book Mode Tutorial

    Slide Mode Tutorial

    YAML Metadata

    Contacts

    Facebook

    Twitter

    Feedback

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions

    Versions and GitHub Sync

    Sign in to link this note to GitHub Learn more
    This note is not linked with GitHub Learn more
     
    Add badge Pull Push GitHub Link Settings
    Upgrade now

    Version named by    

    More Less
    • Edit
    • Delete

    Note content is identical to the latest version.
    Compare with
      Choose a version
      No search result
      Version not found

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub

        Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

         Sign in to GitHub

        HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Available push count

        Upgrade

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Upgrade

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully