hadidanesh
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Response to All Reviewers (a.k.a Letter to the AC) We thank all the reviewers for very constructive comments. We thank **Reviewer m6sU** recognizing the novelity of our work: <font color="green">*"The paper provides a novel perspective -- the preconditioning/adaptive perspective -- on in-context learning in transformers, which could potentially lead to new ways of understanding and improving these models. Given the growing interest in understanding ICL from a theoretical perspective, the "preconditioning gradient descent" can be served as an important building block. The analysis tools developed in this paper could be helpful for future research on understanding ICL."*</font>. We also thank **Reviewer k8UL** for appreciating the value of our approach: <font color="green">This is a very good paper. I was very happy to read it.....I really liked the approach that the authors have taken. It is quite a neat idea to consider sparsity driven constraints....The results resolve some very important questions towards understanding ICL. The paper is very well written and a joy to read....This paper serves as a crucial step towards the goal of understanding ICL is one of the central problems towards demistifying LLMs</font>. Last but not least, we thank **Reviewer vRHd** for appreciating the analysis in our paper: <font color="green">*"The paper provides an interesting analysis that can be useful to understand the intriguing properties of in-context learning......this paper can be useful as a stepping stone for any researcher that wants to further study intriguing properties of in-context learning."*</font> We begin our response with the summary of our main contributions as summarized by the reviewers above. **Primary technical achievements:** - The inspiring previous works have shown that by setting the weights of transformers carefully, one can ICL the task of linear regression. We follow up on these works and show that in fact when you train your the transformer over the random instances of linear regression, **the global minima (and stationary points) correspond to interesting algorithms** (such as "preconditioned" gradient descent). - At a more technical level, as endorsed by the reviewers, we propose an analytical techniques to analyze the training loss of transformer architecture, which led to follow-up/concurrent works [2] and [3]. - We would also like to underscore the relevance of this work to this community by briefly mentioning concurrent works that were posted shortly after our submission. First, Zhang et al. [2] have shown that under a random initialization, gradient descent converges to the global minimum characterized in this work, contributing to our understanding training transformers. Moreover, Arvind et al. [3] establish a similar set of results to ours for the single layer case. We believe that the question we attempt to answer in this work is indeed very timely and relevant to the community, and would hopefully spark further interests. We next address a shared concern raised by the reviewers: **Linear Attention:** As noted by reviewers, our settings in this paper is theoretical where we analyze transformers with linear attention and no MLP layers. Nevertheless, such networks are expressive enough for solving linear regression, and they have become the focus of theoretical studies in various recent works [1,2,3]. In particular, Lemma 2 proves these networks are expressive enough to implement a family of optimization methods. In fact, our results extend---to an extent---to the softmax attention case. In a nutshell, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) In fact, under the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. Without going into gory details, the key intuition is the following linearization trick $\frac{1}{2} (e^x-e^{-x}) \approx x$. Indeed, in our experiments, we observe that the weights of the two attention heads have approximately opposite sign to each other. We'll add this discussion in our final version. - Already this simple setting suffices to learn interesting algorithms. - The expressivity is less than the model with MLP... - Oswald's 2 attentions can achieve the same solution as linear for this setting... then why not focus on linear? **Additional Experiments:** - ICL experiments: number of ICL samples vs test loss - --- **References.** [1] Von Oswald, Johannes, et al. "Transformers learn in-context by gradient descent." International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. "Trained Transformers Learn Linear Models In-Context." arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. "One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention." arXiv preprint arXiv:2307.03576 (2023) # Response to m6sU (7// Confidence 4) ---- As we shared in the common response, we conducted the suggested experiment and shared the results. Thank you for the suggestion. As noted by Oswald et al., the network analyzed in this paper is sufficiently expressive to implement gradient descent on the least-squares objective. Hence, one can zero-out all parameters of the MLP block and skip this block with residual connection to implement gradient descent on ridge regression. Thus, in this work, we omit MLP block, and in fact, this approach was also taken in other recent works in the literature [1,2,3]. Note that ICL of kernel regression requires MLP blocks as shown in [1,4,5]. Extending our results to kernel regression is an interesting research topic for future works. We will add a detailed discussion of this in our final version as per your suggestion. Thank you for catching the typos, we will correct them in the final version. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References.** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. "Trained Transformers Learn Linear Models In-Context." arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. "One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention." arXiv preprint arXiv:2307.03576 (2023) [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022) [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv preprint 2022. ----- # Response to k8UL (7// Confidence 4) ---- For the linear attention comment, we provide an extended response in the common response section. Thanks for pointing it out! Regarding the parametrization, there are experimental and theoretical results confirming the optimization of reparameterized network will recovers the same solution charaterized in this paper: $W_k W_q^\top$ converges to $Q^*$ charaterized in Theorems 1, 2 and 3 when $W_q$ and $W_k$ are square matrices. Experiments in [1] shows this convergence for single-layer and multi-layer linear attention. Theoretically, [2] recently proves the global convergence of GD to the optimal solution when $W_q = W_k$ for a single attention layer. Regarding the LBFGS comment, we conjecture that various algorithms will converge to the same solution charaterized in the Theorems. This is proven for SGD on single layer attention by [1]. Furthermore, [1] includes experimentaly results for AdamW on mutli-layer attention without sparsity constraint. Moreover, regarding the sparsity pattern, [1] conjectures the similar results hold without sparsity constraint based on experimental observations. In fact, in our Section 5, We did our best to relax such techincal sparsity assumption required for theoretical proofs. We will clarify this in the paper. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” # Response to iy8f (5// Confidence 2) ---- Thank you for your constructive comments! We address your comments one by one as below. > Compared to the previous work (von Oswald et al., 2022), the novelty is insufficient except the introduction of non-isotropic data distribution. Our results in Theorems 1, 2, and 3 are novel even for isotropic inputs. Remarkably, [von Oswald et al. 2022] proves the recurrence of linear attentions is experssive to implement gradient descent. However, such experssivity result **do not imply anything about whether the trained transformers also attain such properties**. Here, we theoretically address the conjecture of [von Oswald et al. 2022] which postulates that the optimization of parameters leads to the implementation of gradient descent. We establish the connection between data distribution used for training to motivate the importance of going beyond expressivity. When the data distribution is not isotropic, the network is still expressive to implement gradient descent but the optimal transformer (with a single-layer) is not the ones corresponding to implementing gradient descent. This is proved in Theorem 1. Thus, it is important to analyze the landscape of training to charaterize the outcome of training. > This paper simplifies the problem a lot by treating all components as linear. However, for a standard transformer, the softmax and activation functions are two key components and introduce nonlinearity. A discussion on the relationship between linear and standard transformers is missing. Our problem settings is taken from [1] since we are building on empirical observations in [1]. While [2] and [3] have the same settings and only analyze a single-layer attention, we provide analysis for mutli-layer attentions. - [MLP block with non-linear activations] We argue that MLP block are not necessary for learning linear models. Zeroing out the weights of MLP blocks allows skipping these blocks over residual connections. To implement gradient descent on ridge regression, one can omit MLP blocks and only rely on the attention. - [Softmax] As noted in the general response, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) For the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. The key intuition is the following linearization trick $\frac{1}{2} ( e^x-e^{-x}) \approx x$. We observed that the weights of the two attention heads have approximately opposite sign to each other, confirming that this approximation is effecive. We'll add this discussion in our final version. > Although simulations are provided in this paper, it is not sufficient. It would be interesting to know: How performance varies with the number of layers? How a -layer linear transformer performs compared with a direct gradient descent approach with different step sizes or a standard transformer (during training or after sufficiently trained)? ... To address your concern, we actually conducted more experiments and share them in the common response. We would like to also highlight that there have already been several **empirical works** in the literature [1,4,5] showing various properties of the model learned for in-context learning. Hence, our main focus was to provide "**theoretical footing**" for their interesting empirical observations. In particular, we kindly refer the reviewer to the mentioned empirical works for further empirical properties. > Although $w^*$ is well-defined in the paper, it is not clear how $w$ relates to the model parameters or data. $w^*$ is a random vector, independent of the data distribution. Since the model is trained over random instances, the parameters of optimal model is independent of indiviual random $w^*$ and only depends on their distribution. We established the connection between distribution of $w^*$ and optimal model in Theorem 1. > Related work is absent in this paper. We reviewed the related works in the introduction. To address your concern, we will add a broader scope of related works in the final version. Also, we will include any other related works that the reviewer recommend. > Presuming you are studying least-squares loss, Eq.(5) may miss a square symbol. > There seem to be notation typos: in Eq. (9) and in Lemma 2. Thank you very much for catching these typos and your comments. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” International Conference on Machine Learning. PMLR, 2023. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” arXiv preprint arXiv:2306.09927 (2023) [3] Mahankali, Arvind, Tatsunori B. Hashimoto, and Tengyu Ma. “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention.” arXiv preprint arXiv:2307.03576 (2023) [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022) [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv preprint 2022. # Response to vRHd (6// Confidence 4) ------- > Generally, the paper restricts itself to a very narrow setting of a specific data that is centered at zero, linear model generating the labels, linear transformer, specific sparsity of the parameter matrix. We agree with the reviewer that our setting is theoretical. Yet, we note that our settings is not constructive and used in other theoretical studies. - [1,2,3] use linear attentions as linear attentions are expressive to implement gradient descent on linear regression. - [1-5] use centered data and generated labels - [1,2,3,5] focus on linear models In Section 5, we relax the sparsity constraint. Noted by reviewer `k8UL:` "It is quite a neat idea to consider sparsity driven constraints that permit tractability." Without the minimal sparsity constraint in Eq. 10, the analysis becomes very difficult. > The transformers that they study are linear with a single head only. While I do appreciate the value of the analysis and I understand that this is just a step of many, it would be great to include a short paragraph outlining what the authors think about generalization of the proposed approach when the assumptions are removed. - **Multi-head attention.** Here, we show that learning Multi-head linear attentions reduces to learning a single head attention. Recall notation Attn$_{P,Q}(Z)$ in Eq. 2. Due to the linearity of the attention in $P$ and $Q$, we have $$\sum_{i} \text{Attn}_{P_i,Q_i}(Z) = \text{Attn}_{\sum_i P_i,\sum_{i} Q_i}(Z) $$ Thus the reprameterization $P' := \sum_{i} P_i$ and $Q' = \sum_{i} Q$ casts the problem to learning a single head attention. - **Softmax.** As noted in the general response, we observe both theoretically and empirically the case of softmax is similar to the linear attention (similar to the conclusion of [Appendix A.9] in Oswald et al.) For the two-headed softmax attention, we observe that the learned algorithm is identical to that of linear attention. The key intuition is the following linearization trick $\frac{1}{2} ( e^x- e^{-x}) \approx x$. We observed that the weights of the two attention heads have approximately opposite sign to each other, confirming that this approximation is effecive. We'll add this discussion in our final version. <font color="green">*Responding to this is easy, and we should do it inline here, and say where in the paper we talk about this.*</font> > Also, given the series of the assumptions, I would make sure that they are stated early on in the abstract and introduction. In the abstract, we narrow down our scope to "linear transformers trained over random instances of linear regression". Introduction further clarifies "Akin to recent work, we too focus on the setting of linear regression encoded via ICL, for which we train a transformer architecture with multiple layers of single-head self-attentions without softmax". Linear regression encoded via ICL is introduced in [1,4,5] which encompasses the settings of generated labels, linear regression, centered input data and specific encoding of regression samples. We will gladly add more required details. > I also think that the experimental validation can be slightly improved by considering different choice of layers, dimension of x, conditioning factor of the input etc. It would also be great to have an empirical plot demonstrating how close A is to the Gram matrix as the number of points increase (the results from the end of section 3) <font color="green">*Here it is best to agree with the reviewer and mention something about more experiments, either in the appendix or ones that we may already be adding to the rebuttal.*</font> **Questions:** >1. The equations for the gradient do not immediately pop out from the eq 7. I would encourage the authors to expose the equations for the gradient to the reader (maybe after eq 5) so that the connection between the gradient and the preconditioner would become more apparent. We will apply your comment to improve readiblity. Thank you! >2. What are the properties of A in sections 3 and 4? Can a transformer find a positive definite matrix through training? In experiments, we observe gradient descent converges to a positive definite matrix which depends on the covariance matrix of inputs (charaterized in Theorems 4 and 5). We take steps towards proving this observation showing that the charaterized parameters in Theorem 4 and 5 are stationary points of gradient descent. An interesting follow-up paper proves the convergence of gradients to the global optimum (in Theorem 1) for a single head attention [2]. Yet, the convergence of gradient descent is an open problem for multi-head attention. >3. For the evaluation of the Theorem 4 and 5, it is a bit hard to interpret the numbers, since the used distance metric is close to zero, but not exactly zero. The obtained value of ~0.1-0.2 is quite hard to interpret, apart from the fact that it is smaller that the distance with respect to the identity. We agree with the reviewer. In Figures 2 and 4, we visualize the outputs (after training) to illuserate the diagonal dominance of matrices which matches the result of Theorems 4 and 5. The error can be reduced by using a larger set of training set. > Also, the metric that the authors are using involves the minima over the space of scalars . It would be interesting if the authors can actually provide the scalars from Theorem 4 that they have found empirically. Do these scalars remain the same over multiple restarts of the algorithm? > How come the same matrix is used for both data and the weights (theorem 4 and 5)? It doesn't sound very realistic that the data and the weights use the same covariance matrix. In theorem 1, we used different covariance matrices for data and $w^*$. However, the analysis for multi-layer attention is signficantly more diffficult when covariance matrices are indepdent. Notably, [3] also uses the same covariance stucture as those of Theorems 4 and 5. > Would be great to add more information to Section 4.1 about adaptive coordinate-wise step sizes. If I understood correctly, Theorem 3 talks about the existence of the global minimizer under certain conditions. How exactly does adaptive coordinate-wise step sizes help find this solution? Here, we do not analyze the convergence of adaptive coordinate-wise stepsizes to find the global optimum charaterized in Theorem 3. Instead, we analyze the structure of a global minimizer of training objective and provide an algorithm interperation for that. > I would encourage the authors to add a simple table at the end of the introduction clearly stating the assumptions that each section has. As written, It is quite hard to understand, especially when the notation change from section to section. This is a good idea and will improve readablity. We will add a table at the end of introduction and outline assumptions and main results of different sections. Small comments: > How come the authors choose L-BFGS as an optimizer? It is a fairly specialized algorithm that is not often used these days. What properties of L-BFGS are desirable in the settings that authors used it for? Can the same results be achieved with SGD or Adam? LBFGs enojys considerably faster convergence rate compared to Adam in our settings. [1] has experiments with AdamW without sparsity constraints of Section 5. [2] proves the global convergence of gradient descent for a single-head attention. We will compare convergence rates of these algorithms to justify our choice for the optimizer. > It would be important to highlight limitations of the analysis. E.g. same covariance matrix, singe-head linear attention... In discussions, we included limitations of our analysis where we talked about linearity of attention. We will add more limitations such as covariance matrix structure, and single-head analysis. We fill fix typos. Thank you once again for taking the time to review our paper. We sincerely hope that the response to your concerns, as well as the overall response to other reviewers’ concerns, helps assuage your concerns, and view this paper in a more favorable light. **References** [1] Von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” ICML 23. [2] Zhang, Ruiqi, Spencer Frei, and Peter L. Bartlett. “Trained Transformers Learn Linear Models In-Context.” arXiv 23 [3] Mahankali, Arvind, et al. “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention.” arXiv 23. [4] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." NeurIPs 22. [5] Akyürek, Ekin, et al. "What learning algorithm is in-context learning? investigations with linear models." arXiv 22. [6] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. ICML 21

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully