vyasg
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Matrix Profile ## Target Architecture Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz L1d Cache: 3,4 MiB L1i Cache: 2,3 MiB L2 Cache: 90 MiB L3 Cache: 108 MiB Number of AVX-512 FMA units: 2 All below measurments are taken with the following configuration of inputs. Time series length: 262159 Sublen: 10 ### Baseline The baseline measurments were taken using the icpx compiler with the following arguments: icpx ``` -O3 -mtune=icelake-server ``` Time taken: ``` 45.41 seconds ``` Perf: ``` 4.24 Giga Floating Point Operations Per Second 0.25 Cycles Per Instruction (per Logical Processor) 3.94 Instructions Per Cycle (per Logical Processor) 1.24 Floating Point Operations Per Cycle ``` ### SIMD - User-Defined Reduction in OpenMP The ```#pragma omp simd``` vectoriziation is pretty straightforward, because of the row dependencies a reduction is needed to avoid race conditions. Because we want to preserve the the index, we implemented a custom reduction on a struct data type containing mp and mpi. The comparison for the row max is done on that local variable, which is then reduced. The relevant code snippet is shown below. ```cpp max_mp localMp{ -1,-1 }; #pragma omp declare reduction(argMax : struct max_mp : omp_out = omp_in.mp > omp_out.mp ? omp_in : omp_out) // main computation loops // TODO loop tiling and parallelization for (int i = 0; i < mlen - sublen; i++) { // TODO loop tiling and simd parallelization localMp.mp = mp[i]; localMp.mpi = mpi[i]; #pragma omp simd reduction(argMax:localMp) for (int j = i + sublen; j < mlen; j++) { // streaming dot product if (i != 0) QT[j - i - sublen] += df[i] * dg[j] + df[j] * dg[i]; double cr = QT[j - i - sublen] * norm[i] * norm[j]; // updating the nearest neighbors information if (cr > localMp.mp) { localMp.mp = cr; localMp.mpi = j; } if (cr > mp[j]) { mp[j] = cr; mpi[j] = i; } } mp[i] = localMp.mp; mpi[i] = localMp.mpi; } ``` #### Performance evaluation The OMP SIMD measurments were taken using the gcc compiler (ICPX/ICPC did not work well with custom reductions) with the following arguments: g++ ``` -fopenmp -O3 ``` ``` 44.76 seconds ``` ``` 4.49 Giga Floating Point Operations Per Second 0.32 Cycles Per Instruction (per Logical Processor) 3.17 Instructions Per Cycle (per Logical Processor) 1.35 Floating Point Operations Per Cycle ``` ### AVX-512 Intrinsics #### Steps The kernel contains the following computations in the inner loop that can be vectorized - Row initialization The first row is calculated using the SIMD version of the serial code. ``` __m512d QT_v = _mm512_add_pd(QT_v, _mm512_mul_pd(_mm512_sub_pd(ts_j, mu_start), _mm512_sub_pd(ts_i_j, mu_i))); ``` - Corelation Calculation Correlation is calculated using the SIMD version of the serial code. ``` ... QT_v = _mm512_add_pd(QT_v, _mm512_add_pd(_mm512_mul_pd(df_i_v, dg_j_v), _mm512_mul_pd(df_j_v, dg_i_v))); __m512d cr_v = _mm512_mul_pd(QT_v, _mm512_mul_pd(norm_i_v, norm_j_v)); ``` - Reduction across J We need to reduce max(cr_v) along j and do a manual reduction on max_cr_v at the end of the j SIMD loop ``` result = _mm512_cmp_pd_mask(cr_v, max_cr_v, _CMP_GT_OS); max_cr_v = _mm512_mask_blend_pd(result, max_cr_v, cr_v); max_index = _mm512_mask_blend_epi64(result, max_index, indices_j_v); ``` - Reduction within J We also need to compare cr_v with mp[j] within the j loop and reduce accordingly. The resulting operation is pretty straightforward. ``` result = _mm512_cmp_pd_mask(cr_v, mp_j_v, _CMP_GT_OS); _mm512_mask_store_pd(&mp[j], result, cr_v); _mm512_mask_store_epi64(&mpi[j], result, index_i_v); ``` #### Compiler We decided to use icpx for the intrinsics version with the following parameters: ``` -O3 -xCORE-AVX512 -mtune=icelake-server ``` #### Best Runtime ``` 49,71 seconds. ``` The Intrinsic version is slower than the 256 bit auto-vectorized version of baseline. We speculate that this could be due to the heavy nature of 512 fma instructions being run on a single thread. The results obtained from ```baseline``` and ```omp simd``` versions are very close to the Intrinsics version This could be due to the fact that the number of AVX512 FMA units is 2 in the Icelake CPU. Which bottlenecks the number of AVX512 instructions that can be run in parallel in the CPU #### Performance Measurements - Perf report breakdown (vfma* vmulpd instructions contribute to a high percentage of the runtime) ``` 13,69 │ vfmadd213pd (%rdx,%rcx,8),%zmm6,%zmm7 19,08 │ vfmadd231pd (%r14,%rcx,8),%zmm5,%zmm7 0,00 │ vpbroadcastq %rsi,%zmm8 19,46 │ vmulpd (%r10,%rcx,8),%zmm3,%zmm9 0,01 │ vmovupd %zmm7,(%rdx,%rcx,8) 4,69 │ vmulpd %zmm7,%zmm9,%zmm7 11,64 │ vcmpltpd %zmm7,%zmm1,%k1 1,18 │ vpaddq %zmm0,%zmm8,%zmm2{%k1} 8,12 │ vcmpgtpd (%r12,%rcx,8),%zmm7,%k1 ``` The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters. Here are the following relevant metrics we obtained from the results - GFLOPs ``` 5.14 Giga Floating Point Operations Per Second ``` The Giga flops is a derived metric from the number of Floating point operations divided by the time taken. We have a higher Giga flops when compared to the baseline due to the larger 512bit instruction. - Memory Counters ``` Backend_Bound.Memory_Bound.L3_Bound % Stalls 61.4 Info.Bottleneck Memory_Latency Scaled_Slots 39.36 ``` 61.4 % of backend bound instructions were stalled for memory, this means that we are memory intensive. The total pipeline cost of Memory Latency related bottlenecks (external memory and off-core caches) is not very high. This is a derived metric from different core/uncore DRAM_Bound, L1_Bound PMM_Bound, Store_Bound, Hit_Latency events. (described in https://github.com/andikleen/pmu-tools/blob/master/clx_server_ratios.py) - Miscellaneous ``` 1.88 Cycles Per Instruction (per Logical Processor) 1.56 Floating Point Operations Per Cycle ``` We are spending more time per instruction with than baseline and omp simd due to the 512 bit SIMD instructions but we are doing more Floating point operations per cycle. #### Memory Bandwidth We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket. The pmu tool allows us to monitor the counter over a period of time. ```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB``` In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory. ### Multi-threading - OpenMP Before we explain our parallelization, we should explain the dependencies of the matrix profile code or algorithm. The most important part is that if the rolling dot product is used, then the next row is dependant on the computational results of the previous row. What this means is that the QT array is is updated according to the sliding window principle. For example, In the first row QT[0] computes the QT part of the correlation for index i=0 and j=3 (sublen=3) and for the second row, that initial data is used to compute QT for i=1 and j=4 (in QT[0]). This means the diagonal elements going from the top left to the bottom right, are dependant on their predecessor. Another important point is synchronization of the max comparison and the writing of the values into the mp and mpi arrays. This will be discussed in more detail in the tiling section. #### Tiling To properly distribute the triangular workload onto our 72 hardware threads in ice2, we need to tile it. A simple automatic distribution of the index ranges would work with a rectangular workload, but not in a triangle. Like the assignment document suggests, you can tile the triangle into smaller triangles, some of the same structure and some inverted ones. This is shown in the figure below. <figure> <img src="tiling_triangles.png" alt="figure" style="width:400px"> <figcaption>Figure: The triangular tiling</figcaption> </figure> What is also possible is to connect the inverted triangle, to create parallelograms, this makes the programming logic simpler, makes better use of the whole range of the QT array, removes overhead and might increase cache locality. The figure below shows the tiling. <figure> <img src="tiling_para.png" alt="figure" style="width:400px"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> Now to the number of tiles and the distribution to threads. Each of the threads should ideally have the same amount of work, a near perfect distribution can be achieved with this kind of tiling. The right row still consists of triangles, which are nearly exactly half the amount of work to compute as the parallelograms (one diagonal less). Meaning, one thread should either take two triangles or one parallelogram. The formula for the number of workload units is related to the gaussian sum. It is:```1+...+n +(n+1)/2, (n is an odd integer)``` As an example, for 8 threads a even distribution is possible, the first row has 3 parallelograms, the second 2, the third 1. That's 6 parallelograms, at the side there are 4 triangles meaning the workload of 2 parallelograms. That's 8 work units overall for n=3, n corresponds to the number of parallelograms in the first row. (Or tiles-1) N should be odd, because otherwise there are half work units, for example for n=2, there are 3 parallelograms and 3 triangles, equaling to 4.5 work units, which can not be equally distributed. For our case of 72 threads we have a perfect distribution for n=11, ```1+...+11+(11+1)/2 = 72```. This gives each thread an close to equal workload, with 66 parallelograms taken by 66 threads and 12 triangles taken by 6 threads. We implemented the tiling by having the triangle tiles case as an if case in the same code, the fragment below shows the structure. ```cpp for (int i = 0; i < mlen - sublen; i += tileSize) { for (int j = i + sublen; j < mlen; j += tileSize) { for (int k = 0; k < tileSize; k++) { QT_tile[k] = 0; } for (int k = j; k < j + tileSize; k++) { for (int l = 0; l < sublen; l++) { //if (k - sublen >= 0) QT_tile[k - j] += (ts[i + l] - mu[i])*(ts[k + l] -mu[k]); } } int jj_border = i + tileSize; for (int ii = i; ii < i + tileSize; ii++) { jj_border = j + ii - i + tileSize; if (jj_border > mlen) jj_border = mlen; for (int jj = j + ii - i; jj < jj_border; jj++) { // streaming dot product if (ii != i) QT_tile[jj - j - (ii - i)] += df[ii] * dg[jj] + df[jj] * dg[ii]; double cr = QT_tile[jj - j - (ii - i)] * norm[ii] * norm[jj]; // updating the nearest neighbors information if (cr > mp_local[ii]) { mp_local[ii] = cr; mpi_local[ii] = jj; } if (cr > mp_local[jj]) { mp_local[jj] = cr; mpi_local[jj] = ii; } } } ``` If mlen-sublen is not divisible by n+1, then the tiling will leave a remainder, we also handle this. The figure below shows an example with mlen-sublen = 14, which is does leaves the remainder 2, with 3 tiles in the first row. (n=2). We handle first the upper right triangle, the green tile, with the same logic as usual (triangle bounds, not parallelogram) and the red columns are handled with the parallelogram logic, but treated as mirrored. In this case the red columns are handled as two rows with QT being the size of the buffer. There are minor differences in mp, when handling the remainder because the order of operations is different (rounding errors, we checked closely), but mpi is always the same. Without any remainder we get the exact same results as the sequential code. If there is a remainder it has a negligible impact on performance, that's why it's not parallelized, but could be. <figure> <img src="remainder.png" alt="figure" style="width:400px"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> #### Parallelization To parallelize our implementation we need to watch out for the dependencies between the tiles, the figure below shows which ranges of mp/mpi have collisions. <figure> <img src="collisions.png" alt="figure" style="width:100%"> <figcaption>Figure: The parallelogram tiling</figcaption> </figure> MP1-3 show local mp buffers, the ranges that are touched by the tiles 1-3 are colored by their respective color. The i range is the same for all 3 tiles, the are horizontal neighbors. Due to the symmetry of the triangle the j comparisons touch different ranges of mp. As we can see there would be collision if they worked on the same mp/mpi. That's why each tile has its own mp/mpi, in our case of 72 threads, it's 72 tiles. At first we had an array of structs and handled it in a custom reduction, but to make it more compatible with our intrinsics code we switched to a critical section at the end of the computations, serving as a handmade reduction. This did not decrease the performance, as it's only done once between 72 threads at the very end. ```cpp #pragma omp parallel num_threads(numThreads) { double* QT_tile = (double*)malloc(tileSize * sizeof(double)); double* mp_local = (double*)malloc(mlen_tiled * sizeof(double)); int * mpi_local = (int*)malloc(mlen_tiled * sizeof(int)); for (int k = 0; k < tileSize; k++) { mp_local[k] = -1; mpi_local[k] = -1; } #pragma omp for collapse(2) schedule(dynamic) for (int i = 0; i < mlen - sublen; i += tileSize) { for (int j = i + sublen; j < mlen; j += tileSize) { ``` The code fragment above shows the main part of the parallelization, we have local mp, mpi and QT for each thread and the tiles are distributed with one tile for each thread, or two tiles for some of the threads (this is handled by schedule(dynamic)). We also tried two parallel sections, where the 66 parallelograms get 66 threads and the 12 triangles get 6 threads, but that code was slower. (Possible with omp_set_nested(1)). Each thread has its own local buffers for mp/mp/qt, so no numa considerations needed (first touch). And we tried thread local df, dg and norm buffers, but that slowed down execution. Pinning is done with export OMP_PROC_BIND=close and export OMP_PLACES=cores, giving us a lot better performance for smaller lengths, opposed to no bindings. ### Parallel Performance The parallel performance was not effected by the inner loop vectorization, auto vectorization, omp simd and intrinsics performed nearly the same, as in the sequential case. ICPX ``` -O3 -xCORE-AVX512 -mtune=icelake-server -qopenmp ``` ``` Main Kernel took 0.752386 seconds 0.96 seconds ``` #### Performance Measurements - Perf report breakdown. (The AVX512 FMA and CMP instructions consumed most amount of time) ``` 11,63 │ vmovupd (%rbx,%rcx,8),%zmm7 10,27 │ vfmadd213pd (%r8,%rcx,8),%zmm5,%zmm7 0,01 │ mov 0x1f8(%rsp),%rbx 20,42 │ vfmadd231pd (%rbx,%rcx,8),%zmm4,%zmm7 0,01 │ lea (%r11,%rcx,1),%ebx │ movslq %ebx,%rbx 7,52 │ vmulpd (%rdx,%rcx,8),%zmm2,%zmm8 0,01 │ vmovupd %zmm7,(%r12,%rbx,8) 7,12 │ vmulpd %zmm7,%zmm8,%zmm7 20,59 │ vcmpltpd %zmm7,%zmm0,%k1 (Compiler switches the '>' to a '<' comparison) 2,10 │ vmaxpd %zmm0,%zmm7,%zmm0 ``` The following stats are collected from Intel core/uncore events using pmu_tools (https://github.com/andikleen/pmu-tools). This provides us access to more counters than the ones available in perf. pmu-tools provides us with a Top Level Breakdown of performance counters. The Top-Down Characterization is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application. We also used the manufacturer's guide on Performance counters from http://www.intel.com/content/dam/www/public/us/en/documents/design-guides/xeon-e5-2600-uncore-guide.pdf Here are the following interesting metrics we obtained from the results - GFLOPs ``` 81.54 Giga Floating Point Operations Per Second ``` We are able to achieve 81 GFLOPs using the Tiled, Parallel and intrinsics version for 262159 length time series - Backend Bound Instructions ``` BE/Mem Backend_Bound.Memory_Bound % Slots 32.1 [ 1.2%] BE/Core Backend_Bound.Core_Bound % Slots 35.3 [ 1.2%] ``` The Memory Bound and Core Bound Instructions are equally split due to the tiling across threads. - Miscellaneous ``` 1.06 Cycles Per Instruction (per Logical Processor) 0.94 Instructions Per Cycle (per Logical Processor) 2.38 Floating Point Operations Per Cycle ``` We accomplished 2.38 FLOPs per cycle and this is greater than our Intrinsics and OMP SIMD version. #### Memory Bandwidth We calculate the Memory bandwidth using the intel uncore event ```iMC.MEM_BW_TOTAL``` This is derived from (CAS_COUNT.RD * 64) + (CAS_COUNT.WR * 64) CAS stands for _Column Address Select_. This metric is socket bound and thus, contains readings for the different channels in that socket. The pmu tool allows us to monitor the counter over a period of time. ```python pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB``` In the intrinsics version, since we are using only one thread and processing blocks of data, the Memory bandwidth peaks at **0.19 GB/s** during the beginning of the run and reduces steadily as more data is available in the caches and less needs to be queried from the Main memory. python ../../../pmu-tools/ucevent/ucevent.py iMC.MEM_BW_TOTAL --scale GB #### Different Thread Configurations. Number of elements in time series: 262159 Sub length: 10 Number of Threads: 8 ``` 9.06 seconds ``` Number of Threads: 32 ``` 2.04 seconds ``` Number of Threads: 72 ``` 0.96 seconds ``` The Speedup achieved by our implementation is linear w.r.t. the number of threads. ### Results for different data sizes The timings are for the whole program. - 262159 81.54 GFLOPs 0.96 seconds - 524299: 139.41 GFLOPs 6.82 Seconds - 1048591: 123.50 GFLOPs 32.309 Seconds ### Memory Bandwidth Calculation: TODO:

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully