HackMD
  • Prime
    Prime  Full-text search on all paid plans
    Search anywhere and reach everything in a Workspace with Prime plan.
    Got it
      • Create new note
      • Create a note from template
    • Prime  Full-text search on all paid plans
      Prime  Full-text search on all paid plans
      Search anywhere and reach everything in a Workspace with Prime plan.
      Got it
      • Sharing Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • More (Comment, Invitee)
      • Publishing
        Everyone on the web can find and read all notes of this public team.
        After the note is published, everyone on the web can find and read this note.
        See all published notes on profile page.
      • Commenting Enable
        Disabled Forbidden Owners Signed-in users Everyone
      • Permission
        • Forbidden
        • Owners
        • Signed-in users
        • Everyone
      • Invitee
      • No invitee
      • Options
      • Versions and GitHub Sync
      • Transfer ownership
      • Delete this note
      • Template
      • Save as template
      • Insert from template
      • Export
      • Dropbox
      • Google Drive
      • Gist
      • Import
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
      • Download
      • Markdown
      • HTML
      • Raw HTML
    Menu Sharing Create Help
    Create Create new note Create a note from template
    Menu
    Options
    Versions and GitHub Sync Transfer ownership Delete this note
    Export
    Dropbox Google Drive Gist
    Import
    Dropbox Google Drive Gist Clipboard
    Download
    Markdown HTML Raw HTML
    Back
    Sharing
    Sharing Link copied
    /edit
    View mode
    • Edit mode
    • View mode
    • Book mode
    • Slide mode
    Edit mode View mode Book mode Slide mode
    Note Permission
    Read
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    Write
    Only me
    • Only me
    • Signed-in users
    • Everyone
    Only me Signed-in users Everyone
    More (Comment, Invitee)
    Publishing
    Everyone on the web can find and read all notes of this public team.
    After the note is published, everyone on the web can find and read this note.
    See all published notes on profile page.
    More (Comment, Invitee)
    Commenting Enable
    Disabled Forbidden Owners Signed-in users Everyone
    Permission
    Owners
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Invitee
    No invitee
       owned this note    owned this note      
    Published Linked with GitHub
    Like3 BookmarkBookmarked
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Optimizing AWS F1 DMA When implementing a system involving both software and hardware components, unforeseen issues always seem to occur at the interfaces. Today we will look at one such pain points that occurs when transferring large amounts of data from a software application running on a typical x86 host machine to an FPGA for accelerated computation. Specifically we look at an issue we faced at [Ingonyama](https://github.com/ingonyama-zk) while implementing an FPGA accelerated Multi-Scalar-Multiplication (MSM) application on the AWS F1 instances. ## The Architecture First, let's step back and take a high level view of the problem and the basic architecture of the system. Multi-Scalar-Multiplication is one of the major computational bottlenecks at the core of many Zero-Knowledge-Proof systems. Essentially an MSM operation is defined as: $$\sum^n_{i=0} k_iP_i$$ Where $P_i$ are points on some elliptic curve and $k_i$ is simply a scalar integer. For this post we will assume we are dealing with 32-byte scalars and 64-byte point representations. So somewhere in our software we have at runtime a vector of scalars ($n \times 32$ bytes) and a vector of points ($n \times 64$ bytes), which need to be transferred to the FPGA for computation. On the hardware side, we have implemented in the FPGA an MSM computation core which has two streaming ports, each connected to a FIFO/Queue that sinks incoming data transfers from the host. ![](https://i.imgur.com/GizLjh4.png) The AWS shell presents us with a single 512-bit AXI4 memory-mapped bus from which we will receive data from the host. This bus is connected to two memory-mapped-to-stream converters through an AXI switch and each converter is assigned an address range on the bus. Any write transactions that hit within their respective address ranges will have the transaction data packets pushed into the FIFO. The AWS F1 Shell implements a DMA engine that is compliant with the Xilinx provided (XDMA) drivers. The purpose of a DMA is to offload from the CPU, the responsibility of transferring large amounts of data. In this case, transferring data from host memory to the FPGA device. Using CPU cores to transfer data usually has low latency, however it is very inefficient for large amounts of data and can result in low performance, so we would much rather use a DMA for the task. It is important to note that DMA operations typically have a high latency as they are two step processes. The CPU must first configure the DMA engine, then the DMA will go fetch the data. Rather than the CPU simply writing directly to the FIFO/memory on the FPGA. The advantage of DMA is the ability to transfer at a high throughput. Now that we have an idea of how the hardware is connected, let's look at the DMA driver. ![](https://i.imgur.com/4WdEotz.png) The Xilinx/AWS provides a Linux-based driver running in kernel space that manages the low-level configuration of the DMA hardware on the FPGA. When loaded, the driver creates character-based device files (cdevs) under the `/dev/` directory, e.g. `/dev/xdma0_user`, and `/dev/xdma0_h2c_0` . The user application can interact with the FPGA through these device files as if reading and writing to a standard file stored on disk. Lets see how this looks like from a Rust-based user application perspective: ```rust fn stream_data_to_msm_core(h2c: File, scalars: &[u8], points: &[u8]) { let scalars_fifo_baseaddr = 0x1000_0000; let scalars_fifo_baseaddr = 0x2000_0000; // Write all scalar bytes to the scalars fifo h2c.write_all_at(scalars, scalars_fifo_baseaddr); // Write all points bytes to the points fifo h2c.write_all_at(points, points_fifo_baseaddr); } ``` Simple! Except... this will fail if we transfer any significant amount of scalars or points (more than the size of the FIFOs). The transfer will hang and eventually time-out. Why? ## The Problem Let's say our hardware FIFOs are 1K-entries deep. Since our example code is single threaded, we would need to finish transferring all scalars before we start transferring points. We would write the first 1K scalars and then be dead-locked as the MSM core can not consume any scalars before also receiving the corresponding points! We can consider multi-threading this function, however that opens up further problems we will detail later. Let's keep the discussion single-threaded for now. So let's refactor the code to solve the deadlock by splitting the scalars and points vectors into an equal number of chunks. With a chunk size of 1024 we can avoid the deadlock as we will never overflow any FIFO before switching to writing to the other one. So the code will look something like: ```rust fn stream_data_to_msm_core(h2c: File, scalars: &[u8], points: &[u8]) { let scalars_fifo_baseaddr = 0x1000_0000; let scalars_fifo_baseaddr = 0x2000_0000; let chunk_size = 1024; // Break up the big data transfer into chunks // There are an equal number of points and scalars chunks let scalar_chunks = scalars.chunks(32 * chunk_size); let point_chunks = points.chunks(64 * chunk_size); for i in 0..scalar_chunks.len() { h2c.write_all_at(scalars_chunks[i], scalars_fifo_baseaddr); h2c.write_all_at(points_chunks[i], points_fifo_baseaddr); } } ``` Simple! Except... while we are now able to transfer all the data to the MSM core and after a while get the correct result back, performance is poor. Why? ## The Rabbit Hole To get a better understanding and to keep track of the optimizations we will attempt, let's do some benchmarking. Let's assume the size of the scalars and points vectors is $2^{20}$ elements each. Let's put a timer around the `stream_data_to_msm_core`, run it a 200 times and we get the following results: | FIFO_DEPTH | CHUNK_SIZE | AVG (ms) | MIN (ms) | MAX (ms) | STD_DEV (ms) | | ---------- | ---------- | -------- | -------- | -------- | ------------ | | 1024 | 1024 | 198.49 | 174 | 253 | 11.3 | Note the MSM core consumes the data from the FIFOs at its own rate and as far as we are concerned it has finished computation as soon as we finished writing the last chunk. Ok, so our base-line is 198.49 ms average computation time. From our internal simulations, we are sure that there is a lot of performance left on the table here. What's happening here? ### DMA Callback Servicing: Interrupt vs Polling One of the first suggestions found in the XDMA driver FAQ for addressing poor performance is to try loading the driver in **polling mode**. What does this mean? To understand this we have to dive into the DMA driver internals. When the user application uses the system call `write(buf, size)` they send to XDMA driver a pointer to the buffer/array in memory and the number of bytes to write. Of course, the user application only knows about virtual memory addresses and not physical addresses. Conversely the DMA engine on the FPGA only knows about physical addresses. So the driver must convert the pointer to virtual memory to a pointer to physical memory. Recall from Operating Systems 101, that the virtual memory space a user application sees as contiguous is not necessarily contiguous in physical memory. Typically the operating system organizes the memory into 4KB pages of contiguous memory. ![](https://i.imgur.com/fPajNDd.png) The DMA driver must give the DMA engine a list of **descriptors** which specify where to **gather** each piece of data from. When the DMA engine is done processing every descriptor, it normally raises an **interrupt** to the host operating system to say it's ready for the next transfer. By default, Linux does not provide any guarantees on how quickly it will service this interrupt and wake up the DMA driver. It will typically meander around doing other chores before finally getting to the interrupt. Furthermore, we will incur an additional context switch by waking up the DMA driver. This creates a non-deterministic gap between the end of the previous DMA transfer and the start of the next one. By turning on polling mode, the driver will never go to sleep and basically continuously asks the DMA if it's done and immediately sends the next transfer when it receives the go-ahead. ![](https://i.imgur.com/jfXiYFU.png) So let's try it by loading the driver with: ```bash sudo insmod xdma.ko poll_mode=1 ``` Rerunning the test we get: | FIFO_DEPTH | CHUNK_SIZE | MODE | AVG (ms) | MIN (ms) | MAX (ms) | STD_DEV (ms) | | ---------- | ---------- | ---- | --------- | -------- | -------- | ------------ | | 1024 | 1024 | INT | 198.49 | 174 | 253 | 11.3 | | 1024 | 1024 | POLL | **165.1** | 144 | 333 | 32.98 | Nice! We are now at 165.1 ms, 83% of the baseline time. But we know we can still do better. NOTE: We will talk about the increase in the variance later. ### Page Alignment Due to the fact that our data is not contiguous in physical memory, a DMA descriptor can not ask the DMA engine to cross 4KB boundaries, otherwise it might be accessing a totally unrelated memory region. Therefore if our chunk of 1024 scalars (32768-bytes) is not cleanly divisible by 4KB then the DMA engine must process at least one additional descriptor in every transfer. ![](https://i.imgur.com/eensJ9j.png) We can fix this by forcing the allocation of a 4KB aligned buffer. We can do this in Rust like so: ```rust #[repr(C, align(4096))] struct Align4K([u8; 4096]); unsafe fn aligned_vec(n_bytes: usize) -> Vec<u8> { let n_units = (n_bytes / mem::size_of::<Align4K>()) + 1; let mut aligned: Vec<Align4K> = Vec::with_capacity(n_units); let ptr = aligned.as_mut_ptr(); let len_units = aligned.len(); let cap_units = aligned.capacity(); mem::forget(aligned); Vec::from_raw_parts( ptr as *mut u8, len_units * mem::size_of::<Align4K>(), cap_units * mem::size_of::<Align4K>(), ) } ``` When constructing our original scalars and points vectors we call this `aligned_vec` function to create our aligned buffers. Now rerunning the benchmark we get: | FIFO_DEPTH | CHUNK_SIZE | MODE | ALIGN | AVG (ms) | MIN (ms) | MAX (ms) | STD_DEV (ms) | | ---------- | ---------- | ---- | ----- | ---------- | -------- | -------- | ------------ | | 1024 | 1024 | INT | | 198.49 | 174 | 253 | 11.3 | | 1024 | 1024 | INT | X | 165.46 | 160 | 189 | 5.3 | | 1024 | 1024 | POLL | | 165.1 | 144 | 333 | 32.98 | | 1024 | 1024 | POLL | X | **156.57** | 134 | 291 | 34.8 | Nice! We see an improvement in both the interrupt and poll mode. However, we know that we can do even better. ### Chunk Sizing Let's think about the FIFOs again, if we had a FIFO depth of 1K-entries, and we are writing to them in 1K-element chunks, consider the following scenario: - Both FIFOs are half full, and the host tries to write another chunk of 1024 points. - The transfer is stalled until all 1024 points have been written. - The only way this can happen is if all of the 512 existing points/scalar pairs have been consumed by the MSM core. - So the scalars FIFO will necessarily be starving by the time we initiate another transfer. - Due to the high latency of a DMA write this means we might be wasting hundreds or thousands of clock cycles between when we finish writing points and when new scalars arrive at the scalars FIFO. ![](https://i.imgur.com/WcgYAiq.png) So this basically means that the MSM core can not consume any data while the host is not currently transferring data. Not good! Lets try to lower the chunk size and see if that helps: | FIFO_DEPTH | CHUNK_SIZE | MODE | ALIGN | AVG (ms) | MIN (ms) | MAX (ms) | STD_DEV (ms) | | ---------- | ---------- | ---- | ----- | ---------- | -------- | -------- | ------------ | | 1024 | 256 | INT | | 385.48 | 243 | 502 | 74.44 | | 1024 | 256 | INT | X | 308.43 | 224 | 388 | 51.63 | | 1024 | 512 | INT | | 219.25 | 140 | 303 | 39.79 | | 1024 | 512 | INT | X | 160.06 | 125 | 206 | 23.62 | | 1024 | 1024 | INT | | 198.49 | 174 | 253 | 11.3 | | 1024 | 1024 | INT | X | 165.46 | 160 | 189 | 5.3 | | 1024 | 256 | POLL | | 157.36 | 136 | 302 | 32.54 | | 1024 | 256 | POLL | X | **139.54** | 125 | 275 | 27.72 | | 1024 | 512 | POLL | | 155.46 | 136 | 314 | 31.29 | | 1024 | 512 | POLL | X | 141.26 | 125 | 285 | 31.07 | | 1024 | 1024 | POLL | | 165.1 | 144 | 333 | 32.98 | | 1024 | 1024 | POLL | X | 156.57 | 134 | 291 | 34.8 | Nice! Now our best time is 139.54 ms! But we know we can still do better! ### FIFO Sizing Finally let's try increasing the size of the FIFOs! We save this to last as we'd rather not have to use extra resources on the FPGA if possible. Increasing the FIFO sizes to 4K-entries, and rerunning the benchmark we get: | FIFO\_DEPTH | CHUNK\_SIZE | MODE | ALIGN | AVG (ms) | MIN (ms) | MAX (ms) | STD\_DEV (ms) | | ----------- | ----------- | ---- | ----- | ---------- | -------- | -------- | ------------- | | 4096 | 256 | INT | | 356.6 | 249 | 494 | 50.95 | | 4096 | 256 | INT | X | 306.09 | 229 | 432 | 54.81 | | 4096 | 512 | INT | | 197.13 | 140 | 260 | 25.41 | | 4096 | 512 | INT | X | 173.53 | 125 | 248 | 17.75 | | 4096 | 1024 | INT | | 152.85 | 136 | 188 | 9.59 | | 4096 | 1024 | INT | X | 124.75 | 124 | 144 | 2.3 | | 4096 | 2048 | INT | | 151.17 | 135 | 178 | 9.49 | | 4096 | 2048 | INT | X | **124.09** | 124 | 138 | 0.72 | | 4096 | 4096 | INT | | 160.27 | 146 | 188 | 7.3 | | 4096 | 4096 | INT | X | 136.25 | 136 | 153 | 0.86 | | 4096 | 256 | POLL | | 152.16 | 135 | 306 | 30.26 | | 4096 | 256 | POLL | X | 138.21 | 124 | 268 | 26.62 | | 4096 | 512 | POLL | | 155.93 | 136 | 302 | 31.67 | | 4096 | 512 | POLL | X | 146.36 | 124 | 278 | 32.69 | | 4096 | 1024 | POLL | | 156.49 | 133 | 307 | 33.14 | | 4096 | 1024 | POLL | X | 140.35 | 124 | 281 | 29.65 | | 4096 | 2048 | POLL | | 157.29 | 134 | 305 | 34.3 | | 4096 | 2048 | POLL | X | 141.56 | 124 | 270 | 31.31 | | 4096 | 4096 | POLL | | 160 | 139 | 303 | 32 | | 4096 | 4096 | POLL | X | 145.17 | 129 | 282 | 28.87 | Phew that's a lot of permutations. Nice! Our best time is now 124.09 ms, a huge improvement from our baseline of 198.49 ms. ## The Zombie Hmmm... there is still one more question to be answered. Why is the variance so high in general and especially for polling mode? Why does interrupt mode seem to out-perform polling mode now? Theoretically we should expect polling mode to have **lower** variance... Let's open up htop and check what the processor is doing. ![](https://i.imgur.com/FAkLRfH.jpg) OH MY... Why are the CPUs floored at 100%? And what are all those **vscode** processes? 928 hours uptime? It looks like we have a zombie vscode outbreak. Turns out this was the cause of our weird variance. After killing all those vscode processes, and rerunning the benchmark: | FIFO\_DEPTH | CHUNK\_SIZE | MODE | ALIGN | AVG (ms) | MIN (ms) | MAX (ms) | STD\_DEV (ms) | | ----------- | ----------- | ---- | ----- | -------- | -------- | -------- | ------------- | | 1024 | 256 | INT | X | 275.39 | 218.71 | 286.2 | 5.19 | | 1024 | 512 | INT | X | 145.82 | 144.02 | 147.52 | 0.8 | | 1024 | 1024 | INT | X | 158.72 | 158.33 | 160.11 | 0.24 | | 1024 | 256 | POLL | X | **125.03** | 125.02 | 125.01 | 0.01 | | 1024 | 512 | POLL | X | **125.03** | 125.02 | 125.05 | 0 | | 1024 | 1024 | POLL | X | 130.74 | 130.72 | 130.85 | 0.03 | | 4096 | 256 | INT | X | 274.47 | 270.49 | 278.47 | 1.65 | | 4096 | 512 | INT | X | 146.61 | 143.3 | 149.02 | 0.77 | | 4096 | 1024 | INT | X | **124.9** | 124.85 | 124.98 | 0.02 | | 4096 | 2048 | INT | X | **124.86** | 124.85 | 124.91 | 0.01 | | 4096 | 4096 | INT | X | 138.77 | 134.19 | 141.28 | 1.98 | | 4096 | 256 | POLL | X | **124.82** | 124.81 | 124.84 | 0 | | 4096 | 512 | POLL | X | **124.81** | 124.8 | 124.83 | 0 | | 4096 | 1024 | POLL | X | **124.81** | 124.81 | 124.83 | 0 | | 4096 | 2048 | POLL | X | **124.82** | 124.82 | 124.85 | 0 | | 4096 | 4096 | POLL | X | 126.82 | 126.79 | 126.98 | 0.02 | Ok this is now looking much better! Well now we have a much easier time reaching good speeds. However, the silver-lining is that at least now we have data on what happens to our DMA transfers when the CPU is under heavy load. In a live production machine this will likely be the case! NOTE: In the interest of time we only include the page aligned benchmarks. ## Concluding Thoughts Even though we did most of our benchmarks during a vscode zombie plague, each optimization we applied did genuinely improve the performance, however the difference was artificially amplified. Now we know to overprovision our FIFO to 4K-entries even though it would be over-kill when the CPU is mostly idle. Poll mode is great when we have the CPU to spare, but interrupt mode pulls ahead when under load. Above all we need to have a scientific approach to debugging performance issues as there could be many overlapping underlying reasons. Understanding of what's happening under-the-hood at each layer and applying optimizations in isolation while observing their relative impact is essential to get to the bottom of an issue while learning a lot along the way! ## Appendix: A Multi-threaded Alternative So far we have assumed a single threaded host application. However, the DMA on AWS F1 has four seperate channels that we could take advantage of: `xdma0_h2c_0`, `xdma0_h2c_1`, `xdma0_h2c_2`, and `xdma0_h2c_3`. So we can spawn two seperate threads and write the chunks in parallel! We can change the code to something like this: ```rust fn stream_data_to_msm_core(h2c_0: File, h2c_1: File, scalars: &[u8], points: &[u8]) { let scalars_fifo_baseaddr = 0x1000_0000; let scalars_fifo_baseaddr = 0x2000_0000; // use 512 for 1024-entry FIFO or 2048 for 4096-entry FIFO let chunk_size = 512; let scalar_chunks = scalars.chunks(32 * chunk_size); let point_chunks = points.chunks(64 * chunk_size); let barrier = Arc::new(Barrier::new(2)); let scalars_thread = thread::spawn(|| { for chunk in scalar_chunks { h2c_0.write_all_at(chunk, scalars_fifo_baseaddr); // wait for points_thread to also reach this point barrier.wait(); } }); let points_thread = thread::spawn(|| { for chunk in points_chunks { h2c_1.write_all_at(chunk, points_fifo_baseaddr); // wait for scalars_thread to also reach this point barrier.wait(); } }); scalars_thread.join.unwrap(); points_thread.join.unwrap(); } ``` Due to the non-deterministic nature of multi-threaded execution, we need to synchronize both threads after writing each chunk to prevent the same deadlock situation we had in the single threaded version. For example, due to sub-optimal scheduling by the OS, we may end up in a situation where the points thread has finished writing a chunk before the scalars thread has even started. Therefore, we need to have the threads wait until both threads have finished writing before continuing. For similar reasons, we must still keep the chunk size below the full size of the FIFO. In this fashion hopefully both FIFOs will be filled in parallel (most of the time).

    Import from clipboard

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lost their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template is not available.


    Upgrade

    All
    • All
    • Team
    No template found.

    Create custom template


    Upgrade

    Delete template

    Do you really want to delete this template?

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Tutorials

    Book Mode Tutorial

    Slide Mode Tutorial

    YAML Metadata

    Contacts

    Facebook

    Twitter

    Feedback

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions

    Versions and GitHub Sync

    Sign in to link this note to GitHub Learn more
    This note is not linked with GitHub Learn more
     
    Add badge Pull Push GitHub Link Settings
    Upgrade now

    Version named by    

    More Less
    • Edit
    • Delete

    Note content is identical to the latest version.
    Compare with
      Choose a version
      No search result
      Version not found

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub

        Please sign in to GitHub and install the HackMD app on your GitHub repo. Learn more

         Sign in to GitHub

        HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Available push count

        Upgrade

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Upgrade

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully