羅習五
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Questions for authors’ response **1. @A Why do you not present strong scaling measurements?** We believe that the main factors affecting the performance of ccNUMA-aware algorithms are the NUMA factor and the way cores transfer data to each other. The NUMA factor is the ratio of longest to shortest latency. Figure 2 in our supplementary material demonstrates the performance of different algorithms under different NUMA factors using a simulator. Although the results of this simple simulator do not exactly match the experimental results, this experiment still helps us understand that RON performs better under lower NUMA factors. Under extremely high NUMA factors, RON's high fairness often results in longer communication paths (such as crossing sockets), which resulting in higher latency. Once we have a multi-processor platform, we plan to design ways to integrate RON with other lock algorithms, such as C-TKT-RON, which will combine ticket locks with RON. We can infer the scalability from the nuber of threads in Figures 5 and 7, since the experimental conditions of conducting micro-benchmark are the same. Points numbered 1 to 6 in both figures represent the same experiment in RON. Therefore, we combine the data from Figure 5 and Figure 7 to obtain Figure X. Figure X demonstrates the excellent scalability of RON with respect to the number of threads. ![](https://i.imgur.com/jMkXYJd.png) ![](https://i.imgur.com/GslDPQB.png) ![](https://i.imgur.com/eXgIO0z.png) **2. @B.1@E How does it behave in low-contention scenarios? What is the primary downside of RON compared to Plock?** RON remains correct even if there are no threads on a core. Fig.5.a indicates that RON outperforms Plock slightly in low contention, while Plock is better than RON in cases of extremely low contention. Rachid Guerraoui et al.'s paper[39] suggests that no single lock algorithm can perform well in both high and low contention scenarios. We are currently attempting to dynamically switch spinlock algorithms. Please refer to the 23rd response. ![](https://i.imgur.com/CViJ3VM.png) **3. @B.2,B.3 Why is this fair in cases where the number of threads that contend on a CS are unbalanced? Neither RON-plock nor RON-ticket can achieve longterm fairness and bounded waiting?** In our supplemental material, we provide proofs for RON-ticket and RON-plock, which respectively satisfy bounded waiting and probabilistic fairness. RON-plock's probability of entering the critical section is related to the CPU time it receives, which is more in line with the expectations of programmers since CPU time is related to priority in Linux. RON-plock satisfies long-term fairness unless a thread is unable to obtain CPU time continuously. If a task cannot obtain CPU time, it doesn't make sense whether it acquires a lock or not. ![](https://i.imgur.com/x7Gd0Ui.png) **4. @C How does RON perform compared to other algorithms in some scenarios? ...RON may be arbitrarily worse than FIFO.** RON's algorithm is a heuristic algorithm, so it cannot guarantee the optimal solution. However, according to experiments, RON's algorithm performs very well, similar to how quicksort has good performance but has a worst-case time complexity of O(N^2). **5. @D Does Algorithm-1 prevent cmp_xchg from causing high coherence invalidations?** The cmp_xchg used in line 14 is implemented with test and test-and-set. InUse is only true when no one is waiting to enter the critical section (i.e., in very low contention). ![](https://i.imgur.com/fh1A2QH.png) **6. @D How do you envision the TSP order getting set in practice since it is CPU-specific?** The interaction between cores involves not only latency but also connectivity. Therefore, as the number of cores continues to increase, CPU manufacturers may need to provide more precise data to assist software optimization. # Other Questions **7. @A Why were only a few applications selected from the outrageously out of date SPLASH2 and Parsec?** For comparison with other algorithms, we implemented RON based on the LiTL framework [39]. We select benchmarks based on Rachid Guerraoui et al.'s classification[39] of parallel applications. RON performs worse than Plock in the case of Ferret. ![](https://i.imgur.com/ndQWxXu.png) **8. @A the introduction of blocking (isn't this a spinlock?) and the use of undefined identifiers (e.g. order).** Thank you for pointing out the 'blocking' issue. Our description in this part is not clear enough. In RON-ticket, a running thread will know whether it is the next qualified thread (WaitArray[TSP_ID].grant−l_ticket≠1) to enter the critical section on this core. If the thread is not the next qualified one, it can relinquish the CPU to avoid unnecessary spin-locking. I will improve the description in Section 5.2. Thank you for pointing out this subtle but important error. "order" is a typing mistake, it should be "TSP_ID." **9. @A,@B Programmers may need to use RON-ticket or RON-plock by default, as they cannot predict oversubscription at compile time. It might make it hard to adopt this approach.** A programmer just needs to choose one between RON-ticket and RON-plock. Except for quantitative analysis, all experiments in the paper use either RON-plock or RON-ticket. I will explain the differences in the usage of the three algorithms in the paper. ![](https://i.imgur.com/3biwGrN.png) **10. @A Nits/Questions/Comments** Thank you for your suggestion. We will work on improving these three issues. **11. @C Why is minimizing handover cost the “right” metric?** As far as we know, when two cores access the same data, the data exchange size is equal to the cache line size (64B). Therefore, lower latency means higher bandwidth at the same size (64B). **12. @C different arrival patterns (even in some random order) and performance of RON across such patterns.** Whenever multiple threads are waiting to enter the critical section simultaneously, RON will follow the TSP order instead of the arrival order. In the supplementary material, we have an experiment to demonstrate the effectiveness of the heuristic algorithm. RON without TSP ORDER is referred to as non-RON, which uses randomly generated circular paths. From Fig.5, it can be seen that RON outperforms non-RON by a significant margin, indicating that this heuristic algorithm, although simple, is indeed effective. ![](https://i.imgur.com/FaQaW3E.png) **13. @D Doesn't atomic_fetch_sub(&nWait, 1) trigger an invalidation and cache line fetch on every CPU per lock release under heavy contention?** Although nWait appears similar to the global-grant in the ticket lock, they are used differently. In the ticket lock, each CPU needs to update the global-grant of each core, as each waiting core continuously checks whether the global-grant equals the local-ticket. In RON-ticket, threads only set nWait when entering and leaving the critical section, and there is no thread constantly checking the value of nWait. Waiting cores in RON-ticket check waitArray[TSP_ID].grant. Since each core has its own TSP_ID, accessing waitArray[TSP_ID].grant has the same performance as thread_local. Since nWait does not have a large amount of access and checking, it does not cause performance bottlenecks. ![](https://i.imgur.com/C9Ega93.png) **14. @D While is seems unlock is crafty about invalidations, it is still linear in N under low contention.** The size of WaitArray will increase with the number of cores, which may cause inefficiency in the unlock() operation. We have optimized the unlock() operation using the following methods. 1. Because adjacent cores are likely to have adjacent TSP_IDs and may share L3 cache, aligning the WaitArray to cache line size can improve efficiency. 2. Add prefetch instruction. 3. When a lock is not hot, i.e., its WaitArray is rarely modified, cache hit is almost guaranteed. As you mentioned, in the worst case scenario, RON cannot match up to Plock. **15. @D why it is insufficient for lock() to set WaitArray[x]=1, cmp_xchg InUse, and if unsuccessful, poll until WaitArray[x] == 0?** Please refer to the following diagram. Thread_10 holds the lock and executes line 18. The variable i in the for loop is 13. At the same time, Thread_12 executes line 10. Since Thread_10's i is 13, it cannot detect Thread_12's attempt to enter the critical section. Thread_10 will set InUse and leave. If Thread_12 only checks WaitArray[10] in the while loop, it may never enter the critical section. ![](https://i.imgur.com/q4pTjKw.png) **16. @D In Fig.4, is that the core-to-core handoff latency for cores within a CCX?** Yes. I will revise the paper according to your suggestions. **17. @D Why in the benchmark do you use 100B transfers instead of 8B or 16B transfers?** The length of data exchange between cores is usually the size of cache line (64B). 100B does not have any special meaning. **18. @D die 0 cores are a better binning than the other... some experimental validation is needed to show that these incongrencies are important for the lock to exploit** Yes, the faster performance of Die0 causes the unfairness mentioned in Figure 6. This phenomenon can be seen more clearly in Figure 8 of the Supplemental Materials. Die0 and Die2 (which are closer to Die0) obtain the lock more frequently. ![](https://i.imgur.com/54Yq3Tp.png) **19. @D Section 6.2 says the difference between cores in some cases can be 20.6x!!! If this is true, then I buy the paper’s argument about this.** Usain Bolt can run 100 meters in 9.7 seconds, while it takes me 10 seconds. Although my performance is 97% of Usain Bolt's, if we were to race 1,000 times, I wouldn't stand a chance of winning. This is analogous to faster cores that always react more quickly than slower cores, enabling them to acquire the lock almost every time. **20. @D continually query the value of the ‘grant’, which consumes limited NoC bandwidth?** This is because each core has its own TSP_ID, so the WaitArray[TSP_ID].grant only exists in that core's local cache. ![](https://i.imgur.com/jNVSiJU.png) **21. @D If you repeat the run, you get the same CV values for each lock type?** Sorry for the confusion. The CV values for each lock is different. **22. @D What do worst case unlock latencies look like for RON in pathological cases?** Yes. The worst-case scenario occurs when only a few threads are waiting to enter the critical section. **23. @D some are high contention and some are low and which are changing over time. Now, should you use something like Plock or something like RON?** Your point of view is very insightful. Our current papers in progress are: 1) RON that can address the reader-writer problem, and 2) a method that enables dynamic switching of algorithms. The concept of switching-method involves: 1) creating a switch_thread whose task is to dynamically switch algorithms, 2) using function pointers, and 3) ensuring that it is a "safe time" to switch algorithms. When the switch_thread decides to switch algorithms, it competes for the lock. When it becomes the owner of the lock, that point in time is considered the "safe time". You have provided us with so much valuable input. Thank you very much. **24. @E ...to balance throughput and fairness. However, I didn’t see this idea explored very deeply.** Thank you for your guidance. We will conduct research in the direction you suggested. **25. @E The evaluations results are good but not great.** The potential for improvement in ccNUMA-aware algorithms depends on hardware. NUMA factor is the longest latency divided by the shortest latency. In our experiments using the AMD 2990WX, the NUMA factor is 3.1. In the case of the CLOF paper[48], the NUMA factors of the two platforms used in the experiments were 12.18 and 7.04, respectively. On platforms with a larger NUMA factor, more complex algorithms can be used, and a larger potential for improvement exists. We believe that RON performs well on the platform with a NUMA factor of 3.1. [48] https://dl.acm.org/doi/10.1145/3477132.3483557

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully