# Questions for authors’ response
**1. @A Why do you not present strong scaling measurements?**
We believe that the main factors affecting the performance of ccNUMA-aware algorithms are the NUMA factor and the way cores transfer data to each other. The NUMA factor is the ratio of longest to shortest latency. Figure 2 in our supplementary material demonstrates the performance of different algorithms under different NUMA factors using a simulator.
Although the results of this simple simulator do not exactly match the experimental results, this experiment still helps us understand that RON performs better under lower NUMA factors. Under extremely high NUMA factors, RON's high fairness often results in longer communication paths (such as crossing sockets), which resulting in higher latency. Once we have a multi-processor platform, we plan to design ways to integrate RON with other lock algorithms, such as C-TKT-RON, which will combine ticket locks with RON.
We can infer the scalability from the nuber of threads in Figures 5 and 7, since the experimental conditions of conducting micro-benchmark are the same. Points numbered 1 to 6 in both figures represent the same experiment in RON. Therefore, we combine the data from Figure 5 and Figure 7 to obtain Figure X. Figure X demonstrates the excellent scalability of RON with respect to the number of threads.



**2. @B.1@E How does it behave in low-contention scenarios? What is the primary downside of RON compared to Plock?**
RON remains correct even if there are no threads on a core. Fig.5.a indicates that RON outperforms Plock slightly in low contention, while Plock is better than RON in cases of extremely low contention. Rachid Guerraoui et al.'s paper[39] suggests that no single lock algorithm can perform well in both high and low contention scenarios. We are currently attempting to dynamically switch spinlock algorithms. Please refer to the 23rd response.

**3. @B.2,B.3 Why is this fair in cases where the number of threads that contend on a CS are unbalanced? Neither RON-plock nor RON-ticket can achieve longterm fairness and bounded waiting?**
In our supplemental material, we provide proofs for RON-ticket and RON-plock, which respectively satisfy bounded waiting and probabilistic fairness. RON-plock's probability of entering the critical section is related to the CPU time it receives, which is more in line with the expectations of programmers since CPU time is related to priority in Linux. RON-plock satisfies long-term fairness unless a thread is unable to obtain CPU time continuously. If a task cannot obtain CPU time, it doesn't make sense whether it acquires a lock or not.

**4. @C How does RON perform compared to other algorithms in some scenarios? ...RON may be arbitrarily worse than FIFO.**
RON's algorithm is a heuristic algorithm, so it cannot guarantee the optimal solution. However, according to experiments, RON's algorithm performs very well, similar to how quicksort has good performance but has a worst-case time complexity of O(N^2).
**5. @D Does Algorithm-1 prevent cmp_xchg from causing high coherence invalidations?**
The cmp_xchg used in line 14 is implemented with test and test-and-set. InUse is only true when no one is waiting to enter the critical section (i.e., in very low contention).

**6. @D How do you envision the TSP order getting set in practice since it is CPU-specific?**
The interaction between cores involves not only latency but also connectivity. Therefore, as the number of cores continues to increase, CPU manufacturers may need to provide more precise data to assist software optimization.
# Other Questions
**7. @A Why were only a few applications selected from the outrageously out of date SPLASH2 and Parsec?**
For comparison with other algorithms, we implemented RON based on the LiTL framework [39]. We select benchmarks based on Rachid Guerraoui et al.'s classification[39] of parallel applications. RON performs worse than Plock in the case of Ferret.

**8. @A the introduction of blocking (isn't this a spinlock?) and the use of undefined identifiers (e.g. order).**
Thank you for pointing out the 'blocking' issue. Our description in this part is not clear enough.
In RON-ticket, a running thread will know whether it is the next qualified thread (WaitArray[TSP_ID].grant−l_ticket≠1) to enter the critical section on this core. If the thread is not the next qualified one, it can relinquish the CPU to avoid unnecessary spin-locking. I will improve the description in Section 5.2.
Thank you for pointing out this subtle but important error. "order" is a typing mistake, it should be "TSP_ID."
**9. @A,@B Programmers may need to use RON-ticket or RON-plock by default, as they cannot predict oversubscription at compile time. It might make it hard to adopt this approach.**
A programmer just needs to choose one between RON-ticket and RON-plock. Except for quantitative analysis, all experiments in the paper use either RON-plock or RON-ticket. I will explain the differences in the usage of the three algorithms in the paper.

**10. @A Nits/Questions/Comments**
Thank you for your suggestion. We will work on improving these three issues.
**11. @C Why is minimizing handover cost the “right” metric?**
As far as we know, when two cores access the same data, the data exchange size is equal to the cache line size (64B). Therefore, lower latency means higher bandwidth at the same size (64B).
**12. @C different arrival patterns (even in some random order) and performance of RON across such patterns.**
Whenever multiple threads are waiting to enter the critical section simultaneously, RON will follow the TSP order instead of the arrival order.
In the supplementary material, we have an experiment to demonstrate the effectiveness of the heuristic algorithm. RON without TSP ORDER is referred to as non-RON, which uses randomly generated circular paths. From Fig.5, it can be seen that RON outperforms non-RON by a significant margin, indicating that this heuristic algorithm, although simple, is indeed effective.

**13. @D Doesn't atomic_fetch_sub(&nWait, 1) trigger an invalidation and cache line fetch on every CPU per lock release under heavy contention?**
Although nWait appears similar to the global-grant in the ticket lock, they are used differently. In the ticket lock, each CPU needs to update the global-grant of each core, as each waiting core continuously checks whether the global-grant equals the local-ticket.
In RON-ticket, threads only set nWait when entering and leaving the critical section, and there is no thread constantly checking the value of nWait. Waiting cores in RON-ticket check waitArray[TSP_ID].grant. Since each core has its own TSP_ID, accessing waitArray[TSP_ID].grant has the same performance as thread_local.
Since nWait does not have a large amount of access and checking, it does not cause performance bottlenecks.

**14. @D While is seems unlock is crafty about invalidations, it is still linear in N under low contention.**
The size of WaitArray will increase with the number of cores, which may cause inefficiency in the unlock() operation. We have optimized the unlock() operation using the following methods.
1. Because adjacent cores are likely to have adjacent TSP_IDs and may share L3 cache, aligning the WaitArray to cache line size can improve efficiency.
2. Add prefetch instruction.
3. When a lock is not hot, i.e., its WaitArray is rarely modified, cache hit is almost guaranteed.
As you mentioned, in the worst case scenario, RON cannot match up to Plock.
**15. @D why it is insufficient for lock() to set WaitArray[x]=1, cmp_xchg InUse, and if unsuccessful, poll until WaitArray[x] == 0?**
Please refer to the following diagram. Thread_10 holds the lock and executes line 18. The variable i in the for loop is 13. At the same time, Thread_12 executes line 10. Since Thread_10's i is 13, it cannot detect Thread_12's attempt to enter the critical section. Thread_10 will set InUse and leave. If Thread_12 only checks WaitArray[10] in the while loop, it may never enter the critical section.

**16. @D In Fig.4, is that the core-to-core handoff latency for cores within a CCX?**
Yes. I will revise the paper according to your suggestions.
**17. @D Why in the benchmark do you use 100B transfers instead of 8B or 16B transfers?**
The length of data exchange between cores is usually the size of cache line (64B). 100B does not have any special meaning.
**18. @D die 0 cores are a better binning than the other... some experimental validation is needed to show that these incongrencies are important for the lock to exploit**
Yes, the faster performance of Die0 causes the unfairness mentioned in Figure 6. This phenomenon can be seen more clearly in Figure 8 of the Supplemental Materials. Die0 and Die2 (which are closer to Die0) obtain the lock more frequently.

**19. @D Section 6.2 says the difference between cores in some cases can be 20.6x!!! If this is true, then I buy the paper’s argument about this.**
Usain Bolt can run 100 meters in 9.7 seconds, while it takes me 10 seconds. Although my performance is 97% of Usain Bolt's, if we were to race 1,000 times, I wouldn't stand a chance of winning. This is analogous to faster cores that always react more quickly than slower cores, enabling them to acquire the lock almost every time.
**20. @D continually query the value of the ‘grant’, which consumes limited NoC bandwidth?**
This is because each core has its own TSP_ID, so the WaitArray[TSP_ID].grant only exists in that core's local cache.

**21. @D If you repeat the run, you get the same CV values for each lock type?**
Sorry for the confusion. The CV values for each lock is different.
**22. @D What do worst case unlock latencies look like for RON in pathological cases?**
Yes. The worst-case scenario occurs when only a few threads are waiting to enter the critical section.
**23. @D some are high contention and some are low and which are changing over time. Now, should you use something like Plock or something like RON?**
Your point of view is very insightful. Our current papers in progress are: 1) RON that can address the reader-writer problem, and 2) a method that enables dynamic switching of algorithms.
The concept of switching-method involves: 1) creating a switch_thread whose task is to dynamically switch algorithms, 2) using function pointers, and 3) ensuring that it is a "safe time" to switch algorithms. When the switch_thread decides to switch algorithms, it competes for the lock. When it becomes the owner of the lock, that point in time is considered the "safe time".
You have provided us with so much valuable input. Thank you very much.
**24. @E ...to balance throughput and fairness. However, I didn’t see this idea explored very deeply.**
Thank you for your guidance. We will conduct research in the direction you suggested.
**25. @E The evaluations results are good but not great.**
The potential for improvement in ccNUMA-aware algorithms depends on hardware. NUMA factor is the longest latency divided by the shortest latency. In our experiments using the AMD 2990WX, the NUMA factor is 3.1. In the case of the CLOF paper[48], the NUMA factors of the two platforms used in the experiments were 12.18 and 7.04, respectively. On platforms with a larger NUMA factor, more complex algorithms can be used, and a larger potential for improvement exists. We believe that RON performs well on the platform with a NUMA factor of 3.1.
[48] https://dl.acm.org/doi/10.1145/3477132.3483557