# 2022q1 Homework6 (quiz5/6) contributed by < `idoleat` > ## ToDo - [x] 並行講座 (可能需要搭配實際案例 quiz5/6 再看仔細一點) - [x] Hazard pointer - [ ] Highted ToDos - [ ] Correctness section in *A Pragmatic Implementation of Non-Blocking Linked-Lists* - [x] first pass - [ ] explain lfhashmap - [x] Read KVM and RCU materials - [ ] point out implementation pitfalls - [ ] make lfhashmap accept other type as hash key - [ ] Read [lflt](https://github.com/ksandstr/lfht) and its paper - [ ] Test [atomic_hash](https://github.com/divfor/atomic_hash) with larger scale ## Hazard pointer in [quiz5](https://hackmd.io/@sysprog/linux2022-quiz5) In the video [Implementing Hazard Pointers in Rust](https://youtu.be/fvcbyCYdR10?t=328), Jon Gjengset mentioned that Facebook made a C++ [standardization proposal](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1121r3.pdf) of hazard pointer. ![](https://i.imgur.com/oRDwnC4.png) > **3. Hazard Pointers** > ... > The key rule of the hazard pointers method is **that a retired object can be reclaimed only after it is determined that no hazard pointers have been pointing continuously to it from a time before its retirement.** > ... > **3.4. Pros and Cons** > The main advantages of the hazard pointers method are that: > 1. The number of removed objects that are not yet reclaimed is bounded. > 2. Readers do not interfere with each other or with writers > 3. Cache friendly access patterns. > 4. Constant time complexity for traversal and expected amortized constant time for determination of safe reclamation per retired object Comparison of Deferred Reclamation Methods: ![](https://i.imgur.com/ejmAXQY.png) The implementation complexity chart provided by folly document: // Can not find :( But by observing the header file of RCU and Hazard pointer, RCU seems to be more complicated. optimistic concurrency ### Implementation explanation and defect **Test subjects:** A global config consist of 3 unsigned int, protected by a domain. A domain has two list: * Pointer list: pointers in use * Retired list: pointers can be free Atomic operations are needed to guarantee safe R/W **Tester:** Several reader threads are created to read for n iterations. Several writer threads are created afterward to write for n/2 iterations. **In reader thread:** add a pointer to config to hazard pointer list to be guarded.(if not successful?) Then drop it (remove from list to unguarded it) **In writer thread:** Allocate new config and swap the old one to retired list. Deallocated it directly if possible. **linked list implementation:** notice that `list_remove` only swap `nullptr` into the node, so that there are spaces for `list_insert_or_append()` to insert. Even though `free()` does similar thing, that it won't deallocate right away, but it's still a system call after all. Not sure how much we can benefit from this. How to test it? I neeed the awareness while writing multithreaded code is important. :::spoiler notes There are several warnings. Linux kernel 5.18 has enabled `-Werror` so do we. Hazard pointers are not necessarily wait-free. It depends on the under lying linked list. If the linked list is lock-free, it's lock-free. If the linked list is wait-free, it's wait-free. ::: #### Defect * `list_free()` forgot to free `__hp.ptr` * need to call `cleanup` (see below) * cleanup: cleanup all pointers that have not been deallocated yet. * cleanup_ptr: deallocate a specific pointer or put it in the retired list. * `DEFER` flag is not set to `1` by default #### Questions ::: spoiler * Why this kind of design is good for single writer multi-reader? How about MWMR? If it's not good for MWMR, then why? * ==ToDo. Check the original paper.== * Change `__ATOMIC_SEQ_CST` to others?need memory barrier? * ==ToDo== * 其實我還是沒有很清楚各個 memory order variant 的使用時機,雖然 [wiki](https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync) 有介紹但似乎缺少實際使用範例 * 好了沒事我看懂了,一半 * What's the difference between with `n` and without `n` * Inspect when to use atomic operation * when accessing shared resources * calloc * pthread_create() is blocking? or non-blocking? * Blocking. It will create another thread after creating the previous one. * How to ensure reader threads and writer threads will overlap for testing hazard pointers? Set a barrier? ==Need better test== * recompile everytime for adjusting constant is annoying as well. Make it taking args from command line. (But you won't forget/mess up which params you just used) * 每次共筆寫一寫就覺得知識嚴重不足只好回去補 :baby_bottle: * i don't understand the meaning of line 157~165 * The situation is that you could possibly load value from that pointer but that pointer got removed from pointer list or retired afterward * Seems that it's the corner case that we have to take care of. Don't we have universal solution? * why `uintptr_t` instead of `void *`? * when will the situation of list_insert instead of list_append happened? * swapped `nullptr` * why need drop * `drop()` removes the given pointer from pointer lists, so it's paired with `load()`, which adds the pointer to the list. * It's unsafe to rely on programmer to remember to `drop` after `load`. Make it callback or something? * Unlikely. If `drop` is a callback of `load`, then other operations between the original `load` and `drop` are needed to be inside `load`. Only when `load` and `drop` are tightly coupled can do so. * About the `__builtin_unreachable()` in `drop`, @sternacht09 has done a [detailed study](https://hackmd.io/@sternacht09/ry1fD9C79) on it * Need to call unused **cleanup()** to deallocate * ==When? How often? Check the original paper please== * Need to check memory safety (1 found). * ==ToDo== * A better interface to use Hazard Pointer (ref: folly) to improve readability, re-usability and maintainability * May not needed in this case. But you can checkout the interface of folly [hazptr.h](https://github.com/facebook/folly/blob/b72a70a35ada45ac286eb60028c0ecf5f6725665/folly/synchronization/Hazptr.h). * The hazard pointer here seems not the same as the one in the [paper](https://erdani.org/publications/cuj-2004-12.pdf) * Of course. It's not intended to be a full blown implementation. It's just a simplified proof of concept implementation. * `flag` in swap should be `1` , or the thread will keep spinning until being able to free, which retired list is not used. * Modified. ::: ### Better testing Material: 《[A Pragmatic Implementation of Non-Blocking Linked Lists](https://www.cl.cam.ac.uk/research/srg/netos/papers/2001-caslists.pdf)》 Most parts have been summarized in [Lock-free programming: 關於正確性](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fconcurrency-lockfree#%E9%97%9C%E6%96%BC%E6%AD%A3%E7%A2%BA%E6%80%A7=). Some more to add here: * Assumption * preceded operations: ++A precedes B++ means ++A Sequenced-before B++ * concurrent operations: no real-time ordering (opposite to preceded operations) (沒有間隔一段任意長度時間) * sequential history: invocation is immediately followed by response * Correctness requirement: linearizability ([explanation](https://www.youtube.com/watch?v=noUNH3jDLC0) by Martin Klepmman) * concurrent history equals to some sequential history (~~happens-before~~) * sequential history is in real-time order * **Linearizability means that operations appear to take effect atomically at some point between their invocation and response.** :::spoiler Linearizability in the video: multiple replicas act like a replica happens-before 沒有強調中間有隔一點時間 (real time),甚至是可以 overlap,只要結果可見 可是 [Herb Sutter 說](https://youtu.be/CmxkPChOcvw?t=1676) Linearizability 是針對結果而已,所以還是可以 overlap,那就跟 happens-before 沒差? 沒有達到 linearizable 也沒差,只不過就需要用其他方式證明是對的,有達到就一定是對的,因為他就跟線性發生的一樣 ::: <br> `Search` is invoked in Insert, Delete and Find Insert, Delete and Find all uses CAS to wrap up all operations * 3 ways of Proof * proof of **linearizability and progress** * model checking * Examining traces from particular programs runs Other ways of proof besides l13y? Are all the correct concurrent operations linearizable? Yes. If not, the order of operations are not the same as expected, so is the result. How about letting the later operations finished first but still aware of it self is the later one and making some compensation? *If result is conflict the compensation must be revert and redo again (or even unredoable!). If not, it's fine. It only happens when the result is communicative, then auto-merge is performed. By accepting this (allow later op finished first) assumption, I recommend always write it to a copy, leveraging the immutability, then merge when time has come.* :::spoiler 感覺 semaphore 比較少見? 可以把 fine grained lock 寫回教材裡面? Herb: Measurement! DO MEASUREMENT! [The Computer Science of Concurrency: The Early Years](https://lamport.azurewebsites.net/pubs/turing.pdf) ::: #### How to apply to here? "All the reads happens-before reclaim" Linearize an update operation with "`write` => `read`+ => `reclaim`" I found a recent paper ([DHash](https://www.semanticscholar.org/paper/DHash%3A-Dynamic-Hash-Tables-with-Non-blocking-Wang-Liu/099c1385ee58d4653a3f63eff0a6683b2da9a708)) on TPDS writing good proof of linearizability. (more detailed and integrated) Write better test: 讓 thread 休眠備妥後再啟動 (寫法可以參考謝旻恩同學的筆記),需要規劃一下啟動的時間點,要涵蓋到所有的可能性 (Try to draw state machine)(model checking?) * 我覺得用可以控制執行順序的 coroutine 來撰寫測試比較方便 one situation: ``` 0 time inf --------------------------------------------> cfg_A: write => read+ => reclaim cfg_B: write => read+ => reclaim cfg_C: write => read+ => reclaim cfg_D: write => read+ => reclaim cfg_E: write => read+ => reclaim cfg_F: write => read+ => reclaim ``` ### Thread Sanitizer > note: `-g` is good for GDB but may crash other debuggers. ([ref](https://gcc.gnu.org/onlinedocs/gcc-4.8.3/gcc/Debugging-Options.html#Debugging-Options=)) ### Compare with RCU :::spoiler preface In [folly/hazptr.h](https://github.com/facebook/folly/blob/b72a70a35ada45ac286eb60028c0ecf5f6725665/folly/synchronization/Hazptr.h) and [folly/rcu.h](https://github.com/facebook/folly/blob/main/folly/synchronization/Rcu.h), the most significant difference is that RCU is better with scalability since it protect all memory in critical section, whereas hazptr is object specific. Not sure what's the concern in Linux since I haven't understood it well yet. I think it's better to introduce RCU in a 前世今生 way so that we can understand why we need RCU. Otherwise it's just look like yet another way of doing sychronization. [This video](https://www.youtube.com/watch?v=rxQ5K9lo034) does it well IMO. ::: hazard pointer needs to traverse objects to retry reclaim. reader overhead > RCU is unusual is being a purely social-engineering approach. So do almost all consensus algorithm. By benchmark, RCU is slightly better than hazptr. retire list in hazptr could be cache unfriendly(e.g. 100+ cores share a list). RCU is not? #### Some Thoughts: While reading RCU materials, I come up with a thought (and [presented on a stream](https://youtu.be/Ccf4Fb39ecA?t=365)): **Circular Copy-Update for multi-reader** <small>possibly with multi-writer</small> Condition:multiple readers read $a$ and one writer write it. (Michael just reminded me that it's a ring buffer and I totally forgot that :face_palm:) ```graphviz digraph{ rankdir=LR; a0->a1; a1->a2; a2->a0; } ``` It can be more than 3 or just merely 2 nodes. * Pros * readers are forced updated every round (eventual consistency) * No lock: read write at the same time * No `free()`: no dangling or double free (mem pool) * Cons * For multiple writers * it might need a circular list per writer,and still need a way to synchronize writers. * Every writer is actually a reader by default * There's only one writer. Others need to tell that writer what to write with time stamp (naive way, kafka transactions) * There should be someone have this thought before (variance of RCU?)(MPMC queue?) * [Lockless ring buffer](https://www.kernel.org/doc/html/latest/trace/ring-buffer-design.html) * [Circular Buffer](https://www.kernel.org/doc/html/latest/core-api/circular-buffers.html) by Paul E. McKenney * Haven't read those two yet FAQ: * Which problem are you solving? * What will happen in the original problem? * Atomic write * rwlock? * How do you notify the readers that there are new value to read in the next node? > It sounds like a efficient local data structure. Combining with the global operations in CRDT, this could be an efficient implementation of `merge` operation in CRDT since it's lockless > [name= Michael L. Perry] ### Compare with the one in Folly and the one in original paper Usage difference: In [folly](https://github.com/facebook/folly/blob/b72a70a35ada45ac286eb60028c0ecf5f6725665/folly/synchronization/Hazptr.h), let the object inherit from `hazptr_base_obj` to be protected. ## [Quiz6](https://hackmd.io/@sysprog/linux2022-quiz6) ### Implementation Explanation calloc? malloc? Didn't check allocation success 測試一個程式會測試哪些項目?除了使用各種 sanitizer Linux 有 test suit 嗎? ### Implementation pitfalls --- Is there any resources showing how to use locks but in wrong ways? Searching for best practices is impractical, but how to avoid pitfalls is definitely worth knowing. --- ## 猜想 * No mutable shared resources, no data races/dependencies, no locks needed. 沒有資料共享就沒有 data race,有利工作分割 * Recurrence * 資料話語權 (like Ownership in Rust) * 像 Functional programming 修改資料時都創一塊新的記憶體 * [Immutable Architecture](https://www.immutablearchitecture.com/) * Weak consistency: CRDT[\[1\]](https://www.semanticscholar.org/topic/Conflict-free-replicated-data-type/626525)[\[2\]](https://crdt.tech/), aiming for eventual consistency. Also a kind of optimistic concurrency. * Video - [CRDTs: The hard parts](https://www.youtube.com/watch?v=PMVBuMK_pJY) * Write codes in a functional or recursive style, where each function maintain the local variable in their own scope can avoid shared data. * **Need actual cases to prove the concept** ## Multiverse<small> [(ref)](https://www.youtube.com/watch?v=b7rZO2ACP3A)</small> 把一個 Sequential program 演算法切成以 atomic operations 為單位的塊狀,每個 worker 身上都有一份,worker 被分配到一塊至數塊執行,會依賴其他 worker 結果的就先當作 promise,實際拿到值再一次 resolve promise。 一個 worker 會是一個 POXIS thread, 一個 process, 一個 physical core, 一台 Networked Machine (node) 的抽象畫 此處 atomic operations 是指一連串需要一次被執行完不可分割的 instructions,例如把數值從記憶體中放入暫存器,修改完數值數值再存回記憶體的一連串操作。其次是適合放在同一個 worker 執行不建議被分割的,例如迴圈,因為通常迴圈內的變數會需要保有 temporal locality 及 spatial locality。演算法則是依據這些情境去判斷該從哪邊切。 ### Goal 要找一個新的 memory model,以不共享資源的方式避免 data races/dependencies ,讓一個程式可以被拆成好幾個區塊 concurrent executing。現在的想法是做 wasm instruction level 的 concurrency,在 wasm instructions 中尋找適合的分割點以及幫忙安插必要的 instruction 以達到目的。分割的最小單元為 ### Why [WASM](https://webassembly.github.io/spec/core/intro/index.html)? 1. 獨立於各硬體架構,是一種 Virtual ISA,且主流語言編譯至 WASM 的工具鏈都算成熟,特別是 Rust 3. [Sequentially executed instructions](https://webassembly.github.io/spec/core/intro/overview.html#concepts=) (no reorder) (尚待確認) > **Instructions:** > The computational model of WebAssembly is based on a stack machine. **Code consists of sequences of instructions that are executed in order.** Instructions manipulate values on an implicit operand stack 1 and fall into two main categories. Simple instructions perform basic operations on data. They pop arguments from the operand stack and push results back to it. Control instructions alter control flow. Control flow is structured, meaning it is expressed with well-nested constructs such as blocks, loops, and conditionals. Branches can only target such constructs. 4. Instructions can be organized into functions calling each other * Enabling functional style in instruction level? 5. 設計上本身就可以被分割成多個 module,以便傳輸,引用等操作 ### Prove of concept Need actual cases. I hope the application of the concept could be universal.