mimalloc 實作機制和改善

# mimalloc 實作機制和改善 [GitHub](https://github.com/microsoft/mimalloc) [論文](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/mimalloc-tr-v1.pdf) [學習共筆](https://hackmd.io/@hPMCWajOS-ORQdEEAQ04-w/SkJltC61H) ## 背景知識 * video: [std::allocator Is to Allocation what std::vector Is to Vexation](https://www.youtube.com/watch?v=LIb3L4vKZ7U) * [Brief introduction to common allocators](https://hackmd.io/@afcidk/BkBHqaEZB) * video: [Introduction to Lock-free Programming](https://www.youtube.com/watch?v=RWCadBJ6wTk) * 檢驗認知: 用 C11 atomics 改寫演說中示範的程式碼 * 何謂 ABA？如何避免 ABA？ * [重點整理](https://hackmd.io/@afcidk/rJ5Cxpbbr) * video: [Lock-Free Programming (or, Juggling Razor Blades), Part I](https://www.youtube.com/watch?v=c1gO9aB9nbs) / [Lock-Free Programming (or, Juggling Razor Blades), Part II"](https://www.youtube.com/watch?v=CmxkPChOcvw) * Mike Acton 的 [Recent sketches on concurrency, data design and performance](https://cellperformance.beyond3d.com/articles/2009/08/roundup-recent-sketches-on-concurrency-data-design-and-performance.html) * [What a C programmer should know about memory](https://marek.vavrusa.com/memory/) * [Key Concepts in Chrome Memory](https://chromium.googlesource.com/chromium/src/+/HEAD/docs/memory/key_concepts.md) ## mimalloc 原理在 mimalloc 中一共進行 4 次 Free List 的 Sharding。接下來我們會分別介紹這 4 個 Free List的Sharding 的位置及其為什麼要進行 Free List 的 sharding。 - [ ] 在 mimalloc 頁中進行的 Free List 的 sharding 在其他的記憶體配置的實作中，一般會為每類大小的物件保留一個 Free List，隨著對主記憶體不斷地分配與釋放，這個 list 中的物件可能遍佈於整個地址空間中，因此主記憶體配置的 locality 較差。而在 mimalloc 中，透過將主記憶體分頁，每個頁負責一個特定大小的記憶體配置，每個頁有個 Free List，因此主記憶體配置配的空間 locality 較好。 ![](https://i.imgur.com/ilxMZ6e.png) > 其他記憶體配置器的 Free List ![](https://i.imgur.com/uP3z6HB.png) > mimalloc 的 Free List - [ ] Local Free List mimalloc希望能夠限制記憶體配置與記憶體釋放在最差情況下的時間消耗，如果上層應用需要釋放一個非常大的結構體，其可能需要遞迴地釋放一些子結構體，因而可能帶來非常大的時間消耗。因而在 koka 與 lean 程式語言中，其執行時期系統使用 reference counter 來追蹤各種資料，然後保留一個延遲釋放的隊列(deferred decrement list)，當每次釋放的記憶體 block 超過一定數量後，就將剩餘需要釋放的記憶體 block 加入該隊列，在之後需要釋放的時候再進行釋放。那麼下一個需要確定的問題是什麼時候再去從該隊列中釋放記憶體。從延遲釋放隊列中繼續進行釋放的時機最好應當是記憶體配置器需要更多空間的時候，因此這需要記憶體配置器與上層應用的協作。在 mimalloc 中，其提供一個 callback，當進行了一定次數記憶體的配置與釋放後，會主動呼叫該 callback 來通知上層應用。mimalloc 在實現時檢測當前進行記憶體配置的頁的 Free List 是否為空，如果為空則呼叫該 callback，但為了避免用於一直不斷的配置與釋放記憶體，導致 Free List 一直不為空，而導致 callback 一直無法執行到。因此 mimalloc 將 Free List 第 2 次進行 Sharding，將其分為 Free List 與 Local Free List。當記憶體在進行配置時，會從對應頁的 Free List 中取得記憶體 block ，而釋放時會將記憶體 block 加入 Local Free List 中，因而在進行一定次數的記憶體配置後，Free List 必定為空，此時可以進行 deferred free 的 callback 的呼叫。 - [ ] Thread Free List 在 mimalloc 中每個 heap 都是個 Thread Local 的變數，而每次進行記憶體配置時，都會自這個 Thread Local 的 heap 中進行記憶體配置，而釋放時即可能從該執行緒中釋放也可能從其他執行緒中進行釋放。如果進行記憶體釋放的執行緒是該 heap 的擁有者，則其釋放的記憶體會加入到對應頁的 Local Free List 中，而由於還可能有其他的執行緒來釋放這些記憶體，因此 mimalloc 第 3 次進行 Free List 的 sharding，將 Local Free List分為 Local Free List 與 Thread Free List。在進行記憶體的釋放時，如果釋放的執行緒為記憶體 block 對應 heap 的擁有著，則將其加入 Local Free List，否則利用 CAS 操作將其加入 Thread Free List 中。mimalloc 經由這次分割來保證 heap 的所有者執行緒在自己的 heap 上進行記憶體的釋放是 lock-free，從而提升效能表現。 - [ ] Full List 第 4 次的 Free List 的 sharding 其實來自 mimalloc 自身實作，其記憶體配置的概念程式碼如下。由於在 mimalloc 中每個 heap 中都有一個陣列 pages，該陣列中每個元素都是一個由相應大小的頁組成的 queue；同時還有一個 pages_direct 的陣列，該陣列中每個元素對應一個記憶體 block 的大小類型，每個元素均為指向負責對應大小記憶體 block 配置的頁的指標。因此 mimalloc 在進行記憶體配置時會首先從該陣列指向的頁中嘗試進行配置，若配置失敗則呼叫 malloc_generic，在該函式中會走訪 pages 陣列中對應大小的 queue，此時若對應的 queue 中有很多頁均是滿的，且 queue 很長，則每次配置都會進行 queue 的走訪，致使效能的損失。 ```cpp void* malloc_small( size_t n ) { heap_t* heap = tlb; page_t* page = heap->pages_direct[(n+7)>>3]; block_t* block = page->free; if (block==NULL) return malloc_generic(heap,n); page->free = block->next; page->used++; return block; } ``` 因此 mimalloc 構建了一個 Full List，將所有已沒有空閒空間的頁放入該 queue 中，僅當該頁中有一些空閒空間被釋放後才會將其放回 pages 對應的 queue 中。而在由於記憶體的釋放可能由對應 heap 的擁有者執行緒進行也可能由其他執行緒進行，因此需要一定的方式提醒對應的堆該頁已有空閒 block，同時為了避免使用 lock 導致的開銷，mimalloc 透過加入一個 Thread Delayed Free List，如果一個頁處於 Full List 中，那麼在釋放時會將記憶體 block 加入Thread Delayed Free List 中，該 queue 會在呼叫 malloc_generic 時進行檢測與清除(由於是 Thread Local 的 heap，因此僅可能是擁有者來進行)，因此此時僅需通過 atomic 操作即可完成。那麼還有一個問題是當釋放記憶體時，其執行緒程如何知道是將記憶體 block 加入 Thread Free List 中還是 Thread Delayed Free List 中。mimalloc 使用設定 NORMAL, DELAYED, DELAYING 三種狀態來完成該操作。 ## locality and fragmentation [locality](https://en.wikipedia.org/wiki/Locality_of_reference) and [fragmentation](https://en.wikipedia.org/wiki/Fragmentation_(computing)) are different concepts. The [Mesh](https://arxiv.org/pdf/1902.04738.pdf) strategy is designed explicitly to reduce fragmentation. The Mesh allocator allocates memory randomly within a given span of memory (span = one or more pages). It then combines ("meshes") two memory spans into one such that the Robson worst case bound for fragmentation is broken with high probability, reducing memory use up to 39% in some cases. > "Robson showed that all such allocators can suffer from catastrophic memory fragmentation. This increase in memory consumption can be as high as the log of the ratio between the largest and smallest object sizes allocated. For example, for an application that allocates 16-byte and 128KB objects, it is possible for it to consume 13× more memory than required." (Citation as it appears in the original.)"" The mimalloc strategy, on the other hand, is designed explicitly to maximize locality, exploiting the advantages of the hardware cache architecture to increase speed, e.g. by 14% when compared to jemalloc. Mesh's random allocation seems to me to be fundamentally at odds with maximizing cache locality. On the other hand, the mimalloc paper describes a security-enhanced version of mimalloc, smimalloc, which has a similar feature: "The initial free list in a page is initialized randomly such that there is no predictable allocation pattern (to protect against heap feng shui (Sotirov, 2007)). Also, on a full list, the secure allocator will sometimes extend instead of using the local free list to increase randomness further." from [Reddit](https://www.reddit.com/r/rust/comments/c3qc9z/mimalloc_a_compact_general_purpose_allocator_with/) * [Mesh allocator / GitHub](https://github.com/plasma-umass/Mesh) ## hooks for a monotonic heartbeat and deferred freeing a feature of mimalloc where the application can register a callback function (via mi_register_deferred_free()) that is guaranteed to be called regularly after a bounded number of allocations by a thread. Every time mimalloc calls the user-defined "deferred free" callback, a counter (which they call the heartbeat count) is incremented, and the current value of the heartbeat count is given to the callback. They use this in the Lean theorem prover to terminate a thread that has been running too long. Ordinarily, one could use wall-clock time (and in fact, many uses of theorem provers such as why3 use wall-clock time), but because each machine has different hardware, wall-clock time is not deterministic, whereas termination after the heartbeat count exceeds a certain value is deterministic assuming the thread allocates deterministically. --- ## 測試與驗證 ```shell $ mkdir -p out/debug $ cmake -DCMAKE_BUILD_TYPE=Debug -DMI_CHECK_FULL=ON ../.. $ make $ LD_PRELOAD=./libmimalloc-debug.so ./cfrac 175451865205073170563711388363274837927895 ``` 預期可得: ``` mimalloc: process init: 0x7f319c41db80 mimalloc: debug level : 3 mimalloc: option 'secure': 0 mimalloc: option 'pool_commit': 0 mimalloc: option 'page_reset': 0 mimalloc: option 'show_stats': 0 heap stats: peak total freed unit count normal 2: 110.1 kb 69.2 mb 69.2 mb 16 b 4.5 m ok normal 4: 126.7 kb 517.7 mb 517.7 mb 32 b 17.0 m ok normal 6: 367.3 kb 1.2 mb 1.2 mb 48 b 25.7 k ok normal 8: 64 b 128 b 128 b 64 b 2 ok normal 17: 1.2 kb 1.2 kb 1.2 kb 320 b 4 ok normal 25: 1.2 kb 1.2 kb 1.2 kb 1.2 kb 1 ok normal 28: 2.0 kb 4.0 kb 4.0 kb 2.0 kb 2 ok normal 29: 2.5 kb 7.5 kb 5.0 kb 2.5 kb 3 not all freed! normal 32: 4.0 kb 4.0 kb 0 b 4.0 kb 1 not all freed! ``` --- ## 可能的改進機會 * [Skip List or doubly Linked List](https://hackmd.io/@hPMCWajOS-ORQdEEAQ04-w/HkkQaRB-H) * 學習 [libdivine](https://github.com/ridiculousfish/libdivide) 的 branchfree 手法，避免迴圈中的除法操作 :::info File `src/heap.c` :: `mi_heap_area_visit_blocks` ::: ```cpp size_t free_count = 0; for (mi_block_t* block = page->free; block != NULL; block = mi_block_next(page,block)) { free_count++; mi_assert_internal((uint8_t*)block >= pstart && (uint8_t*)block < (pstart + psize)); size_t offset = (uint8_t*)block - pstart; mi_assert_internal(offset % page->block_size == 0); size_t blockidx = offset / page->block_size; // Todo: avoid division? mi_assert_internal( blockidx < MI_MAX_BLOCKS); size_t bitidx = (blockidx / sizeof(uintptr_t)); size_t bit = blockidx - (bitidx * sizeof(uintptr_t)); free_map[bitidx] |= ((uintptr_t)1 << bit); } ``` * 原本 `mimalloc` 著眼於小型記憶體物件的配置，而 `jemalloc` 對於 [Improve scalability of huge allocation](https://github.com/jemalloc/jemalloc/issues/189) 著墨更多 * Linux 核心的 Transparent Huge Pages (THP) * 提供 test suite * 引入 PGO (profile guided optimization) 和 LTO (link-time optimization) * [A Fresh Look At The PGO Performance With GCC 8](https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo) * [Optimizing large applications](https://www.ucw.cz/~hubicka/slides/labs2013.pdf) * 實作 [mallopt](http://man7.org/linux/man-pages/man3/mallopt.3.html) 的支援 --- ## 相關資訊 * [libleak](https://github.com/WuBingzheng/libleak): detects memory leak by hooking memory functions (e.g. malloc) by LD_PRELOAD.