# mimalloc > 以下皆使用版本 2.0.1 進行測試 > - `dev`: development branch for mimalloc v1. > - `dev-slice`: development branch for mimalloc v2 with a new algorithm for managing internal mimalloc pages. ## 安裝 根據 [github](https://github.com/microsoft/mimalloc) 上指示建構專案: ```shell mkdir -p out/release cd out/release cmake ../.. make ``` 並透過 `sudo make install` 安裝到系統上,出現輸出: ```cmake Install the project... -- Install configuration: "Release" -- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0 -- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so -- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc.cmake -- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-release.cmake -- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.a -- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc.h -- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-override.h -- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-new-delete.h -- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config.cmake -- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config-version.cmake -- Symbolic link: /usr/local/lib/libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0 -- Installing: /usr/local/lib/mimalloc-2.0/mimalloc.o ``` > 但根據在 README.md 中的敘述: > > `sudo make install` (install the library and header files in `/usr/local/lib` and `/usr/local/include`) header files 應放在 `/usr/local/include` 中,而非 `/usr/local/lib/mimalloc-${version}/include` 當中。 從 `CMakeLists.txt` 中可以找到以下程式碼: ```cmake if (MI_INSTALL_TOPLEVEL) set(mi_install_dir "${CMAKE_INSTALL_PREFIX}") else() set(mi_install_dir "${CMAKE_INSTALL_PREFIX}/lib/mimalloc-${mi_version}") endif() ``` 其中 `CMAKE_INSTALL_PREFIX` 在 Unix 系統上預設為 `/usr/local`,所以當 `MI_INSTALL_TOPLEVEL` 未設置時 (default),cmake 會將的安裝路徑指定為 `/usr/local/lib/mimalloc-${mi_version}`,而不會將 header files 放到 `/usr/local/include` 中,造成使用者程式中引入 header files 的部份失效。 [issue 223](https://github.com/microsoft/mimalloc/issues/223) 也有討論此問題,但開發者認為此舉會有版本混亂的問題,也就是 `/usr/local/include/mimalloc.h` 不會標明版本,但此部份應可以透過 `CMakeLists.txt` 額外撰寫,如 [elbaro](https://github.com/microsoft/mimalloc/issues/223#issuecomment-631230861) 的留言: >... >.h - it's good to have versioned headers if you want, but it should be under `/usr/include/`, not `/usr/lib/`. > >I recommend to have versioned filenames and also unversioned symlink. > >e.g. > >```c >#include <mimalloc.h> // symlink >#include <mimalloc-1.5/mimalloc.h> // if want a specific version >#include <mimalloc/mimalloc-1.5.h> // or >#include <mimalloc.1.5.h> // or >``` 根據 [Linux Filesystem Hierarchy](https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/usr.html) 描述: >`/usr/include` >The directory for 'header files', needed for compiling user space source code. 可知 `/usr/local/include` 應用來放置 header files。 安裝完的結構長這樣: ``` /usr/local/lib ├── libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0 ├── mimalloc-2.0 │   ├── cmake │   │   ├── mimalloc.cmake │   │   ├── mimalloc-config.cmake │   │   ├── mimalloc-config-version.cmake │   │   └── mimalloc-release.cmake │   ├── include │   │   ├── mimalloc.h │   │   ├── mimalloc-new-delete.h │   │   └── mimalloc-override.h │   ├── libmimalloc.a │   ├── libmimalloc.so -> libmimalloc.so.2.0 │   ├── libmimalloc.so.2.0 │   └── mimalloc.o ``` 其中 `/usr/local/lib/libmimalloc.so` 為 [symbolic link](https://en.wikipedia.org/wiki/Symbolic_link),連結到 `libmalloc-2.0` 內實際的函式庫 `libmimalloc.so.2.0.1`。在需要使用 library 時使用 gcc linker 連結的參數 `-lmimalloc` 會在路徑中尋找 `mimalloc.so` 檔案做連結。 而在執行的時候,雖然編譯 `libmimalloc.so` 時有設定 [soname](https://en.wikipedia.org/wiki/Soname) 為 `libmimalloc.so.2.0`,可以透過以下命令查看: ```shell $> /usr/local/lib$ objdump -p mimalloc-2.0/libmimalloc.so.2.0 | grep SONAME SONAME libmimalloc.so.2.0 ``` 但 [`ld.so`](https://man7.org/linux/man-pages/man8/ld.so.8.html) 不會檢查 `/usr/local/lib` 底下的 shared library,故會找不到該檔案: ```shell $> ldd a.out linux-vdso.so.1 (0x00007ffcbfb5f000) libmimalloc.so.2.0 => not found libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f436e14e000) /lib64/ld-linux-x86-64.so.2 (0x00007f436e367000) ``` 可以嘗試以下方法: 1. 指定 `rpath` 讓 linker 知道位置,`gcc -Wl,-rpath,/usr/local/lib/mimalloc-2.0/ myprogram.c -lmimalloc` 2. 預先載入函式庫,在執行程式,`LD_PRELOAD=/usr/local/lib/libmimalloc.so ./a.out` 3. 添加 `/usr/local/lib/mimalloc-2.0/` 路徑至 `/etc/ld.so.conf`,並透過 `ldconfig` 更新快取 4. 直接將 `/usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0` 移動到 `/usr/local/lib`,並重新建立 `/usr/local/lib/libmimalloc.so` 的 symbolic link。 :::danger Ubuntu 預設不會將 `/usr/local/lib` 加入到 linker 的 search path 中,需手動加入。 ::: :::warning 1. soname 與 real name 相同,且好像沒有用到? 2. so 檔正確放置位置 > 準備提交 pull request! ::: :::info 已於 5/22 更正: [issue](https://github.com/microsoft/mimalloc/issues/399), [commit](https://github.com/microsoft/mimalloc/commit/e2c095fad2a1d506712be0b22616f97cc927c05b) ::: ## 程式原理 > [mimalloc 原始碼分析](https://hackmd.io/@hankluo6/mimalloc_source) ## mimalloc-bench > environment: Linux ip-172-31-53-217 5.4.0-1049-aws #51-Ubuntu SMP Wed May 12 21:13:51 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux | allocator | build | description | | ----------- |:-----:| ----------------------------- | | jemalloc | :o: | | | tcmalloc | :o: | | | mimalloc | :o: | | | TBB malloc | :o: | | | hoard | :x: | inlining failed | | mesh | :o: | | | nomesh | :o: | | | supermalloc | :x: | need to use `RTM` instruction | | snmalloc | :o: | | | rpmalloc | :o: | | | scalloc | :x: | need `-m64` flag | :::warning mesh 和 nomesh 需要額外更改 script 才能 build。 [[reference](https://github.com/plasma-umass/Mesh/issues/50)] ::: 建置 bench 需額外指定 Arm 架構,添加 `-DAPPLE` 能讓產生的 `CMakeLists.txt` 略過編譯 [alloc-test](https://github.com/daanx/mimalloc-bench/tree/master/bench/alloc-test),其需要使用到 SIMD 指令集: ```diff=480 phase "build benchmarks" mkdir -p out/bench cd out/bench - cmake ../../bench + cmake ../../bench -DAPPLE make cd ../.. ``` 而 lean 預設使用多 CPU 編譯,會有卡住的問題 (原因?),改成單一 CPU 即可 ```diff=399 pushd $devdir if test -d lean; then echo "$devdir/lean already exists; no need to git clone" else git clone https://github.com/leanprover/lean fi cd lean git checkout v3.4.1 mkdir -p out/release cd out/release env CC=gcc CXX="g++ -Wno-exceptions" cmake ../../src -DCUSTOM_ALLOCATORS=OFF - make -j $procs + make popd ``` --- ### 透過 [gprof](https://sourceware.org/binutils/docs/gprof/) 分析 `mimalloc` 程式碼熱點 修改 `CMakeLists.txt`: ```diff @@ -174,6 +174,11 @@ if(CMAKE_C_COMPILER_ID MATCHES "AppleClang|Clang|GNU") endif() endif() +# Enforce profiling build +# https://sourceware.org/binutils/docs/gprof/Compiling.html +list(APPEND mi_cflags -g -pg) +link_libraries("-pg") + if(CMAKE_C_COMPILER_ID MATCHES "Intel") list(APPEND mi_cflags -Wall -fvisibility=hidden) endif() @@ -362,7 +367,7 @@ if (MI_BUILD_TESTS) target_compile_definitions(mimalloc-test-stress PRIVATE ${mi_defines}) target_compile_options(mimalloc-test-stress PRIVATE ${mi_cflags}) target_include_directories(mimalloc-test-stress PRIVATE include) - target_link_libraries(mimalloc-test-stress PRIVATE mimalloc ${mi_libraries}) + target_link_libraries(mimalloc-test-stress PRIVATE mimalloc-static ${mi_libraries}) enable_testing() add_test(test_api, mimalloc-test-api) ``` 切換到 `out/release` 目錄,重新建構專案: ```shell $ cd out/release $ make clean all ``` 執行 `mimalloc-test-stress`,耐心等待程式結束: ```shell $ ./mimalloc-test-stress ``` 注意,現行目錄應該會多一個檔案 `gmon.out`,接著可用 [gprof](https://sourceware.org/binutils/docs/gprof/) 分析: ```shell $ gprof mimalloc-test-stress | less ``` 參考執行輸出: (部分) ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 67.42 0.91 0.91 221 4.12 6.05 stress 16.30 1.13 0.22 593350 0.00 0.00 alloc_items 5.93 1.21 0.08 649070 0.00 0.00 mi_calloc 1.48 1.23 0.02 2240319 0.00 0.00 free 1.48 1.25 0.02 130442 0.00 0.00 _mi_free_block_mt 1.48 1.27 0.02 106897 0.00 0.00 mi_free_generic 1.48 1.29 0.02 46815 0.00 0.00 _mi_page_free_collect 1.48 1.31 0.02 22973 0.00 0.00 _mi_segment_page_alloc 0.74 1.32 0.01 653771 0.00 0.00 malloc_size 0.74 1.33 0.01 30876 0.00 0.00 _mi_malloc_generic 0.74 1.34 0.01 633 0.02 0.02 mi_segment_init ``` ### 使用 perf 分析 :::warning :warning: 不需要上述的 `-pg` 編譯選項,適度重新編譯 ::: 執行並收集資訊: ```shell $ perf record --call-graph dwarf -- ./mimalloc-test-stress ``` :::info :information_source: 可能需要透過 `sudo` ::: 分析 perf 收集的執行資訊: ```shell $ perf report -g graph --no-children ``` 需要進一步探究: * `mi_page_free_list_extend` (`src/page.c`) * `malloc_usable_size` $\to$ `mi_usable_size` (`src/alloc.c`) * `mi_page_queue_find_free_ex` (`src/page.c`) * `_mi_page_free_collect` (`src/page.c`) ## 使用 Android 應用程式實際的記憶體配置行為來測試 [allocbench](https://github.com/kdrag0n/allocbench) is a set of tools to benchmark memory allocators realistically using techniques from Android. 修改 `reply.cc`,將第 15 行的 `DEBUG` 數值變更為 `false`,接著編譯 ```shell $ g++ -o reply reply.cpp ``` 解開 traces 檔案: ```shell $ pushd traces $ tar Jxvf android.tar.xz $ popd ``` 修改 `run.sh`,在 `declare -A allocators` 之後,僅保留 `[mimalloc]` 和 `[glibc]`,並修正 `libmimalloc.so` 的路徑。 執行: ```shell $ ./run.sh ``` 預期輸出: ``` =========================== Allocator: glibc ==================== Trace: angry_birds2 Loading events... Loaded 10080312 events. Running events... Run 0: 637.656 ms Run 1: 457.265 ms Run 2: 525.347 ms Run 3: 574.06 ms ``` --- ## `madvise(MADV_DONTNEED)` 使用案例 呼叫 `free()` 函式後,該指定的記憶體空間將被回收,並移入 free list 中,倘若之後再去存取該記憶體位址,就會遇到 segmentation fault。若我們想釋放某段記憶體位址的內容,就可使用 `madvise(addr, len, MADV_DONTNEED)`,`addr` 對應的記憶體空間將會被釋放,但該地址不會被 Linux 核心回收而放在 free list 中,也就是說,如果存取該地址,不會遇到 segmentation fault,這樣就能重用該地址。 以下是測試程式碼: ```cpp #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> int main() { printf("++++ before malloc:\n"); system("free -h"); long len = 1024 * 1024 * 1024; // 1 GB char *buffer; posix_memalign((void **) &buffer, sysconf(_SC_PAGESIZE), len); memset(buffer, 'M', len); printf("buffer[0]: %c\n", buffer[0]); printf("\n\n"); printf("++++ after malloc:\n"); system("free -h"); int ret = madvise(buffer, len, MADV_DONTNEED); if (ret == -1) { printf("madvise ret: %d\n", ret); printf("err: %s\n", strerror(errno)); } printf("\n\n"); printf("++++ after madvise:\n"); system("free -h"); printf("buffer[0]: %d\n", buffer[0]); } ``` 參考執行輸出: ``` ++++ before malloc: total used free shared buff/cache available Mem: 110Gi 4.0Gi 40Gi 5.0Mi 65Gi 105Gi Swap: 64Gi 0B 64Gi buffer[0]: M ++++ after malloc: total used free shared buff/cache available Mem: 110Gi 5.0Gi 39Gi 5.0Mi 65Gi 104Gi Swap: 64Gi 0B 64Gi ++++ after madvise: total used free shared buff/cache available Mem: 110Gi 4.0Gi 40Gi 5.0Mi 65Gi 105Gi Swap: 64Gi 0B 64Gi buffer[0]: 0 ``` 注意看 `used` 欄位: * 在呼叫 `malloc` 之前,系統佔用 `4.0` GiB * 呼叫 `malloc` 後,系統佔用 `5.0` GiB,也就是新增 `1.0` GiB 空間 * 呼叫 `madvise` 後,系統佔用回到 `4.0` GiB,也就是等同呼叫 `malloc` 之間的記憶體佔用量 注意: `posix_memalign((void **)&buffer, sysconf(_SC_PAGESIZE), len)` 所配置的記憶體空間是 page-aligned,這是為了配合 `madvise`,後者只接受 page-aligned 地址。 --- ## 改進機會 ### `getcpu` mimalloc 為了支援 NUMA,會頻繁呼叫 [getcpu](https://man7.org/linux/man-pages/man2/getcpu.2.html) 系統呼叫,這可藉由 restartable sequences (rseq) 來加速,參見 [The 5-year journey to bring restartable sequences to Linux](https://www.efficios.com/blog/2019/02/08/linux-restartable-sequences/) Google 發展的 TCMalloc 也用到 rseq,可見: * [Restartable Sequence Mechancism for TCMalloc](https://google.github.io/tcmalloc/rseq.html) * [TCMalloc : Thread-Caching Malloc](https://google.github.io/tcmalloc/design.html) mimalloc [#315](https://github.com/microsoft/mimalloc/issues/315) 就討論 rseq 的引入,但尚未有程式碼 (==很好的機會==) rseq 目前只提供 cpu id 的存取,而在 mimalloc 中呼叫 `getcpu` 的時機在於需要取得 numa node id,所以必須要透過 cpu id 額外對應 node id。根據 [NUMA man page](https://linux.die.net/man/3/numa) 可以使用 `numa_node_of_cpu` 取得 numa id,如以下程式: ```c // compile with -lnuma #include <numa.h> ... cpu = __rseq_abi.cpu_id; node = numa_node_of_cpu(cpu); ``` 以下測試四種方法取得 cpu id 的效能,分別是 glibc 的 [getcpu](https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/getcpu.c;h=143ebde144a0930f5528ad1dc51e744147097125;hb=refs/heads/release/2.33/master)、直接透過 vDSO 呼叫的 [getcpu](https://github.com/nlynch-mentor/vdsotest/blob/4ad733fb25364e4afaf4060e57c2429a93b686c5/src/util.c#L153)、system call 呼叫的 [getcpu](https://man7.org/linux/man-pages/man2/syscall.2.html) 以及 rseq 取得的 [cpu](https://github.com/torvalds/linux/blob/7426cedc7dad67bf3c71ea6cc29ab7822e1a453f/include/uapi/linux/rseq.h#L62): ![](https://i.imgur.com/gVSyy0Q.png) (syscall 因時間太長超出圖片範圍) 可以看到 rseq 版本的時間最快,與 [The 5-year journey to bring restartable sequences to Linux]() 內的實驗數據相同,而 glibc 為 vDSO 的額外一層包裝 (在 x86-64 當中),故兩者時間差不多,差別在於 glibc 函式呼叫較多層導致時間些微增加。 rseq 加上 `numa_node_of_cpu` 後的時間比較: ![](https://i.imgur.com/7o5gOXr.png) 會發現 rseq 因為需要呼叫 `numa_node_of_cpu` 速度比 glibc 以及 vDSO 的 `getcpu` 還慢,而後兩者因在執行時會同時取得 node id,便不需要額外再取得 node id,速度上較有優勢。 可以將 mimalloc 中的 `syscall` 改為使用 vDSO 的系統呼叫。 > 已提交 [PR#424: Use much faster getcpu() via vDSO](https://github.com/microsoft/mimalloc/pull/424) ### `rseq` 想法:在 segment 上層新增 segment pool。 ```graphviz graph "" { n018 ; n018 [label="CPU0"] ; n018 -- n035 ; n035 -- n019 n035 [label="pool"] ; n019 [label="segment"] ; n035 -- n022 ; n022 [label="segment"] ; n019 -- n020 ; n020 [label="page"] ; n019 -- n021 ; n021 [label="page"] ; n028 [label="page"] ; n029 [label="page"] ; n022 -- n028 n022 -- n029 n023 ; n023 [label="CPU1"] ; n023 -- n036 ; n036 -- n024 ; n036 [label="pool"] ; n024 [label="segment"] ; n036 -- n027 ; n027 [label="segment"] ; n024 -- n025 ; n025 [label="page"] ; n024 -- n026 ; n026 [label="page"] ; n030 [label="page"] ; n031 [label="page"] ; n027 -- n031 n027 -- n030 } ``` 這樣 segment 間的 atomic operation 應該可以去除? ### `__builtin_prefetch` 在 linked list 的使用 mimalloc 的 `src/page.c`, `src/page-queue.c`, `src/segment.c` 用到 linked list,可藉由 CPU 的 prefetch 指令來加速,可見: * [On lists, cache, algorithms, and microarchitecture](https://paweldziepak.dev/2019/05/02/on-lists-cache-algorithms-and-microarchitecture/) * [Make your programs run faster by better using the data cache](https://johnysswlab.com/make-your-programs-run-faster-by-better-using-the-data-cache/) > ```shell > $ git clone https://github.com/ibogosavljevic/johnysswlab > $ cd johnysswlab/2020-05-datacaching > $ make linked_list_runtimes > $ make linked_list_cache_misses > ``` > :notes: 使用 [Cachegrind](https://valgrind.org/docs/manual/cg-manual.html) 分析 cache 效益 TCMalloc 的原始程式碼也用到這技巧,可見 [tcmalloc/tcmalloc/internal/linked_list.h](https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linked_list.h): ```cpp inline void* SLL_Pop(void** list) { void* result = *list; void* next = SLL_Next(*list); *list = next; // Prefetching NULL leads to a DTLB miss, thus only prefetch when 'next' // is not NULL. #if defined(__GNUC__) if (next) { __builtin_prefetch(next, 0, 3); } #endif return result; } ``` 單純在 freelist 上添加 `__builtin_prefetch` 對 cache 並沒有影響。 :::info TODO 1. 可能是 list 長度很小,prefetch 意義不大 2. 硬體已經幫忙做 prefetch? ref: [The problem with prefetch](https://lwn.net/Articles/444336/) ::: ### 藉由 [MADV_POPULATE](https://lwn.net/Articles/846501/) 降低 page faults 以 SQLite 為例,[引入 MAP_POPULATE](https://github.com/zidootech/zidoo-kodi-15.1/blob/cb5019a89613c5d9a1df93a2e38d8ce619dbad4c/tools/depends/target/sqlite3/sqlite3.c.patch) 的效能影響 | | Before | After | |:-------:|:-------:|:--------:| | Mean | 2.550 | 2.375 | | StdDev | 0.021 | 0.043 | > The time taken to issue "SELECT * FROM tvshowview" in seconds [isoalloc](https://github.com/struct/isoalloc) 也用到 `MAP_POPULATE` 目前 mimalloc 的測試: ```shell $ perf stat -e faults ./mimalloc-test-stress ``` 參考輸出: ``` 33,5144 faults 1.331217221 seconds time elapsed 16.206853000 seconds user 2.566793000 seconds sys ``` mimalloc 中主要分配空間為 `mi_unix_mmap` 函式,嘗試新增 `MAP_POPULATE` 到其中: ```diff static void* mi_unix_mmap(void* addr, size_t size, size_t try_alignment, int protect_flags, bool large_only, bool allow_large, bool* is_large) { void* p = NULL; #if !defined(MAP_ANONYMOUS) #define MAP_ANONYMOUS MAP_ANON #endif #if !defined(MAP_NORESERVE) #define MAP_NORESERVE 0 #endif - int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE + int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE | MAP_POPULATE; ... if ((large_only || use_large_os_page(size, try_alignment)) && allow_large) { static _Atomic(uintptr_t) large_page_try_ok; // = 0; uintptr_t try_ok = mi_atomic_load_acquire(&large_page_try_ok); if (!large_only && try_ok > 0) { // If the OS is not configured for large OS pages, or the user does not have // enough permission, the `mmap` will always fail (but it might also fail for other reasons). // Therefore, once a large page allocation failed, we don't try again for `large_page_try_ok` times // to avoid too many failing calls to mmap. mi_atomic_cas_strong_acq_rel(&large_page_try_ok, &try_ok, try_ok - 1); } else { int lflags = flags & ~MAP_NORESERVE; // using NORESERVE on huge pages seems to fail on Linux + lflags = lflags & ~MAP_POPULATE; // don't use MAP_POPULATE on huge pages ... ``` 在 mimalloc-bench 上測試: :::spoiler mimalloc v1 | | major-faults | major-faults (populate) | minor-faults | minor-faults (populate) | |:--------------:|:------------------:|:-----------------:|:------------------:|:-----------------:| | cfrac | 0 | 0 | 369 | 65706 | | espresso | 0 | 0 | 684 | 65693 | | barnes | 0 | 0 | 15612 | 185511 | | leanN | 3 | 0 | 122108 | 197477 | | alloc-test1 | 1 | 0 | 2566 | 65787 | | alloc-testN | 0 | 0 | 2971 | 65812 | | larsonN | 1 | 0 | 17958 | 68939 | | sh6benchN | 0 | 0 | 53165 | 65731 | | sh8benchN | 0 | 0 | 31688 | 87237 | | xmalloc-testN | 1 | 0 | 16094 | 65724 | | cache-scratch1 | 0 | 0 | 230 | 65758 | | cache-scratchN | 0 | 0 | 271 | 65773 | | mstressN | 0 | 0 | 187768 | 328171 | | rptestN | 1 | 0 | 41804 | 65814 | ::: :::spoiler mimalloc v2 | | major-faults | major-faults (populate) | minor-faults | minor-faults (populate) | |:--------------:|:------------------:|:-----------------:|:------------------:|:-----------------:| | cfrac | 0 | 0 | 371 | 2218 | | espresso | 0 | 0 | 789 | 2206 | | barnes | 0 | 0 | 15618 | 168386 | | leanN | 0 | 0 | 125934 | 153145 | | alloc-test1 | 0 | 0 | 2575 | 6403 | | alloc-testN | 0 | 0 | 3002 | 16677 | | larsonN | 0 | 0 | 19182 | 27134 | | sh6benchN | 0 | 0 | 53427 | 57567 | | sh8benchN | 0 | 0 | 38250 | 75987 | | xmalloc-testN | 0 | 0 | 19217 | 20696 | | cache-scratch1 | 0 | 0 | 240 | 4326 | | cache-scratchN | 0 | 0 | 283 | 16643 | | mstressN | 0 | 0 | 268846 | 534053 | | rptestN | 0 | 0 | 51005 | 94541 | ::: 可以看到在 mimalloc v1.x 中 major page faults 的次數會減少,但是 minor page faults 卻也一起增加,此情況在 v2.x 中也會出現。 mimalloc-bench 會設置 `MIMALLOC_EAGER_COMMIT_DELAY=0`,如果將 `MIMALLOC_EAGER_COMMIT_DELAY` 改為預設值 1 的話,minor page faults 的數值則會恢復正常。 而使用 `MAP_POPULATE` 測試 mimalloc-test-stress,此時 minor page faults 卻能大幅下降。(或許可以新增選項讓 user 決定是否開啟 `MAP_POPULATE`) ```shell perf stat -e page-faults env MIMALLOC_EAGER_COMMIT_DELAY={0,1} ./mimalloc-test-stress 32 100 50 ``` | eager_commit | populate | page-faults | |:------------:|:--------:|:-----------:| | 0 | No | 2,955,325 | | 1 | No | 2,943,560 | | 0 | Yes | 68,053 | | 1 | Yes | 94,554 | > [IsoAlloc Performance](https://github.com/struct/isoalloc/blob/master/PERFORMANCE.md) 提到: > > All bitmaps pages allocated with mmap are passed to the madvise syscall with the advice arguments `MADV_WILLNEED` and `MADV_SEQUENTIAL`. All user pages allocated with mmap are passed to the madvise syscall with the advice arguments `MADV_WILLNEED` and `MADV_RANDOM`. By default both of these mappings are created with `MAP_POPULATE` which instructs the kernel to pre-populate the page tables which reduces page faults and results in better performance. ... The performance of short lived programs will benefit from `PRE_POPULATE_PAGES` being disabled. ### 使用 `MADV_DONTNEED` 減少記憶體 參考 [isoalloc](https://github.com/struct/isoalloc/blob/fc22dcafa13dcda6b9be4561b03857d9b9fe87e8/src/iso_alloc.c#L267) 中的 guard page 實作,因 guard page 的內容在程式中不會使用,故可以藉由 `MADV_DONTNEED` 將之移出記憶體。guard page 對應到 mimalloc 中的部份在 `mi_segment_init` 內: ```c // set up guard pages size_t guard_slices = 0; if (MI_SECURE>0) { // in secure mode, we set up a protected page in between the segment info // and the page data ... _mi_os_protect((uint8_t*)segment + mi_segment_info_size(segment) - os_pagesize, os_pagesize); ... _mi_os_protect(end, os_pagesize); ... } ``` (在 mimalloc 1.x 版本則在 `mi_segment_protect`) 在 `mi_os_protectx` 中添加 `MADV_DONTNEED`: ```diff // Protect a region in memory to be not accessible. static bool mi_os_protectx(void* addr, size_t size, bool protect) { ... +#ifdef MADV_DONTNEED + madv_err = madvise(start, csize, MADV_DONTNEED); + if (madv_err != 0) { + _mi_warning_message("madvise error: start: %p, csize: 0x%x, err: %i\n", start, csize, madv_err); + } +#endif return (err == 0); } ``` ## TODO * huge page (hugetlbpage) * https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt * NUMA * mi_reserve_huge_os_pages_interleave * http://www.cc.ntu.edu.tw/chinese/epaper/0015/20101220_1508.htm * http://www.aspphp.online/shujuku/sqlserversjk/gysqlserver/201701/19789.html * free list * CSAPP 9 * skip list * XOR linked list * PRNG * src/random.c * ChaCha20 * madvise * "Accessing this memory with proper alignment will minimize CPU cache flushes."