mimalloc

以下皆使用版本 2.0.1 進行測試

dev: development branch for mimalloc v1.

dev-slice: development branch for mimalloc v2 with a new algorithm for managing internal mimalloc pages.

安裝

根據 github 上指示建構專案：

mkdir -p out/release
cd out/release
cmake ../..
make

並透過 sudo make install 安裝到系統上，出現輸出：

Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-release.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.a
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc.h
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-override.h
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-new-delete.h
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config-version.cmake
-- Symbolic link: /usr/local/lib/libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0
-- Installing: /usr/local/lib/mimalloc-2.0/mimalloc.o

但根據在 README.md 中的敘述：

sudo make install (install the library and header files in /usr/local/lib and /usr/local/include)

header files 應放在 /usr/local/include 中，而非 /usr/local/lib/mimalloc-${version}/include 當中。

從 CMakeLists.txt 中可以找到以下程式碼：

if (MI_INSTALL_TOPLEVEL)
  set(mi_install_dir "${CMAKE_INSTALL_PREFIX}")
else()
  set(mi_install_dir "${CMAKE_INSTALL_PREFIX}/lib/mimalloc-${mi_version}")
endif()

其中 CMAKE_INSTALL_PREFIX 在 Unix 系統上預設為 /usr/local，所以當 MI_INSTALL_TOPLEVEL 未設置時 (default)，cmake 會將的安裝路徑指定為 /usr/local/lib/mimalloc-${mi_version}，而不會將 header files 放到 /usr/local/include 中，造成使用者程式中引入 header files 的部份失效。

issue 223 也有討論此問題，但開發者認為此舉會有版本混亂的問題，也就是 /usr/local/include/mimalloc.h 不會標明版本，但此部份應可以透過 CMakeLists.txt 額外撰寫，如 elbaro 的留言：

…
.h - it's good to have versioned headers if you want, but it should be under /usr/include/, not /usr/lib/.

I recommend to have versioned filenames and also unversioned symlink.

e.g.
#include <mimalloc.h>  // symlink
#include <mimalloc-1.5/mimalloc.h> // if want a specific version
#include <mimalloc/mimalloc-1.5.h> // or
#include <mimalloc.1.5.h> // or

根據 Linux Filesystem Hierarchy 描述：

/usr/include
The directory for 'header files', needed for compiling user space source code.

可知 /usr/local/include 應用來放置 header files。

安裝完的結構長這樣：

/usr/local/lib
├── libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0
├── mimalloc-2.0
│   ├── cmake
│   │   ├── mimalloc.cmake
│   │   ├── mimalloc-config.cmake
│   │   ├── mimalloc-config-version.cmake
│   │   └── mimalloc-release.cmake
│   ├── include
│   │   ├── mimalloc.h
│   │   ├── mimalloc-new-delete.h
│   │   └── mimalloc-override.h
│   ├── libmimalloc.a
│   ├── libmimalloc.so -> libmimalloc.so.2.0
│   ├── libmimalloc.so.2.0
│   └── mimalloc.o

其中 /usr/local/lib/libmimalloc.so 為 symbolic link，連結到 libmalloc-2.0 內實際的函式庫 libmimalloc.so.2.0.1。在需要使用 library 時使用 gcc linker 連結的參數 -lmimalloc 會在路徑中尋找 mimalloc.so 檔案做連結。

而在執行的時候，雖然編譯 libmimalloc.so 時有設定 soname 為 libmimalloc.so.2.0，可以透過以下命令查看：

$> /usr/local/lib$ objdump -p mimalloc-2.0/libmimalloc.so.2.0 | grep SONAME
   SONAME               libmimalloc.so.2.0

但 ld.so 不會檢查 /usr/local/lib 底下的 shared library，故會找不到該檔案：

$> ldd a.out 
        linux-vdso.so.1 (0x00007ffcbfb5f000)
        libmimalloc.so.2.0 => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f436e14e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f436e367000)

可以嘗試以下方法：

指定 rpath 讓 linker 知道位置，gcc -Wl,-rpath,/usr/local/lib/mimalloc-2.0/ myprogram.c -lmimalloc
預先載入函式庫，在執行程式，LD_PRELOAD=/usr/local/lib/libmimalloc.so ./a.out
添加 /usr/local/lib/mimalloc-2.0/ 路徑至 /etc/ld.so.conf，並透過 ldconfig 更新快取
直接將 /usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0 移動到 /usr/local/lib，並重新建立 /usr/local/lib/libmimalloc.so 的 symbolic link。

Ubuntu 預設不會將 /usr/local/lib 加入到 linker 的 search path 中，需手動加入。

soname 與 real name 相同，且好像沒有用到？
so 檔正確放置位置

準備提交 pull request!

已於 5/22 更正：
issue, commit

程式原理

mimalloc 原始碼分析

mimalloc-bench

environment: Linux ip-172-31-53-217 5.4.0-1049-aws #51-Ubuntu SMP Wed May 12 21:13:51 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

allocator	build	description
jemalloc	:o:
tcmalloc	:o:
mimalloc	:o:
TBB malloc	:o:
hoard	:x:	inlining failed
mesh	:o:
nomesh	:o:
supermalloc	:x:	need to use `RTM` instruction
snmalloc	:o:
rpmalloc	:o:
scalloc	:x:	need `-m64` flag

mesh 和 nomesh 需要額外更改 script 才能 build。
[reference]

建置 bench 需額外指定 Arm 架構，添加 -DAPPLE 能讓產生的 CMakeLists.txt 略過編譯 alloc-test，其需要使用到 SIMD 指令集：








phase "build benchmarks"

  mkdir -p out/bench
  cd out/bench
- cmake ../../bench
+ cmake ../../bench -DAPPLE
  make
  cd ../..

而 lean 預設使用多 CPU 編譯，會有卡住的問題 (原因?)，改成單一 CPU 即可
















  pushd $devdir
  if test -d lean; then
    echo "$devdir/lean already exists; no need to git clone"
  else
    git clone https://github.com/leanprover/lean
  fi
  cd lean
  git checkout v3.4.1
  mkdir -p out/release
  cd out/release
  env CC=gcc CXX="g++ -Wno-exceptions" cmake ../../src -DCUSTOM_ALLOCATORS=OFF
- make -j $procs
+ make
  popd

透過 gprof 分析 `mimalloc` 程式碼熱點

修改 CMakeLists.txt:

@@ -174,6 +174,11 @@ if(CMAKE_C_COMPILER_ID MATCHES "AppleClang|Clang|GNU")
   endif()
 endif()
 
+# Enforce profiling build
+# https://sourceware.org/binutils/docs/gprof/Compiling.html
+list(APPEND mi_cflags -g -pg)
+link_libraries("-pg")
+
 if(CMAKE_C_COMPILER_ID MATCHES "Intel")
   list(APPEND mi_cflags -Wall -fvisibility=hidden)
 endif()
@@ -362,7 +367,7 @@ if (MI_BUILD_TESTS)
   target_compile_definitions(mimalloc-test-stress PRIVATE ${mi_defines})
   target_compile_options(mimalloc-test-stress PRIVATE ${mi_cflags})
   target_include_directories(mimalloc-test-stress PRIVATE include)
-  target_link_libraries(mimalloc-test-stress PRIVATE mimalloc ${mi_libraries})
+  target_link_libraries(mimalloc-test-stress PRIVATE mimalloc-static ${mi_libraries})
 
   enable_testing()
   add_test(test_api, mimalloc-test-api)

切換到 out/release 目錄，重新建構專案:

$ cd out/release
$ make clean all

執行 mimalloc-test-stress，耐心等待程式結束:

$ ./mimalloc-test-stress

注意，現行目錄應該會多一個檔案 gmon.out，接著可用 gprof 分析:

$ gprof mimalloc-test-stress | less

參考執行輸出: (部分)

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 67.42      0.91     0.91      221     4.12     6.05  stress
 16.30      1.13     0.22   593350     0.00     0.00  alloc_items
  5.93      1.21     0.08   649070     0.00     0.00  mi_calloc
  1.48      1.23     0.02  2240319     0.00     0.00  free
  1.48      1.25     0.02   130442     0.00     0.00  _mi_free_block_mt
  1.48      1.27     0.02   106897     0.00     0.00  mi_free_generic
  1.48      1.29     0.02    46815     0.00     0.00  _mi_page_free_collect
  1.48      1.31     0.02    22973     0.00     0.00  _mi_segment_page_alloc
  0.74      1.32     0.01   653771     0.00     0.00  malloc_size
  0.74      1.33     0.01    30876     0.00     0.00  _mi_malloc_generic
  0.74      1.34     0.01      633     0.02     0.02  mi_segment_init

使用 perf 分析

:warning: 不需要上述的 -pg 編譯選項，適度重新編譯

執行並收集資訊:

$ perf record --call-graph dwarf -- ./mimalloc-test-stress

:information_source: 可能需要透過 sudo

分析 perf 收集的執行資訊:

$ perf report -g graph --no-children

需要進一步探究:

mi_page_free_list_extend (src/page.c)
malloc_usable_size
$\to$ mi_usable_size (src/alloc.c)
mi_page_queue_find_free_ex (src/page.c)
_mi_page_free_collect (src/page.c)

使用 Android 應用程式實際的記憶體配置行為來測試

allocbench is a set of tools to benchmark memory allocators realistically using techniques from Android.

修改 reply.cc，將第 15 行的 DEBUG 數值變更為 false，接著編譯

$ g++ -o reply reply.cpp

解開 traces 檔案:

$ pushd traces
$ tar Jxvf android.tar.xz
$ popd

修改 run.sh，在 declare -A allocators 之後，僅保留 [mimalloc] 和 [glibc]，並修正 libmimalloc.so 的路徑。

執行:

$ ./run.sh

預期輸出:

===========================
Allocator: glibc
====================
Trace: angry_birds2
Loading events...
Loaded 10080312 events.

Running events...
  Run 0: 637.656 ms
  Run 1: 457.265 ms
  Run 2: 525.347 ms
  Run 3: 574.06 ms

`madvise(MADV_DONTNEED)` 使用案例

呼叫 free() 函式後，該指定的記憶體空間將被回收，並移入 free list 中，倘若之後再去存取該記憶體位址，就會遇到 segmentation fault。若我們想釋放某段記憶體位址的內容，就可使用 madvise(addr, len, MADV_DONTNEED)，addr 對應的記憶體空間將會被釋放，但該地址不會被 Linux 核心回收而放在 free list 中，也就是說，如果存取該地址，不會遇到 segmentation fault，這樣就能重用該地址。

以下是測試程式碼:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>
#include <unistd.h>

int main()
{
    printf("++++ before malloc:\n");
    system("free -h");

    long len = 1024 * 1024 * 1024;  // 1 GB                                                                                                                   
    char *buffer;

    posix_memalign((void **) &buffer, sysconf(_SC_PAGESIZE), len);

    memset(buffer, 'M', len);
    printf("buffer[0]: %c\n", buffer[0]);

    printf("\n\n");
    printf("++++ after malloc:\n");
    system("free -h");

    int ret = madvise(buffer, len, MADV_DONTNEED);
    if (ret == -1) {
        printf("madvise ret: %d\n", ret);
        printf("err: %s\n", strerror(errno));
    }

    printf("\n\n");
    printf("++++ after madvise:\n");
    system("free -h");

    printf("buffer[0]: %d\n", buffer[0]);
}

參考執行輸出:

++++ before malloc:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       4.0Gi        40Gi       5.0Mi        65Gi       105Gi
Swap:          64Gi          0B        64Gi
buffer[0]: M


++++ after malloc:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       5.0Gi        39Gi       5.0Mi        65Gi       104Gi
Swap:          64Gi          0B        64Gi


++++ after madvise:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       4.0Gi        40Gi       5.0Mi        65Gi       105Gi
Swap:          64Gi          0B        64Gi
buffer[0]: 0

注意看 used 欄位:

在呼叫 malloc 之前，系統佔用 4.0 GiB
呼叫 malloc 後，系統佔用 5.0 GiB，也就是新增 1.0 GiB 空間
呼叫 madvise 後，系統佔用回到 4.0 GiB，也就是等同呼叫 malloc 之間的記憶體佔用量

注意: posix_memalign((void **)&buffer, sysconf(_SC_PAGESIZE), len) 所配置的記憶體空間是 page-aligned，這是為了配合 madvise，後者只接受 page-aligned 地址。

改進機會

`getcpu`

mimalloc 為了支援 NUMA，會頻繁呼叫 getcpu 系統呼叫，這可藉由 restartable sequences (rseq) 來加速，參見 The 5-year journey to bring restartable sequences to Linux

Google 發展的 TCMalloc 也用到 rseq，可見:

mimalloc #315 就討論 rseq 的引入，但尚未有程式碼 (很好的機會)

rseq 目前只提供 cpu id 的存取，而在 mimalloc 中呼叫 getcpu 的時機在於需要取得 numa node id，所以必須要透過 cpu id 額外對應 node id。根據 NUMA man page 可以使用 numa_node_of_cpu 取得 numa id，如以下程式：

// compile with -lnuma

#include <numa.h>
...
cpu = __rseq_abi.cpu_id;
node = numa_node_of_cpu(cpu);

以下測試四種方法取得 cpu id 的效能，分別是 glibc 的 getcpu、直接透過 vDSO 呼叫的 getcpu、system call 呼叫的 getcpu 以及 rseq 取得的 cpu：

(syscall 因時間太長超出圖片範圍)

可以看到 rseq 版本的時間最快，與 The 5-year journey to bring restartable sequences to Linux 內的實驗數據相同，而 glibc 為 vDSO 的額外一層包裝 (在 x86-64 當中)，故兩者時間差不多，差別在於 glibc 函式呼叫較多層導致時間些微增加。

rseq 加上 numa_node_of_cpu 後的時間比較：

會發現 rseq 因為需要呼叫 numa_node_of_cpu 速度比 glibc 以及 vDSO 的 getcpu 還慢，而後兩者因在執行時會同時取得 node id，便不需要額外再取得 node id，速度上較有優勢。

可以將 mimalloc 中的 syscall 改為使用 vDSO 的系統呼叫。

已提交 PR#424: Use much faster getcpu() via vDSO

`rseq`

想法：在 segment 上層新增 segment pool。

這樣 segment 間的 atomic operation 應該可以去除？

`__builtin_prefetch` 在 linked list 的使用

mimalloc 的 src/page.c, src/page-queue.c, src/segment.c 用到 linked list，可藉由 CPU 的 prefetch 指令來加速，可見:

On lists, cache, algorithms, and microarchitecture

Make your programs run faster by better using the data cache

$ git clone https://github.com/ibogosavljevic/johnysswlab
$ cd johnysswlab/2020-05-datacaching
$ make linked_list_runtimes
$ make linked_list_cache_misses

:notes: 使用 Cachegrind 分析 cache 效益

TCMalloc 的原始程式碼也用到這技巧，可見 tcmalloc/tcmalloc/internal/linked_list.h:

inline void* SLL_Pop(void** list) {
  void* result = *list;
  void* next = SLL_Next(*list);
  *list = next;
  // Prefetching NULL leads to a DTLB miss, thus only prefetch when 'next'
  // is not NULL.
#if defined(__GNUC__)
  if (next) {
    __builtin_prefetch(next, 0, 3);
  }
#endif
  return result;
}

單純在 freelist 上添加 __builtin_prefetch 對 cache 並沒有影響。

TODO

可能是 list 長度很小，prefetch 意義不大
硬體已經幫忙做 prefetch?

ref: The problem with prefetch

藉由 MADV_POPULATE 降低 page faults

以 SQLite 為例，引入 MAP_POPULATE 的效能影響

	Before	After
Mean	2.550	2.375
StdDev	0.021	0.043

The time taken to issue "SELECT * FROM tvshowview" in seconds

isoalloc 也用到 MAP_POPULATE

目前 mimalloc 的測試:

$ perf stat -e faults ./mimalloc-test-stress

參考輸出:

           33,5144      faults
       1.331217221 seconds time elapsed
      16.206853000 seconds user
       2.566793000 seconds sys

mimalloc 中主要分配空間為 mi_unix_mmap 函式，嘗試新增 MAP_POPULATE 到其中：

static void* mi_unix_mmap(void* addr, size_t size, size_t try_alignment, int protect_flags, bool large_only, bool allow_large, bool* is_large) {
    void* p = NULL;
    #if !defined(MAP_ANONYMOUS)
    #define MAP_ANONYMOUS  MAP_ANON
    #endif
    #if !defined(MAP_NORESERVE)
    #define MAP_NORESERVE  0
    #endif
-   int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE
+   int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE | MAP_POPULATE;
    ...
    if ((large_only || use_large_os_page(size, try_alignment)) && allow_large) {
        static _Atomic(uintptr_t) large_page_try_ok; // = 0;
        uintptr_t try_ok = mi_atomic_load_acquire(&large_page_try_ok);
        if (!large_only && try_ok > 0) {
            // If the OS is not configured for large OS pages, or the user does not have
            // enough permission, the `mmap` will always fail (but it might also fail for other reasons).
            // Therefore, once a large page allocation failed, we don't try again for `large_page_try_ok` times
            // to avoid too many failing calls to mmap.
            mi_atomic_cas_strong_acq_rel(&large_page_try_ok, &try_ok, try_ok - 1);
        }
        else {
            int lflags = flags & ~MAP_NORESERVE;  // using NORESERVE on huge pages seems to fail on Linux
+           lflags = lflags & ~MAP_POPULATE; // don't use MAP_POPULATE on huge pages
    ...

在 mimalloc-bench 上測試：

mimalloc v1

	major-faults	minor-faults	minor-faults (populate)
cfrac	0	369	65706
espresso	0	684	65693
barnes	0	15612	185511
leanN	3	122108	197477
alloc-test1	1	2566	65787
alloc-testN	0	2971	65812
larsonN	1	17958	68939
sh6benchN	0	53165	65731
sh8benchN	0	31688	87237
xmalloc-testN	1	16094	65724
cache-scratch1	0	230	65758
cache-scratchN	0	271	65773
mstressN	0	187768	328171
rptestN	1	41804	65814

mimalloc v2

	minor-faults	minor-faults (populate)
cfrac	371	2218
espresso	789	2206
barnes	15618	168386
leanN	125934	153145
alloc-test1	2575	6403
alloc-testN	3002	16677
larsonN	19182	27134
sh6benchN	53427	57567
sh8benchN	38250	75987
xmalloc-testN	19217	20696
cache-scratch1	240	4326
cache-scratchN	283	16643
mstressN	268846	534053
rptestN	51005	94541

可以看到在 mimalloc v1.x 中 major page faults 的次數會減少，但是 minor page faults 卻也一起增加，此情況在 v2.x 中也會出現。

mimalloc-bench 會設置 MIMALLOC_EAGER_COMMIT_DELAY=0，如果將 MIMALLOC_EAGER_COMMIT_DELAY 改為預設值 1 的話，minor page faults 的數值則會恢復正常。

而使用 MAP_POPULATE 測試 mimalloc-test-stress，此時 minor page faults 卻能大幅下降。(或許可以新增選項讓 user 決定是否開啟 MAP_POPULATE)

perf stat -e page-faults env MIMALLOC_EAGER_COMMIT_DELAY={0,1} ./mimalloc-test-stress 32 100 50

eager_commit	populate	page-faults
0	No	2,955,325
1	No	2,943,560
0	Yes	68,053
1	Yes	94,554

IsoAlloc Performance 提到:

All bitmaps pages allocated with mmap are passed to the madvise syscall with the advice arguments MADV_WILLNEED and MADV_SEQUENTIAL. All user pages allocated with mmap are passed to the madvise syscall with the advice arguments MADV_WILLNEED and MADV_RANDOM. By default both of these mappings are created with MAP_POPULATE which instructs the kernel to pre-populate the page tables which reduces page faults and results in better performance. … The performance of short lived programs will benefit from PRE_POPULATE_PAGES being disabled.

使用 `MADV_DONTNEED` 減少記憶體

參考 isoalloc 中的 guard page 實作，因 guard page 的內容在程式中不會使用，故可以藉由 MADV_DONTNEED 將之移出記憶體。guard page 對應到 mimalloc 中的部份在 mi_segment_init 內：

// set up guard pages
size_t guard_slices = 0;
if (MI_SECURE>0) {
    // in secure mode, we set up a protected page in between the segment info
    // and the page data
    ...
    _mi_os_protect((uint8_t*)segment + mi_segment_info_size(segment) - os_pagesize, os_pagesize);
    ...
    _mi_os_protect(end, os_pagesize);
    ...
  }

(在 mimalloc 1.x 版本則在 mi_segment_protect)

在 mi_os_protectx 中添加 MADV_DONTNEED：

// Protect a region in memory to be not accessible.
static  bool mi_os_protectx(void* addr, size_t size, bool protect) {
    ...
+#ifdef MADV_DONTNEED
+   madv_err = madvise(start, csize, MADV_DONTNEED);
+   if (madv_err != 0) {
+       _mi_warning_message("madvise error: start: %p, csize: 0x%x, err: %i\n", start, csize, madv_err);
+   }
+#endif
    return (err == 0);
}

TODO

huge page (hugetlbpage)
- https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
NUMA
- mi_reserve_huge_os_pages_interleave
- http://www.cc.ntu.edu.tw/chinese/epaper/0015/20101220_1508.htm
- http://www.aspphp.online/shujuku/sqlserversjk/gysqlserver/201701/19789.html
free list
- CSAPP 9
- skip list
- XOR linked list
PRNG
- src/random.c
- ChaCha20
madvise
- "Accessing this memory with proper alignment will minimize CPU cache flushes."

mimalloc

安裝

程式原理

mimalloc-bench

透過 gprof 分析 mimalloc 程式碼熱點

使用 perf 分析

使用 Android 應用程式實際的記憶體配置行為來測試

madvise(MADV_DONTNEED) 使用案例

改進機會

getcpu

rseq

__builtin_prefetch 在 linked list 的使用

藉由 MADV_POPULATE 降低 page faults

使用 MADV_DONTNEED 減少記憶體

TODO

Read more

Linked List Sort

2021q1 Homework5 (sort)

2022q1 Homework5 (quiz5)

2022q1 Homework5 (quiz6)

透過 gprof 分析 `mimalloc` 程式碼熱點

`madvise(MADV_DONTNEED)` 使用案例

`getcpu`

`rseq`

`__builtin_prefetch` 在 linked list 的使用

使用 `MADV_DONTNEED` 減少記憶體