mimalloc

以下皆使用版本 2.0.1 進行測試

  • dev: development branch for mimalloc v1.
  • dev-slice: development branch for mimalloc v2 with a new algorithm for managing internal mimalloc pages.

安裝

根據 github 上指示建構專案:

mkdir -p out/release
cd out/release
cmake ../..
make

並透過 sudo make install 安裝到系統上,出現輸出:

Install the project...
-- Install configuration: "Release"
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.so
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-release.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/libmimalloc.a
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc.h
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-override.h
-- Installing: /usr/local/lib/mimalloc-2.0/include/mimalloc-new-delete.h
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config.cmake
-- Installing: /usr/local/lib/mimalloc-2.0/cmake/mimalloc-config-version.cmake
-- Symbolic link: /usr/local/lib/libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0
-- Installing: /usr/local/lib/mimalloc-2.0/mimalloc.o

但根據在 README.md 中的敘述:

sudo make install (install the library and header files in /usr/local/lib and /usr/local/include)

header files 應放在 /usr/local/include 中,而非 /usr/local/lib/mimalloc-${version}/include 當中。

CMakeLists.txt 中可以找到以下程式碼:

if (MI_INSTALL_TOPLEVEL)
  set(mi_install_dir "${CMAKE_INSTALL_PREFIX}")
else()
  set(mi_install_dir "${CMAKE_INSTALL_PREFIX}/lib/mimalloc-${mi_version}")
endif()

其中 CMAKE_INSTALL_PREFIX 在 Unix 系統上預設為 /usr/local,所以當 MI_INSTALL_TOPLEVEL 未設置時 (default),cmake 會將的安裝路徑指定為 /usr/local/lib/mimalloc-${mi_version},而不會將 header files 放到 /usr/local/include 中,造成使用者程式中引入 header files 的部份失效。

issue 223 也有討論此問題,但開發者認為此舉會有版本混亂的問題,也就是 /usr/local/include/mimalloc.h 不會標明版本,但此部份應可以透過 CMakeLists.txt 額外撰寫,如 elbaro 的留言:


.h - it's good to have versioned headers if you want, but it should be under /usr/include/, not /usr/lib/.

I recommend to have versioned filenames and also unversioned symlink.

e.g.

#include <mimalloc.h>  // symlink
#include <mimalloc-1.5/mimalloc.h> // if want a specific version
#include <mimalloc/mimalloc-1.5.h> // or
#include <mimalloc.1.5.h> // or

根據 Linux Filesystem Hierarchy 描述:

/usr/include
The directory for 'header files', needed for compiling user space source code.

可知 /usr/local/include 應用來放置 header files。

安裝完的結構長這樣:

/usr/local/lib
├── libmimalloc.so -> mimalloc-2.0/libmimalloc.so.2.0
├── mimalloc-2.0
│   ├── cmake
│   │   ├── mimalloc.cmake
│   │   ├── mimalloc-config.cmake
│   │   ├── mimalloc-config-version.cmake
│   │   └── mimalloc-release.cmake
│   ├── include
│   │   ├── mimalloc.h
│   │   ├── mimalloc-new-delete.h
│   │   └── mimalloc-override.h
│   ├── libmimalloc.a
│   ├── libmimalloc.so -> libmimalloc.so.2.0
│   ├── libmimalloc.so.2.0
│   └── mimalloc.o

其中 /usr/local/lib/libmimalloc.sosymbolic link,連結到 libmalloc-2.0 內實際的函式庫 libmimalloc.so.2.0.1。在需要使用 library 時使用 gcc linker 連結的參數 -lmimalloc 會在路徑中尋找 mimalloc.so 檔案做連結。

而在執行的時候,雖然編譯 libmimalloc.so 時有設定 sonamelibmimalloc.so.2.0,可以透過以下命令查看:

$> /usr/local/lib$ objdump -p mimalloc-2.0/libmimalloc.so.2.0 | grep SONAME
   SONAME               libmimalloc.so.2.0

ld.so 不會檢查 /usr/local/lib 底下的 shared library,故會找不到該檔案:

$> ldd a.out 
        linux-vdso.so.1 (0x00007ffcbfb5f000)
        libmimalloc.so.2.0 => not found
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f436e14e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f436e367000)

可以嘗試以下方法:

  1. 指定 rpath 讓 linker 知道位置,gcc -Wl,-rpath,/usr/local/lib/mimalloc-2.0/ myprogram.c -lmimalloc
  2. 預先載入函式庫,在執行程式,LD_PRELOAD=/usr/local/lib/libmimalloc.so ./a.out
  3. 添加 /usr/local/lib/mimalloc-2.0/ 路徑至 /etc/ld.so.conf,並透過 ldconfig 更新快取
  4. 直接將 /usr/local/lib/mimalloc-2.0/libmimalloc.so.2.0 移動到 /usr/local/lib,並重新建立 /usr/local/lib/libmimalloc.so 的 symbolic link。

Ubuntu 預設不會將 /usr/local/lib 加入到 linker 的 search path 中,需手動加入。

  1. soname 與 real name 相同,且好像沒有用到?
  2. so 檔正確放置位置

準備提交 pull request!

已於 5/22 更正:
issue, commit

程式原理

mimalloc 原始碼分析

mimalloc-bench

environment: Linux ip-172-31-53-217 5.4.0-1049-aws #51-Ubuntu SMP Wed May 12 21:13:51 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux

allocator build description
jemalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
tcmalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
mimalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
TBB malloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
hoard
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
inlining failed
mesh
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
nomesh
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
supermalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
need to use RTM instruction
snmalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
rpmalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
scalloc
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
need -m64 flag

mesh 和 nomesh 需要額外更改 script 才能 build。
[reference]

建置 bench 需額外指定 Arm 架構,添加 -DAPPLE 能讓產生的 CMakeLists.txt 略過編譯 alloc-test,其需要使用到 SIMD 指令集:

phase "build benchmarks" mkdir -p out/bench cd out/bench - cmake ../../bench + cmake ../../bench -DAPPLE make cd ../..

而 lean 預設使用多 CPU 編譯,會有卡住的問題 (原因?),改成單一 CPU 即可

pushd $devdir if test -d lean; then echo "$devdir/lean already exists; no need to git clone" else git clone https://github.com/leanprover/lean fi cd lean git checkout v3.4.1 mkdir -p out/release cd out/release env CC=gcc CXX="g++ -Wno-exceptions" cmake ../../src -DCUSTOM_ALLOCATORS=OFF - make -j $procs + make popd

透過 gprof 分析 mimalloc 程式碼熱點

修改 CMakeLists.txt:

@@ -174,6 +174,11 @@ if(CMAKE_C_COMPILER_ID MATCHES "AppleClang|Clang|GNU")
   endif()
 endif()
 
+# Enforce profiling build
+# https://sourceware.org/binutils/docs/gprof/Compiling.html
+list(APPEND mi_cflags -g -pg)
+link_libraries("-pg")
+
 if(CMAKE_C_COMPILER_ID MATCHES "Intel")
   list(APPEND mi_cflags -Wall -fvisibility=hidden)
 endif()
@@ -362,7 +367,7 @@ if (MI_BUILD_TESTS)
   target_compile_definitions(mimalloc-test-stress PRIVATE ${mi_defines})
   target_compile_options(mimalloc-test-stress PRIVATE ${mi_cflags})
   target_include_directories(mimalloc-test-stress PRIVATE include)
-  target_link_libraries(mimalloc-test-stress PRIVATE mimalloc ${mi_libraries})
+  target_link_libraries(mimalloc-test-stress PRIVATE mimalloc-static ${mi_libraries})
 
   enable_testing()
   add_test(test_api, mimalloc-test-api)

切換到 out/release 目錄,重新建構專案:

$ cd out/release
$ make clean all

執行 mimalloc-test-stress,耐心等待程式結束:

$ ./mimalloc-test-stress

注意,現行目錄應該會多一個檔案 gmon.out,接著可用 gprof 分析:

$ gprof mimalloc-test-stress | less

參考執行輸出: (部分)

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 67.42      0.91     0.91      221     4.12     6.05  stress
 16.30      1.13     0.22   593350     0.00     0.00  alloc_items
  5.93      1.21     0.08   649070     0.00     0.00  mi_calloc
  1.48      1.23     0.02  2240319     0.00     0.00  free
  1.48      1.25     0.02   130442     0.00     0.00  _mi_free_block_mt
  1.48      1.27     0.02   106897     0.00     0.00  mi_free_generic
  1.48      1.29     0.02    46815     0.00     0.00  _mi_page_free_collect
  1.48      1.31     0.02    22973     0.00     0.00  _mi_segment_page_alloc
  0.74      1.32     0.01   653771     0.00     0.00  malloc_size
  0.74      1.33     0.01    30876     0.00     0.00  _mi_malloc_generic
  0.74      1.34     0.01      633     0.02     0.02  mi_segment_init

使用 perf 分析

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
不需要上述的 -pg 編譯選項,適度重新編譯

執行並收集資訊:

$ perf record --call-graph dwarf -- ./mimalloc-test-stress

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
可能需要透過 sudo

分析 perf 收集的執行資訊:

$ perf report -g graph --no-children

需要進一步探究:

  • mi_page_free_list_extend (src/page.c)
  • malloc_usable_size
    mi_usable_size (src/alloc.c)
  • mi_page_queue_find_free_ex (src/page.c)
  • _mi_page_free_collect (src/page.c)

使用 Android 應用程式實際的記憶體配置行為來測試

allocbench is a set of tools to benchmark memory allocators realistically using techniques from Android.

修改 reply.cc,將第 15 行的 DEBUG 數值變更為 false,接著編譯

$ g++ -o reply reply.cpp

解開 traces 檔案:

$ pushd traces
$ tar Jxvf android.tar.xz
$ popd

修改 run.sh,在 declare -A allocators 之後,僅保留 [mimalloc][glibc],並修正 libmimalloc.so 的路徑。

執行:

$ ./run.sh

預期輸出:

===========================
Allocator: glibc
====================
Trace: angry_birds2
Loading events...
Loaded 10080312 events.

Running events...
  Run 0: 637.656 ms
  Run 1: 457.265 ms
  Run 2: 525.347 ms
  Run 3: 574.06 ms

madvise(MADV_DONTNEED) 使用案例

呼叫 free() 函式後,該指定的記憶體空間將被回收,並移入 free list 中,倘若之後再去存取該記憶體位址,就會遇到 segmentation fault。若我們想釋放某段記憶體位址的內容,就可使用 madvise(addr, len, MADV_DONTNEED)addr 對應的記憶體空間將會被釋放,但該地址不會被 Linux 核心回收而放在 free list 中,也就是說,如果存取該地址,不會遇到 segmentation fault,這樣就能重用該地址。

以下是測試程式碼:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>
#include <unistd.h>

int main()
{
    printf("++++ before malloc:\n");
    system("free -h");

    long len = 1024 * 1024 * 1024;  // 1 GB                                                                                                                   
    char *buffer;

    posix_memalign((void **) &buffer, sysconf(_SC_PAGESIZE), len);

    memset(buffer, 'M', len);
    printf("buffer[0]: %c\n", buffer[0]);

    printf("\n\n");
    printf("++++ after malloc:\n");
    system("free -h");

    int ret = madvise(buffer, len, MADV_DONTNEED);
    if (ret == -1) {
        printf("madvise ret: %d\n", ret);
        printf("err: %s\n", strerror(errno));
    }

    printf("\n\n");
    printf("++++ after madvise:\n");
    system("free -h");

    printf("buffer[0]: %d\n", buffer[0]);
}

參考執行輸出:

++++ before malloc:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       4.0Gi        40Gi       5.0Mi        65Gi       105Gi
Swap:          64Gi          0B        64Gi
buffer[0]: M


++++ after malloc:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       5.0Gi        39Gi       5.0Mi        65Gi       104Gi
Swap:          64Gi          0B        64Gi


++++ after madvise:
              total        used        free      shared  buff/cache   available
Mem:          110Gi       4.0Gi        40Gi       5.0Mi        65Gi       105Gi
Swap:          64Gi          0B        64Gi
buffer[0]: 0

注意看 used 欄位:

  • 在呼叫 malloc 之前,系統佔用 4.0 GiB
  • 呼叫 malloc 後,系統佔用 5.0 GiB,也就是新增 1.0 GiB 空間
  • 呼叫 madvise 後,系統佔用回到 4.0 GiB,也就是等同呼叫 malloc 之間的記憶體佔用量

注意: posix_memalign((void **)&buffer, sysconf(_SC_PAGESIZE), len) 所配置的記憶體空間是 page-aligned,這是為了配合 madvise,後者只接受 page-aligned 地址。


改進機會

getcpu

mimalloc 為了支援 NUMA,會頻繁呼叫 getcpu 系統呼叫,這可藉由 restartable sequences (rseq) 來加速,參見 The 5-year journey to bring restartable sequences to Linux

Google 發展的 TCMalloc 也用到 rseq,可見:

mimalloc #315 就討論 rseq 的引入,但尚未有程式碼 (很好的機會)

rseq 目前只提供 cpu id 的存取,而在 mimalloc 中呼叫 getcpu 的時機在於需要取得 numa node id,所以必須要透過 cpu id 額外對應 node id。根據 NUMA man page 可以使用 numa_node_of_cpu 取得 numa id,如以下程式:

// compile with -lnuma

#include <numa.h>
...
cpu = __rseq_abi.cpu_id;
node = numa_node_of_cpu(cpu);

以下測試四種方法取得 cpu id 的效能,分別是 glibc 的 getcpu、直接透過 vDSO 呼叫的 getcpu、system call 呼叫的 getcpu 以及 rseq 取得的 cpu


(syscall 因時間太長超出圖片範圍)

可以看到 rseq 版本的時間最快,與 The 5-year journey to bring restartable sequences to Linux 內的實驗數據相同,而 glibc 為 vDSO 的額外一層包裝 (在 x86-64 當中),故兩者時間差不多,差別在於 glibc 函式呼叫較多層導致時間些微增加。

rseq 加上 numa_node_of_cpu 後的時間比較:

會發現 rseq 因為需要呼叫 numa_node_of_cpu 速度比 glibc 以及 vDSO 的 getcpu 還慢,而後兩者因在執行時會同時取得 node id,便不需要額外再取得 node id,速度上較有優勢。

可以將 mimalloc 中的 syscall 改為使用 vDSO 的系統呼叫。

已提交 PR#424: Use much faster getcpu() via vDSO

rseq

想法:在 segment 上層新增 segment pool。










n018

CPU0



n035

pool



n018--n035




n019

segment



n035--n019




n022

segment



n035--n022




n020

page



n019--n020




n021

page



n019--n021




n028

page



n022--n028




n029

page



n022--n029




n023

CPU1



n036

pool



n023--n036




n024

segment



n036--n024




n027

segment



n036--n027




n025

page



n024--n025




n026

page



n024--n026




n030

page



n027--n030




n031

page



n027--n031




這樣 segment 間的 atomic operation 應該可以去除?

__builtin_prefetch 在 linked list 的使用

mimalloc 的 src/page.c, src/page-queue.c, src/segment.c 用到 linked list,可藉由 CPU 的 prefetch 指令來加速,可見:

TCMalloc 的原始程式碼也用到這技巧,可見 tcmalloc/tcmalloc/internal/linked_list.h:

inline void* SLL_Pop(void** list) {
  void* result = *list;
  void* next = SLL_Next(*list);
  *list = next;
  // Prefetching NULL leads to a DTLB miss, thus only prefetch when 'next'
  // is not NULL.
#if defined(__GNUC__)
  if (next) {
    __builtin_prefetch(next, 0, 3);
  }
#endif
  return result;
}

單純在 freelist 上添加 __builtin_prefetch 對 cache 並沒有影響。

TODO

  1. 可能是 list 長度很小,prefetch 意義不大
  2. 硬體已經幫忙做 prefetch?

ref: The problem with prefetch

藉由 MADV_POPULATE 降低 page faults

以 SQLite 為例,引入 MAP_POPULATE 的效能影響

Before After
Mean 2.550 2.375
StdDev 0.021 0.043

The time taken to issue "SELECT * FROM tvshowview" in seconds

isoalloc 也用到 MAP_POPULATE

目前 mimalloc 的測試:

$ perf stat -e faults ./mimalloc-test-stress

參考輸出:

           33,5144      faults
       1.331217221 seconds time elapsed
      16.206853000 seconds user
       2.566793000 seconds sys

mimalloc 中主要分配空間為 mi_unix_mmap 函式,嘗試新增 MAP_POPULATE 到其中:

static void* mi_unix_mmap(void* addr, size_t size, size_t try_alignment, int protect_flags, bool large_only, bool allow_large, bool* is_large) {
    void* p = NULL;
    #if !defined(MAP_ANONYMOUS)
    #define MAP_ANONYMOUS  MAP_ANON
    #endif
    #if !defined(MAP_NORESERVE)
    #define MAP_NORESERVE  0
    #endif
-   int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE
+   int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE | MAP_POPULATE;
    ...
    if ((large_only || use_large_os_page(size, try_alignment)) && allow_large) {
        static _Atomic(uintptr_t) large_page_try_ok; // = 0;
        uintptr_t try_ok = mi_atomic_load_acquire(&large_page_try_ok);
        if (!large_only && try_ok > 0) {
            // If the OS is not configured for large OS pages, or the user does not have
            // enough permission, the `mmap` will always fail (but it might also fail for other reasons).
            // Therefore, once a large page allocation failed, we don't try again for `large_page_try_ok` times
            // to avoid too many failing calls to mmap.
            mi_atomic_cas_strong_acq_rel(&large_page_try_ok, &try_ok, try_ok - 1);
        }
        else {
            int lflags = flags & ~MAP_NORESERVE;  // using NORESERVE on huge pages seems to fail on Linux
+           lflags = lflags & ~MAP_POPULATE; // don't use MAP_POPULATE on huge pages
    ...

在 mimalloc-bench 上測試:

mimalloc v1
major-faults major-faults (populate) minor-faults minor-faults (populate)
cfrac 0 0 369 65706
espresso 0 0 684 65693
barnes 0 0 15612 185511
leanN 3 0 122108 197477
alloc-test1 1 0 2566 65787
alloc-testN 0 0 2971 65812
larsonN 1 0 17958 68939
sh6benchN 0 0 53165 65731
sh8benchN 0 0 31688 87237
xmalloc-testN 1 0 16094 65724
cache-scratch1 0 0 230 65758
cache-scratchN 0 0 271 65773
mstressN 0 0 187768 328171
rptestN 1 0 41804 65814
mimalloc v2
major-faults major-faults (populate) minor-faults minor-faults (populate)
cfrac 0 0 371 2218
espresso 0 0 789 2206
barnes 0 0 15618 168386
leanN 0 0 125934 153145
alloc-test1 0 0 2575 6403
alloc-testN 0 0 3002 16677
larsonN 0 0 19182 27134
sh6benchN 0 0 53427 57567
sh8benchN 0 0 38250 75987
xmalloc-testN 0 0 19217 20696
cache-scratch1 0 0 240 4326
cache-scratchN 0 0 283 16643
mstressN 0 0 268846 534053
rptestN 0 0 51005 94541

可以看到在 mimalloc v1.x 中 major page faults 的次數會減少,但是 minor page faults 卻也一起增加,此情況在 v2.x 中也會出現。

mimalloc-bench 會設置 MIMALLOC_EAGER_COMMIT_DELAY=0,如果將 MIMALLOC_EAGER_COMMIT_DELAY 改為預設值 1 的話,minor page faults 的數值則會恢復正常。

而使用 MAP_POPULATE 測試 mimalloc-test-stress,此時 minor page faults 卻能大幅下降。(或許可以新增選項讓 user 決定是否開啟 MAP_POPULATE)

perf stat -e page-faults env MIMALLOC_EAGER_COMMIT_DELAY={0,1} ./mimalloc-test-stress 32 100 50
eager_commit populate page-faults
0 No 2,955,325
1 No 2,943,560
0 Yes 68,053
1 Yes 94,554

IsoAlloc Performance 提到:

All bitmaps pages allocated with mmap are passed to the madvise syscall with the advice arguments MADV_WILLNEED and MADV_SEQUENTIAL. All user pages allocated with mmap are passed to the madvise syscall with the advice arguments MADV_WILLNEED and MADV_RANDOM. By default both of these mappings are created with MAP_POPULATE which instructs the kernel to pre-populate the page tables which reduces page faults and results in better performance. The performance of short lived programs will benefit from PRE_POPULATE_PAGES being disabled.

使用 MADV_DONTNEED 減少記憶體

參考 isoalloc 中的 guard page 實作,因 guard page 的內容在程式中不會使用,故可以藉由 MADV_DONTNEED 將之移出記憶體。guard page 對應到 mimalloc 中的部份在 mi_segment_init 內:

// set up guard pages
size_t guard_slices = 0;
if (MI_SECURE>0) {
    // in secure mode, we set up a protected page in between the segment info
    // and the page data
    ...
    _mi_os_protect((uint8_t*)segment + mi_segment_info_size(segment) - os_pagesize, os_pagesize);
    ...
    _mi_os_protect(end, os_pagesize);
    ...
  }

(在 mimalloc 1.x 版本則在 mi_segment_protect)

mi_os_protectx 中添加 MADV_DONTNEED

// Protect a region in memory to be not accessible.
static  bool mi_os_protectx(void* addr, size_t size, bool protect) {
    ...
+#ifdef MADV_DONTNEED
+   madv_err = madvise(start, csize, MADV_DONTNEED);
+   if (madv_err != 0) {
+       _mi_warning_message("madvise error: start: %p, csize: 0x%x, err: %i\n", start, csize, madv_err);
+   }
+#endif
    return (err == 0);
}

TODO