Entropy Memcached Evaluation Results === ###### tags: `contiguitas` `research` Results are available here, each tab corresponds to one configuration: https://docs.google.com/spreadsheets/d/1mC1wLibg6uouCWoHcHUG6YWODrs7njj4ybENikdcefE/edit?usp=sharing Ideas I have not tried yet: * [ ] Sequential access of benchmark * [ ] Pipelining the benchmark * [ ] Compare with 64G results ## General Information * Environment: * A compiled mainline Linux kernel * Swapping is disabled. Therefore, the memory allowance of memcached server should be smaller than system memory, which I set to 240G. * Port is set to 8888 instead of the default one, in order to avoid any extra traffic. * Background threads are turned off * `perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses --pid=$(pidof memcached)` is started right before running the client and is killed right after the client finishes. Configurations for the server are thus the same accross different trials: ```shell!= taskset -c 0-7 \ /u2/kaiwenx/memcached-1.6.17/memcached -p 8888 \ -c 80000 \ # max 80000 clients at a time -m 245760 \ # allow 240G memory -M \ # -t 8 \ # 8 threads -o no_lru_maintainer,no_lru_crawler # disable background threads ``` ## `memcached-t1-c1-gaussian` The results are in the `memcached-t1-c1-gaussian` tab. This configuration has only one client sequentially sending all commands to the server. Reason for using Gaussian: when usage of memory is concentrated, system with huge pages will benefit more from TLB hit. Contend: ```shell!= taskset -c 8-15 memtier_benchmark \ -p 8888 \ -P memcache_binary \ -n 'allkeys' \ -c 500 \ -t 8 \ --pipeline=100 \ --ratio=1:0 \ --data-size-pattern=R \ --data-size=512 \ --key-maximum=360000000 \ --key-pattern=P:P ``` Client: ```shell!= taskset -c 8-15 memtier_benchmark \ -n 5000000 \ -p 8888 \ -P memcache_binary \ -c 1 \ -t 1 \ --ratio=0:1 \ --data-size=512 \ --key-maximum=360000000 \ --key-pattern=G:G ``` ## `memcached-t1-c1-uniform` Same as `memcached-t1-c1-gaussian`, except that the the benchmark's access to key is distributed uniformly. Contend: ```shell!= taskset -c 8-15 memtier_benchmark \ -p 8888 \ -P memcache_binary \ -n 'allkeys' \ -c 500 \ -t 8 \ --pipeline=100 \ --ratio=1:0 \ --data-size-pattern=R \ --data-size=512 \ --key-maximum=360000000 \ --key-pattern=P:P ``` Client: ```shell!= taskset -c 8-15 memtier_benchmark \ -n 5000000 \ -p 8888 \ -P memcache_binary \ -c 1 \ -t 1 \ --ratio=0:1 \ --data-size=512 \ --key-maximum=360000000 \ --key-pattern=R:R ``` ## `memcached-t1-c1-gaussian-obj4096` Using larger objects (larger than 4KB) will force systems without huge pages to use multiple pages to store the same object, thus wasting TLB space. Contend: ```shell!= taskset -c 8-15 memtier_benchmark \ -p 8888 \ -P memcache_binary \ -n 'allkeys' \ -c 500 \ -t 8 \ --pipeline=100 \ --ratio=1:0 \ --data-size-pattern=R \ --data-size=4096 \ --key-maximum=50000000 \ --key-pattern=P:P ``` Client: ```shell!= taskset -c 8-15 memtier_benchmark \ -n 5000000 \ -p 8888 \ -P memcache_binary \ -c 1 \ -t 1 \ --ratio=0:1 \ --data-size=4096 \ --key-maximum=50000000 \ --key-pattern=G:G ``` ## `memcached-t8-c100-gaussian-obj4096` Concurrent accesses is closer to production workload and might hit more TLB misses. Contend: ```shell!= taskset -c 8-15 memtier_benchmark \ -p 8888 \ -P memcache_binary \ -n 'allkeys' \ -c 500 \ -t 8 \ --pipeline=100 \ --ratio=1:0 \ --data-size-pattern=R \ --data-size=4096 \ --key-maximum=50000000 \ --key-pattern=P:P ``` Client: ```shell!= taskset -c 8-15 memtier_benchmark \ -n 100000 \ -p 8888 \ -P memcache_binary \ -c 100 \ -t 8 \ --ratio=0:1 \ --data-size=4096 \ --key-maximum=50000000 \ --key-pattern=G:G ```