NUMA research note

# NUMA research note ## April 7 ### ### [BWAP](https://arxiv.org/pdf/2003.03304.pdf) Given a NUMA system and a parallel application running on a set of worker nodes, the goal of BWAP is to devise and enforce an efficient interleaving of the application’s pages across the NUMA nodes. Since application threads access different memory nodes through potentially diverse BWs, BWAP assigns different weights to different nodes. A node’s weight denotes the fraction of pages mapped to the node. - Machine A is a 4-socket AMD Opteron Processor 6272, with 8 memory nodes, 8 cores per node, 64GB DRAM, running Linux 4.17. - Machine B is a 2-socket Intel Xeon CPU E5-2660 v4. - scenarios - co-scheduled - 2 benchmarks share the same NUMA machine. - benchmark A not memory-intensive - locally place pages for locality - benchmark B memory intensive - BWAP to scatter pages across all nodes - Try not to degrade A - run swaptions - 計算選擇權的投資組合 - RMS + Monte Carlo (MC) simulation - Refer to [Benchmarking Modern Multiprocessors](https://parsec.cs.princeton.edu/publications/bienia11benchmarking.pdf) - stand-alone - 1 benchmark - Machine is entirely available for application A ### NUMA topology aware 實驗 - 控制組： - 現行 NUMA-aware placement policies - Linux’s default policy: first-touch - uniform interleaving across workers - Carrefour - Asymsched - uniform interleaving across all nodes - autonuma - nonuniform interleaving across workers - BWAP - 實驗組： - Memory Broker - 實驗情境： - Single applications - Multi applications - Plenty resources - Like co-scheduled in BWAP - How the Broker aware application not to violate others resources. - Few resources - How the Broker help application assemble resource - Normal resources and multi benchmark - How the Broker help multi applications work with each others. - What to Run - Highly memory consumed application - Swaptions - Highly memory shared application - Communication - Highly memory unbalanced application - Mysql - 資料取得： - Through put, Run time - Memory consumption - Remote access - Cache miss, swap - Page replication ### Memory Broker 構想 - Carrefour (Linux) 2013 (robust) - Shared memory uniform interleaved across worker - Asymsched 2015 - Place Threads and memory based on asymmetric interconnect - BWAP 2020 - Shard memory interleaved across worker based on asymmetric interconnect - Memory Broker - node1 拓樸 ``` basg $ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 48240 MB node 0 free: 313 MB node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 1 size: 0 MB node 1 free: 0 MB node 2 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 2 size: 64482 MB node 2 free: 170 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 ``` ## March 10 [Purgeable Memory Allocations for Linux](https://nullprogram.com/blog/2019/12/29/) [Qthreads/libnuma.c](https://github.com/Qthreads/qthreads/blob/master/src/affinity/libnuma.c) - 實作 based on numa.h - qt_affinity_gendists: 計算 node 間距離 [Linux NUMA balancer vs load balancer](https://www.phoronix.com/scan.php?page=news_item&px=Linux-NUMA-Reconcile-Balance-V4) [Replication](https://lwn.net/Articles/223056/) 透過採取的數據量差異，採取不同的方法論。 - [carrefour](http://fabiengaud.net/resources/dashti13traffic-slides.pdf) - balance memory pressure on interconnect and MC - locality |數據量|不確定性高|不確定性中|不確定性低| |-----|--------|--------|--------| |低|Linear program|depends|迴歸分析| |中|depends|depends|迴歸分析| |高|min regret|min regret|min regret| - (visualization)[https://github.com/sonicyang/wastedcores/tree/master/tools/visualizations_4.4] - (conventional way)[http://www.cs.kent.edu/~jmaletic/cs69995-PC/papers/Moreta07.pdf] 觀察記憶體benchmark ###### tags: `NUMA`