### Overall
NUMA 是硬體的拓樸,用戶是無法修改的。NUMA 會在邏輯上對硬體進行分類,例如把一組 CPU, 記憶體,網卡, GPU 分在一組,在一組代表彼此之間是最近的,會最快速。CPU 通常會和一個記憶體在一起,CPU 和此記憶體會是最近的。
---
使用 prompt
`根據這份輸出,我得到哪些需要注意的事情?哪些資訊通過哪些設定可以加速?`
可以取得一些需要注意的點。
:::info
### `lscpu`
```
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-255
```
:::
:::info
### `hwloc-ls`
裡面的 Package L#X, GroupX L#X NUMANodeX
```
hwloc-ls
Machine (504GB total)
Package L#0
Group0 L#0
NUMANode L#0 (P#0 63GB)
HostBridge
PCIBridge
PCIBridge
PCI 62:00.0 (VGA)
Group0 L#1
NUMANode L#1 (P#1 63GB)
L3 L#0 (32MB) + L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#16)
PU L#1 (P#144)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 43:00.0 (InfiniBand)
Net "ib0"
OpenFabrics "mlx5_0"
PCIBridge
PCI 44:00.0 (3D)
PCIBridge
PCI 45:00.0 (SAS)
Group0 L#2
NUMANode L#2 (P#2 63GB)
Group0 L#3
NUMANode L#3 (P#3 63GB)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 03:00.0 (3D)
PCIBridge
PCI 05:00.0 (SAS)
Package L#1
Group0 L#4
NUMANode L#4 (P#4 63GB)
HostBridge
PCIBridge
PCI e1:00.0 (Ethernet)
Net "enp225s0f0"
PCI e1:00.1 (Ethernet)
Net "enp225s0f1"
Group0 L#5
NUMANode L#5 (P#5 63GB)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI c4:00.0 (3D)
PCIBridge
PCI c5:00.0 (SAS)
Group0 L#6
NUMANode L#6 (P#6 63GB)
Group0 L#7
NUMANode L#7 (P#7 63GB)
HostBridge
PCIBridge
PCIBridge
PCIBridge
PCI 83:00.0 (InfiniBand)
Net "ib1"
OpenFabrics "mlx5_1"
PCIBridge
PCI 84:00.0 (3D)
PCIBridge
PCI 85:00.0 (SAS)
```
Package:代表一個 CPU 插槽(socket),有兩個 CPU socket, Package L#0 和 Package L#1。
Group:是 hwloc 自動建立的邏輯群組,表示同一個 CPU socket 底下的邏輯區域(包含 NUMA + PCIe + cache + core 組合),便於組織。
3D, VGA (Video Graphics Array) 是 GPU,但後者才能輸出到螢幕。
SAS 是 Serial Attached SCSI,負責讀取磁碟。
:::
:::info
### `nvidia-smi topo -m`
這是要求 1 個 gpu 的輸出,這說明 slurm 資源要求配置會影響 nvidia-smi 的視野,不知道怎麼做到的。
```
GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX SYS 16,144 1 N/A
NIC0 PIX X SYS
NIC1 SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
```
這是要求 4 個 gpu 的輸出
```
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 NV4 NV4 SYS SYS 48-63,176-191 3 N/A
GPU1 NV4 X NV4 NV4 PIX SYS 16-31,144-159 1 N/A
GPU2 NV4 NV4 X NV4 SYS PIX 112-127,240-255 7 N/A
GPU3 NV4 NV4 NV4 X SYS SYS 80-95,208-223 5 N/A
NIC0 SYS PIX SYS SYS X SYS
NIC1 SYS SYS PIX SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
```
:::
:::info
### `numactl --hardware`
```
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 0 size: 64307 MB
node 0 free: 62961 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 1 size: 64447 MB
node 1 free: 61775 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
node 2 size: 64502 MB
node 2 free: 62525 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 64490 MB
node 3 free: 63291 MB
node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
node 4 size: 64502 MB
node 4 free: 61003 MB
node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
node 5 size: 64502 MB
node 5 free: 299 MB
node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
node 6 size: 64502 MB
node 6 free: 26719 MB
node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 7 size: 64491 MB
node 7 free: 37930 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 12 12 12 32 32 32 32
1: 12 10 12 12 32 32 32 32
2: 12 12 10 12 32 32 32 32
3: 12 12 12 10 32 32 32 32
4: 32 32 32 32 10 12 12 12
5: 32 32 32 32 12 10 12 12
6: 32 32 32 32 12 12 10 12
7: 32 32 32 32 12 12 12 10
```
:::
:::info
### `ibstat`
```
ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x0800380300bbb2e0
System image GUID: 0x0800380300bbb2e0
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 3521
LMC: 0
SM lid: 125
Capability mask: 0xa651e848
Port GUID: 0x0800380300bbb2e0
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.41.1000
Hardware version: 0
Node GUID: 0x0800380300bbb270
System image GUID: 0x0800380300bbb270
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 3444
LMC: 0
SM lid: 125
Capability mask: 0xa651e848
Port GUID: 0x0800380300bbb270
Link layer: InfiniBand
```
:::
dmidecode --type 17
ethtool ens3
---
### 你得知了哪些資訊,應該怎麼分配?
8 個 NUMA nodes, 每個 32 cores, 63 GB 記憶體,以及 NUMA 間的距離。
4 個 GPU, 2 個 IB NIC。
對於 GPU,兩個最好的位置位於 NUMA 1, NUMA 7,有 Ib NIC。

### NUMA matters
錯誤的 NUMA 設定會讓運行時間大幅上漲。
### 需要 sudo
```
sudo dmidecode
```
可以用來查主板型號和一些資訊。