### Overall NUMA 是硬體的拓樸,用戶是無法修改的。NUMA 會在邏輯上對硬體進行分類,例如把一組 CPU, 記憶體,網卡, GPU 分在一組,在一組代表彼此之間是最近的,會最快速。CPU 通常會和一個記憶體在一起,CPU 和此記憶體會是最近的。 --- 使用 prompt `根據這份輸出,我得到哪些需要注意的事情?哪些資訊通過哪些設定可以加速?` 可以取得一些需要注意的點。 :::info ### `lscpu` ``` On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 8 NUMA node0 CPU(s): 0-15,128-143 NUMA node1 CPU(s): 16-31,144-159 NUMA node2 CPU(s): 32-47,160-175 NUMA node3 CPU(s): 48-63,176-191 NUMA node4 CPU(s): 64-79,192-207 NUMA node5 CPU(s): 80-95,208-223 NUMA node6 CPU(s): 96-111,224-239 NUMA node7 CPU(s): 112-127,240-255 ``` ::: :::info ### `hwloc-ls` 裡面的 Package L#X, GroupX L#X NUMANodeX ``` hwloc-ls Machine (504GB total) Package L#0 Group0 L#0 NUMANode L#0 (P#0 63GB) HostBridge PCIBridge PCIBridge PCI 62:00.0 (VGA) Group0 L#1 NUMANode L#1 (P#1 63GB) L3 L#0 (32MB) + L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#16) PU L#1 (P#144) HostBridge PCIBridge PCIBridge PCIBridge PCI 43:00.0 (InfiniBand) Net "ib0" OpenFabrics "mlx5_0" PCIBridge PCI 44:00.0 (3D) PCIBridge PCI 45:00.0 (SAS) Group0 L#2 NUMANode L#2 (P#2 63GB) Group0 L#3 NUMANode L#3 (P#3 63GB) HostBridge PCIBridge PCIBridge PCIBridge PCI 03:00.0 (3D) PCIBridge PCI 05:00.0 (SAS) Package L#1 Group0 L#4 NUMANode L#4 (P#4 63GB) HostBridge PCIBridge PCI e1:00.0 (Ethernet) Net "enp225s0f0" PCI e1:00.1 (Ethernet) Net "enp225s0f1" Group0 L#5 NUMANode L#5 (P#5 63GB) HostBridge PCIBridge PCIBridge PCIBridge PCI c4:00.0 (3D) PCIBridge PCI c5:00.0 (SAS) Group0 L#6 NUMANode L#6 (P#6 63GB) Group0 L#7 NUMANode L#7 (P#7 63GB) HostBridge PCIBridge PCIBridge PCIBridge PCI 83:00.0 (InfiniBand) Net "ib1" OpenFabrics "mlx5_1" PCIBridge PCI 84:00.0 (3D) PCIBridge PCI 85:00.0 (SAS) ``` Package:代表一個 CPU 插槽(socket),有兩個 CPU socket, Package L#0 和 Package L#1。 Group:是 hwloc 自動建立的邏輯群組,表示同一個 CPU socket 底下的邏輯區域(包含 NUMA + PCIe + cache + core 組合),便於組織。 3D, VGA (Video Graphics Array) 是 GPU,但後者才能輸出到螢幕。 SAS 是 Serial Attached SCSI,負責讀取磁碟。 ::: :::info ### `nvidia-smi topo -m` 這是要求 1 個 gpu 的輸出,這說明 slurm 資源要求配置會影響 nvidia-smi 的視野,不知道怎麼做到的。 ``` GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX SYS 16,144 1 N/A NIC0 PIX X SYS NIC1 SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 ``` 這是要求 4 個 gpu 的輸出 ``` GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV4 NV4 NV4 SYS SYS 48-63,176-191 3 N/A GPU1 NV4 X NV4 NV4 PIX SYS 16-31,144-159 1 N/A GPU2 NV4 NV4 X NV4 SYS PIX 112-127,240-255 7 N/A GPU3 NV4 NV4 NV4 X SYS SYS 80-95,208-223 5 N/A NIC0 SYS PIX SYS SYS X SYS NIC1 SYS SYS PIX SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 ``` ::: :::info ### `numactl --hardware` ``` available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 0 size: 64307 MB node 0 free: 62961 MB node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 1 size: 64447 MB node 1 free: 61775 MB node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 node 2 size: 64502 MB node 2 free: 62525 MB node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 64490 MB node 3 free: 63291 MB node 4 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 node 4 size: 64502 MB node 4 free: 61003 MB node 5 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 node 5 size: 64502 MB node 5 free: 299 MB node 6 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 node 6 size: 64502 MB node 6 free: 26719 MB node 7 cpus: 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 7 size: 64491 MB node 7 free: 37930 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 ``` ::: :::info ### `ibstat` ``` ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.41.1000 Hardware version: 0 Node GUID: 0x0800380300bbb2e0 System image GUID: 0x0800380300bbb2e0 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 3521 LMC: 0 SM lid: 125 Capability mask: 0xa651e848 Port GUID: 0x0800380300bbb2e0 Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.41.1000 Hardware version: 0 Node GUID: 0x0800380300bbb270 System image GUID: 0x0800380300bbb270 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 3444 LMC: 0 SM lid: 125 Capability mask: 0xa651e848 Port GUID: 0x0800380300bbb270 Link layer: InfiniBand ``` ::: dmidecode --type 17 ethtool ens3 --- ### 你得知了哪些資訊,應該怎麼分配? 8 個 NUMA nodes, 每個 32 cores, 63 GB 記憶體,以及 NUMA 間的距離。 4 個 GPU, 2 個 IB NIC。 對於 GPU,兩個最好的位置位於 NUMA 1, NUMA 7,有 Ib NIC。 ![image](https://hackmd.io/_uploads/S1CBLnexle.png) ### NUMA matters 錯誤的 NUMA 設定會讓運行時間大幅上漲。 ### 需要 sudo ``` sudo dmidecode ``` 可以用來查主板型號和一些資訊。