RKE2 啟用 Topology Manager 綁定 cpu 功能

# RKE2 啟用 Topology Manager 綁定 cpu 功能 Topology Manager 是屬於 kubelet 中的元件之一，並提供了一個介面(interface) Hint Providers。這個介面會負責去發送與接收 topology 相關的資訊，讓 Topology Manager 再搭配 kubelet 中定義的 topology-manager-policy 去算出最佳的結果。簡單來說 Topology Manager 就是定義一個策略，讓 Kubelet 判斷要不要盡量把 container 分配在同一個 NUMA node 上執行。 1. none 預設的 topology-manager-policy，若設成 none 則與原本沒設定一樣。 2. best-effort 會盡可能地將開出來的 Container 集中在同一個 NUMA node，但若是沒辦法通通集中在一起，也會允許將 Container 開在不同的 NUMA node 上。 3. restricted Container 都必須被集中在同一個 NUMA node，除非被要求的資源超過了一個 NUMA node 所能提供的上限，所以還是有部分會跨 NUMA node，仍然允許 pod 啟動。我們舉例來說：假設目前擁有著兩個 NUMA node 的情況下，一個 node 有著 6 顆 CPU。 - requests and limits CPU → 6 core Container 會被限制只能在一台 NUMA node 上執行 - requests and limits CPU → 7~12 core Container 會被限制只能在兩台 NUMA node 上執行 4. single-numa-node 最嚴格的設定，Pod 裡面所有的 Container 都必須集中在同樣的 NUMA node，若是沒辦法達成，Pod 就沒辦法被建立。 ## 在 PVE 上開啟 NUMA 功能 * 注意 vm 需要 2 個 socket ![image](https://hackmd.io/_uploads/By2XtAYbee.png) * 如果看到 NUMA node(s): 2，就代表你有兩個 NUMA 節點 ``` $ lscpu | grep "NUMA node(s)" NUMA node(s): 2 ``` * 查看每個 NUMA node 對應的 CPU 和記憶體資訊 ``` $ numactl --hardware available: 2 nodes (0-1) # 機器有 2 個 NUMA node，編號為 0 和 1 node 0 cpus: 0 1 2 3 4 5 6 7 # Node 0 管理 CPU 0~7 node 0 size: 5874 MB node 0 free: 4855 MB node 1 cpus: 8 9 10 11 12 13 14 15 # Node 1 管理 CPU 8~15 node 1 size: 5735 MB node 1 free: 4461 MB node distances: node 0 1 0: 10 20 1: 20 10 ``` ## 實作透過 rancher 創建 rke2 叢集 kubelet 設定以下參數: * 啟用 feature-gates 功能 : 我們要開啟 `TopologyManagerPolicyOptions` 、`CPUManagerPolicyOptions` 的功能。 * `topology-manager-policy=single-numa-node` : 讓 Pod 所用的所有 CPU 和記憶體資源，都來自於同一個 NUMA node。 * `cpu-manager-policy=static` : 會讓設定了 CPU 資源 limits 的 Pod 獲得固定綁定的實體 CPU core，而要讓 Topology Manager 能夠正確分配 Container 到 NUMA 上，必須要把 Pod 設定成 Guaranteed。 * `kube-reserved=cpu=1,memory=2048Mi` : 保留 1 顆 CPU core 與 2048Mi 記憶體給 Kubernetes 系統的背景程序（如 kubelet、container runtime 時等）預留資源。 * `system-reserved=cpu=1,memory=2048Mi` : 保留 1 顆 CPU core 與 2048Mi 記憶體給作業系統本身使用。 * `topology-manager-scope=pod` : 將 pod 視為一個整體，並將整個 pod（所有 container）都指派給單一 NUMA node。 ``` feature-gates=TopologyManagerPolicyOptions=true,CPUManagerPolicyOptions=true topology-manager-policy=single-numa-node cpu-manager-policy=static kube-reserved=cpu=1,memory=2048Mi system-reserved=cpu=1,memory=2048Mi topology-manager-scope=pod ``` > kubelet 設定開啟 `feature-gates=TopologyManagerPolicyOptions=true` 這個功能在 1.32 預設是已經啟用了。 > `CPUManagerPolicyOptions=true` 這個功能在 1.33 預設是已經啟用了。 ![image](https://hackmd.io/_uploads/HkmCocGMgx.png) * 檢查 rke2 叢集狀態 ``` $ kubectl get no NAME STATUS ROLES AGE VERSION rke2-m1 Ready control-plane,etcd,master,worker 179m v1.32.4+rke2r1 rke2-w1 Ready worker 172m v1.32.4+rke2r1 rke2-w2 Ready worker 102m v1.32.4+rke2r1 ``` * 檢查 kubelet 參數是否套用 ``` $ kubectl get --raw "/api/v1/nodes/rke2-m1/proxy/configz" | jq . { "kubeletconfig": { "cpuManagerPolicy": "static", "topologyManagerPolicy": "single-numa-node", "featureGates": { "CPUManagerPolicyOptions": true, "TopologyManagerPolicyOptions": true }, "systemReserved": { "cpu": "1", "memory": "2048Mi" }, "kubeReserved": { "cpu": "1", "memory": "2048Mi" }, ...... ``` ## Quality of Service * 在 Kubernetes 中，Pod 的 Quality of Service (QoS) 分為三種等級：Guaranteed（保證）、Burstable（突發）和 BestEffort（盡力而為）。這些等級是根據 Pod 中 container 所設定的 CPU 和 Memory 資源來自動判斷的，會影響資源爭用時的優先級與調度行為。 - Guaranteed : 所有 container 都設定了 requests = limits（CPU & Memory），擁有最高優先權，在節點壓力大時最不容易被驅逐。 - Burstable : 至少有一個 container 設定了 requests，但 requests 、 limits 的值不完全相同，中等優先權，在節點壓力下，有可能被驅逐，但優先於 BestEffort。 - BestEffort : 沒有設定任何資源 requests 或 limits，最低優先權，最容易被驅逐。 ## 驗證 container 是否都跑在同一個 NUMA Node 上執行 ### Cgroup v1 環境 * 檢查 Cgroup，出現 tmpfs 代表是 v1 架構 ``` $ stat -fc %T /sys/fs/cgroup/ tmpfs ``` * 產生一個 Guaranteed 等級的 Pod，需要指定 1 core 的 cpu 使用。 ``` apiVersion: apps/v1 kind: Deployment metadata: name: test namespace: default spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - image: nginx imagePullPolicy: IfNotPresent name: container-0 resources: limits: cpu: 1 memory: 512Mi requests: cpu: 1 memory: 512Mi nodeName: rke2-w2 # 更改自己環境的節點名稱 ``` * 找到 container 對應的 uid，並且 QoS Class 是 Guaranteed。 ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE test-84b5cbb8db-r4frd 1/1 Running 0 3m $ kubectl describe pod test-84b5cbb8db-r4frd | grep QoS QoS Class: Guaranteed $ kubectl get pod test-84b5cbb8db-r4frd -o jsonpath="{.metadata.uid}" d23f2473-c831-4c2d-a948-96ca09abf068 ``` * `kubepods-podd23f2473_c831_4c2d_a948_96ca09abf068.slice` 是剛剛創建的 pod，可以根據 container 的 uid 確認。 ``` # v1 架構是以下目錄 $ cd /sys/fs/cgroup/cpuset/kubepods.slice $ ls -l total 0 -rw-r--r-- 1 root root 0 May 20 16:57 cgroup.clone_children -rw-r--r-- 1 root root 0 May 20 16:57 cgroup.procs -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.cpu_exclusive -rw-r--r-- 1 root root 0 May 20 16:45 cpuset.cpus -r--r--r-- 1 root root 0 May 20 16:57 cpuset.effective_cpus -r--r--r-- 1 root root 0 May 20 16:57 cpuset.effective_mems -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.mem_exclusive -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.mem_hardwall -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.memory_migrate -r--r--r-- 1 root root 0 May 20 16:57 cpuset.memory_pressure -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.memory_spread_page -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.memory_spread_slab -rw-r--r-- 1 root root 0 May 20 16:45 cpuset.mems -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.sched_load_balance -rw-r--r-- 1 root root 0 May 20 16:57 cpuset.sched_relax_domain_level drwxr-xr-x 3 root root 0 May 20 16:45 kubepods-besteffort.slice drwxr-xr-x 4 root root 0 May 20 16:45 kubepods-burstable.slice drwxr-xr-x 4 root root 0 May 20 16:47 kubepods-podd23f2473_c831_4c2d_a948_96ca09abf068.slice -rw-r--r-- 1 root root 0 May 20 16:57 notify_on_release -rw-r--r-- 1 root root 0 May 20 16:57 tasks ``` * 透過 crictl 檢查對應的 container `cgroupsPath` 是在哪裡，並查看 `cpuset.cpus` 是在哪個 core 上執行。 ``` $ crictl ps -a | grep container-0 e792f799393ef a830707172e80 20 seconds ago Running container-0 0 1bb236e4bcd45 test-7f8fbb578f-9fld4 default $ crictl inspect e792f799393ef | grep cgroupsPath "cgroupsPath": "kubepods-podd23f2473_c831_4c2d_a948_96ca09abf068.slice:cri-containerd:168972ba3c4a808cfcf4c028b2690940a158e728347b6da47897b99928dbb2d1", # 進入 cgroupsPath 的目錄 $ cd kubepods-podd23f2473_c831_4c2d_a948_96ca09abf068.slice/cri-containerd-168972ba3c4a808cfcf4c028b2690940a158e728347b6da47897b99928dbb2d1.scope/ $ ls -l total 0 -rw-r--r-- 1 root root 0 May 20 18:21 cgroup.clone_children -rw-r--r-- 1 root root 0 May 20 18:16 cgroup.procs -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.cpu_exclusive -rw-r--r-- 1 root root 0 May 20 18:16 cpuset.cpus -r--r--r-- 1 root root 0 May 20 18:21 cpuset.effective_cpus -r--r--r-- 1 root root 0 May 20 18:21 cpuset.effective_mems -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.mem_exclusive -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.mem_hardwall -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.memory_migrate -r--r--r-- 1 root root 0 May 20 18:21 cpuset.memory_pressure -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.memory_spread_page -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.memory_spread_slab -rw-r--r-- 1 root root 0 May 20 18:16 cpuset.mems -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.sched_load_balance -rw-r--r-- 1 root root 0 May 20 18:21 cpuset.sched_relax_domain_level -rw-r--r-- 1 root root 0 May 20 18:21 notify_on_release -rw-r--r-- 1 root root 0 May 20 18:21 tasks ``` * 查看 `cpuset.cpus` 是 `2` ，代表 container 被限制只能使用 CPU core 2，就代表他綁定在 NUMA Node 0 上執行。 * 如果沒有設定 Topology Manager 綁定 cpu 的功能，那麼在 `cpuset.cpus` 看到的會是一個區間，就有可能跨 NUMA Node 執行。 ``` $ cat cpuset.cpus 2 $ numactl --hardware available: 2 nodes (0-1) # 機器有 2 個 NUMA node，編號為 0 和 1 node 0 cpus: 0 1 2 3 4 5 6 7 # Node 0 管理 CPU 0~7 node 0 size: 5874 MB node 0 free: 4855 MB node 1 cpus: 8 9 10 11 12 13 14 15 # Node 1 管理 CPU 8~15 node 1 size: 5735 MB node 1 free: 4461 MB node distances: node 0 1 0: 10 20 1: 20 10 ``` ### Cgroup v2 環境 * 檢查 Cgroup，出現 cgroup2fs 代表是 v2 架構 ``` $ stat -fc %T /sys/fs/cgroup/ cgroup2fs ``` * 產生一個 Guaranteed 等級的 Pod，需要指定 1 core 的 cpu 使用。 ``` apiVersion: apps/v1 kind: Deployment metadata: name: test namespace: default spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - image: nginx imagePullPolicy: IfNotPresent name: container-0 resources: limits: cpu: 1 memory: 512Mi requests: cpu: 1 memory: 512Mi nodeName: rke2-m1 # 更改自己環境的節點名稱 ``` * 找到 container 對應的 uid，並且 QoS Class 是 Guaranteed。 ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE test-84b5cbb8db-v5jlj 1/1 Running 0 13s $ kubectl describe pod test-84b5cbb8db-v5jlj | grep QoS QoS Class: Guaranteed $ kubectl get pod test-84b5cbb8db-v5jlj -o jsonpath="{.metadata.uid}" 5b35383e-e6ab-4e25-bdb5-f44c8beef4c7 ``` * `kubepods-pod5b35383e_e6ab_4e25_bdb5_f44c8beef4c7.slice` 是剛剛創建的 pod，可以根據 container 的 uid 確認。 ``` # v2 架構是以下目錄 $ cd /sys/fs/cgroup/kubepods.slice/ $ ls -l total 0 ...... drwxr-xr-x 12 root root 0 May 20 15:35 kubepods-besteffort.slice drwxr-xr-x 11 root root 0 May 20 15:32 kubepods-burstable.slice drwxr-xr-x 4 root root 0 May 20 18:15 kubepods-pod51251f5d_8fc7_424e_9756_90d49dcc2c35.slice drwxr-xr-x 4 root root 0 May 20 18:45 kubepods-pod5b35383e_e6ab_4e25_bdb5_f44c8beef4c7.slice ``` * 透過 crictl 檢查對應的 container `cgroupsPath` 是在哪裡，並查看 `cpuset.cpus` 是在哪個 core 上執行。 ``` $ crictl ps -a | grep container-0 a85108cfc1c7c a830707172e80 2 hours ago Running container-0 0 23feaee82f382 test-84b5cbb8db-v5jlj default $ crictl inspect a85108cfc1c7c | grep cgroupsPath "cgroupsPath": "kubepods-pod5b35383e_e6ab_4e25_bdb5_f44c8beef4c7.slice:cri-containerd:a85108cfc1c7cdf4a082cb8fffd6427f5e56b6812e4f216fd64d2e86097ee1fe", # 進入 cgroupsPath 的目錄 $ cd kubepods-pod5b35383e_e6ab_4e25_bdb5_f44c8beef4c7.slice/cri-containerd-a85108cfc1c7cdf4a082cb8fffd6427f5e56b6812e4f216fd64d2e86097ee1fe.scope/ ``` * 查看 `cpuset.cpus` 是 `2` ，代表 container 被限制只能使用 CPU core 2，就代表他綁定在 NUMA Node 0 上執行。 ``` $ cat cpuset.cpus 2 $ numactl --hardware available: 2 nodes (0-1) # 機器有 2 個 NUMA node，編號為 0 和 1 node 0 cpus: 0 1 2 3 4 5 6 7 # Node 0 管理 CPU 0~7 node 0 size: 5874 MB node 0 free: 4855 MB node 1 cpus: 8 9 10 11 12 13 14 15 # Node 1 管理 CPU 8~15 node 1 size: 5735 MB node 1 free: 4461 MB node distances: node 0 1 0: 10 20 1: 20 10 ``` ## 參考 https://medium.com/gemini-open-cloud/kubernetes%E6%90%AD%E9%85%8Dnuma%E5%B8%B6%E4%BD%A0%E9%A3%9B-e71193bba996#b13c https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/#topology-manager-policy-options