Kubernetes CPU throttling

# Kubernetes CPU throttling ## 問題背景在 Kubernetes 環境中，Cgroup 是如何限制 container 的 CPU 使用，以及如何觸發 CPU throttling。 ## K8s 如何套用 CPU 資源的限制? 當 Kubelet 啟動 Pod 中的 container 時，會將該 container 在 Kubernetes 中設定的 CPU 與記憶體的 requests（請求）與 limits（限制）傳遞給底層的 container runtime 執行（如 Containerd、CRI-O 等）。在 Linux 系統中，container runtime 通常會設定在 kernel 中的 cgroups ，以應用並強制執行定義的資源限制。 Kubelet(containerd 會去執行 cgroup) 會使用 Linux Kernel CFS（Completely Fair Scheduler，完全公平調度）來限制 Workload 的 CPU time 的使用率，要點如下： * CPU 使用量的計量週期為 100ms。 * CPU limit 決定每計量週期（100ms）內 container 可以使用的 CPU 時間的上限 * 本週期若 container 的 CPU 時間用量達到上限，CPU throttling (限流) 開始，container 只能在下個週期繼續執行。 * 換算如下： - limit 限制使用 1 顆 CPU = 每 100ms 可用 100ms 的 CPU 時間 - limit 限制使用 0.2 顆 CPU = 每 100ms 可用 20ms 的 CPU 時間 - limit 限制使用 2.5 顆 CPU = 每 100ms 可用 250ms 的 CPU 時間 * 若 container 內的應用程式為多執行緒（multi-threaded），在多核心情況下會將所有執行緒的 CPU 使用時間累加計算，整體仍受 CFS quota 所限制。 > cgroups 提供了資源限制與隔離的框架，而 CFS 則在此框架下，實施具體的 CPU 時間分配與調度。所以當 Workload 的 CPU 達到限制時，他是限制 CPU 的使用時間，會造成應用延時，讓應用執行所需要的時間變得更長了。 ### CFS Bandwidth Control 單位換算機制 ``` CPU Limit = 100m = 0.1 core = 每 100ms CFS 週期只能使用 10ms 的 CPU 時間 ``` * `100m` 是什麼單位？ - `m` 是 millicores（毫核心），1000m = 1 個 CPU core。 ``` 1000m = 1 core（完整一顆 CPU） 500m = 0.5 core 100m = 0.1 core 1m = 0.001 core ``` * 為什麼 `100m` 可以換算成「每 `100ms` CFS 週期只能用 `10ms`」？ - 這是 Linux CFS（Completely Fair Scheduler）的實作方式。CFS 用兩個參數來實現 CPU limit： ``` cpu.cfs_period_us = 100,000 微秒 = 100ms ← 固定週期（K8s 預設） cpu.cfs_quota_us = 10,000 微秒 = 10ms ← 每個週期能用多少 ``` * quota 怎麼算出來的? ``` quota = period × limit = 100ms × 100m/1000m = 100ms × 0.1 = 10ms ``` * 所以 CPU Limit `100m` 的意思是：每 `100ms` 的週期裡，這個 cgroup 最多只能累積 10ms 的 CPU 執行時間。用完就強制暫停（throttle），等下一個週期重置。 ### 以實際舉例，假設應用需要 150m，但 limit 設定 100m，會發生什麼事？ * 換算成 CFS 週期 - limit = 100m = 每 100ms 週期內只能用 10ms。 - 需求 = 150m = 每 100ms 週期內需要 15ms。 ``` 週期 1（0ms ~ 100ms） ├─ 0ms : 開始執行 ├─ 10ms : 配額用完 → kernel 強制暫停 ├─ 10ms ~ 100ms : 容器被凍結，什麼都不能做（等待 90ms） └─ 100ms : 新週期開始，解凍週期 2（100ms ~ 200ms） ├─ 100ms : 繼續執行（還差 5ms） ├─ 105ms : 任務完成 ✓ └─ 剩餘配額 5ms 閒置 ``` 結果：本來 15ms 就能跑完，實際花了 105ms，慢了 7 倍。 ![image](https://hackmd.io/_uploads/B1hIkBUcZe.png) ## Quality of Service * 在 Kubernetes 中，Pod 的 Quality of Service (QoS) 分為三種等級：Guaranteed（保證）、Burstable（突發）和 BestEffort（盡力而為）。這些等級是根據 Pod 中 container 所設定的 CPU 和 Memory 資源來自動判斷的，會影響資源爭用時的優先級與調度行為。 - Guaranteed : 所有 container 都設定了 requests = limits（CPU & Memory），擁有最高優先權，在節點壓力大時最不容易被驅逐。 - Burstable : 至少有一個 container 設定了 requests，但 requests 、 limits 的值不完全相同，中等優先權，在節點壓力下，有可能被驅逐，但優先於 BestEffort。 - BestEffort : 沒有設定任何資源 requests 或 limits，最低優先權，最容易被驅逐。 ![image](https://hackmd.io/_uploads/BkHbhNGhZx.png) ## 在 Cgroup v1 架構驗證 * 檢查 Cgroup，出現 `tmpfs` 代表是 v1 架構 ``` $ stat -fc %T /sys/fs/cgroup/ tmpfs ``` * 產生一個 Guaranteed 等級的 Pod ``` apiVersion: apps/v1 kind: Deployment metadata: name: test namespace: default spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - image: nginx imagePullPolicy: IfNotPresent name: container-0 resources: limits: cpu: 100m memory: 512Mi requests: cpu: 100m memory: 512Mi nodeName: rke2-w2 # 更改自己環境的節點名稱 ``` * 檢查 container uid ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE test-57bdb449bf-h6fcz 1/1 Running 0 8m22s $ kubectl describe pod test-57bdb449bf-h6fcz | grep QoS QoS Class: Guaranteed $ kubectl get pod test-57bdb449bf-h6fcz -o jsonpath="{.metadata.uid}" 5bac8518-d4bd-4088-b72d-46c993a9b2cd ``` * 在產生 pod 的節點進入 `/sys/fs/cgroup/cpu/kubepods.slice/` 目錄，如果是創建 Guaranteed 等級的 Pod，那麼就是會在這裡產生檔案。 ``` $ cd /sys/fs/cgroup/cpu/kubepods.slice/ # kubepods-pod5bac8518_d4bd_4088_b72d_46c993a9b2cd.slice 是我們剛剛創建的 pod # 後面的 id 是 pod 產生的 uid，可以透過 kubectl get po -oyaml 找到對應的 pod $ ls -l total 0 -rw-r--r-- 1 root root 0 May 23 08:34 cgroup.clone_children -rw-r--r-- 1 root root 0 May 23 08:34 cgroup.procs -rw-r--r-- 1 root root 0 May 23 09:07 cpu.cfs_burst_us -rw-r--r-- 1 root root 0 May 23 08:34 cpu.cfs_period_us -rw-r--r-- 1 root root 0 May 23 08:34 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 May 23 09:07 cpu.idle -rw-r--r-- 1 root root 0 May 23 08:34 cpu.shares -r--r--r-- 1 root root 0 May 23 08:34 cpu.stat -r--r--r-- 1 root root 0 May 23 08:34 cpuacct.stat -rw-r--r-- 1 root root 0 May 23 08:34 cpuacct.usage -r--r--r-- 1 root root 0 May 23 08:34 cpuacct.usage_all -r--r--r-- 1 root root 0 May 23 08:34 cpuacct.usage_percpu -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_percpu_sys -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_percpu_user -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_sys -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_user drwxr-xr-x 11 root root 0 May 23 08:34 kubepods-besteffort.slice drwxr-xr-x 6 root root 0 May 23 08:34 kubepods-burstable.slice drwxr-xr-x 4 root root 0 May 23 08:49 kubepods-pod5bac8518_d4bd_4088_b72d_46c993a9b2cd.slice -rw-r--r-- 1 root root 0 May 23 09:07 notify_on_release -rw-r--r-- 1 root root 0 May 23 09:07 tasks ``` ``` $ cd kubepods-pod5bac8518_d4bd_4088_b72d_46c993a9b2cd.slice $ ls -l total 0 -rw-r--r-- 1 root root 0 May 23 08:49 cgroup.clone_children -rw-r--r-- 1 root root 0 May 23 08:49 cgroup.procs -rw-r--r-- 1 root root 0 May 23 09:07 cpu.cfs_burst_us -rw-r--r-- 1 root root 0 May 23 08:49 cpu.cfs_period_us -rw-r--r-- 1 root root 0 May 23 08:49 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 May 23 09:07 cpu.idle -rw-r--r-- 1 root root 0 May 23 08:49 cpu.shares -r--r--r-- 1 root root 0 May 23 08:49 cpu.stat -r--r--r-- 1 root root 0 May 23 08:49 cpuacct.stat -rw-r--r-- 1 root root 0 May 23 08:49 cpuacct.usage -r--r--r-- 1 root root 0 May 23 08:49 cpuacct.usage_all -r--r--r-- 1 root root 0 May 23 08:49 cpuacct.usage_percpu -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_percpu_sys -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_percpu_user -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_sys -r--r--r-- 1 root root 0 May 23 09:07 cpuacct.usage_user drwxr-xr-x 2 root root 0 May 23 08:49 cri-containerd-d6fbe13e17995b57615f45fc0c8386b651ecb51aa17a60c754eacf5330963518.scope drwxr-xr-x 2 root root 0 May 23 08:49 cri-containerd-f3a4beafc13f8c2b832d47110a5da9b6c0da6336c6a44992c1bff6d059402e99.scope -rw-r--r-- 1 root root 0 May 23 09:07 notify_on_release -rw-r--r-- 1 root root 0 May 23 09:07 tasks ``` * 透過 crictl 檢查對應的 container `cgroupsPath` 是在哪裡。 ``` $ crictl ps -a | grep container-0 d6fbe13e17995 a830707172e80 18 minutes ago Running container-0 0 f3a4beafc13f8 test-57bdb449bf-h6fcz default $ crictl inspect d6fbe13e17995 | grep cgroupsPath "cgroupsPath": "kubepods-pod5bac8518_d4bd_4088_b72d_46c993a9b2cd.slice:cri-containerd:d6fbe13e17995b57615f45fc0c8386b651ecb51aa17a60c754eacf5330963518", $ cd cri-containerd-d6fbe13e17995b57615f45fc0c8386b651ecb51aa17a60c754eacf5330963518.scope ``` * `cpu.cfs_period_us` : 定義重新分配 CPU 資源的時間週期長度。預設是 100000（即 100 毫秒）。 * `cpu.cfs_quota_us` : 在一個週期內，這個 cgroup 最多可以使用的 CPU 時間。如果用完會被 throttled。 * `nr_throttled` : cgroup 中被限制的次數。 * `throttled_time` : 表示總共被節流（Throttled）的時間奈秒（ns）。 ``` $ ls -l total 0 -rw-r--r-- 1 root root 0 May 22 13:53 cgroup.clone_children -rw-r--r-- 1 root root 0 May 22 13:53 cgroup.procs -rw-r--r-- 1 root root 0 May 22 13:55 cpu.cfs_burst_us -rw-r--r-- 1 root root 0 May 22 13:53 cpu.cfs_period_us -rw-r--r-- 1 root root 0 May 22 13:53 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 May 22 13:55 cpu.idle -rw-r--r-- 1 root root 0 May 22 13:53 cpu.shares -r--r--r-- 1 root root 0 May 22 13:53 cpu.stat -r--r--r-- 1 root root 0 May 22 13:53 cpuacct.stat -rw-r--r-- 1 root root 0 May 22 13:53 cpuacct.usage -r--r--r-- 1 root root 0 May 22 13:53 cpuacct.usage_all -r--r--r-- 1 root root 0 May 22 13:53 cpuacct.usage_percpu -r--r--r-- 1 root root 0 May 22 13:55 cpuacct.usage_percpu_sys -r--r--r-- 1 root root 0 May 22 13:55 cpuacct.usage_percpu_user -r--r--r-- 1 root root 0 May 22 13:55 cpuacct.usage_sys -r--r--r-- 1 root root 0 May 22 13:55 cpuacct.usage_user -rw-r--r-- 1 root root 0 May 22 13:55 notify_on_release -rw-r--r-- 1 root root 0 May 22 13:55 tasks # 預設的 cpu 週期都是 100 ms $ cat cpu.cfs_period_us 100000 # 10000 就是代表限制(limit) 100m cpu $ cat cpu.cfs_quota_us 10000 # 419816941 ns = 419.816941 ms $ cat cpu.stat nr_periods 7 nr_throttled 3 throttled_time 419816941 ``` * 進到 test pod 去做 cpu 壓力測試 ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE test-57bdb449bf-8kxvj 1/1 Running 0 3m31s $ kubectl exec test-57bdb449bf-8kxvj -- timeout 240 yes >/dev/null & ``` * 可以看到 `nr_throttled` 被限制次數變多了。 ``` $ cat cpu.stat nr_periods 255 nr_throttled 221 throttled_time 42323283399 ``` * 使用 PromQL 表達式查詢，可以看到 pod 有被節流，只要大於 0 就是有發生 throttled， 1 表示幾乎每次都被 throttled。 ``` sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) ``` ![image](https://hackmd.io/_uploads/BkGOaH9exg.png) ## 在 Cgroup v2 架構驗證 * 檢查 Cgroup，出現 `cgroup2fs` 代表是 v2 架構 ``` $ stat -fc %T /sys/fs/cgroup/ cgroup2fs ``` * 產生一個 Guaranteed 等級的 Pod ``` apiVersion: apps/v1 kind: Deployment metadata: name: test namespace: default spec: replicas: 1 selector: matchLabels: app: test template: metadata: labels: app: test spec: containers: - image: nginx imagePullPolicy: IfNotPresent name: container-0 resources: limits: cpu: 100m memory: 512Mi requests: cpu: 100m memory: 512Mi nodeName: rke2-m1 # 更改自己環境的節點名稱 ``` * 檢查 container uid ``` $ kubectl get pod NAME READY STATUS RESTARTS AGE test-f84769dfd-ngpj6 1/1 Running 0 59s $ kubectl describe pod test-f84769dfd-ngpj6 | grep QoS QoS Class: Guaranteed $ kubectl get pod test-f84769dfd-ngpj6 -o jsonpath="{.metadata.uid}" 3bd1d53b-35e4-4ee4-a73d-24a3645354bd ``` * 在產生 pod 的節點進入 `/sys/fs/cgroup/kubepods.slice/` 目錄，如果是創建 Guaranteed 等級的 Pod，那麼就是會在這裡產生檔案。 ``` $ cd /sys/fs/cgroup/kubepods.slice/ # kubepods-pod3bd1d53b_35e4_4ee4_a73d_24a3645354bd.slice 是我們剛剛創建的 pod $ ls -l /sys/fs/cgroup/kubepods.slice/ total 0 -r--r--r-- 1 root root 0 May 22 11:24 cgroup.controllers -r--r--r-- 1 root root 0 May 22 11:24 cgroup.events -rw-r--r-- 1 root root 0 May 22 11:27 cgroup.freeze --w------- 1 root root 0 May 22 14:30 cgroup.kill -rw-r--r-- 1 root root 0 May 22 14:30 cgroup.max.depth -rw-r--r-- 1 root root 0 May 22 14:30 cgroup.max.descendants -rw-r--r-- 1 root root 0 May 22 11:24 cgroup.procs -r--r--r-- 1 root root 0 May 22 14:30 cgroup.stat -rw-r--r-- 1 root root 0 May 22 11:25 cgroup.subtree_control -rw-r--r-- 1 root root 0 May 22 14:30 cgroup.threads -rw-r--r-- 1 root root 0 May 22 11:24 cgroup.type -rw-r--r-- 1 root root 0 May 22 11:24 cpu.idle -rw-r--r-- 1 root root 0 May 22 11:24 cpu.max -rw-r--r-- 1 root root 0 May 22 14:30 cpu.max.burst -r--r--r-- 1 root root 0 May 22 11:24 cpu.stat -r--r--r-- 1 root root 0 May 22 14:30 cpu.stat.local -rw-r--r-- 1 root root 0 May 22 11:24 cpu.weight -rw-r--r-- 1 root root 0 May 22 14:30 cpu.weight.nice -rw-r--r-- 1 root root 0 May 22 11:24 cpuset.cpus -r--r--r-- 1 root root 0 May 22 11:24 cpuset.cpus.effective -rw-r--r-- 1 root root 0 May 22 14:30 cpuset.cpus.partition -rw-r--r-- 1 root root 0 May 22 11:24 cpuset.mems -r--r--r-- 1 root root 0 May 22 14:30 cpuset.mems.effective -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.1GB.current -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.1GB.events -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.1GB.events.local -rw-r--r-- 1 root root 0 May 22 11:24 hugetlb.1GB.max -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.1GB.numa_stat -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.1GB.rsvd.current -rw-r--r-- 1 root root 0 May 22 11:24 hugetlb.1GB.rsvd.max -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.2MB.current -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.2MB.events -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.2MB.events.local -rw-r--r-- 1 root root 0 May 22 11:24 hugetlb.2MB.max -r--r--r-- 1 root root 0 May 22 14:30 hugetlb.2MB.numa_stat -r--r--r-- 1 root root 0 May 22 11:25 hugetlb.2MB.rsvd.current -rw-r--r-- 1 root root 0 May 22 11:24 hugetlb.2MB.rsvd.max -rw-r--r-- 1 root root 0 May 22 11:24 io.bfq.weight -rw-r--r-- 1 root root 0 May 22 14:30 io.latency -rw-r--r-- 1 root root 0 May 22 14:30 io.max -r--r--r-- 1 root root 0 May 22 11:24 io.stat -rw-r--r-- 1 root root 0 May 22 11:24 io.weight drwxr-xr-x 16 root root 0 May 22 13:58 kubepods-besteffort.slice drwxr-xr-x 11 root root 0 May 22 11:27 kubepods-burstable.slice drwxr-xr-x 4 root root 0 May 23 09:10 kubepods-pod3bd1d53b_35e4_4ee4_a73d_24a3645354bd.slice drwxr-xr-x 4 root root 0 May 23 08:34 kubepods-pod51251f5d_8fc7_424e_9756_90d49dcc2c35.slice ...... ``` ``` $ cd kubepods-pod3bd1d53b_35e4_4ee4_a73d_24a3645354bd.slice $ ls -l total 0 -r--r--r-- 1 root root 0 May 23 09:10 cgroup.controllers -r--r--r-- 1 root root 0 May 23 09:10 cgroup.events -rw-r--r-- 1 root root 0 May 23 09:11 cgroup.freeze --w------- 1 root root 0 May 23 09:11 cgroup.kill -rw-r--r-- 1 root root 0 May 23 09:11 cgroup.max.depth -rw-r--r-- 1 root root 0 May 23 09:11 cgroup.max.descendants -rw-r--r-- 1 root root 0 May 23 09:10 cgroup.procs -r--r--r-- 1 root root 0 May 23 09:11 cgroup.stat -rw-r--r-- 1 root root 0 May 23 09:10 cgroup.subtree_control -rw-r--r-- 1 root root 0 May 23 09:11 cgroup.threads -rw-r--r-- 1 root root 0 May 23 09:10 cgroup.type -rw-r--r-- 1 root root 0 May 23 09:10 cpu.idle -rw-r--r-- 1 root root 0 May 23 09:10 cpu.max -rw-r--r-- 1 root root 0 May 23 09:11 cpu.max.burst -r--r--r-- 1 root root 0 May 23 09:10 cpu.stat -r--r--r-- 1 root root 0 May 23 09:11 cpu.stat.local -rw-r--r-- 1 root root 0 May 23 09:10 cpu.weight -rw-r--r-- 1 root root 0 May 23 09:11 cpu.weight.nice -rw-r--r-- 1 root root 0 May 23 09:10 cpuset.cpus -r--r--r-- 1 root root 0 May 23 09:10 cpuset.cpus.effective -rw-r--r-- 1 root root 0 May 23 09:11 cpuset.cpus.partition -rw-r--r-- 1 root root 0 May 23 09:10 cpuset.mems -r--r--r-- 1 root root 0 May 23 09:11 cpuset.mems.effective drwxr-xr-x 2 root root 0 May 23 09:10 cri-containerd-32dc3b88a5f2175809059f08bb2fd9e155fe6ef5a3266030094a4873435afc8f.scope drwxr-xr-x 2 root root 0 May 23 09:10 cri-containerd-e266625d0bfec4cd12819866d2e929cf8b4540670342ca3da77a32e336c5f2cc.scope ...... ``` * 透過 crictl 檢查對應的 container `cgroupsPath` 是在哪裡。 ``` $ crictl ps -a | grep container-0 e266625d0bfec a830707172e80 About a minute ago Running container-0 0 32dc3b88a5f21 test-f84769dfd-ngpj6 default $ crictl inspect e266625d0bfec | grep cgroupsPath "cgroupsPath": "kubepods-pod3bd1d53b_35e4_4ee4_a73d_24a3645354bd.slice:cri-containerd:e266625d0bfec4cd12819866d2e929cf8b4540670342ca3da77a32e336c5f2cc", $ cd cri-containerd-e266625d0bfec4cd12819866d2e929cf8b4540670342ca3da77a32e336c5f2cc.scope ``` * 環境是 cgroup v2 的話，則根據 cpu.max 的設定的值來限制 CPU，他的格式就是 `cfs_quota_us cfs_period_us`。 * `usage_usec` : 該 cgroup 總共使用的 CPU 時間（單位：微秒），包括使用者與核心空間的時間。 * `user_usec` : 使用者空間（user space）所使用的 CPU 時間（微秒），例如應用程式本身的運算。 * `system_usec` : 核心空間（kernel space）所使用的 CPU 時間（微秒），例如系統呼叫、網路 I/O。 * `nr_periods` : cpu.max 配額控制的週期總數。每個週期表示一次限速監控的單位（預設是 100ms）。 * `nr_throttled` : 有多少個週期因為超出 cpu.max 而被限速（throttled）。越高代表資源常常不夠用。 * `throttled_usec` : 累積被限速的 CPU 時間（微秒）。 ``` # 第一個值 10000 就是代表限制(limit) 100m cpu # 第二個值 100000 就是預設的 cpu 週期，都是 100 ms $ cat cpu.max 10000 100000 $ cat cpu.stat usage_usec 42485 user_usec 15935 system_usec 26550 core_sched.force_idle_usec 0 nr_periods 7 nr_throttled 4 throttled_usec 627048 nr_bursts 0 burst_usec 0 ``` * 進到 test pod 去做 cpu 壓力測試。 ``` $ kubectl get po NAME READY STATUS RESTARTS AGE test-746865dbb-c8tsf 1/1 Running 0 22m $ kubectl exec test-746865dbb-c8tsf -- timeout 240 yes >/dev/null & ``` * 可以看到 `nr_throttled` 和 `throttled_usec` 都變高，代表目前這個 pod 被 throttled ``` $ cat cpu.stat usage_usec 1211910 user_usec 84596 system_usec 1127314 core_sched.force_idle_usec 0 nr_periods 125 nr_throttled 119 throttled_usec 16374085 nr_bursts 0 burst_usec 0 ``` * 使用 PromQL 表達式查詢，可以看到 pod 有被節流，只要大於 0 就是有發生 throttled， 1 表示幾乎每次都被 throttled。 ``` sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace) ``` ![image](https://hackmd.io/_uploads/rJERpkjele.png) ## K8s 是否有方式可以避免 CPU throttling? ### 答案是有的，分為以下兩種 1. 提高 CPU Limit 的用量 * Kubernetes 中 container 資源限制（Resources）可以分為： - cpu request：保證 container 最低可以使用的 CPU。 - cpu limit： container 最多可以使用的 CPU。如果 container 的 cpu limit 設定得太低，而應用實際需要更多資源，當超過限制時，Kubernetes 的 CFS（Completely Fair Scheduler）就會對其進行「Throttling」（節流），導致應用程式效能下降或不穩定。 2. 檢視應用程式的 Thread 用量是否不正確 * 應用程式若開啟過多執行緒（threads），尤其是在多執行緒（multi-threaded）或非同步架構中，可能會造成： - 多個執行緒同時爭用 CPU，導致短時間內 CPU 使用急升。 - 每個執行緒都被排程，但在受限的 CPU limit 下無法順利執行 → 出現 CFS throttling。 #### 例：container 的 CPU limit 設定 100m，系統分配了 4 個 Threads 來平行處理，且每個 Thread 各自需要 5ms 的純 CPU 運算時間，會發生什麼事？ ![image](https://hackmd.io/_uploads/S1sqZSf2-l.png) 1. 先換算 CFS quota: 每 100ms 週期內，Container 可用的 CPU 時間 = 10ms。 2. 4 個 Thread 同時跑，共同消耗同一個 10ms 的 quota pool，4 個 Thread 同時各跑 2.5ms 後，pool 耗盡（4 × 2.5ms = 10ms），每個 Thread 實際執行了 2.5ms。 3. 每個 Thread 都需要花費 5ms 才能跑完單次執行，第一週期只執行了 2.5ms，剩下的留給第二週期執行，因此應用總共花費了 `102.5ms` 才執行完成。 #### 相同問題，如果換成一個 Thread 執行，性能反而提升，因此檢視應用程式的 Thread 數量很重要的 ![image](https://hackmd.io/_uploads/Hya_7Hfh-l.png) ## 優化 CPU 使用效率另外一個影響 Container CPU 效能的因素就是 CPU 的使用效率。在 Kubernetes 中，Container 所使用的 CPU 資源是共享的。雖然我們可以透過 CPU request 為 Container 保留資源，但實際上這些 request 所對應的 vCPU 並不是固定綁定在某幾個實體 core 上。因此，在同一個 Container 中執行的多個 thread，可能會在不同的時間區段內被調度到不同的 CPU core。舉例來說，假設應用是使用 java 開發，那麼他一定會有多執行緒的設計（multi-threaded），假設使用 4 條 thread 同時處理工作。在執行期間，這些 thread 被排程在哪些 CPU 上，其實是變動的，如下圖所示： ![image](https://hackmd.io/_uploads/BkGD7DFbll.png) ### 什麼是 NUMA(Non-Uniform Memory Access) 在透過 Topology Manager 優化 K8s 系統中的 container CPU 使用效率前，我們先認識什麼是 NUMA。 NUMA 它是一種讓多顆 CPU 能更有效率地存取記憶體的設計方式，常見於多核心、高階伺服器或工作站上。 ![image](https://hackmd.io/_uploads/Sk9Y4wKbxg.png) 在執行應用時，系統為了分散使用 CPU 資源，可能會將應用分別執行在不同的 NUMA node 上，在應用執行過程中，只要有 thread 跨 NUMA node 兩者要互相讀寫記憶體時，就會經過 Interconnect，而透過 Interconnect 都是會增加整個應用所執行的時間。想像你在辦公室裡有兩個團隊（代表 2 顆 CPU），每個團隊旁邊都有一個書櫃（代表 RAM）。每個團隊成員通常都會從「自己旁邊的書櫃」拿資料（速度很快），但有時也會需要去「另一邊的書櫃」找資料（花比較久）。 * 這就是 NUMA 架構的核心概念： - 每個 CPU 有自己「比較快的記憶體」可以用（稱為 local memory）。 - 但它也能用其他 CPU 的記憶體，只是速度會比較慢（稱為 remote memory）。 * 在 NUMA 架構下，本地記憶體存取 local memory access 的速度會比非本地記憶體存取 non-local memory access 快上許多。上圖中的 Node1 中的 Core 要存取 Node1 中的記憶體，就比透過 Interconnect 存取 Node2 的記憶體快上許多。為了達到資源的最有效利用，就必須要盡可能地把資源集中在同一台 NUMA Node 上。如果看到 NUMA node(s): 2，就代表你有兩個 NUMA 節點 ``` $ lscpu | grep "NUMA node(s)" NUMA node(s): 2 ``` 查看每個 NUMA node 對應的 CPU 和記憶體資訊 ``` $ numactl --hardware available: 2 nodes (0-1) # 機器有 2 個 NUMA node，編號為 0 和 1 node 0 cpus: 0 1 2 3 4 5 6 7 # Node 0 管理 CPU 0~7 node 0 size: 5874 MB node 0 free: 4855 MB node 1 cpus: 8 9 10 11 12 13 14 15 # Node 1 管理 CPU 8~15 node 1 size: 5735 MB node 1 free: 4461 MB node distances: node 0 1 0: 10 20 1: 20 10 ``` ### RKE2 啟用 Topology Manager 綁定 cpu 功能可以[參考](https://hackmd.io/URf7odnWStepSLUxWhSScQ) ## 參考 https://hackmd.io/@QI-AN/S1Z2FYUyxx https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu#sect-cfs https://www.hwchiu.com/docs/2023/container-vm#how-to-avoid https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/