Autoscaling - HackMD

Autoscaling === ###### tags: `SlinkyProject` ###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `Autoscaling`, `KEDA`, `HPA`, `VPA`, `Scale out`, `Scale in` [TOC] ## 所需背景知識 - [[應用] keda 部署與測試教學](/iU44zPyST9O5rD7OQYdQYQ) ## Intro - [Autoscaling](https://github.com/SlinkyProject/slurm-operator/blob/main/docs/autoscaling.md) ## 如何填寫 `keda-slurm-compute-e900.yaml` ？ - ### `keda-slurm-compute-e900.yaml` ```yaml= apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: scale-e900 namespace: slurm spec: scaleTargetRef: apiVersion: slinky.slurm.net/v1alpha1 kind: NodeSet name: slurm-compute-e900 idleReplicaCount: 0 minReplicaCount: 1 maxReplicaCount: 3 triggers: - type: prometheus metricType: Value metadata: serverAddress: http://monitor-kube-prometheus-st-prometheus.monitor:9090 query: slurm_partition_pending_jobs{partition="radar"} threshold: '5' ``` - ### 確認 `scaleTargetRef` 資訊正確 > ```yaml > scaleTargetRef: > apiVersion: slinky.slurm.net/v1alpha1 > kind: NodeSet > name: slurm-compute-e900 > ``` - ### 確認要縮放的對象存在：對象：`-n slurm NodeSet/slurm-compute-e900` ```bash $ kubectl -n slurm get NodeSet # 同 $ kubectl -n slurm get nss NAME REPLICAS UPDATED READY AGE slurm-compute-e900 20d slurm-compute-esc8000a 38d slurm-compute-esc8000b 1 1 1 38d $ kubectl -n slurm get NodeSet/slurm-compute-e900 NAME REPLICAS UPDATED READY AGE slurm-compute-e900 20d ``` - ### apiVersion ``` $ kubectl -n slurm get NodeSet/slurm-compute-e900 -o yaml | grep apiVersion: apiVersion: slinky.slurm.net/v1alpha1 ``` - ### 預期的副本數 > ```yaml > idleReplicaCount: 0 > minReplicaCount: 1 > maxReplicaCount: 3 > ``` - **`minReplicaCount` & `minReplicaCount`** 當觸發器上有活動（至少有一個等待中的作業）時，KEDA 會將 NodeSet 的副本數調整為 `minReplicaCount`，然後讓 HPA 負責根據負載自動調整，最高可至 `maxReplicaCount`，或降回 minReplicaCount。 - **`idleReplicaCount`** 當觸發器在設定的時間內沒有偵測到活動時，KEDA 會將 NodeSet 的副本數調整為 `idleReplicaCount`。更多範例請參考 KEDA 官方文件中關於 `idleReplicaCount` 的說明。 - 官方解釋 > When there is activity on the trigger (at least one pending job), KEDA will scale the NodeSet to `minReplicaCount` and then let HPA handle scaling up to `maxReplicaCount` or back down to `minReplicaCount`. When there is no activity on the trigger after a configurable amount of time, KEDA will scale the NodeSet to `idleReplicaCount`. See the [KEDA](https://keda.sh/docs/) documentation on [`idleReplicaCount`](https://keda.sh/docs/concepts/scaling-deployments/#idlereplicacount) for more examples. 當觸發器上有活動（至少有一個等待中的作業）時，KEDA 會將 NodeSet 的副本數調整為 `minReplicaCount`，然後讓 HPA 負責根據負載自動調整，最高可至 `maxReplicaCount`，或降回 `minReplicaCount`。當觸發器在設定的時間內沒有偵測到活動時，KEDA 會將 NodeSet 的副本數調整為 `idleReplicaCount`。更多範例請參考 [KEDA](https://keda.sh/docs/) 官方文件中關於 [`idleReplicaCount`](https://keda.sh/docs/concepts/scaling-deployments/#idlereplicacount) 的說明。 - ### 確認 triggers 資訊 > ```yaml > triggers: > - type: prometheus > metricType: Value > metadata: > serverAddress: http://prometheus-kube-prometheus-prometheus.prometheus:9090 > query: slurm_partition_pending_jobs{partition="radar"} > threshold: '5' > ``` - ### KEDA 使用的 scale object 是 `prometheus` - ### 確認 prometheus service 是否能存取？ - ### helm repo 已經有 prometheus ？ ``` $ helm repo list | grep -i prometheus prometheus-community https://prometheus-community.github.io/helm-charts ``` - ### prometheus 安裝在哪個 namespace？ ``` $ helm list -A | egrep -i "name|prometheus" NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION monitor monitor 1 2025-07-16 07:05:53.714709202 +0000 UTC deployed kube-prometheus-stack-70.7.0 v0.81.0 ``` - ### 查看 prometheus 服務 ``` $ kubectl -n monitor get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 19d monitor-grafana ClusterIP 10.108.0.168 <none> 80/TCP 19d monitor-grafana-headless ClusterIP None <none> 9094/TCP 19d monitor-kube-prometheus-st-alertmanager ClusterIP 10.108.233.120 <none> 9093/TCP,8080/TCP 19d monitor-kube-prometheus-st-operator ClusterIP 10.97.240.59 <none> 443/TCP 19d monitor-kube-prometheus-st-prometheus ClusterIP 10.101.246.225 <none> 9090/TCP,8080/TCP 19d monitor-kube-state-metrics ClusterIP 10.102.45.71 <none> 8080/TCP 19d monitor-prometheus-node-exporter ClusterIP 10.96.132.130 <none> 9100/TCP 19d prometheus-operated ClusterIP None <none> 9090/TCP 19d ``` - **service** - name: `monitor-kube-prometheus-st-prometheus` - port: 9090 or 8080 - **DNS 格式** - **`<service>.<namespace>:<port>`** `monitor-kube-prometheus-st-prometheus.monitor:9090` - **`<service>.<namespace>.svc:<port>`** `monitor-kube-prometheus-st-prometheus.monitor.svc:9090` - **`<service>.<namespace>.svc.cluster.local:<port>`** `monitor-kube-prometheus-st-prometheus.monitor.svc.cluster.local:9090` - **進到 `-n slurm` 的任意 pod，查看 `/etc/resolv.conf` 設定** ``` $ cat /etc/resolv.conf search slurm.svc.cluster.local svc.cluster.local cluster.local nameserver 10.96.0.10 options ndots:5 ``` | domain-name | matched suffix | FQDN | |-------------|--------|------| | `monitor-kube-prometheus-st-prometheus.monitor` | `svc.cluster.local` | `monitor-kube-prometheus-st-prometheus.monitor.svc.cluster.local` | | `monitor-kube-prometheus-st-prometheus.monitor.svc` | `cluster.local` | `monitor-kube-prometheus-st-prometheus.monitor.svc.cluster.local` | | `monitor-kube-prometheus-st-prometheus.monitor.svc.cluster.local` | | `monitor-kube-prometheus-st-prometheus.monitor.svc.cluster.local` | - **確定 service 可以連線？** ```bash # 建立測試用的 pod $ kubectl -n slurm run -it --rm dnstools --image=infoblox/dnstools # 或：進入現有的 pod： $ kubectl -n slurm exec -it pod/dnstools -- sh --- # 進入到 pod $ nslookup monitor-kube-prometheus-st-prometheus.monitor $ curl "http://monitor-kube-prometheus-st-prometheus.monitor:9090" -i HTTP/1.1 302 Found Content-Type: text/html; charset=utf-8 Location: /query Date: Tue, 05 Aug 2025 04:15:10 GMT Content-Length: 29 <a href="/query">Found</a>. --- # 確認可連線後，可進一步進行查詢： # 1. 查詢 Prometheus 自己的時間戳 $ curl "http://monitor-kube-prometheus-st-prometheus.monitor:9090/api/v1/query?query=time()" # 2. 查詢 Prometheus 內建的 up 指標 $ curl "http://monitor-kube-prometheus-st-prometheus.monitor:9090/api/v1/query?query=up" # up 這個指標會顯示每個 target 是否處於「上線」（1=正常，0=異常）。 # 3. 查詢某個 exporter 的 metrics 數量 $ curl "http://monitor-kube-prometheus-st-prometheus.monitor:9090/api/v1/query?query=node_cpu_seconds_total" # 4. 查詢所有 metric 名稱 $ curl "http://monitor-kube-prometheus-st-prometheus.monitor:9090/api/v1/label/__name__/values" ``` - ### 套用並確認 ```bash $ kubectl -n slurm apply -f keda-slurm-compute-e900.yaml ``` ``` $ kubectl -n slurm get all ... NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE horizontalpodautoscaler.autoscaling/keda-hpa-scale-e900 NodeSet/slurm-compute-e900 <unknown>/5 1 3 0 25h ``` - TARGETS 是 `<unknown>`，應該表示還沒有上傳 metrics 到 Prometheus ? ``` $ kubectl -n slurm get scaledobjects NAME SCALETARGETKIND SCALETARGETNAME MIN MAX READY ACTIVE FALLBACK PAUSED TRIGGERS AUTHENTICATIONS AGE scale-e900 slinky.slurm.net/v1alpha1.NodeSet slurm-compute-e900 1 3 True False False Unknown prometheus 25h ``` ## 討論 ### 解釋 HPA 和 VPA - ### Question https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ Horizontal Pod Autoscaling https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/ Vertical Pod Autoscaler 用簡單的方式解釋 - ### gpt-4.1 (2025/08/05) 當然可以！以下是**Horizontal Pod Autoscaler (HPA)** 和 **Vertical Pod Autoscaler (VPA)** 的簡單說明： --- ### Horizontal Pod Autoscaler (HPA) [官方文件說明連結](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) #### 功能簡介 * **HPA 會根據 CPU 使用率、記憶體用量，或其他自訂指標，自動調整 Pod 的數量。** * 當負載增加時，HPA 會自動擴充（增加）Pod；負載降低時則會縮減（減少）Pod。 #### 適用場景 * 當你的服務在高流量時需要更多實例分攤流量，低流量時節省資源。 #### 流程簡述 1. 設定 HPA，指定監控指標（例如 CPU 使用率）。 2. Kubernetes controller 會持續監控這些指標。 3. 如果指標超過或低於你設定的目標，Kubernetes 會自動調整 Pod 數量。 --- ### Vertical Pod Autoscaler (VPA) [GitHub 連結](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler/) #### 功能簡介 * **VPA 會根據實際的資源使用情況，自動調整 Pod 的「CPU/記憶體配額」大小（不是數量）。** * 負載變大時自動分配更多資源給 Pod；負載小時減少分配，提升資源利用率。 #### 適用場景 * 當你的應用程式的資源需求會隨負載變動，但不想頻繁調整 Deployment 的資源設定時。 #### 流程簡述 1. 部署 VPA 組件並套用 VPA 資源設定。 2. VPA 持續監控每個 Pod 的資源實際用量。 3. 需要時會建議或自動修改 Pod 的資源限制，可能需要重啟 Pod 以套用新設定。 --- ### 小結比較表 | | HPA | VPA | | ------- | ------------------ | ----------------------------- | | 自動調整什麼？ | Pod 數量 | Pod 的資源（CPU/記憶體） | | 用途 | 水平擴展（Scale out/in） | 垂直調整（Scale up/down） | | 適合的場景 | 處理流量大起大落的服務 | 資源需求變動大、不想手動調整資源 | | 常見組合 | 可和 VPA 合併用，但有限制 | 可和 HPA 合併用，但 Pod 必須有最小/最大數量限制 | --- 如需範例設定，或想知道兩者能否同時使用，可以再告訴我！ --- ### 水平擴展（Scale out / Scale in），哪個數量是增加？哪個數量是減少？ - ### gpt-4.1 (2025/08/05) 讓我用最簡單的方式說明：「**Scale out**」和「**Scale in**」這兩個詞的意思： --- ### 水平擴展（Scale out / Scale in） * **Scale out（擴出）**：**增加**資源的「數量」 * 例如：增加更多 Pod、伺服器、主機（通常是同樣規格的機器）。 * 在 Kubernetes HPA 裡，就是**Pod 數量變多**，來應付更大的流量或工作量。 * 中文常說「橫向擴展」。 * **Scale in（縮入）**：**減少**資源的「數量」 * 例如：把多餘的 Pod、伺服器關閉或刪除。 * 在 Kubernetes HPA 裡，就是**Pod 數量變少**，以節省資源。 * 中文常說「橫向縮減」。 --- ### 小口訣 * **out = 增加數量**（數量往「外」擴散出去） * **in = 減少數量**（數量往「內」收縮） --- ### 圖像化舉例假設原本有 2 個 Pod： * **Scale out** 之後 → 4 個 Pod ``` 原本： PodA, PodB 擴出： PodA, PodB, PodC, PodD ``` * **Scale in** 之後 → 1 個 Pod ``` 原本： PodA, PodB 縮入： PodA ``` --- ### 補充 * **Scale up / Scale down（垂直擴展/縮減）**：是指「**單一台機器（或 Pod）的資源變大/變小**」，比如把 1G RAM 增加到 2G。 --- 如果還不確定，可以告訴我你遇到的情境或是哪裡卡住，我可以用更生活化的例子解釋！ {%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}