[debug][v1.0.0-rc1] slinky x KEDA
===
###### tags: `Slurm / SlinkyProject / debug`
###### tags: `Slurm`, `SlinkyProject`, `Kubernetes`, `k8s`, `app`, `KEDA`, `v1.0.0-rc1`
<br>
[TOC]
<br>
## Notes
### 1. slurm-exporter 已於 v1.0.0-rc1 phase out
- 根據 [CHANGELOG-1.0.md:95](https://github.com/SlinkyProject/slurm-operator/blob/main/CHANGELOG/CHANGELOG-1.0.md) 明確指出:
> Replaced slurm-exporter with a serviceMonitor that scrapes slurmctld directly.
- slurm-exporter 已被移除,改為直接從 slurmctld 抓取 metrics。
<br>
### 2. v1.0.0-rc1 如何匯出指標?
- 新版作法是透過 **slurmctld 內建的 metrics endpoints**:
- ### 啟用 metrics
安裝時需設定:
```sh
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--set 'controller.metrics.enabled=true' \
--set 'controller.metrics.serviceMonitor.enabled=true' \
--namespace slurm --create-namespace
```
- ### 可用的 metrics endpoints
根據 [helm/slurm/values.yaml:275-278](https://github.com/SlinkyProject/slurm-operator/blob/main/helm/slurm/values.yaml#L275-L278) 及 [internal/builder/controller_servicemonitor.go:55-60](https://github.com/SlinkyProject/slurm-operator/blob/main/internal/builder/controller_servicemonitor.go#L55-L60),預設提供四個 endpoints:
- `/metrics/jobs`
- `/metrics/nodes`
- `/metrics/partitions`
- `/metrics/scheduler`
- ### 如果忘了 metric URL ,如何找出 metrics endpoints?
- **STEP1**:查詢 controller IP & port
```
$ kubectl -n slurm get pod,svc -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/slurm-controller-0 3/3 Running 0 3d15h 192.168.0.218 stage-kube01 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/slurm-controller ClusterIP 10.107.212.234 <none> 6817/TCP 3d15h app.kubernetes.io/instance=slurm,app.kubernetes.io/name=slurmctld
```
- 192.168.0.218:6817
- 10.107.212.234:6817
- **STEP2**: 透過 `curl <ip>:<port>` 獲取資訊
```
$ curl 192.168.0.218:6817
slurmctld index of endpoints:
'/readyz': check slurmctld is servicing RPCs
'/livez': check slurmctld is running
'/healthz': check slurmctld is running
'/metrics': print available metric endpoints
```
```
$ curl 10.107.212.234:6817
slurmctld index of endpoints:
'/readyz': check slurmctld is servicing RPCs
'/livez': check slurmctld is running
'/healthz': check slurmctld is running
'/metrics': print available metric endpoints
```
- **STEP3**: 透過 `curl <ip>:<port>/metrics` 獲取資訊
```
$ curl 192.168.0.218:6817/metrics
slurmctld index of metrics endpoints:
'/metrics/jobs': get job metrics
'/metrics/nodes': get node metrics
'/metrics/partitions': get partition metrics
'/metrics/jobs-users-accts': get user and account jobs metrics
'/metrics/scheduler': get scheduler metrics
```
- **STEP4**: 選擇 metrics 來源:jobs, nodes, partitions, jobs-users-accts, scheduler
```
$ curl -s 192.168.0.218:6817/metrics/partitions | head
# HELP slurm_partitions Total number of partitions
# TYPE slurm_partitions gauge
slurm_partitions 5
# HELP slurm_partition_jobs Number of jobs in this partition
# TYPE slurm_partition_jobs gauge
slurm_partition_jobs{partition="slinky"} 0
slurm_partition_jobs{partition="all"} 0
slurm_partition_jobs{partition="book"} 0
slurm_partition_jobs{partition="tainan"} 0
slurm_partition_jobs{partition="tp"} 7
```
- ### 新版 ScaledObject 範例
> 參考文件:https://github.com/SlinkyProject/slurm-operator/blob/main/docs/usage/autoscaling.md#keda-scaledobject
```yaml=
# config-slurm-worker-241-gpu1080-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: slurm-worker-gpu1080-scaler
namespace: slurm
spec:
scaleTargetRef:
apiVersion: slinky.slurm.net/v1beta1 # (1) 改為 v1beta1
kind: NodeSet
name: slurm-worker-gpu1080
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 3
triggers:
- type: prometheus
metricType: Value
metadata:
# $> kubectl run curl-test -it --rm --image=curlimages/curl -- sh
# $> nslookup kube-prometheus-stack-prometheus.monitoring.svc.cluster.local
# $> curl http://kube-prometheus-stack-prometheus.monitoring:9090/api/v1/query?query=up
serverAddress: http://kube-prometheus-stack-prometheus.monitoring:9090 # depend on environments
query: slurm_partition_jobs_pending{partition="tp"} # (2) 拿掉 API suffix: _total
threshold: '5'
```
- ### 主要變更
1. **apiVersion**: `v1alpha1` → `v1beta1`
2. **指標來源**: slurm-exporter → slurmctld 直接提供
3. **指標名稱**: `slurm_partition_jobs_pending_total` → `slurm_partition_jobs_pending`
- ### 透過 Grafana 觀看

- ### 透過 endpoints 查詢
```
$ curl -s http://192.168.0.218:6817/metrics/partitions | grep slurm_partition_jobs_pending
# HELP slurm_partition_jobs_pending Number of jobs in Pending state
# TYPE slurm_partition_jobs_pending gauge
slurm_partition_jobs_pending{partition="slinky"} 0
slurm_partition_jobs_pending{partition="all"} 0
slurm_partition_jobs_pending{partition="book"} 0
slurm_partition_jobs_pending{partition="tainan"} 0
slurm_partition_jobs_pending{partition="tp"} 0
```
- 已經沒有 `slurm_partition_jobs_pending_total`
<br>
---
<br>
## 診斷問題:`"error": "the server could not find the requested resource (get nodesets.meta.k8s.io slurm-worker-gpu1080)"`
### keda/vendor/k8s.io/client-go/discovery/fake/discovery.go
https://github.com/kedacore/keda/blob/release/v2.17/vendor/k8s.io/client-go/discovery/fake/discovery.go#L63
### 診斷指令
- ### Step1:ScaledObject 設定與狀態,對應的 KEDA 錯誤訊息
```bash
# 1. 檢查 NodeSet 是否存在
$ kubectl -n slurm get nodeset slurm-worker-gpu1080
NAME REPLICAS UPDATED READY AGE
slurm-worker-gpu1080
# 2. 查看 ScaledObject 詳細狀態
kubectl -n slurm describe scaledobject slurm-worker-gpu1080-scaler
# 3. 查看 ScaledObject 完整 YAML
kubectl -n slurm get scaledobject slurm-worker-gpu1080-scaler -o yaml
# 4. 查看 KEDA operator logs
kubectl -n keda logs -l app=keda-operator --tail=50
```
- ### Step2:NodeSets 的 CRD 版本,以及是否支援 scale subresource
```bash
# 1. 檢查 CRD 版本
$ kubectl get crd nodesets.slinky.slurm.net -o jsonpath='{.spec.versions[*].name}'
v1beta1
# 2. 檢查是否有 scale subresource
$ kubectl get crd nodesets.slinky.slurm.net -o jsonpath='{.spec.versions[?(@.name=="v1beta1")].subresources}' | jq
{
"scale": {
"labelSelectorPath": ".status.selector",
"specReplicasPath": ".spec.replicas",
"statusReplicasPath": ".status.replicas"
},
"status": {}
}
# or
# $ kubectl get crd nodesets.slinky.slurm.net -o yaml | grep -A 10 subresources
# 3. 如果 CRD 缺少 scale subresource,需要重新安裝或更新 CRD
#helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
# --namespace slurm \
# --reuse-values
```
- ### Step3:確認 scale subresource 是否可存取
```bash
# 1. 直接測試 scale API
$ kubectl get --raw "/apis/slinky.slurm.net/v1beta1/namespaces/slurm/nodesets/slurm-worker-gpu1080/scale" | jq
{
"kind": "Scale",
"apiVersion": "autoscaling/v1",
"metadata": {
"name": "slurm-worker-gpu1080",
"namespace": "slurm",
"uid": "83e0e85e-8758-4157-b0c5-de928faead60",
"resourceVersion": "25964227",
"creationTimestamp": "2025-11-21T10:31:39Z"
},
"spec": {},
"status": {
"replicas": 0,
"selector": "app.kubernetes.io/instance=slurm-worker-gpu1080,app.kubernetes.io/name=slurmd"
}
}
# 2. 檢查 KEDA 版本(需 v2.10+ 以完整支援自訂 CRD scaling)
$ kubectl -n keda get deploy keda-operator -o jsonpath='{.spec.template.spec.containers[0].image}'
ghcr.io/kedacore/keda:2.17.2
```
<br>
### 診斷結果
- ### 如果 CRD 已經有 scale subresource,問題可能是**:
- CRD 的 `.spec.replicas` 或 `.status.replicas` 欄位路徑設定錯誤
- NodeSet CR 實例沒有這些欄位
- KEDA 的 discovery client 快取問題(需要重啟 KEDA operator)
- ### 深度分析:問題根本原因
根據程式碼追蹤,我找到了問題的關鍵:
### **錯誤發生的完整流程**
1. **KEDA 在 `scaledobject_controller.go:409` 呼叫**:
```go
scale, errScale = (r.ScaleClient).Scales(scaledObject.Namespace).Get(ctx, gr, scaledObject.Spec.ScaleTargetRef.Name, metav1.GetOptions{})
```
其中 `gr` 是 `schema.GroupResource{Group: "slinky.slurm.net", Resource: "nodesets"}`
2. **Scale client 在 `client.go:187` 呼叫 ScaleForResource**:
```go
desiredGVK, err := c.client.scaleKindResolver.ScaleForResource(gvr)
```
- 這裡 `gvr` 是 `slinky.slurm.net/v1beta1/nodesets`
- GVK 就是 group、verison、kind
- GVR 就是group、version、resource
3. **discoveryScaleResolver 在 `util.go:63-88` 查詢 scale subresource**:
```go
func (r *discoveryScaleResolver) ScaleForResource(inputRes schema.GroupVersionResource) (scaleVersion schema.GroupVersionKind, err error) {
groupVerResources, err := r.discoveryClient.ServerResourcesForGroupVersion(inputRes.GroupVersion().String())
for _, resource := range groupVerResources.APIResources {
resourceParts := strings.SplitN(resource.Name, "/", 2)
if len(resourceParts) != 2 || resourceParts[0] != inputRes.Resource || resourceParts[1] != "scale" {
continue
}
scaleGV := inputRes.GroupVersion()
if resource.Group != "" && resource.Version != "" {
scaleGV = schema.GroupVersion{
Group: resource.Group,
Version: resource.Version,
}
}
return scaleGV.WithKind(resource.Kind), nil
}
return schema.GroupVersionKind{}, fmt.Errorf("could not find scale subresource for %s in discovery information", inputRes.String())
}
```
4. **當找不到 scale subresource 時返回錯誤**,錯誤訊息顯示 `meta.k8s.io` 是因為**錯誤訊息中的 resource path 被錯誤格式化**
### **問題的真正原因**
`meta.k8s.io` 出現在錯誤訊息中,是因為:
1. **KEDA 的 discovery client 沒有正確發現 NodeSet 的 scale subresource**
2. **可能的原因**:
- KEDA operator 啟動時快取了舊的 API discovery 資訊
- CRD 在 KEDA operator 啟動後才更新了 scale subresource
- **Discovery cache 沒有正確更新** <--
### **解決方案**
請依照以下步驟操作:
```bash
# 1. 重啟 KEDA operator 以清除 discovery cache
kubectl rollout restart deployment keda-operator -n keda
# 2. 等待重啟完成
kubectl rollout status deployment keda-operator -n keda
# 3. 刪除並重新建立 ScaledObject(可選,但建議執行)
kubectl delete -f config-slurm-worker-241-gpu1080-scaler.yaml
kubectl apply -f config-slurm-worker-241-gpu1080-scaler.yaml
# 4. 監控 logs 確認問題解決
kubectl -n keda logs -l app=keda-operator -f
```
### **為什麼會出現 `meta.k8s.io`?**
這是 Kubernetes error reporting 的特性:當 discovery client 無法找到特定 GVR 的 scale subresource 時,錯誤訊息會使用 `meta.k8s.io` 作為預設的 API group 來格式化錯誤訊息,這只是錯誤格式化的副作用,**並非 KEDA 真的去查詢了 `meta.k8s.io` API group**。
<br>
### KEDA cache 問題的重製步驟:
**關鍵**:apiVersion 變化:v1alpha1 <-> v1beta1
Step1: 清空 KEDA cache (rollout restart)
Step2: 部署 v1alpha1 版本的 slinky (v0.4.1)
Step3: 部署 v1alpha1 版本的 worker scaler -> READY: true
Step4: 移除 v1alpha1 版本的 slinky & worker scaler
Step5: 部署 v1beta1 版本的 sliny (v1.0.0-rc1)
Step6: 部署 v1beta1 版本的 worker scaler -> READY: false
(delete scaler 後,重新再 apply ,READY 狀態依舊是 false )
Step7: 清空 KEDA cache (rollout restart) -> worker scaler : READY: true
<br>
### 2025/11/26 - 是否跟 Commit 6a7301a 有關?
> https://github.com/alvidofaisal/keda/commit/6a7301a5f71c9a136592621a6172734e350d40b9
> Fix(scaling): Correct API group resolution for cluster-scoped CRDs
This commit addresses an issue where KEDA would incorrectly attempt to query the `meta.k8s.io` API group when scaling cluster-scoped Custom Resources (CRs) that have a /scale subresource.
>
> The primary change involves updating `pkg/k8s/scaleclient.go` to use `scale.NewFixedScaleKindResolver`. This allows KEDA to directly use the `apiVersion` and `kind` specified in the `ScaledObject`'s `spec.scaleTargetRef` to resolve the API group, ensuring correct interaction with cluster-scoped CRDs.
>
> A new test case, `testClusterScopedCRDScale`, has been added to `tests/internals/subresource_scale/subresource_scale_test.go` to specifically verify the scaling of cluster-scoped CRDs. This test includes:
> - Definition of a cluster-scoped CRD (`ClusterScaler`) with a `/scale` subresource.
> - Creation of a `ScaledObject` targeting an instance of `ClusterScaler`.
> - Verification of scale-out and scale-in operations.
> - Checks to ensure no errors related to incorrect API group querying appear in KEDA operator logs.
>
> **Note:** Due to environmental limitations (out of disk space) I encountered during the development process, I couldn't complete the full build and end-to-end testing (including the new test case). I recommend you manually verify and further test in a stable environment.
>
> Fixes kedacore#6798
<br>
### 2025/11/26 - 使用 keda v2.18.1 (latest) 版本測試,問題依舊

<br>
{%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}