教學範本：slurm-exporter 串接 Prometheus

教學範本：slurm-exporter 串接 Prometheus === ###### tags: `SlinkyProject` ###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `Autoscaling`, `KEDA`, `HPA`, `VPA`, `Scale out`, `Scale in`, `ServiceMonitor` [TOC] :::success ## 🎯 目標 **讓 Prometheus 能透過 ServiceMonitor 自動抓取 slurm-exporter 的 `/metrics`，並驗證全流程。** ::: :::success ## 🎯 串接流程圖 ``` Prometheus (kube-prometheus-stack) │ ▼ ServiceMonitor (告訴 Prometheus，slurm-exporter 的來源在哪) │ ▼ Service (slurm-exporter, port=metrics) │ ▼ Deployment/Pod (slurm-exporter, 提供 /metrics) ``` ::: --- ## 1. 部署 Prometheus（以 kube-prometheus-stack 為例）如果你的 cluster 還沒有 Prometheus，最簡單的方式是用 Helm 裝 **kube-prometheus-stack**： ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # 安裝到 monitor namespace helm upgrade -i kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitor --create-namespace ``` 安裝後會有： * Prometheus CR（用於監聽 ServiceMonitor/PodMonitor） * Grafana * Alertmanager ### 驗證 ```bash kubectl -n monitor get pods ``` - 要看到 `kube-prometheus-stack-*` 跑起來。 - **執行範例** ``` $ kubectl -n monitor get pods NAME READY STATUS RESTARTS AGE alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 6d1h kube-prometheus-stack-grafana-cbd794898-wb72l 3/3 Running 0 6d1h kube-prometheus-stack-kube-state-metrics-577d4b4c4f-mpvnb 1/1 Running 0 6d1h kube-prometheus-stack-operator-7c84ff6bb7-x7nc2 1/1 Running 0 6d1h kube-prometheus-stack-prometheus-node-exporter-rt678 1/1 Running 0 7h38m prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 6d1h ``` --- ## 2. 部署 slurm-exporter > https://github.com/SlinkyProject/slurm-operator/blob/main/helm/slurm/values.yaml#L269-L279 在 `values.yaml` 啟用 exporter： > - slurm:v0.3.0 -> 預設關閉 > - slurm:v0.3.1 -> 預設打開 ```yaml= slurm-exporter: enabled: true exporter: enabled: true secretName: "slurm-token-exporter" ``` Helm 會產生 Deployment + Service（但不會產生 ServiceMonitor，要自己建）。 ``` # pod NAME READY STATUS RESTARTS AGE pod/slurm-exporter-cf8944f49-mzcbp 1/1 Running 0 6h31m # service NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/slurm-exporter ClusterIP None <none> 8080/TCP 26h # deployment NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/slurm-exporter 1/1 1 1 26h # replicaset NAME DESIRED CURRENT READY AGE replicaset.apps/slurm-exporter-cf8944f49 1 1 1 26h ``` - deployment 控制 replicaset - replicaset 再控制 service + pod ### 驗證 ```bash kubectl -n slurm get deploy slurm-exporter kubectl -n slurm get svc slurm-exporter -o yaml ``` - **檢查** - Pod Running，容器 port 名稱是 `metrics`。 - Service 有 port 8080 且名稱叫 `metrics`。 - **執行範例** ```bash= $ kubectl -n slurm get deploy slurm-exporter NAME READY UP-TO-DATE AVAILABLE AGE slurm-exporter 1/1 1 1 26h ``` ```yaml= $ kubectl -n slurm get svc slurm-exporter -o yaml apiVersion: v1 kind: Service metadata: annotations: meta.helm.sh/release-name: slurm meta.helm.sh/release-namespace: slurm creationTimestamp: "2025-09-08T08:01:38Z" labels: app.kubernetes.io/component: slurm-exporter app.kubernetes.io/instance: slurm-exporter app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: slurm-exporter app.kubernetes.io/version: "25.05" helm.sh/chart: slurm-exporter-0.3.1 name: slurm-exporter namespace: slurm resourceVersion: "6379719" uid: 79c2541b-625b-455f-9f64-04949fd6ad79 spec: clusterIP: None clusterIPs: - None internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: metrics # <-- Service 有 port 8080 且名稱叫 `metrics`。 port: 8080 protocol: TCP targetPort: 8080 selector: app.kubernetes.io/instance: slurm-exporter app.kubernetes.io/name: slurm-exporter sessionAffinity: None type: ClusterIP status: loadBalancer: {} ``` ### 連線測試 > 已經確認有 Service `slurm-exporter`，port 8080 名稱叫 `metrics`。要用 `curl` 測試 exporter 的 `/metrics` endpoint，可以這樣做： --- :::danger ### :warning: 從 slurm-exporter Pod 內部測試，但行不通！直接 `exec` 進去 Pod 測試，該容器不支援 `sh`, `bash`, `curl`, `wget` 等指令： ```bash # 找出 pod 名稱 $ kubectl -n slurm get pod -l app.kubernetes.io/name=slurm-exporter # 進去容器 $ kubectl -n slurm exec -it <pod-name> -- sh error: Internal error occurred: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "9c5b0711edffcefae50032e3277a8f24456ddcb824a5344fc6c0b59cc2d8d690": OCI runtime exec failed: exec failed: unable to start container process: exec: ``` ::: - ### [方法一] 從 **同一個 namespace 內的其他 Pod** 測試因為 Service 名稱是 `slurm-exporter`，namespace 在 `slurm`，所以任何同 namespace Pod 可以直接這樣測： - ### 用 busybox 映像檔 ```bash # 單行指令，建立出暫時的 pod kubectl -n slurm run tmp-metric-dumper \ -it --rm --image=busybox:1.36 --restart=Never -- sh # 在 busybox 內 wget -qO- http://slurm-exporter:8080/metrics ``` - 等效於 `-q -O-`, `-q -O -` - `-q`, Quiet, 不要印出進度條 - `-O FILE`, Save to FILE (`'-'` for stdout) - ### 用 curl 映像檔 ```bash kubectl -n slurm run tmp-metric-dumper \ -it --rm --image=curlimages/curl --restart=Never -- \ curl http://slurm-exporter:8080/metrics ``` - 或是在 Shell 中互動 ```bash kubectl -n slurm run tmp-metric-dumper \ -it --rm --image=curlimages/curl --restart=Never -- sh # shell 互動模式 ~ $ curl http://slurm-exporter:8080/metrics | grep partition ``` - ### [補充] 底下的 domain 都可以通 - `http://slurm-exporter:8080/metrics` - `http://slurm-exporter.slurm:8080/metrics` - `http://slurm-exporter.slurm.svc:8080/metrics` - `http://slurm-exporter.slurm.svc.cluster.local:8080/metrics` (**FQSN**, fully qualified service name) --- - ### [方法二] 從其他 namespace 測試必須加上 namespace 的 DNS domain： ``` http://slurm-exporter.slurm.svc.cluster.local:8080/metrics ``` 範例：在 `default` 命名空間，向 `slurm` 命名空間存取 `/metrics` ```bash kubectl -n default run tmp-metric-dumper \ -it --rm --image=curlimages/curl --restart=Never -- \ curl http://slurm-exporter.slurm.svc.cluster.local:8080/metrics ``` --- - ### [方法三] 從本地電腦測試目前這個 Service 是 `ClusterIP: None`（headless service），只能在 cluster 內部存取。如果你要從外部（例如你的桌機、筆電）curl，必須先做 port-forward： ```bash kubectl port-forward -n slurm svc/slurm-exporter 8080:8080 ``` 然後在本機執行： ```bash curl http://localhost:8080/metrics ``` --- ### 概念流程圖 ``` [你的程式 / 瀏覽器] │ │ 連線到本機 127.0.0.1:8080 ▼ [本機 kubectl port-forward 程序] │ │ 與 K8s API Server 建立 TLS/SPDY 隧道 ▼ [Kubernetes API Server] │ │ 轉送「portforward」資料流給目標節點的 kubelet ▼ [kubelet (在目標 Node 上)] │ │ 將資料導入 Pod 對應的 containerPort（= targetPort） ▼ [Pod: slurm-exporter-xxxxx 容器] ``` --- ### 詳細解釋 `kubectl port-forward` 流程： - ### 概念 > 將本機 `127.0.0.1:8080` 的連線，透過 API server 轉送到 `slurm` 命名空間中 `svc/slurm-exporter` 背後的一個 Pod，並且轉到該 Service 的 `8080` 所對應的 Pod 端口（`targetPort`）。 ([Kubernetes][1]) - ### 細節 - `kubectl port-forward` 這條指令會把**本機 `127.0.0.1:8080`** 的流量，**轉送到 `slurm` 命名空間中的 Service `slurm-exporter`**。`kubectl` 會從該 Service 背後**自動挑一個 Pod**來建立通道。 ([Kubernetes][1]) - 右邊的 `8080` 代表 **Service 的某個 port**；`kubectl` 會根據 Service 的 `port -> targetPort` 對應，把流量轉到那個 **Pod 的實際容器埠**。也就是說，**就算 Service `port: 8080` 對應 `targetPort: 9090`，你連到本機 `8080` 也會被轉到 Pod 的 `9090`**。 ([Kubernetes][2]) - ### 小提醒 - 預設只在本機回圈位址監聽，不對外開放；要對外網卡監聽可加 `--address 0.0.0.0`（請注意安全）。([Kubernetes][2]) - 若 Service 有多個埠，建議用**埠名**指定，例如： `kubectl port-forward -n slurm svc/slurm-exporter 8080:metrics`（會轉到該埠名對應的 `targetPort`）。([Kubernetes][2]) - 若本機端與 Service 使用相同埠，指令可進一步簡化為 ``` kubectl port-forward -n slurm svc/slurm-exporter 8080 ``` 或用 `metrics` 表示 8080 ``` kubectl port-forward -n slurm svc/slurm-exporter metrics ``` - `$ kubectl -n slurm get service/slurm-exporter -o yaml` ```yaml spec: clusterIP: None ... ports: - name: metrics # <-- 埠名 port: 8080 protocol: TCP targetPort: 8080 ``` [1]: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/ "Use Port Forwarding to Access Applications in a Cluster | Kubernetes" [2]: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_port-forward/ "kubectl port-forward | Kubernetes" --- 👉 建議先用 **方法 1**（在相同 namespace 下）確認 exporter 本身能正確輸出 metrics， 👉 再用 **方法 2** 或 **方法 3** 確認 Service 與外部存取路徑是否正常。 --- ## 3. 建立 ServiceMonitor CR > 告訴 Prometheus ： > - 我的資料來源為 `-n slurm`, `svc/slurm-exporter`, `port: metrics` > - 每 5 秒抓取一次資料 > - 預設存取路徑為 `/metrics` 新建一個 YAML 檔 `servicemonitor-slurm-exporter.yaml`： ```yaml= apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: slurm-exporter namespace: slurm labels: release: kube-prometheus-stack # 👈 Prometheus 會用這個 label 來挑 ServiceMonitor (重要關鍵) spec: selector: matchLabels: app.kubernetes.io/name: slurm-exporter app.kubernetes.io/instance: slurm-exporter namespaceSelector: matchNames: - slurm endpoints: - port: metrics interval: 5s ``` - ### 方案A：建立 ServiceMonitor： ```bash kubectl apply -f servicemonitor-slurm-exporter.yaml ``` - ### 方案B：動態套用 label 到即有的 ServiceMonitor： ```bash kubectl -n slurm label servicemonitor/slurm-exporter \ release=kube-prometheus-stack --overwrite ``` - ### 驗證 servicemonitor ```bash kubectl -n slurm get servicemonitor slurm-exporter -o yaml ``` 確認 selector 與 Service 的 labels 一致。 - ### 詳細解釋設定方式 - **metadata.namespace: `slurm`** 這個 ServiceMonitor 物件本身放在 `slurm` 命名空間。 - **metadata.labels.release: `kube-prometheus-stack`（很關鍵）** `kube-prometheus-stack` 預設只會挑選帶有同名 `release` 標籤的 ServiceMonitor。 ➜ 若你安裝 Prometheus 的 Helm 釋出名稱不是 `kube-prometheus-stack`，這裡也要改成**同樣的名稱**，否則 Prometheus 會忽略它。 - #### 如何確認 Prometheus 是使用什麼 release label 來挑選 ServiceMonitor？ ``` $ NAMESPACE=monitor $ kubectl -n ${NAMESPACE} get prometheus NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE monitor-kube-prometheus-st-prometheus v3.3.0 1 1 True True 37d $ kubectl -n ${NAMESPACE} get prometheus -o yaml $ kubectl -n ${NAMESPACE} get prometheus -o yaml | yq '.items[0].spec.serviceMonitorSelector' | yq matchLabels: release: monitor $ kubectl -n ${NAMESPACE} get prometheus -o yaml | yq '.items[0].spec.serviceMonitorNamespaceSelector' {} ``` - **spec.selector.matchLabels** 要被抓取的 **Service（注意：是 Service，不是 Pod）** 必須擁有這兩個標籤 (AND 條件)： - `app.kubernetes.io/name=slurm-exporter` - `app.kubernetes.io/instance=slurm-exporter` ➜ 換句話說，你的 **`Service/slurm-exporter`** 需要帶上這些 labels，ServiceMonitor 才會選中它。 **檢查方式**： ```yaml $ kubectl -n slurm get svc/slurm-exporter -o yaml | grep labels -A6 labels: app.kubernetes.io/component: slurm-exporter app.kubernetes.io/instance: slurm-exporter # <-- app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: slurm-exporter # <-- app.kubernetes.io/version: "25.05" helm.sh/chart: slurm-exporter-0.3.1 ``` - **spec.namespaceSelector.matchNames: `slurm`** 只在 `slurm` 這個命名空間裡尋找符合 selector 的 **Service**。（要跨命名空間監測可以改成 `any: true` 或列出多個 `matchNames`。） - **spec.endpoints** - `port: metrics`：用 **Service** 內**名為 `metrics`** 的埠（`spec.ports[].name: metrics`）。Prometheus 會透過該埠對應的 `targetPort` 去抓 Pod 的 `/metrics`。 - `interval: 5s`：每 5 秒抓一次（抓取頻率）。 - 預設抓取的路徑是 **`/metrics`**。在使用 **ServiceMonitor** 時，如果你沒有另外指定，Prometheus Operator 產生的抓取設定會把路徑設為 `/metrics`。要改成別的（例如 `/data`），在 `endpoints` 裡加上 `path` 即可： ```yaml spec: ... endpoints: - port: metrics # 指到 Service 的埠名 path: /data # ← 改成自訂路徑（預設為 /metrics） interval: 5s ``` * 這個 `path` 只影響 **ServiceMonitor** 這個工作；舊式的 `prometheus.io/path` 註解是給「註解式抓取」用的，與 ServiceMonitor 無關。 * 若有特殊需求，也可以用 `relabelings` 改寫目標的 `__metrics_path__`，但一般直接用 `path` 最清楚。 - ### [模擬測試] 驗證 servicemonitor 所填的資訊，Prometheus 是否能撈到目標資料？ > i.e. Prometheus 如何根據 SeviceMonitor 撈取資料： - ### 目標1：哪個 service ``` $ kubectl -n slurm get service \ -l app.kubernetes.io/instance=slurm-exporter,app.kubernetes.io/name=slurm-exporter ``` ``` # -l, --selector='' # selector 可以拆解成多個 $ kubectl -n slurm get service \ -l app.kubernetes.io/instance=slurm-exporter \ -l app.kubernetes.io/name=slurm-exporter NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE slurm-exporter ClusterIP None <none> 8080/TCP 6h14m ``` - ### 目標2：哪個 port ``` $ kubectl -n slurm get service \ -l app.kubernetes.io/instance=slurm-exporter \ -l app.kubernetes.io/name=slurm-exporter \ -o yaml \ | yq '.items[].spec.ports[] | select(.name == "metrics")' ``` --- ### 一句話總結 > 這個 ServiceMonitor 讓 kube-prometheus-stack 這個 Prometheus 實例，每 5 秒從 **`slurm` 命名空間**內、帶有指定 labels 的 **Service（埠名為 `metrics`）** 所對應的 Pod 上抓 `/metrics`。 --- ### 快速自我檢查（最常見踩雷） 1. **Label 是否對上？** ```bash kubectl -n slurm get svc/slurm-exporter -o jsonpath-as-json='{.metadata.labels}' ``` 2. **Service 埠名是否為 `metrics`？** ```bash kubectl -n slurm get svc/slurm-exporter -o jsonpath='{.spec.ports[*].name}' ``` 3. **Prometheus 是否會挑選到這個 ServiceMonitor？（release 名稱是否一致）** （將 `kube-prometheus-stack` 換成你實際的 Helm release 名稱） ```bash kubectl get servicemonitors --all-namespaces -l release=kube-prometheus-stack ``` --- ## 4. 驗證 /metrics 輸出 > (前一小節已有詳細說明) ### 方法 A：port-forward 到本機 ```bash kubectl -n slurm port-forward svc/slurm-exporter 8080:8080 curl http://localhost:8080/metrics | head -20 ``` ### 方法 B：臨時 Pod 測試 ```bash kubectl -n slurm run -it curl-test \ --rm --restart=Never --image=curlimages/curl -- \ curl http://slurm-exporter:8080/metrics ``` --- ## 5. 驗證 Service → Pod 導流 ```bash kubectl -n slurm get endpoints/slurm-exporter -o yaml ``` - 要能看到 Pod IP 和 port 8080。 - **執行範例** ```bash $ kubectl -n slurm get endpoints/slurm-exporter -o yaml ``` ```yaml= apiVersion: v1 kind: Endpoints metadata: annotations: endpoints.kubernetes.io/last-change-trigger-time: "2025-09-10T02:51:40Z" creationTimestamp: "2025-09-10T02:49:49Z" labels: app.kubernetes.io/component: slurm-exporter app.kubernetes.io/instance: slurm-exporter app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: slurm-exporter app.kubernetes.io/version: "25.05" helm.sh/chart: slurm-exporter-0.3.1 service.kubernetes.io/headless: "" name: slurm-exporter namespace: slurm resourceVersion: "6838327" uid: 3b162b0f-63e3-4cec-b022-d3225490a459 subsets: - addresses: - ip: 192.168.0.89 nodeName: stage-kube01 targetRef: kind: Pod name: slurm-exporter-cf8944f49-pd62b namespace: slurm uid: 2bf47fca-85a5-4a2a-949e-35589897df66 ports: - name: metrics port: 8080 protocol: TCP ``` --- ## 6. 驗證 Prometheus 是否有收到 `/metrics` 資訊？ - ### 1. 進 Prometheus UI（透過 port-forward 或 ingress）： ```bash kubectl -n monitor port-forward svc/kube-prometheus-stack-prometheus 9090:9090 ``` 然後瀏覽器打 `http://localhost:9090` - **Screenshot** ![](https://hackmd.io/_uploads/BytsXpJsgg.png) - ### 2. 到 **Status → Targets**，應看到： ``` serviceMonitor/slurm/slurm-exporter/0 ``` 且狀態是 `UP`。 - **Screenshot** - Status -> Target health ![](https://hackmd.io/_uploads/SyK-N6Jsgx.png) - 在 [Select scrape pool] 中，輸入 `serviceMonitor/slurm/slurm-exporter/0` 執行過濾 ![](https://hackmd.io/_uploads/B1pa4pJsex.png) - ### 3. Query 一個 exporter metric： ```promql slurm_jobs_total ``` 有數值就代表整個鏈路 OK。 - **Screenshot** ![](https://hackmd.io/_uploads/SkHpLaJjee.png) **過濾結果**： ``` slurm_jobs_total{ container="metrics", endpoint="metrics", instance="192.168.0.89:8080", job="slurm-exporter", namespace="slurm", pod="slurm-exporter-cf8944f49-pd62b", service="slurm-exporter" } ‵‵‵ - ### 4. 透過 `curl` 查詢上面範例 ```bash $ curl -s http://localhost:9090/api/v1/query?query=slurm_jobs_total | jq # -s, --silent: Silent mode -> 不要印出進度 # ------------------------------------------------------------------------------- # % Total % Received % Xferd Average Speed Time Time Time Current # Dload Upload Total Spent Left Speed # 100 578 100 578 0 0 99982 0 --:--:-- --:--:-- --:--:-- 112k # ------------------------------------------------------------------------------- ``` ```json= { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "slurm_jobs_total", "container": "metrics", "endpoint": "metrics", "instance": "192.168.0.89:8080", "job": "slurm-exporter", "namespace": "slurm", "pod": "slurm-exporter-cf8944f49-pd62b", "service": "slurm-exporter" }, "value": [ 1757563263.28, "0" ] } ] } } ``` - #### 若有多筆資料，進一步限制條件：`namespace=slurm` ``` curl -s 'http://localhost:9090/api/v1/query?query=slurm_jobs_total{namespace="slurm"}' | jq ``` - #### **執行結果（有錯誤訊息）** ``` { "status": "error", "errorType": "bad_data", "error": "invalid parameter \"query\": 1:26: parse error: unexpected \"=\"" } ``` - #### 限制條件需做「編碼」 > https://meyerweb.com/eric/tools/dencoder/ > 編碼前： `{namespace="slurm"}` > 編碼後：`%7Bnamespace%3D%22slurm%22%7D` - #### 修正後的 curl 指令： ``` $ curl -s http://localhost:9090/api/v1/query?query=slurm_jobs_total%7Bnamespace%3D%22slurm%22%7D | jq ``` ```json= { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "slurm_jobs_total", "container": "metrics", "endpoint": "metrics", "instance": "192.168.0.89:8080", "job": "slurm-exporter", "namespace": "slurm", "pod": "slurm-exporter-cf8944f49-pd62b", "service": "slurm-exporter" }, "value": [ 1757563884.951, "0" ] } ] } } ``` - #### 自動編碼 data 部份 ```bash $ curl -s 'http://localhost:9090/api/v1/query' \ --data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq # --data-urlencode 可用 --data, -d 置換 ``` - 若使用 `-X POST`，同上 ``` $ curl -X POST -s 'http://localhost:9090/api/v1/query' \ --data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq ``` - 若使用 `-X GET`，需加上參數 `-G` > `-G`: 把那些資料改成 URL 查詢字串附加到網址後面**（`?a=b&c=d`），而不是放到 HTTP body ``` $ curl -X GET -sG 'http://localhost:9090/api/v1/query' \ --data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq ``` - #### 最終指令 ```bash $ curl -s 'http://localhost:9090/api/v1/query' \ -d 'query=slurm_jobs_total{namespace="slurm"}' | jq ``` - ### 5. 更多範例 - **排除某個命名空間**： ``` slurm_jobs_total{namespace!="xxx-slurm"} ``` - **多個命名空間（正則）**： ``` slurm_jobs_total{namespace=~"slurm|prod-slurm"} ``` --- ## 7. 常見問題排查 - **ServiceMonitor 無效** → 檢查 Prometheus 的 `serviceMonitorSelector`，預設需要 label `release=kube-prometheus-stack`。 - **/metrics 拿不到** → 先看 Pod log： ```bash kubectl -n slurm logs deploy/slurm-exporter ``` - **Endpoints 為空** → Service selector label 沒對到 Pod。 - **Targets 顯示 DOWN** → 檢查 NetworkPolicy 是否擋住 `monitor` namespace 的 Prometheus 到 `slurm` namespace 的 Service。 --- --- ## 附錄：`/metrics` 的資料長相 > `http://slurm-exporter:8080/metrics` 的資料長相 ```ini= # HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0.000158618 go_gc_duration_seconds{quantile="0.25"} 0.000293336 go_gc_duration_seconds{quantile="0.5"} 0.000321119 go_gc_duration_seconds{quantile="0.75"} 0.000362432 go_gc_duration_seconds{quantile="1"} 0.007996102 go_gc_duration_seconds_sum 0.368574099 go_gc_duration_seconds_count 1153 # HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent. # TYPE go_gc_gogc_percent gauge go_gc_gogc_percent 100 # HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes. # TYPE go_gc_gomemlimit_bytes gauge go_gc_gomemlimit_bytes 9.223372036854776e+18 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 25 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.24.5"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated in heap and currently in use. Equals to /memory/classes/heap/objects:bytes. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 8.16408e+06 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated in heap until now, even if released already. Equals to /gc/heap/allocs:bytes. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 6.231564864e+09 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. Equals to /memory/classes/profiling/buckets:bytes. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.752257e+06 # HELP go_memstats_frees_total Total number of heap objects frees. Equals to /gc/heap/frees:objects + /gc/heap/tiny/allocs:objects. # TYPE go_memstats_frees_total counter go_memstats_frees_total 3.6451928e+07 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. Equals to /memory/classes/metadata/other:bytes. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 3.53308e+06 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and currently in use, same as go_memstats_alloc_bytes. Equals to /memory/classes/heap/objects:bytes. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 8.16408e+06 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. Equals to /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 1.908736e+07 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 1.1845632e+07 # HELP go_memstats_heap_objects Number of currently allocated objects. Equals to /gc/heap/objects:objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 18082 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. Equals to /memory/classes/heap/released:bytes. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 1.1870208e+07 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes + /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 3.0932992e+07 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 1.7574158133290594e+09 # HELP go_memstats_mallocs_total Total number of heap objects allocated, both live and gc-ed. Semantically a counter version for go_memstats_heap_objects gauge. Equals to /gc/heap/allocs:objects + /gc/heap/tiny/allocs:objects. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 3.647001e+07 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. Equals to /memory/classes/metadata/mcache/inuse:bytes. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 106304 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. Equals to /memory/classes/metadata/mcache/inuse:bytes + /memory/classes/metadata/mcache/free:bytes. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 109928 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. Equals to /memory/classes/metadata/mspan/inuse:bytes. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 428320 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. Equals to /memory/classes/metadata/mspan/inuse:bytes + /memory/classes/metadata/mspan/free:bytes. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 522240 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. Equals to /gc/heap/goal:bytes. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 1.6529442e+07 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. Equals to /memory/classes/other:bytes. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 6.293959e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes obtained from system for stack allocator in non-CGO environments. Equals to /memory/classes/heap/stacks:bytes. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 2.62144e+06 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. Equals to /memory/classes/heap/stacks:bytes + /memory/classes/os-stacks:bytes. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 2.62144e+06 # HELP go_memstats_sys_bytes Number of bytes obtained from system. Equals to /memory/classes/total:byte. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 4.5765896e+07 # HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads. # TYPE go_sched_gomaxprocs_threads gauge go_sched_gomaxprocs_threads 88 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 32 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 119.73 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1.048576e+06 # HELP process_network_receive_bytes_total Number of bytes received by the process over the network. # TYPE process_network_receive_bytes_total counter process_network_receive_bytes_total 1.22111852e+08 # HELP process_network_transmit_bytes_total Number of bytes sent by the process over the network. # TYPE process_network_transmit_bytes_total counter process_network_transmit_bytes_total 4.1687486e+07 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 11 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 5.1982336e+07 # HELP process_start_time_seconds Start time of the process since unix epoch in seconds. # TYPE process_start_time_seconds gauge process_start_time_seconds 1.75739091263e+09 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 1.30664448e+09 # HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. # TYPE process_virtual_memory_max_bytes gauge process_virtual_memory_max_bytes 1.8446744073709552e+19 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 5012 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0 # HELP slurm_bfscheduler_active_bool Backfill scheduler currently running # TYPE slurm_bfscheduler_active_bool gauge slurm_bfscheduler_active_bool 0 # HELP slurm_bfscheduler_backfilledhetjobs_total Number of heterogeneous job components started through backfilling since last Slurm start # TYPE slurm_bfscheduler_backfilledhetjobs_total gauge slurm_bfscheduler_backfilledhetjobs_total 0 # HELP slurm_bfscheduler_backfilledjobs_total Number of jobs started through backfilling since last slurm start # TYPE slurm_bfscheduler_backfilledjobs_total gauge slurm_bfscheduler_backfilledjobs_total 2 # HELP slurm_bfscheduler_cycle_max_seconds Execution time in microseconds of longest backfill scheduling cycle # TYPE slurm_bfscheduler_cycle_max_seconds gauge slurm_bfscheduler_cycle_max_seconds 1688 # HELP slurm_bfscheduler_cycle_mean_seconds Mean time in microseconds of backfilling scheduling cycles since last reset # TYPE slurm_bfscheduler_cycle_mean_seconds gauge slurm_bfscheduler_cycle_mean_seconds 922 # HELP slurm_bfscheduler_cycle_seconds Execution time in microseconds of last backfill scheduling cycle # TYPE slurm_bfscheduler_cycle_seconds gauge slurm_bfscheduler_cycle_seconds 1111 # HELP slurm_bfscheduler_cycle_sum_seconds Total time in microseconds of backfilling scheduling cycles since last reset # TYPE slurm_bfscheduler_cycle_sum_seconds gauge slurm_bfscheduler_cycle_sum_seconds 41498 # HELP slurm_bfscheduler_cycle_total Number of backfill scheduling cycles since last reset # TYPE slurm_bfscheduler_cycle_total gauge slurm_bfscheduler_cycle_total 45 # HELP slurm_bfscheduler_depth_mean_total Mean number of eligible to run jobs processed during all backfilling scheduling cycles since last reset # TYPE slurm_bfscheduler_depth_mean_total gauge slurm_bfscheduler_depth_mean_total 2 # HELP slurm_bfscheduler_depth_sum_total Total number of jobs processed during all backfilling scheduling cycles since last reset # TYPE slurm_bfscheduler_depth_sum_total gauge slurm_bfscheduler_depth_sum_total 109 # HELP slurm_bfscheduler_depth_try_total The subset of Depth Mean that the backfill scheduler attempted to schedule # TYPE slurm_bfscheduler_depth_try_total gauge slurm_bfscheduler_depth_try_total 2 # HELP slurm_bfscheduler_depth_trysum_total Subset of bf_depth_sum that the backfill scheduler attempted to schedule # TYPE slurm_bfscheduler_depth_trysum_total gauge slurm_bfscheduler_depth_trysum_total 109 # HELP slurm_bfscheduler_endjobqueue_total Reached RPC limit # TYPE slurm_bfscheduler_endjobqueue_total gauge slurm_bfscheduler_endjobqueue_total 45 # HELP slurm_bfscheduler_lastbackfilledjobs_total Number of jobs started through backfilling since last reset # TYPE slurm_bfscheduler_lastbackfilledjobs_total gauge slurm_bfscheduler_lastbackfilledjobs_total 2 # HELP slurm_bfscheduler_lastcycle_timestamp When the last backfill scheduling cycle happened (UNIX timestamp) # TYPE slurm_bfscheduler_lastcycle_timestamp gauge slurm_bfscheduler_lastcycle_timestamp 1.757409677e+09 # HELP slurm_bfscheduler_lastdepth_total Number of processed jobs during last backfilling scheduling cycle # TYPE slurm_bfscheduler_lastdepth_total gauge slurm_bfscheduler_lastdepth_total 3 # HELP slurm_bfscheduler_lastdepthtry_total Number of processed jobs during last backfilling scheduling cycle that had a chance to start using available resources # TYPE slurm_bfscheduler_lastdepthtry_total gauge slurm_bfscheduler_lastdepthtry_total 3 # HELP slurm_bfscheduler_maxjobstart_total Reached number of jobs allowed to be tested # TYPE slurm_bfscheduler_maxjobstart_total gauge slurm_bfscheduler_maxjobstart_total 0 # HELP slurm_bfscheduler_maxjobtest_total Reached end of queue # TYPE slurm_bfscheduler_maxjobtest_total gauge slurm_bfscheduler_maxjobtest_total 0 # HELP slurm_bfscheduler_maxtime_total Blocked on licenses # TYPE slurm_bfscheduler_maxtime_total gauge slurm_bfscheduler_maxtime_total 0 # HELP slurm_bfscheduler_nodespace_total Reached table size limit # TYPE slurm_bfscheduler_nodespace_total gauge slurm_bfscheduler_nodespace_total 0 # HELP slurm_bfscheduler_queue_mean_total Mean number of jobs pending to be processed by backfilling algorithm # TYPE slurm_bfscheduler_queue_mean_total gauge slurm_bfscheduler_queue_mean_total 2 # HELP slurm_bfscheduler_queue_sum_total Total number of jobs pending to be processed by backfilling algorithm since last reset # TYPE slurm_bfscheduler_queue_sum_total gauge slurm_bfscheduler_queue_sum_total 109 # HELP slurm_bfscheduler_queue_total Number of jobs pending to be processed by backfilling algorithm # TYPE slurm_bfscheduler_queue_total gauge slurm_bfscheduler_queue_total 3 # HELP slurm_bfscheduler_statechanged_total Reached maximum allowed scheduler time # TYPE slurm_bfscheduler_statechanged_total gauge slurm_bfscheduler_statechanged_total 0 # HELP slurm_bfscheduler_table_total Number of different time slots tested by the backfill scheduler in its last iteration # TYPE slurm_bfscheduler_table_total gauge slurm_bfscheduler_table_total 6 # HELP slurm_bfscheduler_tablemean_total Mean number of different time slots tested by the backfill scheduler # TYPE slurm_bfscheduler_tablemean_total gauge slurm_bfscheduler_tablemean_total 2 # HELP slurm_bfscheduler_tablesum_total Total number of different time slots tested by the backfill scheduler # TYPE slurm_bfscheduler_tablesum_total gauge slurm_bfscheduler_tablesum_total 204 # HELP slurm_jobs_bootfail_total Number of jobs in BootFail state # TYPE slurm_jobs_bootfail_total gauge slurm_jobs_bootfail_total 0 # HELP slurm_jobs_cancelled_total Number of jobs in Cancelled state # TYPE slurm_jobs_cancelled_total gauge slurm_jobs_cancelled_total 0 # HELP slurm_jobs_completed_total Number of jobs in Completed state # TYPE slurm_jobs_completed_total gauge slurm_jobs_completed_total 0 # HELP slurm_jobs_completing_total Number of jobs with Completing flag # TYPE slurm_jobs_completing_total gauge slurm_jobs_completing_total 0 # HELP slurm_jobs_configuring_total Number of jobs with Configuring flag # TYPE slurm_jobs_configuring_total gauge slurm_jobs_configuring_total 0 # HELP slurm_jobs_cpus_alloc_total Number of Allocated CPUs among jobs # TYPE slurm_jobs_cpus_alloc_total gauge slurm_jobs_cpus_alloc_total 0 # HELP slurm_jobs_deadline_total Number of jobs in Deadline state # TYPE slurm_jobs_deadline_total gauge slurm_jobs_deadline_total 0 # HELP slurm_jobs_failed_total Number of jobs in Failed state # TYPE slurm_jobs_failed_total gauge slurm_jobs_failed_total 0 # HELP slurm_jobs_hold_total Number of jobs with Hold flag # TYPE slurm_jobs_hold_total gauge slurm_jobs_hold_total 0 # HELP slurm_jobs_memory_alloc_bytes Amount of Allocated Memory (MB) among jobs # TYPE slurm_jobs_memory_alloc_bytes gauge slurm_jobs_memory_alloc_bytes 0 # HELP slurm_jobs_nodefail_total Number of jobs in NodeFail state # TYPE slurm_jobs_nodefail_total gauge slurm_jobs_nodefail_total 0 # HELP slurm_jobs_outofmemory_total Number of jobs in OutOfMemory state # TYPE slurm_jobs_outofmemory_total gauge slurm_jobs_outofmemory_total 0 # HELP slurm_jobs_pending_total Number of jobs in Pending state # TYPE slurm_jobs_pending_total gauge slurm_jobs_pending_total 0 # HELP slurm_jobs_powerupnode_total Number of jobs with PowerUpNode flag # TYPE slurm_jobs_powerupnode_total gauge slurm_jobs_powerupnode_total 0 # HELP slurm_jobs_preempted_total Number of jobs in Preempted state # TYPE slurm_jobs_preempted_total gauge slurm_jobs_preempted_total 0 # HELP slurm_jobs_running_total Number of jobs in Running state # TYPE slurm_jobs_running_total gauge slurm_jobs_running_total 0 # HELP slurm_jobs_stageout_total Number of jobs with StageOut flag # TYPE slurm_jobs_stageout_total gauge slurm_jobs_stageout_total 0 # HELP slurm_jobs_suspended_total Number of jobs in Suspended state # TYPE slurm_jobs_suspended_total gauge slurm_jobs_suspended_total 0 # HELP slurm_jobs_timeout_total Number of jobs in Timeout state # TYPE slurm_jobs_timeout_total gauge slurm_jobs_timeout_total 0 # HELP slurm_jobs_total Total number of jobs # TYPE slurm_jobs_total gauge slurm_jobs_total 0 # HELP slurm_nodes_allocated_total Number of nodes in Allocated state # TYPE slurm_nodes_allocated_total gauge slurm_nodes_allocated_total 0 # HELP slurm_nodes_completing_total Number of nodes with Completing flag # TYPE slurm_nodes_completing_total gauge slurm_nodes_completing_total 0 # HELP slurm_nodes_down_total Number of nodes in Down state # TYPE slurm_nodes_down_total gauge slurm_nodes_down_total 0 # HELP slurm_nodes_drain_total Number of nodes with Drain flag # TYPE slurm_nodes_drain_total gauge slurm_nodes_drain_total 0 # HELP slurm_nodes_error_total Number of nodes in Error state # TYPE slurm_nodes_error_total gauge slurm_nodes_error_total 0 # HELP slurm_nodes_fail_total Number of nodes with Fail flag # TYPE slurm_nodes_fail_total gauge slurm_nodes_fail_total 0 # HELP slurm_nodes_future_total Number of nodes in Future state # TYPE slurm_nodes_future_total gauge slurm_nodes_future_total 0 # HELP slurm_nodes_idle_total Number of nodes in Idle state # TYPE slurm_nodes_idle_total gauge slurm_nodes_idle_total 0 # HELP slurm_nodes_maintenance_total Number of nodes with Maintenance flag # TYPE slurm_nodes_maintenance_total gauge slurm_nodes_maintenance_total 0 # HELP slurm_nodes_mixed_total Number of nodes in Mixed state # TYPE slurm_nodes_mixed_total gauge slurm_nodes_mixed_total 0 # HELP slurm_nodes_notresponding_total Number of nodes with NotResponding flag # TYPE slurm_nodes_notresponding_total gauge slurm_nodes_notresponding_total 0 # HELP slurm_nodes_planned_total Number of nodes with Planned flag # TYPE slurm_nodes_planned_total gauge slurm_nodes_planned_total 0 # HELP slurm_nodes_rebootrequested_total Number of nodes with RebootRequested flag # TYPE slurm_nodes_rebootrequested_total gauge slurm_nodes_rebootrequested_total 0 # HELP slurm_nodes_reserved_total Number of nodes with Reserved flag # TYPE slurm_nodes_reserved_total gauge slurm_nodes_reserved_total 0 # HELP slurm_nodes_total Total number of nodes # TYPE slurm_nodes_total gauge slurm_nodes_total 0 # HELP slurm_nodes_unknown_total Number of nodes in Unknown state # TYPE slurm_nodes_unknown_total gauge slurm_nodes_unknown_total 0 # HELP slurm_partition_jobs_bootfail_total Number of jobs in BootFail state in the partition # TYPE slurm_partition_jobs_bootfail_total gauge slurm_partition_jobs_bootfail_total{partition="tp"} 0 # HELP slurm_partition_jobs_cancelled_total Number of jobs in Cancelled state in the partition # TYPE slurm_partition_jobs_cancelled_total gauge slurm_partition_jobs_cancelled_total{partition="tp"} 0 # HELP slurm_partition_jobs_completed_total Number of jobs in Completed state in the partition # TYPE slurm_partition_jobs_completed_total gauge slurm_partition_jobs_completed_total{partition="tp"} 0 # HELP slurm_partition_jobs_completing_total Number of jobs with Completing flag in the partition # TYPE slurm_partition_jobs_completing_total gauge slurm_partition_jobs_completing_total{partition="tp"} 0 # HELP slurm_partition_jobs_configuring_total Number of jobs with Configuring flag in the partition # TYPE slurm_partition_jobs_configuring_total gauge slurm_partition_jobs_configuring_total{partition="tp"} 0 # HELP slurm_partition_jobs_cpus_alloc_total Number of Allocated CPUs among jobs in the partition # TYPE slurm_partition_jobs_cpus_alloc_total gauge slurm_partition_jobs_cpus_alloc_total{partition="tp"} 0 # HELP slurm_partition_jobs_deadline_total Number of jobs in Deadline state in the partition # TYPE slurm_partition_jobs_deadline_total gauge slurm_partition_jobs_deadline_total{partition="tp"} 0 # HELP slurm_partition_jobs_failed_total Number of jobs in Failed state in the partition # TYPE slurm_partition_jobs_failed_total gauge slurm_partition_jobs_failed_total{partition="tp"} 0 # HELP slurm_partition_jobs_hold_total Number of jobs with Hold flag in the partition # TYPE slurm_partition_jobs_hold_total gauge slurm_partition_jobs_hold_total{partition="tp"} 0 # HELP slurm_partition_jobs_memory_alloc_bytes Amount of Allocated Memory (MB) among jobs in the partition # TYPE slurm_partition_jobs_memory_alloc_bytes gauge slurm_partition_jobs_memory_alloc_bytes{partition="tp"} 0 # HELP slurm_partition_jobs_nodefail_total Number of jobs in NodeFail state in the partition # TYPE slurm_partition_jobs_nodefail_total gauge slurm_partition_jobs_nodefail_total{partition="tp"} 0 # HELP slurm_partition_jobs_outofmemory_total Number of jobs in OutOfMemory state in the partition # TYPE slurm_partition_jobs_outofmemory_total gauge slurm_partition_jobs_outofmemory_total{partition="tp"} 0 # HELP slurm_partition_jobs_pending_maxnodecount_total Largest number of nodes required among pending jobs in the partition # TYPE slurm_partition_jobs_pending_maxnodecount_total gauge slurm_partition_jobs_pending_maxnodecount_total{partition="tp"} 0 # HELP slurm_partition_jobs_pending_total Number of jobs in Pending state in the partition # TYPE slurm_partition_jobs_pending_total gauge slurm_partition_jobs_pending_total{partition="tp"} 0 # HELP slurm_partition_jobs_powerupnode_total Number of jobs with PowerUpNode flag in the partition # TYPE slurm_partition_jobs_powerupnode_total gauge slurm_partition_jobs_powerupnode_total{partition="tp"} 0 # HELP slurm_partition_jobs_preempted_total Number of jobs in Preempted state in the partition # TYPE slurm_partition_jobs_preempted_total gauge slurm_partition_jobs_preempted_total{partition="tp"} 0 # HELP slurm_partition_jobs_running_total Number of jobs in Running state in the partition # TYPE slurm_partition_jobs_running_total gauge slurm_partition_jobs_running_total{partition="tp"} 0 # HELP slurm_partition_jobs_stageout_total Number of jobs with StageOut flag in the partition # TYPE slurm_partition_jobs_stageout_total gauge slurm_partition_jobs_stageout_total{partition="tp"} 0 # HELP slurm_partition_jobs_suspended_total Number of jobs in Suspended state in the partition # TYPE slurm_partition_jobs_suspended_total gauge slurm_partition_jobs_suspended_total{partition="tp"} 0 # HELP slurm_partition_jobs_timeout_total Number of jobs in Timeout state in the partition # TYPE slurm_partition_jobs_timeout_total gauge slurm_partition_jobs_timeout_total{partition="tp"} 0 # HELP slurm_partition_jobs_total Total number of jobs in the partition # TYPE slurm_partition_jobs_total gauge slurm_partition_jobs_total{partition="tp"} 0 # HELP slurm_partition_nodes_allocated_total Number of nodes in Allocated state # TYPE slurm_partition_nodes_allocated_total gauge slurm_partition_nodes_allocated_total{partition="tp"} 0 # HELP slurm_partition_nodes_completing_total Number of nodes with Completing flag # TYPE slurm_partition_nodes_completing_total gauge slurm_partition_nodes_completing_total{partition="tp"} 0 # HELP slurm_partition_nodes_cpus_alloc_total Number of Allocated CPUs on the node # TYPE slurm_partition_nodes_cpus_alloc_total gauge slurm_partition_nodes_cpus_alloc_total{partition="tp"} 0 # HELP slurm_partition_nodes_cpus_effective_total Total number of effective CPUs on the node, excludes CoreSpec # TYPE slurm_partition_nodes_cpus_effective_total gauge slurm_partition_nodes_cpus_effective_total{partition="tp"} 0 # HELP slurm_partition_nodes_cpus_idle_total Number of Idle CPUs on the node # TYPE slurm_partition_nodes_cpus_idle_total gauge slurm_partition_nodes_cpus_idle_total{partition="tp"} 0 # HELP slurm_partition_nodes_cpus_total Total number of CPUs on the node # TYPE slurm_partition_nodes_cpus_total gauge slurm_partition_nodes_cpus_total{partition="tp"} 0 # HELP slurm_partition_nodes_down_total Number of nodes in Down state # TYPE slurm_partition_nodes_down_total gauge slurm_partition_nodes_down_total{partition="tp"} 0 # HELP slurm_partition_nodes_drain_total Number of nodes with Drain flag # TYPE slurm_partition_nodes_drain_total gauge slurm_partition_nodes_drain_total{partition="tp"} 0 # HELP slurm_partition_nodes_error_total Number of nodes in Error state # TYPE slurm_partition_nodes_error_total gauge slurm_partition_nodes_error_total{partition="tp"} 0 # HELP slurm_partition_nodes_fail_total Number of nodes with Fail flag # TYPE slurm_partition_nodes_fail_total gauge slurm_partition_nodes_fail_total{partition="tp"} 0 # HELP slurm_partition_nodes_future_total Number of nodes in Future state # TYPE slurm_partition_nodes_future_total gauge slurm_partition_nodes_future_total{partition="tp"} 0 # HELP slurm_partition_nodes_idle_total Number of nodes in Idle state # TYPE slurm_partition_nodes_idle_total gauge slurm_partition_nodes_idle_total{partition="tp"} 0 # HELP slurm_partition_nodes_maintenance_total Number of nodes with Maintenance flag # TYPE slurm_partition_nodes_maintenance_total gauge slurm_partition_nodes_maintenance_total{partition="tp"} 0 # HELP slurm_partition_nodes_memory_alloc_bytes Amount of Allocated Memory (MB) on the node # TYPE slurm_partition_nodes_memory_alloc_bytes gauge slurm_partition_nodes_memory_alloc_bytes{partition="tp"} 0 # HELP slurm_partition_nodes_memory_bytes Total amount of Memory (MB) on the node # TYPE slurm_partition_nodes_memory_bytes gauge slurm_partition_nodes_memory_bytes{partition="tp"} 0 # HELP slurm_partition_nodes_memory_effective_bytes Total amount of effective Memory (MB) on the node, excludes MemSpec # TYPE slurm_partition_nodes_memory_effective_bytes gauge slurm_partition_nodes_memory_effective_bytes{partition="tp"} 0 # HELP slurm_partition_nodes_memory_free_bytes Amount of Free Memory (MB) on the node # TYPE slurm_partition_nodes_memory_free_bytes gauge slurm_partition_nodes_memory_free_bytes{partition="tp"} 0 # HELP slurm_partition_nodes_mixed_total Number of nodes in Mixed state # TYPE slurm_partition_nodes_mixed_total gauge slurm_partition_nodes_mixed_total{partition="tp"} 0 # HELP slurm_partition_nodes_notresponding_total Number of nodes with NotResponding flag # TYPE slurm_partition_nodes_notresponding_total gauge slurm_partition_nodes_notresponding_total{partition="tp"} 0 # HELP slurm_partition_nodes_planned_total Number of nodes with Planned flag # TYPE slurm_partition_nodes_planned_total gauge slurm_partition_nodes_planned_total{partition="tp"} 0 # HELP slurm_partition_nodes_rebootrequested_total Number of nodes with RebootRequested flag # TYPE slurm_partition_nodes_rebootrequested_total gauge slurm_partition_nodes_rebootrequested_total{partition="tp"} 0 # HELP slurm_partition_nodes_reserved_total Number of nodes with Reserved flag # TYPE slurm_partition_nodes_reserved_total gauge slurm_partition_nodes_reserved_total{partition="tp"} 0 # HELP slurm_partition_nodes_total Total number of slurm nodes # TYPE slurm_partition_nodes_total gauge slurm_partition_nodes_total{partition="tp"} 0 # HELP slurm_partition_nodes_unknown_total Number of nodes in Unknown state # TYPE slurm_partition_nodes_unknown_total gauge slurm_partition_nodes_unknown_total{partition="tp"} 0 # HELP slurm_scheduler_agent_queue_total Number of enqueued outgoing RPC requests in an internal retry list # TYPE slurm_scheduler_agent_queue_total gauge slurm_scheduler_agent_queue_total 0 # HELP slurm_scheduler_agent_thread_total Total number of active threads created by all agent threads # TYPE slurm_scheduler_agent_thread_total gauge slurm_scheduler_agent_thread_total 0 # HELP slurm_scheduler_agent_total Number of agent threads # TYPE slurm_scheduler_agent_total gauge slurm_scheduler_agent_total 0 # HELP slurm_scheduler_cycle_depth_mean_total Mean of the number of jobs processed in a scheduling # TYPE slurm_scheduler_cycle_depth_mean_total gauge slurm_scheduler_cycle_depth_mean_total 0 # HELP slurm_scheduler_cycle_depth_total Total number of jobs processed in scheduling cycles # TYPE slurm_scheduler_cycle_depth_total gauge slurm_scheduler_cycle_depth_total 131 # HELP slurm_scheduler_cycle_last_seconds Time in microseconds for last scheduling cycle # TYPE slurm_scheduler_cycle_last_seconds gauge slurm_scheduler_cycle_last_seconds 100 # HELP slurm_scheduler_cycle_max_seconds Max time of any scheduling cycle in microseconds since last reset # TYPE slurm_scheduler_cycle_max_seconds gauge slurm_scheduler_cycle_max_seconds 19940 # HELP slurm_scheduler_cycle_mean_seconds Mean time in microseconds for all scheduling cycles since last reset # TYPE slurm_scheduler_cycle_mean_seconds gauge slurm_scheduler_cycle_mean_seconds 208 # HELP slurm_scheduler_cycle_perminute_total Number of scheduling executions per minute # TYPE slurm_scheduler_cycle_perminute_total gauge slurm_scheduler_cycle_perminute_total 1 # HELP slurm_scheduler_cycle_sum_seconds_total Total run time in microseconds for all scheduling cycles since last reset # TYPE slurm_scheduler_cycle_sum_seconds_total gauge slurm_scheduler_cycle_sum_seconds_total 98826 # HELP slurm_scheduler_cycle_total Number of scheduling cycles since last reset # TYPE slurm_scheduler_cycle_total gauge slurm_scheduler_cycle_total 474 # HELP slurm_scheduler_dbdagentqueue_total Number of messages for SlurmDBD that are queued # TYPE slurm_scheduler_dbdagentqueue_total gauge slurm_scheduler_dbdagentqueue_total 0 # HELP slurm_scheduler_defaultqueuedepth_total Reached number of jobs allowed to be tested # TYPE slurm_scheduler_defaultqueuedepth_total gauge slurm_scheduler_defaultqueuedepth_total 0 # HELP slurm_scheduler_endjobqueue_total Reached end of queue # TYPE slurm_scheduler_endjobqueue_total gauge slurm_scheduler_endjobqueue_total 474 # HELP slurm_scheduler_jobs_canceled_total Number of jobs canceled since the last reset # TYPE slurm_scheduler_jobs_canceled_total gauge slurm_scheduler_jobs_canceled_total 12 # HELP slurm_scheduler_jobs_completed_total Number of jobs completed since last reset # TYPE slurm_scheduler_jobs_completed_total gauge slurm_scheduler_jobs_completed_total 15 # HELP slurm_scheduler_jobs_failed_total Number of jobs failed due to slurmd or other internal issues since last reset # TYPE slurm_scheduler_jobs_failed_total gauge slurm_scheduler_jobs_failed_total 0 # HELP slurm_scheduler_jobs_pending_total Number of jobs pending at the time of listed in job_state_ts # TYPE slurm_scheduler_jobs_pending_total gauge slurm_scheduler_jobs_pending_total 0 # HELP slurm_scheduler_jobs_running_total Number of jobs running at the time of listed in job_state_ts # TYPE slurm_scheduler_jobs_running_total gauge slurm_scheduler_jobs_running_total 0 # HELP slurm_scheduler_jobs_started_total Number of jobs started since last reset # TYPE slurm_scheduler_jobs_started_total gauge slurm_scheduler_jobs_started_total 19 # HELP slurm_scheduler_jobs_stats_timestamp When the job state counts were gathered (UNIX timestamp) # TYPE slurm_scheduler_jobs_stats_timestamp gauge slurm_scheduler_jobs_stats_timestamp 1.757415788e+09 # HELP slurm_scheduler_jobs_submitted_total Number of jobs submitted since last reset # TYPE slurm_scheduler_jobs_submitted_total gauge slurm_scheduler_jobs_submitted_total 27 # HELP slurm_scheduler_licenses_total Blocked on licenses # TYPE slurm_scheduler_licenses_total gauge slurm_scheduler_licenses_total 0 # HELP slurm_scheduler_maxjobstart_total Reached number of jobs allowed to start # TYPE slurm_scheduler_maxjobstart_total gauge slurm_scheduler_maxjobstart_total 0 # HELP slurm_scheduler_maxrpc_total Reached RPC limit # TYPE slurm_scheduler_maxrpc_total gauge slurm_scheduler_maxrpc_total 0 # HELP slurm_scheduler_maxschedtime_total Reached maximum allowed scheduler time # TYPE slurm_scheduler_maxschedtime_total gauge slurm_scheduler_maxschedtime_total 0 # HELP slurm_scheduler_queue_total Number of jobs pending in queue # TYPE slurm_scheduler_queue_total gauge slurm_scheduler_queue_total 0 # HELP slurm_scheduler_thread_total Number of current active slurmctld threads # TYPE slurm_scheduler_thread_total gauge slurm_scheduler_thread_total 1 ``` {%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}