教學範本:slurm-exporter 串接 Prometheus
===
###### tags: `SlinkyProject`
###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `Autoscaling`, `KEDA`, `HPA`, `VPA`, `Scale out`, `Scale in`, `ServiceMonitor`
<br>
[TOC]
<br>
:::success
## 🎯 目標
**讓 Prometheus 能透過 ServiceMonitor 自動抓取 slurm-exporter 的 `/metrics`,並驗證全流程。**
:::
:::success
## 🎯 串接流程圖
```
Prometheus (kube-prometheus-stack)
│
▼
ServiceMonitor (告訴 Prometheus,slurm-exporter 的來源在哪)
│
▼
Service (slurm-exporter, port=metrics)
│
▼
Deployment/Pod (slurm-exporter, 提供 /metrics)
```
:::
---
## 1. 部署 Prometheus(以 kube-prometheus-stack 為例)
如果你的 cluster 還沒有 Prometheus,最簡單的方式是用 Helm 裝 **kube-prometheus-stack**:
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安裝到 monitor namespace
helm upgrade -i kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitor --create-namespace
```
安裝後會有:
* Prometheus CR(用於監聽 ServiceMonitor/PodMonitor)
* Grafana
* Alertmanager
### 驗證
```bash
kubectl -n monitor get pods
```
- 要看到 `kube-prometheus-stack-*` 跑起來。
- **執行範例**
```
$ kubectl -n monitor get pods
NAME READY STATUS RESTARTS AGE
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 6d1h
kube-prometheus-stack-grafana-cbd794898-wb72l 3/3 Running 0 6d1h
kube-prometheus-stack-kube-state-metrics-577d4b4c4f-mpvnb 1/1 Running 0 6d1h
kube-prometheus-stack-operator-7c84ff6bb7-x7nc2 1/1 Running 0 6d1h
kube-prometheus-stack-prometheus-node-exporter-rt678 1/1 Running 0 7h38m
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 6d1h
```
<br>
---
## 2. 部署 slurm-exporter
> https://github.com/SlinkyProject/slurm-operator/blob/main/helm/slurm/values.yaml#L269-L279
在 `values.yaml` 啟用 exporter:
> - slurm:v0.3.0 -> 預設關閉
> - slurm:v0.3.1 -> 預設打開
```yaml=
slurm-exporter:
enabled: true
exporter:
enabled: true
secretName: "slurm-token-exporter"
```
Helm 會產生 Deployment + Service(但不會產生 ServiceMonitor,要自己建)。
```
# pod
NAME READY STATUS RESTARTS AGE
pod/slurm-exporter-cf8944f49-mzcbp 1/1 Running 0 6h31m
# service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/slurm-exporter ClusterIP None <none> 8080/TCP 26h
# deployment
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/slurm-exporter 1/1 1 1 26h
# replicaset
NAME DESIRED CURRENT READY AGE
replicaset.apps/slurm-exporter-cf8944f49 1 1 1 26h
```
- deployment 控制 replicaset
- replicaset 再控制 service + pod
### 驗證
```bash
kubectl -n slurm get deploy slurm-exporter
kubectl -n slurm get svc slurm-exporter -o yaml
```
- **檢查**
- Pod Running,容器 port 名稱是 `metrics`。
- Service 有 port 8080 且名稱叫 `metrics`。
- **執行範例**
```bash=
$ kubectl -n slurm get deploy slurm-exporter
NAME READY UP-TO-DATE AVAILABLE AGE
slurm-exporter 1/1 1 1 26h
```
```yaml=
$ kubectl -n slurm get svc slurm-exporter -o yaml
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: slurm
meta.helm.sh/release-namespace: slurm
creationTimestamp: "2025-09-08T08:01:38Z"
labels:
app.kubernetes.io/component: slurm-exporter
app.kubernetes.io/instance: slurm-exporter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: slurm-exporter
app.kubernetes.io/version: "25.05"
helm.sh/chart: slurm-exporter-0.3.1
name: slurm-exporter
namespace: slurm
resourceVersion: "6379719"
uid: 79c2541b-625b-455f-9f64-04949fd6ad79
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: metrics # <-- Service 有 port 8080 且名稱叫 `metrics`。
port: 8080
protocol: TCP
targetPort: 8080
selector:
app.kubernetes.io/instance: slurm-exporter
app.kubernetes.io/name: slurm-exporter
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
```
<br>
### 連線測試
> 已經確認有 Service `slurm-exporter`,port 8080 名稱叫 `metrics`。
要用 `curl` 測試 exporter 的 `/metrics` endpoint,可以這樣做:
---
:::danger
### :warning: 從 slurm-exporter Pod 內部測試,但行不通!
直接 `exec` 進去 Pod 測試,該容器不支援 `sh`, `bash`, `curl`, `wget` 等指令:
```bash
# 找出 pod 名稱
$ kubectl -n slurm get pod -l app.kubernetes.io/name=slurm-exporter
# 進去容器
$ kubectl -n slurm exec -it <pod-name> -- sh
error: Internal error occurred: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "9c5b0711edffcefae50032e3277a8f24456ddcb824a5344fc6c0b59cc2d8d690": OCI runtime exec failed: exec failed: unable to start container process: exec:
```
:::
- ### [方法一] 從 **同一個 namespace 內的其他 Pod** 測試
因為 Service 名稱是 `slurm-exporter`,namespace 在 `slurm`,
所以任何同 namespace Pod 可以直接這樣測:
- ### 用 busybox 映像檔
```bash
# 單行指令,建立出暫時的 pod
kubectl -n slurm run tmp-metric-dumper \
-it --rm --image=busybox:1.36 --restart=Never -- sh
# 在 busybox 內
wget -qO- http://slurm-exporter:8080/metrics
```
- 等效於 `-q -O-`, `-q -O -`
- `-q`, Quiet, 不要印出進度條
- `-O FILE`, Save to FILE (`'-'` for stdout)
- ### 用 curl 映像檔
```bash
kubectl -n slurm run tmp-metric-dumper \
-it --rm --image=curlimages/curl --restart=Never -- \
curl http://slurm-exporter:8080/metrics
```
- 或是在 Shell 中互動
```bash
kubectl -n slurm run tmp-metric-dumper \
-it --rm --image=curlimages/curl --restart=Never -- sh
# shell 互動模式
~ $ curl http://slurm-exporter:8080/metrics | grep partition
```
- ### [補充] 底下的 domain 都可以通
- `http://slurm-exporter:8080/metrics`
- `http://slurm-exporter.slurm:8080/metrics`
- `http://slurm-exporter.slurm.svc:8080/metrics`
- `http://slurm-exporter.slurm.svc.cluster.local:8080/metrics`
(**FQSN**, fully qualified service name)
---
- ### [方法二] 從其他 namespace 測試
必須加上 namespace 的 DNS domain:
```
http://slurm-exporter.slurm.svc.cluster.local:8080/metrics
```
範例:在 `default` 命名空間,向 `slurm` 命名空間存取 `/metrics`
```bash
kubectl -n default run tmp-metric-dumper \
-it --rm --image=curlimages/curl --restart=Never -- \
curl http://slurm-exporter.slurm.svc.cluster.local:8080/metrics
```
---
- ### [方法三] 從本地電腦測試
目前這個 Service 是 `ClusterIP: None`(headless service),只能在 cluster 內部存取。
如果你要從外部(例如你的桌機、筆電)curl,必須先做 port-forward:
```bash
kubectl port-forward -n slurm svc/slurm-exporter 8080:8080
```
然後在本機執行:
```bash
curl http://localhost:8080/metrics
```
---
### 概念流程圖
```
[你的程式 / 瀏覽器]
│
│ 連線到本機 127.0.0.1:8080
▼
[本機 kubectl port-forward 程序]
│
│ 與 K8s API Server 建立 TLS/SPDY 隧道
▼
[Kubernetes API Server]
│
│ 轉送「portforward」資料流給目標節點的 kubelet
▼
[kubelet (在目標 Node 上)]
│
│ 將資料導入 Pod 對應的 containerPort(= targetPort)
▼
[Pod: slurm-exporter-xxxxx 容器]
```
---
### 詳細解釋 `kubectl port-forward` 流程:
- ### 概念
> 將本機 `127.0.0.1:8080` 的連線,透過 API server 轉送到 `slurm` 命名空間中 `svc/slurm-exporter` 背後的一個 Pod,並且轉到該 Service 的 `8080` 所對應的 Pod 端口(`targetPort`)。 ([Kubernetes][1])
- ### 細節
- `kubectl port-forward` 這條指令會把**本機 `127.0.0.1:8080`** 的流量,**轉送到 `slurm` 命名空間中的 Service `slurm-exporter`**。`kubectl` 會從該 Service 背後**自動挑一個 Pod**來建立通道。 ([Kubernetes][1])
- 右邊的 `8080` 代表 **Service 的某個 port**;`kubectl` 會根據 Service 的 `port -> targetPort` 對應,把流量轉到那個 **Pod 的實際容器埠**。也就是說,**就算 Service `port: 8080` 對應 `targetPort: 9090`,你連到本機 `8080` 也會被轉到 Pod 的 `9090`**。 ([Kubernetes][2])
- ### 小提醒
- 預設只在本機回圈位址監聽,不對外開放;
要對外網卡監聽可加 `--address 0.0.0.0`(請注意安全)。([Kubernetes][2])
- 若 Service 有多個埠,建議用**埠名**指定,例如:
`kubectl port-forward -n slurm svc/slurm-exporter 8080:metrics`(會轉到該埠名對應的 `targetPort`)。([Kubernetes][2])
- 若本機端與 Service 使用相同埠,指令可進一步簡化為
```
kubectl port-forward -n slurm svc/slurm-exporter 8080
```
或用 `metrics` 表示 8080
```
kubectl port-forward -n slurm svc/slurm-exporter metrics
```
- `$ kubectl -n slurm get service/slurm-exporter -o yaml`
```yaml
spec:
clusterIP: None
...
ports:
- name: metrics # <-- 埠名
port: 8080
protocol: TCP
targetPort: 8080
```
[1]: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/ "Use Port Forwarding to Access Applications in a Cluster | Kubernetes"
[2]: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_port-forward/ "kubectl port-forward | Kubernetes"
---
👉 建議先用 **方法 1**(在相同 namespace 下)確認 exporter 本身能正確輸出 metrics,
👉 再用 **方法 2** 或 **方法 3** 確認 Service 與外部存取路徑是否正常。
<br>
---
<br>
## 3. 建立 ServiceMonitor CR
> 告訴 Prometheus :
> - 我的資料來源為 `-n slurm`, `svc/slurm-exporter`, `port: metrics`
> - 每 5 秒抓取一次資料
> - 預設存取路徑為 `/metrics`
新建一個 YAML 檔 `servicemonitor-slurm-exporter.yaml`:
```yaml=
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: slurm-exporter
namespace: slurm
labels:
release: kube-prometheus-stack # 👈 Prometheus 會用這個 label 來挑 ServiceMonitor (重要關鍵)
spec:
selector:
matchLabels:
app.kubernetes.io/name: slurm-exporter
app.kubernetes.io/instance: slurm-exporter
namespaceSelector:
matchNames:
- slurm
endpoints:
- port: metrics
interval: 5s
```
- ### 方案A:建立 ServiceMonitor:
```bash
kubectl apply -f servicemonitor-slurm-exporter.yaml
```
- ### 方案B:動態套用 label 到即有的 ServiceMonitor:
```bash
kubectl -n slurm label servicemonitor/slurm-exporter \
release=kube-prometheus-stack --overwrite
```
- ### 驗證 servicemonitor
```bash
kubectl -n slurm get servicemonitor slurm-exporter -o yaml
```
確認 selector 與 Service 的 labels 一致。
- ### 詳細解釋設定方式
- **metadata.namespace: `slurm`**
這個 ServiceMonitor 物件本身放在 `slurm` 命名空間。
- **metadata.labels.release: `kube-prometheus-stack`(很關鍵)**
`kube-prometheus-stack` 預設只會挑選帶有同名 `release` 標籤的 ServiceMonitor。
➜ 若你安裝 Prometheus 的 Helm 釋出名稱不是 `kube-prometheus-stack`,這裡也要改成**同樣的名稱**,否則 Prometheus 會忽略它。
- #### 如何確認 Prometheus 是使用什麼 release label 來挑選 ServiceMonitor?
```
$ NAMESPACE=monitor
$ kubectl -n ${NAMESPACE} get prometheus
NAME VERSION DESIRED READY RECONCILED AVAILABLE AGE
monitor-kube-prometheus-st-prometheus v3.3.0 1 1 True True 37d
$ kubectl -n ${NAMESPACE} get prometheus -o yaml
$ kubectl -n ${NAMESPACE} get prometheus -o yaml | yq '.items[0].spec.serviceMonitorSelector' | yq
matchLabels:
release: monitor
$ kubectl -n ${NAMESPACE} get prometheus -o yaml | yq '.items[0].spec.serviceMonitorNamespaceSelector'
{}
```
- **spec.selector.matchLabels**
要被抓取的 **Service(注意:是 Service,不是 Pod)** 必須擁有這兩個標籤 (AND 條件):
- `app.kubernetes.io/name=slurm-exporter`
- `app.kubernetes.io/instance=slurm-exporter`
➜ 換句話說,你的 **`Service/slurm-exporter`** 需要帶上這些 labels,ServiceMonitor 才會選中它。
**檢查方式**:
```yaml
$ kubectl -n slurm get svc/slurm-exporter -o yaml | grep labels -A6
labels:
app.kubernetes.io/component: slurm-exporter
app.kubernetes.io/instance: slurm-exporter # <--
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: slurm-exporter # <--
app.kubernetes.io/version: "25.05"
helm.sh/chart: slurm-exporter-0.3.1
```
- **spec.namespaceSelector.matchNames: `slurm`**
只在 `slurm` 這個命名空間裡尋找符合 selector 的 **Service**。
(要跨命名空間監測可以改成 `any: true` 或列出多個 `matchNames`。)
- **spec.endpoints**
- `port: metrics`:用 **Service** 內**名為 `metrics`** 的埠(`spec.ports[].name: metrics`)。Prometheus 會透過該埠對應的 `targetPort` 去抓 Pod 的 `/metrics`。
- `interval: 5s`:每 5 秒抓一次(抓取頻率)。
- 預設抓取的路徑是 **`/metrics`**。
在使用 **ServiceMonitor** 時,如果你沒有另外指定,Prometheus Operator 產生的抓取設定會把路徑設為 `/metrics`。要改成別的(例如 `/data`),在 `endpoints` 裡加上 `path` 即可:
```yaml
spec:
...
endpoints:
- port: metrics # 指到 Service 的埠名
path: /data # ← 改成自訂路徑(預設為 /metrics)
interval: 5s
```
* 這個 `path` 只影響 **ServiceMonitor** 這個工作;舊式的 `prometheus.io/path` 註解是給「註解式抓取」用的,與 ServiceMonitor 無關。
* 若有特殊需求,也可以用 `relabelings` 改寫目標的 `__metrics_path__`,但一般直接用 `path` 最清楚。
- ### [模擬測試] 驗證 servicemonitor 所填的資訊,Prometheus 是否能撈到目標資料?
> i.e. Prometheus 如何根據 SeviceMonitor 撈取資料:
- ### 目標1:哪個 service
```
$ kubectl -n slurm get service \
-l app.kubernetes.io/instance=slurm-exporter,app.kubernetes.io/name=slurm-exporter
```
```
# -l, --selector=''
# selector 可以拆解成多個
$ kubectl -n slurm get service \
-l app.kubernetes.io/instance=slurm-exporter \
-l app.kubernetes.io/name=slurm-exporter
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
slurm-exporter ClusterIP None <none> 8080/TCP 6h14m
```
- ### 目標2:哪個 port
```
$ kubectl -n slurm get service \
-l app.kubernetes.io/instance=slurm-exporter \
-l app.kubernetes.io/name=slurm-exporter \
-o yaml \
| yq '.items[].spec.ports[] | select(.name == "metrics")'
```
---
### 一句話總結
> 這個 ServiceMonitor 讓 kube-prometheus-stack 這個 Prometheus 實例,每 5 秒從 **`slurm` 命名空間**內、帶有指定 labels 的 **Service(埠名為 `metrics`)** 所對應的 Pod 上抓 `/metrics`。
---
### 快速自我檢查(最常見踩雷)
1. **Label 是否對上?**
```bash
kubectl -n slurm get svc/slurm-exporter -o jsonpath-as-json='{.metadata.labels}'
```
2. **Service 埠名是否為 `metrics`?**
```bash
kubectl -n slurm get svc/slurm-exporter -o jsonpath='{.spec.ports[*].name}'
```
3. **Prometheus 是否會挑選到這個 ServiceMonitor?(release 名稱是否一致)**
(將 `kube-prometheus-stack` 換成你實際的 Helm release 名稱)
```bash
kubectl get servicemonitors --all-namespaces -l release=kube-prometheus-stack
```
<br>
---
<br>
## 4. 驗證 /metrics 輸出
> (前一小節已有詳細說明)
### 方法 A:port-forward 到本機
```bash
kubectl -n slurm port-forward svc/slurm-exporter 8080:8080
curl http://localhost:8080/metrics | head -20
```
### 方法 B:臨時 Pod 測試
```bash
kubectl -n slurm run -it curl-test \
--rm --restart=Never --image=curlimages/curl -- \
curl http://slurm-exporter:8080/metrics
```
<br>
---
<br>
## 5. 驗證 Service → Pod 導流
```bash
kubectl -n slurm get endpoints/slurm-exporter -o yaml
```
- 要能看到 Pod IP 和 port 8080。
- **執行範例**
```bash
$ kubectl -n slurm get endpoints/slurm-exporter -o yaml
```
```yaml=
apiVersion: v1
kind: Endpoints
metadata:
annotations:
endpoints.kubernetes.io/last-change-trigger-time: "2025-09-10T02:51:40Z"
creationTimestamp: "2025-09-10T02:49:49Z"
labels:
app.kubernetes.io/component: slurm-exporter
app.kubernetes.io/instance: slurm-exporter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: slurm-exporter
app.kubernetes.io/version: "25.05"
helm.sh/chart: slurm-exporter-0.3.1
service.kubernetes.io/headless: ""
name: slurm-exporter
namespace: slurm
resourceVersion: "6838327"
uid: 3b162b0f-63e3-4cec-b022-d3225490a459
subsets:
- addresses:
- ip: 192.168.0.89
nodeName: stage-kube01
targetRef:
kind: Pod
name: slurm-exporter-cf8944f49-pd62b
namespace: slurm
uid: 2bf47fca-85a5-4a2a-949e-35589897df66
ports:
- name: metrics
port: 8080
protocol: TCP
```
<br>
---
<br>
## 6. 驗證 Prometheus 是否有收到 `/metrics` 資訊?
- ### 1. 進 Prometheus UI(透過 port-forward 或 ingress):
```bash
kubectl -n monitor port-forward svc/kube-prometheus-stack-prometheus 9090:9090
```
然後瀏覽器打 `http://localhost:9090`
- **Screenshot**

- ### 2. 到 **Status → Targets**,應看到:
```
serviceMonitor/slurm/slurm-exporter/0
```
且狀態是 `UP`。
- **Screenshot**
- Status -> Target health

- 在 [Select scrape pool] 中,輸入 `serviceMonitor/slurm/slurm-exporter/0` 執行過濾

- ### 3. Query 一個 exporter metric:
```promql
slurm_jobs_total
```
有數值就代表整個鏈路 OK。
- **Screenshot**

**過濾結果**:
```
slurm_jobs_total{
container="metrics",
endpoint="metrics",
instance="192.168.0.89:8080",
job="slurm-exporter",
namespace="slurm",
pod="slurm-exporter-cf8944f49-pd62b",
service="slurm-exporter"
}
‵‵‵
- ### 4. 透過 `curl` 查詢上面範例
```bash
$ curl -s http://localhost:9090/api/v1/query?query=slurm_jobs_total | jq
# -s, --silent: Silent mode -> 不要印出進度
# -------------------------------------------------------------------------------
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 578 100 578 0 0 99982 0 --:--:-- --:--:-- --:--:-- 112k
# -------------------------------------------------------------------------------
```
```json=
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "slurm_jobs_total",
"container": "metrics",
"endpoint": "metrics",
"instance": "192.168.0.89:8080",
"job": "slurm-exporter",
"namespace": "slurm",
"pod": "slurm-exporter-cf8944f49-pd62b",
"service": "slurm-exporter"
},
"value": [
1757563263.28,
"0"
]
}
]
}
}
```
- #### 若有多筆資料,進一步限制條件:`namespace=slurm`
```
curl -s 'http://localhost:9090/api/v1/query?query=slurm_jobs_total{namespace="slurm"}' | jq
```
- #### **執行結果(有錯誤訊息)**
```
{
"status": "error",
"errorType": "bad_data",
"error": "invalid parameter \"query\": 1:26: parse error: unexpected \"=\""
}
```
- #### 限制條件需做「編碼」
> https://meyerweb.com/eric/tools/dencoder/
> 編碼前: `{namespace="slurm"}`
> 編碼後:`%7Bnamespace%3D%22slurm%22%7D`
- #### 修正後的 curl 指令:
```
$ curl -s http://localhost:9090/api/v1/query?query=slurm_jobs_total%7Bnamespace%3D%22slurm%22%7D | jq
```
```json=
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "slurm_jobs_total",
"container": "metrics",
"endpoint": "metrics",
"instance": "192.168.0.89:8080",
"job": "slurm-exporter",
"namespace": "slurm",
"pod": "slurm-exporter-cf8944f49-pd62b",
"service": "slurm-exporter"
},
"value": [
1757563884.951,
"0"
]
}
]
}
}
```
- #### 自動編碼 data 部份
```bash
$ curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq
# --data-urlencode 可用 --data, -d 置換
```
- 若使用 `-X POST`,同上
```
$ curl -X POST -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq
```
- 若使用 `-X GET`,需加上參數 `-G`
> `-G`: 把那些資料改成 URL 查詢字串附加到網址後面**(`?a=b&c=d`),而不是放到 HTTP body
```
$ curl -X GET -sG 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=slurm_jobs_total{namespace="slurm"}' | jq
```
- #### 最終指令
```bash
$ curl -s 'http://localhost:9090/api/v1/query' \
-d 'query=slurm_jobs_total{namespace="slurm"}' | jq
```
- ### 5. 更多範例
- **排除某個命名空間**:
```
slurm_jobs_total{namespace!="xxx-slurm"}
```
- **多個命名空間(正則)**:
```
slurm_jobs_total{namespace=~"slurm|prod-slurm"}
```
<br>
---
<br>
## 7. 常見問題排查
- **ServiceMonitor 無效**
→ 檢查 Prometheus 的 `serviceMonitorSelector`,預設需要 label `release=kube-prometheus-stack`。
- **/metrics 拿不到**
→ 先看 Pod log:
```bash
kubectl -n slurm logs deploy/slurm-exporter
```
- **Endpoints 為空**
→ Service selector label 沒對到 Pod。
- **Targets 顯示 DOWN**
→ 檢查 NetworkPolicy 是否擋住 `monitor` namespace 的 Prometheus 到 `slurm` namespace 的 Service。
<br>
---
---
<br>
## 附錄:`/metrics` 的資料長相
> `http://slurm-exporter:8080/metrics` 的資料長相
```ini=
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.000158618
go_gc_duration_seconds{quantile="0.25"} 0.000293336
go_gc_duration_seconds{quantile="0.5"} 0.000321119
go_gc_duration_seconds{quantile="0.75"} 0.000362432
go_gc_duration_seconds{quantile="1"} 0.007996102
go_gc_duration_seconds_sum 0.368574099
go_gc_duration_seconds_count 1153
# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent.
# TYPE go_gc_gogc_percent gauge
go_gc_gogc_percent 100
# HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes.
# TYPE go_gc_gomemlimit_bytes gauge
go_gc_gomemlimit_bytes 9.223372036854776e+18
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 25
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.24.5"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated in heap and currently in use. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 8.16408e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated in heap until now, even if released already. Equals to /gc/heap/allocs:bytes.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 6.231564864e+09
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. Equals to /memory/classes/profiling/buckets:bytes.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.752257e+06
# HELP go_memstats_frees_total Total number of heap objects frees. Equals to /gc/heap/frees:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 3.6451928e+07
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. Equals to /memory/classes/metadata/other:bytes.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.53308e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and currently in use, same as go_memstats_alloc_bytes. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 8.16408e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. Equals to /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 1.908736e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.1845632e+07
# HELP go_memstats_heap_objects Number of currently allocated objects. Equals to /gc/heap/objects:objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 18082
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. Equals to /memory/classes/heap/released:bytes.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.1870208e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes + /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.0932992e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7574158133290594e+09
# HELP go_memstats_mallocs_total Total number of heap objects allocated, both live and gc-ed. Semantically a counter version for go_memstats_heap_objects gauge. Equals to /gc/heap/allocs:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 3.647001e+07
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. Equals to /memory/classes/metadata/mcache/inuse:bytes.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 106304
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. Equals to /memory/classes/metadata/mcache/inuse:bytes + /memory/classes/metadata/mcache/free:bytes.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 109928
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. Equals to /memory/classes/metadata/mspan/inuse:bytes.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 428320
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. Equals to /memory/classes/metadata/mspan/inuse:bytes + /memory/classes/metadata/mspan/free:bytes.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 522240
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. Equals to /gc/heap/goal:bytes.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.6529442e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. Equals to /memory/classes/other:bytes.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 6.293959e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes obtained from system for stack allocator in non-CGO environments. Equals to /memory/classes/heap/stacks:bytes.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 2.62144e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. Equals to /memory/classes/heap/stacks:bytes + /memory/classes/os-stacks:bytes.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 2.62144e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system. Equals to /memory/classes/total:byte.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 4.5765896e+07
# HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads.
# TYPE go_sched_gomaxprocs_threads gauge
go_sched_gomaxprocs_threads 88
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 32
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 119.73
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_network_receive_bytes_total Number of bytes received by the process over the network.
# TYPE process_network_receive_bytes_total counter
process_network_receive_bytes_total 1.22111852e+08
# HELP process_network_transmit_bytes_total Number of bytes sent by the process over the network.
# TYPE process_network_transmit_bytes_total counter
process_network_transmit_bytes_total 4.1687486e+07
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 5.1982336e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.75739091263e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.30664448e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 5012
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
# HELP slurm_bfscheduler_active_bool Backfill scheduler currently running
# TYPE slurm_bfscheduler_active_bool gauge
slurm_bfscheduler_active_bool 0
# HELP slurm_bfscheduler_backfilledhetjobs_total Number of heterogeneous job components started through backfilling since last Slurm start
# TYPE slurm_bfscheduler_backfilledhetjobs_total gauge
slurm_bfscheduler_backfilledhetjobs_total 0
# HELP slurm_bfscheduler_backfilledjobs_total Number of jobs started through backfilling since last slurm start
# TYPE slurm_bfscheduler_backfilledjobs_total gauge
slurm_bfscheduler_backfilledjobs_total 2
# HELP slurm_bfscheduler_cycle_max_seconds Execution time in microseconds of longest backfill scheduling cycle
# TYPE slurm_bfscheduler_cycle_max_seconds gauge
slurm_bfscheduler_cycle_max_seconds 1688
# HELP slurm_bfscheduler_cycle_mean_seconds Mean time in microseconds of backfilling scheduling cycles since last reset
# TYPE slurm_bfscheduler_cycle_mean_seconds gauge
slurm_bfscheduler_cycle_mean_seconds 922
# HELP slurm_bfscheduler_cycle_seconds Execution time in microseconds of last backfill scheduling cycle
# TYPE slurm_bfscheduler_cycle_seconds gauge
slurm_bfscheduler_cycle_seconds 1111
# HELP slurm_bfscheduler_cycle_sum_seconds Total time in microseconds of backfilling scheduling cycles since last reset
# TYPE slurm_bfscheduler_cycle_sum_seconds gauge
slurm_bfscheduler_cycle_sum_seconds 41498
# HELP slurm_bfscheduler_cycle_total Number of backfill scheduling cycles since last reset
# TYPE slurm_bfscheduler_cycle_total gauge
slurm_bfscheduler_cycle_total 45
# HELP slurm_bfscheduler_depth_mean_total Mean number of eligible to run jobs processed during all backfilling scheduling cycles since last reset
# TYPE slurm_bfscheduler_depth_mean_total gauge
slurm_bfscheduler_depth_mean_total 2
# HELP slurm_bfscheduler_depth_sum_total Total number of jobs processed during all backfilling scheduling cycles since last reset
# TYPE slurm_bfscheduler_depth_sum_total gauge
slurm_bfscheduler_depth_sum_total 109
# HELP slurm_bfscheduler_depth_try_total The subset of Depth Mean that the backfill scheduler attempted to schedule
# TYPE slurm_bfscheduler_depth_try_total gauge
slurm_bfscheduler_depth_try_total 2
# HELP slurm_bfscheduler_depth_trysum_total Subset of bf_depth_sum that the backfill scheduler attempted to schedule
# TYPE slurm_bfscheduler_depth_trysum_total gauge
slurm_bfscheduler_depth_trysum_total 109
# HELP slurm_bfscheduler_endjobqueue_total Reached RPC limit
# TYPE slurm_bfscheduler_endjobqueue_total gauge
slurm_bfscheduler_endjobqueue_total 45
# HELP slurm_bfscheduler_lastbackfilledjobs_total Number of jobs started through backfilling since last reset
# TYPE slurm_bfscheduler_lastbackfilledjobs_total gauge
slurm_bfscheduler_lastbackfilledjobs_total 2
# HELP slurm_bfscheduler_lastcycle_timestamp When the last backfill scheduling cycle happened (UNIX timestamp)
# TYPE slurm_bfscheduler_lastcycle_timestamp gauge
slurm_bfscheduler_lastcycle_timestamp 1.757409677e+09
# HELP slurm_bfscheduler_lastdepth_total Number of processed jobs during last backfilling scheduling cycle
# TYPE slurm_bfscheduler_lastdepth_total gauge
slurm_bfscheduler_lastdepth_total 3
# HELP slurm_bfscheduler_lastdepthtry_total Number of processed jobs during last backfilling scheduling cycle that had a chance to start using available resources
# TYPE slurm_bfscheduler_lastdepthtry_total gauge
slurm_bfscheduler_lastdepthtry_total 3
# HELP slurm_bfscheduler_maxjobstart_total Reached number of jobs allowed to be tested
# TYPE slurm_bfscheduler_maxjobstart_total gauge
slurm_bfscheduler_maxjobstart_total 0
# HELP slurm_bfscheduler_maxjobtest_total Reached end of queue
# TYPE slurm_bfscheduler_maxjobtest_total gauge
slurm_bfscheduler_maxjobtest_total 0
# HELP slurm_bfscheduler_maxtime_total Blocked on licenses
# TYPE slurm_bfscheduler_maxtime_total gauge
slurm_bfscheduler_maxtime_total 0
# HELP slurm_bfscheduler_nodespace_total Reached table size limit
# TYPE slurm_bfscheduler_nodespace_total gauge
slurm_bfscheduler_nodespace_total 0
# HELP slurm_bfscheduler_queue_mean_total Mean number of jobs pending to be processed by backfilling algorithm
# TYPE slurm_bfscheduler_queue_mean_total gauge
slurm_bfscheduler_queue_mean_total 2
# HELP slurm_bfscheduler_queue_sum_total Total number of jobs pending to be processed by backfilling algorithm since last reset
# TYPE slurm_bfscheduler_queue_sum_total gauge
slurm_bfscheduler_queue_sum_total 109
# HELP slurm_bfscheduler_queue_total Number of jobs pending to be processed by backfilling algorithm
# TYPE slurm_bfscheduler_queue_total gauge
slurm_bfscheduler_queue_total 3
# HELP slurm_bfscheduler_statechanged_total Reached maximum allowed scheduler time
# TYPE slurm_bfscheduler_statechanged_total gauge
slurm_bfscheduler_statechanged_total 0
# HELP slurm_bfscheduler_table_total Number of different time slots tested by the backfill scheduler in its last iteration
# TYPE slurm_bfscheduler_table_total gauge
slurm_bfscheduler_table_total 6
# HELP slurm_bfscheduler_tablemean_total Mean number of different time slots tested by the backfill scheduler
# TYPE slurm_bfscheduler_tablemean_total gauge
slurm_bfscheduler_tablemean_total 2
# HELP slurm_bfscheduler_tablesum_total Total number of different time slots tested by the backfill scheduler
# TYPE slurm_bfscheduler_tablesum_total gauge
slurm_bfscheduler_tablesum_total 204
# HELP slurm_jobs_bootfail_total Number of jobs in BootFail state
# TYPE slurm_jobs_bootfail_total gauge
slurm_jobs_bootfail_total 0
# HELP slurm_jobs_cancelled_total Number of jobs in Cancelled state
# TYPE slurm_jobs_cancelled_total gauge
slurm_jobs_cancelled_total 0
# HELP slurm_jobs_completed_total Number of jobs in Completed state
# TYPE slurm_jobs_completed_total gauge
slurm_jobs_completed_total 0
# HELP slurm_jobs_completing_total Number of jobs with Completing flag
# TYPE slurm_jobs_completing_total gauge
slurm_jobs_completing_total 0
# HELP slurm_jobs_configuring_total Number of jobs with Configuring flag
# TYPE slurm_jobs_configuring_total gauge
slurm_jobs_configuring_total 0
# HELP slurm_jobs_cpus_alloc_total Number of Allocated CPUs among jobs
# TYPE slurm_jobs_cpus_alloc_total gauge
slurm_jobs_cpus_alloc_total 0
# HELP slurm_jobs_deadline_total Number of jobs in Deadline state
# TYPE slurm_jobs_deadline_total gauge
slurm_jobs_deadline_total 0
# HELP slurm_jobs_failed_total Number of jobs in Failed state
# TYPE slurm_jobs_failed_total gauge
slurm_jobs_failed_total 0
# HELP slurm_jobs_hold_total Number of jobs with Hold flag
# TYPE slurm_jobs_hold_total gauge
slurm_jobs_hold_total 0
# HELP slurm_jobs_memory_alloc_bytes Amount of Allocated Memory (MB) among jobs
# TYPE slurm_jobs_memory_alloc_bytes gauge
slurm_jobs_memory_alloc_bytes 0
# HELP slurm_jobs_nodefail_total Number of jobs in NodeFail state
# TYPE slurm_jobs_nodefail_total gauge
slurm_jobs_nodefail_total 0
# HELP slurm_jobs_outofmemory_total Number of jobs in OutOfMemory state
# TYPE slurm_jobs_outofmemory_total gauge
slurm_jobs_outofmemory_total 0
# HELP slurm_jobs_pending_total Number of jobs in Pending state
# TYPE slurm_jobs_pending_total gauge
slurm_jobs_pending_total 0
# HELP slurm_jobs_powerupnode_total Number of jobs with PowerUpNode flag
# TYPE slurm_jobs_powerupnode_total gauge
slurm_jobs_powerupnode_total 0
# HELP slurm_jobs_preempted_total Number of jobs in Preempted state
# TYPE slurm_jobs_preempted_total gauge
slurm_jobs_preempted_total 0
# HELP slurm_jobs_running_total Number of jobs in Running state
# TYPE slurm_jobs_running_total gauge
slurm_jobs_running_total 0
# HELP slurm_jobs_stageout_total Number of jobs with StageOut flag
# TYPE slurm_jobs_stageout_total gauge
slurm_jobs_stageout_total 0
# HELP slurm_jobs_suspended_total Number of jobs in Suspended state
# TYPE slurm_jobs_suspended_total gauge
slurm_jobs_suspended_total 0
# HELP slurm_jobs_timeout_total Number of jobs in Timeout state
# TYPE slurm_jobs_timeout_total gauge
slurm_jobs_timeout_total 0
# HELP slurm_jobs_total Total number of jobs
# TYPE slurm_jobs_total gauge
slurm_jobs_total 0
# HELP slurm_nodes_allocated_total Number of nodes in Allocated state
# TYPE slurm_nodes_allocated_total gauge
slurm_nodes_allocated_total 0
# HELP slurm_nodes_completing_total Number of nodes with Completing flag
# TYPE slurm_nodes_completing_total gauge
slurm_nodes_completing_total 0
# HELP slurm_nodes_down_total Number of nodes in Down state
# TYPE slurm_nodes_down_total gauge
slurm_nodes_down_total 0
# HELP slurm_nodes_drain_total Number of nodes with Drain flag
# TYPE slurm_nodes_drain_total gauge
slurm_nodes_drain_total 0
# HELP slurm_nodes_error_total Number of nodes in Error state
# TYPE slurm_nodes_error_total gauge
slurm_nodes_error_total 0
# HELP slurm_nodes_fail_total Number of nodes with Fail flag
# TYPE slurm_nodes_fail_total gauge
slurm_nodes_fail_total 0
# HELP slurm_nodes_future_total Number of nodes in Future state
# TYPE slurm_nodes_future_total gauge
slurm_nodes_future_total 0
# HELP slurm_nodes_idle_total Number of nodes in Idle state
# TYPE slurm_nodes_idle_total gauge
slurm_nodes_idle_total 0
# HELP slurm_nodes_maintenance_total Number of nodes with Maintenance flag
# TYPE slurm_nodes_maintenance_total gauge
slurm_nodes_maintenance_total 0
# HELP slurm_nodes_mixed_total Number of nodes in Mixed state
# TYPE slurm_nodes_mixed_total gauge
slurm_nodes_mixed_total 0
# HELP slurm_nodes_notresponding_total Number of nodes with NotResponding flag
# TYPE slurm_nodes_notresponding_total gauge
slurm_nodes_notresponding_total 0
# HELP slurm_nodes_planned_total Number of nodes with Planned flag
# TYPE slurm_nodes_planned_total gauge
slurm_nodes_planned_total 0
# HELP slurm_nodes_rebootrequested_total Number of nodes with RebootRequested flag
# TYPE slurm_nodes_rebootrequested_total gauge
slurm_nodes_rebootrequested_total 0
# HELP slurm_nodes_reserved_total Number of nodes with Reserved flag
# TYPE slurm_nodes_reserved_total gauge
slurm_nodes_reserved_total 0
# HELP slurm_nodes_total Total number of nodes
# TYPE slurm_nodes_total gauge
slurm_nodes_total 0
# HELP slurm_nodes_unknown_total Number of nodes in Unknown state
# TYPE slurm_nodes_unknown_total gauge
slurm_nodes_unknown_total 0
# HELP slurm_partition_jobs_bootfail_total Number of jobs in BootFail state in the partition
# TYPE slurm_partition_jobs_bootfail_total gauge
slurm_partition_jobs_bootfail_total{partition="tp"} 0
# HELP slurm_partition_jobs_cancelled_total Number of jobs in Cancelled state in the partition
# TYPE slurm_partition_jobs_cancelled_total gauge
slurm_partition_jobs_cancelled_total{partition="tp"} 0
# HELP slurm_partition_jobs_completed_total Number of jobs in Completed state in the partition
# TYPE slurm_partition_jobs_completed_total gauge
slurm_partition_jobs_completed_total{partition="tp"} 0
# HELP slurm_partition_jobs_completing_total Number of jobs with Completing flag in the partition
# TYPE slurm_partition_jobs_completing_total gauge
slurm_partition_jobs_completing_total{partition="tp"} 0
# HELP slurm_partition_jobs_configuring_total Number of jobs with Configuring flag in the partition
# TYPE slurm_partition_jobs_configuring_total gauge
slurm_partition_jobs_configuring_total{partition="tp"} 0
# HELP slurm_partition_jobs_cpus_alloc_total Number of Allocated CPUs among jobs in the partition
# TYPE slurm_partition_jobs_cpus_alloc_total gauge
slurm_partition_jobs_cpus_alloc_total{partition="tp"} 0
# HELP slurm_partition_jobs_deadline_total Number of jobs in Deadline state in the partition
# TYPE slurm_partition_jobs_deadline_total gauge
slurm_partition_jobs_deadline_total{partition="tp"} 0
# HELP slurm_partition_jobs_failed_total Number of jobs in Failed state in the partition
# TYPE slurm_partition_jobs_failed_total gauge
slurm_partition_jobs_failed_total{partition="tp"} 0
# HELP slurm_partition_jobs_hold_total Number of jobs with Hold flag in the partition
# TYPE slurm_partition_jobs_hold_total gauge
slurm_partition_jobs_hold_total{partition="tp"} 0
# HELP slurm_partition_jobs_memory_alloc_bytes Amount of Allocated Memory (MB) among jobs in the partition
# TYPE slurm_partition_jobs_memory_alloc_bytes gauge
slurm_partition_jobs_memory_alloc_bytes{partition="tp"} 0
# HELP slurm_partition_jobs_nodefail_total Number of jobs in NodeFail state in the partition
# TYPE slurm_partition_jobs_nodefail_total gauge
slurm_partition_jobs_nodefail_total{partition="tp"} 0
# HELP slurm_partition_jobs_outofmemory_total Number of jobs in OutOfMemory state in the partition
# TYPE slurm_partition_jobs_outofmemory_total gauge
slurm_partition_jobs_outofmemory_total{partition="tp"} 0
# HELP slurm_partition_jobs_pending_maxnodecount_total Largest number of nodes required among pending jobs in the partition
# TYPE slurm_partition_jobs_pending_maxnodecount_total gauge
slurm_partition_jobs_pending_maxnodecount_total{partition="tp"} 0
# HELP slurm_partition_jobs_pending_total Number of jobs in Pending state in the partition
# TYPE slurm_partition_jobs_pending_total gauge
slurm_partition_jobs_pending_total{partition="tp"} 0
# HELP slurm_partition_jobs_powerupnode_total Number of jobs with PowerUpNode flag in the partition
# TYPE slurm_partition_jobs_powerupnode_total gauge
slurm_partition_jobs_powerupnode_total{partition="tp"} 0
# HELP slurm_partition_jobs_preempted_total Number of jobs in Preempted state in the partition
# TYPE slurm_partition_jobs_preempted_total gauge
slurm_partition_jobs_preempted_total{partition="tp"} 0
# HELP slurm_partition_jobs_running_total Number of jobs in Running state in the partition
# TYPE slurm_partition_jobs_running_total gauge
slurm_partition_jobs_running_total{partition="tp"} 0
# HELP slurm_partition_jobs_stageout_total Number of jobs with StageOut flag in the partition
# TYPE slurm_partition_jobs_stageout_total gauge
slurm_partition_jobs_stageout_total{partition="tp"} 0
# HELP slurm_partition_jobs_suspended_total Number of jobs in Suspended state in the partition
# TYPE slurm_partition_jobs_suspended_total gauge
slurm_partition_jobs_suspended_total{partition="tp"} 0
# HELP slurm_partition_jobs_timeout_total Number of jobs in Timeout state in the partition
# TYPE slurm_partition_jobs_timeout_total gauge
slurm_partition_jobs_timeout_total{partition="tp"} 0
# HELP slurm_partition_jobs_total Total number of jobs in the partition
# TYPE slurm_partition_jobs_total gauge
slurm_partition_jobs_total{partition="tp"} 0
# HELP slurm_partition_nodes_allocated_total Number of nodes in Allocated state
# TYPE slurm_partition_nodes_allocated_total gauge
slurm_partition_nodes_allocated_total{partition="tp"} 0
# HELP slurm_partition_nodes_completing_total Number of nodes with Completing flag
# TYPE slurm_partition_nodes_completing_total gauge
slurm_partition_nodes_completing_total{partition="tp"} 0
# HELP slurm_partition_nodes_cpus_alloc_total Number of Allocated CPUs on the node
# TYPE slurm_partition_nodes_cpus_alloc_total gauge
slurm_partition_nodes_cpus_alloc_total{partition="tp"} 0
# HELP slurm_partition_nodes_cpus_effective_total Total number of effective CPUs on the node, excludes CoreSpec
# TYPE slurm_partition_nodes_cpus_effective_total gauge
slurm_partition_nodes_cpus_effective_total{partition="tp"} 0
# HELP slurm_partition_nodes_cpus_idle_total Number of Idle CPUs on the node
# TYPE slurm_partition_nodes_cpus_idle_total gauge
slurm_partition_nodes_cpus_idle_total{partition="tp"} 0
# HELP slurm_partition_nodes_cpus_total Total number of CPUs on the node
# TYPE slurm_partition_nodes_cpus_total gauge
slurm_partition_nodes_cpus_total{partition="tp"} 0
# HELP slurm_partition_nodes_down_total Number of nodes in Down state
# TYPE slurm_partition_nodes_down_total gauge
slurm_partition_nodes_down_total{partition="tp"} 0
# HELP slurm_partition_nodes_drain_total Number of nodes with Drain flag
# TYPE slurm_partition_nodes_drain_total gauge
slurm_partition_nodes_drain_total{partition="tp"} 0
# HELP slurm_partition_nodes_error_total Number of nodes in Error state
# TYPE slurm_partition_nodes_error_total gauge
slurm_partition_nodes_error_total{partition="tp"} 0
# HELP slurm_partition_nodes_fail_total Number of nodes with Fail flag
# TYPE slurm_partition_nodes_fail_total gauge
slurm_partition_nodes_fail_total{partition="tp"} 0
# HELP slurm_partition_nodes_future_total Number of nodes in Future state
# TYPE slurm_partition_nodes_future_total gauge
slurm_partition_nodes_future_total{partition="tp"} 0
# HELP slurm_partition_nodes_idle_total Number of nodes in Idle state
# TYPE slurm_partition_nodes_idle_total gauge
slurm_partition_nodes_idle_total{partition="tp"} 0
# HELP slurm_partition_nodes_maintenance_total Number of nodes with Maintenance flag
# TYPE slurm_partition_nodes_maintenance_total gauge
slurm_partition_nodes_maintenance_total{partition="tp"} 0
# HELP slurm_partition_nodes_memory_alloc_bytes Amount of Allocated Memory (MB) on the node
# TYPE slurm_partition_nodes_memory_alloc_bytes gauge
slurm_partition_nodes_memory_alloc_bytes{partition="tp"} 0
# HELP slurm_partition_nodes_memory_bytes Total amount of Memory (MB) on the node
# TYPE slurm_partition_nodes_memory_bytes gauge
slurm_partition_nodes_memory_bytes{partition="tp"} 0
# HELP slurm_partition_nodes_memory_effective_bytes Total amount of effective Memory (MB) on the node, excludes MemSpec
# TYPE slurm_partition_nodes_memory_effective_bytes gauge
slurm_partition_nodes_memory_effective_bytes{partition="tp"} 0
# HELP slurm_partition_nodes_memory_free_bytes Amount of Free Memory (MB) on the node
# TYPE slurm_partition_nodes_memory_free_bytes gauge
slurm_partition_nodes_memory_free_bytes{partition="tp"} 0
# HELP slurm_partition_nodes_mixed_total Number of nodes in Mixed state
# TYPE slurm_partition_nodes_mixed_total gauge
slurm_partition_nodes_mixed_total{partition="tp"} 0
# HELP slurm_partition_nodes_notresponding_total Number of nodes with NotResponding flag
# TYPE slurm_partition_nodes_notresponding_total gauge
slurm_partition_nodes_notresponding_total{partition="tp"} 0
# HELP slurm_partition_nodes_planned_total Number of nodes with Planned flag
# TYPE slurm_partition_nodes_planned_total gauge
slurm_partition_nodes_planned_total{partition="tp"} 0
# HELP slurm_partition_nodes_rebootrequested_total Number of nodes with RebootRequested flag
# TYPE slurm_partition_nodes_rebootrequested_total gauge
slurm_partition_nodes_rebootrequested_total{partition="tp"} 0
# HELP slurm_partition_nodes_reserved_total Number of nodes with Reserved flag
# TYPE slurm_partition_nodes_reserved_total gauge
slurm_partition_nodes_reserved_total{partition="tp"} 0
# HELP slurm_partition_nodes_total Total number of slurm nodes
# TYPE slurm_partition_nodes_total gauge
slurm_partition_nodes_total{partition="tp"} 0
# HELP slurm_partition_nodes_unknown_total Number of nodes in Unknown state
# TYPE slurm_partition_nodes_unknown_total gauge
slurm_partition_nodes_unknown_total{partition="tp"} 0
# HELP slurm_scheduler_agent_queue_total Number of enqueued outgoing RPC requests in an internal retry list
# TYPE slurm_scheduler_agent_queue_total gauge
slurm_scheduler_agent_queue_total 0
# HELP slurm_scheduler_agent_thread_total Total number of active threads created by all agent threads
# TYPE slurm_scheduler_agent_thread_total gauge
slurm_scheduler_agent_thread_total 0
# HELP slurm_scheduler_agent_total Number of agent threads
# TYPE slurm_scheduler_agent_total gauge
slurm_scheduler_agent_total 0
# HELP slurm_scheduler_cycle_depth_mean_total Mean of the number of jobs processed in a scheduling
# TYPE slurm_scheduler_cycle_depth_mean_total gauge
slurm_scheduler_cycle_depth_mean_total 0
# HELP slurm_scheduler_cycle_depth_total Total number of jobs processed in scheduling cycles
# TYPE slurm_scheduler_cycle_depth_total gauge
slurm_scheduler_cycle_depth_total 131
# HELP slurm_scheduler_cycle_last_seconds Time in microseconds for last scheduling cycle
# TYPE slurm_scheduler_cycle_last_seconds gauge
slurm_scheduler_cycle_last_seconds 100
# HELP slurm_scheduler_cycle_max_seconds Max time of any scheduling cycle in microseconds since last reset
# TYPE slurm_scheduler_cycle_max_seconds gauge
slurm_scheduler_cycle_max_seconds 19940
# HELP slurm_scheduler_cycle_mean_seconds Mean time in microseconds for all scheduling cycles since last reset
# TYPE slurm_scheduler_cycle_mean_seconds gauge
slurm_scheduler_cycle_mean_seconds 208
# HELP slurm_scheduler_cycle_perminute_total Number of scheduling executions per minute
# TYPE slurm_scheduler_cycle_perminute_total gauge
slurm_scheduler_cycle_perminute_total 1
# HELP slurm_scheduler_cycle_sum_seconds_total Total run time in microseconds for all scheduling cycles since last reset
# TYPE slurm_scheduler_cycle_sum_seconds_total gauge
slurm_scheduler_cycle_sum_seconds_total 98826
# HELP slurm_scheduler_cycle_total Number of scheduling cycles since last reset
# TYPE slurm_scheduler_cycle_total gauge
slurm_scheduler_cycle_total 474
# HELP slurm_scheduler_dbdagentqueue_total Number of messages for SlurmDBD that are queued
# TYPE slurm_scheduler_dbdagentqueue_total gauge
slurm_scheduler_dbdagentqueue_total 0
# HELP slurm_scheduler_defaultqueuedepth_total Reached number of jobs allowed to be tested
# TYPE slurm_scheduler_defaultqueuedepth_total gauge
slurm_scheduler_defaultqueuedepth_total 0
# HELP slurm_scheduler_endjobqueue_total Reached end of queue
# TYPE slurm_scheduler_endjobqueue_total gauge
slurm_scheduler_endjobqueue_total 474
# HELP slurm_scheduler_jobs_canceled_total Number of jobs canceled since the last reset
# TYPE slurm_scheduler_jobs_canceled_total gauge
slurm_scheduler_jobs_canceled_total 12
# HELP slurm_scheduler_jobs_completed_total Number of jobs completed since last reset
# TYPE slurm_scheduler_jobs_completed_total gauge
slurm_scheduler_jobs_completed_total 15
# HELP slurm_scheduler_jobs_failed_total Number of jobs failed due to slurmd or other internal issues since last reset
# TYPE slurm_scheduler_jobs_failed_total gauge
slurm_scheduler_jobs_failed_total 0
# HELP slurm_scheduler_jobs_pending_total Number of jobs pending at the time of listed in job_state_ts
# TYPE slurm_scheduler_jobs_pending_total gauge
slurm_scheduler_jobs_pending_total 0
# HELP slurm_scheduler_jobs_running_total Number of jobs running at the time of listed in job_state_ts
# TYPE slurm_scheduler_jobs_running_total gauge
slurm_scheduler_jobs_running_total 0
# HELP slurm_scheduler_jobs_started_total Number of jobs started since last reset
# TYPE slurm_scheduler_jobs_started_total gauge
slurm_scheduler_jobs_started_total 19
# HELP slurm_scheduler_jobs_stats_timestamp When the job state counts were gathered (UNIX timestamp)
# TYPE slurm_scheduler_jobs_stats_timestamp gauge
slurm_scheduler_jobs_stats_timestamp 1.757415788e+09
# HELP slurm_scheduler_jobs_submitted_total Number of jobs submitted since last reset
# TYPE slurm_scheduler_jobs_submitted_total gauge
slurm_scheduler_jobs_submitted_total 27
# HELP slurm_scheduler_licenses_total Blocked on licenses
# TYPE slurm_scheduler_licenses_total gauge
slurm_scheduler_licenses_total 0
# HELP slurm_scheduler_maxjobstart_total Reached number of jobs allowed to start
# TYPE slurm_scheduler_maxjobstart_total gauge
slurm_scheduler_maxjobstart_total 0
# HELP slurm_scheduler_maxrpc_total Reached RPC limit
# TYPE slurm_scheduler_maxrpc_total gauge
slurm_scheduler_maxrpc_total 0
# HELP slurm_scheduler_maxschedtime_total Reached maximum allowed scheduler time
# TYPE slurm_scheduler_maxschedtime_total gauge
slurm_scheduler_maxschedtime_total 0
# HELP slurm_scheduler_queue_total Number of jobs pending in queue
# TYPE slurm_scheduler_queue_total gauge
slurm_scheduler_queue_total 0
# HELP slurm_scheduler_thread_total Number of current active slurmctld threads
# TYPE slurm_scheduler_thread_total gauge
slurm_scheduler_thread_total 1
```
<br>
{%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}