slurm-cluster-minimal.yaml === ###### tags: `K8s / app / Slurm Operator` ###### tags: `Kubernetes`, `k8s`, `app`, `Slurm`, `Operator`, `Soperator`, `SlurmCluster` ## 2025/06/10 ### slurm-cluster-minimal.yaml ```yaml= apiVersion: slurm.nebius.ai/v1 kind: SlurmCluster metadata: name: soperator # slurm-cluster-minimal namespace: soperator spec: # 叢集類型 (目前不支援 CPU-only) clusterType: gpu # 建議填入與 Helm chart 相同的版本號 crVersion: "1.19.0" # 1. 節點篩選:只選出帶 GPU 的節點 k8sNodeFilters: - name: gpu-filter nodeSelector: nvidia.com/gpu.present: "true" #affinity: # nodeAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # nodeSelectorTerms: # - matchExpressions: # - key: nvidia.com/gpu.present # operator: In # values: # - "true" # 2. 定期健康檢查:NCCL benchmark periodicChecks: ncclBenchmark: image: "ghcr.io/nebius/soperator/nccl_benchmark:1.19.0-jammy-slurm24.05.5" # 預設每 3 小時執行一次 schedule: "0 */3 * * *" k8sNodeFilterName: gpu-filter ## error: rendering worker NCCL topology ConfigMap: unknown NCCLType ncclSettings: topologyType: auto #ncclSettings: # topologyType: custom # topologyData: | # <topology> # <system> # <node id="0" /> # <node id="1" /> # </system> # </topology> # 3. 建立 jail 檔案系統快照 populateJail: image: "ghcr.io/nebius/soperator/populate_jail:1.19.0-jammy-slurm24.05.5" k8sNodeFilterName: gpu-filter #jailSnapshotVolume: # dataSource: # kind: PersistentVolumeClaim # name: jail-pvc # 4. SSHD 金鑰 Secret 名稱 secrets: sshdKeysName: sshd-keys # 全域定義 PVC 作為 VolumeSources volumeSources: - name: jail persistentVolumeClaim: claimName: jail-pvc - name: jail-root emptyDir: {} - name: spool persistentVolumeClaim: claimName: spool-pvc # 5. Slurm 各節點設定 slurmNodes: controller: size: 1 k8sNodeFilterName: gpu-filter munge: image: "ghcr.io/nebius/soperator/munge:1.19.0-jammy-slurm24.05.5" slurmctld: image: "ghcr.io/nebius/soperator/controller_slurmctld:1.19.0-jammy-slurm24.05.5" port: 6817 volumes: jail: volumeSourceName: jail spool: volumeSourceName: spool worker: size: 1 k8sNodeFilterName: gpu-filter munge: image: "ghcr.io/nebius/soperator/munge:1.19.0-jammy-slurm24.05.5" slurmd: image: "ghcr.io/nebius/soperator/worker_slurmd:1.19.0-jammy-slurm24.05.5" port: 6818 resources: cpu: "500m" memory: "1Gi" volumes: jail: volumeSourceName: jail jailSubMounts: - name: jail-root mountPath: "/" volumeSourceName: jail-root spool: volumeSourceName: spool login: size: 1 k8sNodeFilterName: gpu-filter munge: image: "ghcr.io/nebius/soperator/munge:1.19.0-jammy-slurm24.05.5" sshRootPublicKeys: - <host_key_1> - <host_key_2> sshd: image: "ghcr.io/nebius/soperator/login_sshd:1.19.0-jammy-slurm24.05.5" port: 2222 sshdServiceType: NodePort sshdServiceNodePort: 31080 volumes: jail: volumeSourceName: jail jailSubMounts: - name: jail-root mountPath: "/" volumeSourceName: jail-root # Prometheus exporter 節點 exporter: size: 1 k8sNodeFilterName: gpu-filter munge: image: "ghcr.io/nebius/soperator/munge:1.19.0-jammy-slurm24.05.5" exporter: image: "ghcr.io/nebius/soperator/exporter:1.19.0-jammy-slurm24.05.5" volumes: jail: volumeSourceName: jail rest: size: 1 k8sNodeFilterName: gpu-filter rest: image: "ghcr.io/nebius/soperator/slurmrestd:1.19.0-jammy-slurm24.05.5" # Accounting(預設不連外部 DB) accounting: size: 1 k8sNodeFilterName: gpu-filter enabled: false munge: image: "ghcr.io/nebius/soperator/munge:1.19.0-jammy-slurm24.05.5" ``` <br> ## jail-pv+pvc.yaml ```yaml= # kubectl delete pv jail-pv apiVersion: v1 kind: PersistentVolume metadata: name: jail-pv namespace: soperator spec: storageClassName: "" capacity: storage: 5Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: {~/k8s/volume/soperator/jail} --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: jail-pvc namespace: soperator spec: storageClassName: "" resources: requests: storage: 5Gi accessModes: - ReadWriteOnce ``` <br> ## spool-pv+pvc.yaml ```yaml= apiVersion: v1 kind: PersistentVolume metadata: name: spool-pv namespace: soperator spec: storageClassName: "" capacity: storage: 5Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: {~/k8s/volume/soperator/spool} --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: spool-pvc namespace: soperator spec: storageClassName: "" resources: requests: storage: 5Gi accessModes: - ReadWriteOnce ```
×
Sign in
Email
Password
Forgot password
or
By clicking below, you agree to our
terms of service
.
Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet
Wallet (
)
Connect another wallet
New to HackMD?
Sign up