[資源管理] Slurm QoS / freeze & unfreeze
===
###### tags: `Slurm / 資源管理`
###### tags: `Slurm`, `HPC`, `Linux`, `Slinky`, `SlinkyProject`, `Account`, `Accounting`, `SlurmDBD`, `QoS`, `frozen`
<br>
[TOC]
<br>
## freeze & unfreeze config
```yaml=
sync:
frozen_handling:
# Enable frozen-state handling during synchronization.
enabled: true
# Name of the dedicated QOS for frozen associations.
frozen_qos_name: frozen
# Definition of the frozen QOS.
frozen_qos_config:
# Allow zero running jobs in aggregate for this QOS.
grp_jobs: 0
# Allow zero pending or running jobs in aggregate for this QOS.
grp_submit: 0
# Keep the QOS priority neutral.
priority: 0
# Reject jobs immediately at submission time when QOS limits
# are violated; without this flag, jobs may be accepted first
# and then remain stuck in pending state.
# flags: DenyOnLimit # for QOS Max/Min limits
# Automatically create the frozen QOS if missing.
auto_create_frozen_qos: true
# Parameters applied when the target is frozen.
frozen_params:
# Set "frozen" as the default QOS.
default_qos: frozen
# Allow only the "frozen" QOS.
qos: frozen
# Parameters applied when the target is unfrozen.
unfrozen_params:
# Restore the normal default QOS.
default_qos: normal
# Restore the inherited QOS list.
qos: ""
# Usages for frozen_handling:
#
# 1. Create frozen QOS.
# sacctmgr -i add qos frozen set \
# GrpJobs=0 \
# GrpSubmit=0 \
# Priority=0 \
# Flags=DenyOnLimit
#
# 2. Update frozen QOS if it already exists.
# sacctmgr -i modify qos frozen set \
# GrpJobs=0 \
# GrpSubmit=0 \
# Priority=0 \
# Flags=DenyOnLimit
#
# 3. Freeze a user association.
# sacctmgr -i modify user where name=<USER> account=<ACCOUNT> set \
# DefaultQOS=frozen \
# QosLevel=frozen
#
# 4. Unfreeze a user association.
# sacctmgr -i modify user where name=<USER> account=<ACCOUNT> set \
# DefaultQOS=-1 \
# QosLevel=''
#
# 5. Freeze an account association.
# sacctmgr -i modify account where name=<ACCOUNT> set \
# DefaultQOS=frozen \
# QosLevel=frozen
#
# 6. Unfreeze an account association.
# sacctmgr -i modify account where name=<ACCOUNT> set \
# DefaultQOS=-1 \
# QosLevel=''
```
### 陷阱
- ### 0. 正確的認知:
- `QosLevel=''` -> `Qos=''` -> 參考 parent 的設定
- `DefaultQOS=-1` -> 參考 parent 的設定
- ### 1. 當前狀態
```
root@slurm-login-slinky-599d7fc57b-d9kd2:/tmp# sacctmgr show account science --as
Account Descr Org Cluster ParentName User Share Priority GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS
---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------
science science science slurm_slu+ root 1 normal
science science science slurm_slu+ cehmistry 1 normal
science science science slurm_slu+ chemistry 1 normal
science science science slurm_slu+ physics 1 normal
root@slurm-login-slinky-599d7fc57b-d9kd2:/tmp# sacctmgr show user where defaultaccount=science --as
User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- ---------
cehmistry science None slurm_slu+ science 1 normal
chemistry science None slurm_slu+ science 1 normal
physics science None slurm_slu+ science 1 normal
```
- ### 2. 操作步驟
```
# freeze
sacctmgr -i modify user \
where name=chemistry account=science set \
DefaultQOS=frozen \
QosLevel=frozen
# view
sacctmgr show account science --as
sacctmgr show user where defaultaccount=science --as
# unfreeze
sacctmgr -i modify user \
where name=chemistry account=science set \
DefaultQOS=normal \
QosLevel=''
sacctmgr: error:
These associations don't have access to their default qos.
Please give them access before they the default can be set to this.
DefQOS = normal C = slurm_slurm A = science U = chemistry
Modified account associations...
Error with request: This request would make it so some associations would not have access to their default qos.
```
- ### 正確 unfreeze user 的操作
```
# unfreeze
sacctmgr -i modify user \
where name=chemistry account=science set \
DefaultQOS=-1 \
QosLevel=''
```
- `DefaultQOS=normal` 改成 `DefaultQOS=-1`
- ### `us.anthropic.claude-sonnet-4-5-20250929-v1:0` (2026/03/11)
根據程式碼分析和官方文件,我找到了答案:
### 問題原因
從 `src/sacctmgr/association_functions.c:236-254` 的程式碼可以看到:
```c
if (atoi(value) == -1)
assoc->def_qos_id = -1;
else
assoc->def_qos_id = str_2_slurmdb_qos(g_qos_list, value);
if (assoc->def_qos_id == NO_VAL) {
fprintf(stderr,
"You gave a bad default qos '%s'. "
"Use 'list qos' to get complete list.\n",
value);
exit_code = 1;
}
```
當你使用空字串 `''` 時:
1. `atoi('')` 回傳 `0` (不是 `-1`)
2. 程式嘗試在 QOS 列表中查找名為空字串的 QOS
3. 找不到,回傳 `NO_VAL`,觸發錯誤訊息
### 解決方法
**使用 `-1` 而不是空字串來清除 DefaultQos**:
```bash
sacctmgr modify user chemistry set DefaultQos=-1
```
這在 `sacctmgr` 的 man page 中有明確說明(doc/man/man1/sacctmgr.1:381):
> **DefaultQOS**=<default_qos>
>
> To clear an existing value, set a new value of **-1**.
執行後,chemistry 使用者的 DefaultQos 就會被清除,變成 empty (繼承父層或無預設值)。
<br>
## Troubleshooting
### 如何讓 association 在顯示時,不要參考 parent ?
- ### 解法:使用 WOPLimits
```
sacctmgr show assoc WOPLimits format=cluster,account,user,qoslevel
```
- ### [WOPLimits](https://slurm.schedmd.com/sacctmgr.html#OPT_WOPLimits)
Display information without hierarchical parent limits (i.e. will only display limits where they are set instead of propagating them from the parent).
- **WOPLimits**: Without Parent Limits
- ### WOPLimits 的作用是:
**不要把上層繼承下來的 limits / QOS 展開後一起顯示,只顯示「這一層自己有設定的東西」**。
可以把它理解成:
- **不加 `WOPLimits`**
看到的是「**最後算完、含繼承後**」的結果。
- **加 `WOPLimits`**
看到的是「**這一層原始到底設了什麼**」,不把 parent 傳下來的值攤開。
- ### 什麼情況適合加 `WOPLimits`
當你想查:
* 這個 user/account 的 QOS 是**自己設的**,還是**上層繼承的**
* 為什麼某個 association 看起來有某個 QOS
* 哪一層把某個 QOS `+` 加進來,或 `-` 排除掉
- ### 效果
- ### account associations
```
$ sacctmgr show assoc account=science format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
slurm_slu+ science chemistry frozen
slurm_slu+ science physics frozen
$ sacctmgr show assoc account=science WOPLimits format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
slurm_slu+ science chemistry
slurm_slu+ science physics
```
- ### account-user associations
```
$ sacctmgr show assoc account=science user=chemistry format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science chemistry frozen
$ sacctmgr show assoc account=science user=chemistry WOPLimits format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science chemistry
```
- ### account-level
```
$ sacctmgr show account science --as user='' format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
$ sacctmgr show account science --as user='' WOPLimits format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
$ sacctmgr show assoc account=science user='' format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
$ sacctmgr show assoc account=science user='' WOPLimits format=cluster,account,user,qoslevel
Cluster Account User QOS
---------- ---------- ---------- --------------------
slurm_slu+ science frozen
```
<br>
---
## 討論
### 如何正確比對 QOS 和 DefaultQOS?
- ### Question
```python=
def is_account_frozen(self, account_name: str) -> bool:
"""
Check if an account is frozen
Args:
account_name: Account name
Returns:
True if account is frozen
"""
if not self.frozen_enabled:
return False
try:
output = self._run_command([
'show', 'account', account_name,
f'cluster={self.cluster_name}',
'-P', '--associations',
'format=Account,QOS,DefaultQOS'
])
lines = output.strip().split('\n')
if len(lines) > 1 and lines[1].strip():
parts = lines[1].split('|')
if len(parts) >= 3:
qos = parts[1]
default_qos = parts[2]
# Check if QoS or DefaultQOS contains frozen QoS name
return (
self.frozen_qos_name in qos or
self.frozen_qos_name in default_qos
)
return False
except subprocess.CalledProcessError:
return False
```
請根據 codebase,指出 QoS 哪裡的判斷有問題?
- ### `us.anthropic.claude-sonnet-4-5-20250929-v1:0` (2026/03/23)
完美!我找到答案了。讓我查看 Slurm 源代碼中 `sacctmgr_print_default_qos` 的實作,這是處理 DefaultQOS 輸出的關鍵函數。
根據 `/tmp/mycodeagent/slurm-v25.11.4-20260320-1774247128950/src/sacctmgr/association_functions.c` 第 536-551 行:
```c
extern void sacctmgr_print_default_qos(uint32_t def_qos_id,
print_field_t *field, bool last)
{
char *tmp_char = NULL, *print_acct = NULL;
if (!g_qos_list)
g_qos_list = slurmdb_qos_get(db_conn, NULL);
if (def_qos_id != NO_VAL) {
tmp_char = slurmdb_qos_str(g_qos_list, def_qos_id); // ← 單一 QoS 名稱
if (!tmp_char)
tmp_char = print_acct =
xstrdup_printf("UNKN-%u", def_qos_id);
} else
tmp_char = print_acct = xstrdup("");
```
同時對比 `QOS` 欄位的處理(第 741-748 行):
```c
case PRINT_QOS:
if (!g_qos_list)
g_qos_list = slurmdb_qos_get(db_conn, NULL);
tmp_char = get_qos_complete_str(g_qos_list, assoc->qos_list); // ← QoS 列表
field->print_routine(field, tmp_char, last);
xfree(tmp_char);
break;
```
根據 Slurm 源代碼:
1. **DefaultQOS**:
- 內部儲存:`uint32_t def_qos_id`(單一 QoS ID)
- 輸出格式:**單一 QoS 名稱**(例如:`"normal"`)
- **不會是逗號分隔的字串**
2. **QOS**:
- 內部存儲:`list_t *qos_list`(QoS 列表)
- 輸出格式:**逗號分隔的 QoS 列表**(例如:`"normal,high,urgent"`)
### 原代碼問題分析
雖然 `DefaultQOS` 是單一值,但原代碼仍然有問題:
```python
# ❌ 問題代碼
return (
self.frozen_qos_name in qos or # qos 是列表(逗號分隔)
self.frozen_qos_name in default_qos # default_qos 是單一值
)
```
### 兩個欄位都有誤判風險:
1. **QOS 欄位(列表)**:需要分割逗號
```python
frozen_qos_name = "frozen"
qos = "unfrozen,normal"
# "frozen" in "unfrozen,normal" → True ❌ 誤判!
```
2. **DefaultQOS 欄位(單一值)**:雖然是單一值,但仍有子字串匹配問題
```python
frozen_qos_name = "test"
default_qos = "latest"
# "test" in "latest" → True ❌ 誤判!
```
### ✅ 正確做法
```python
def is_account_frozen(self, account_name: str) -> bool:
"""Check if an account is frozen"""
if not self.frozen_enabled:
return False
try:
output = self._run_command([
'show', 'account', account_name,
f'cluster={self.cluster_name}',
'-P', '--associations',
'format=Account,QOS,DefaultQOS'
])
lines = output.strip().split('\n')
if len(lines) > 1 and lines[1].strip():
parts = lines[1].split('|')
if len(parts) >= 3:
qos = parts[1]
default_qos = parts[2]
# QOS 欄位:分割逗號(列表)
qos_list = [q.strip() for q in qos.split(',') if q.strip()]
# DefaultQOS 欄位:直接比較(單一值)
return (
self.frozen_qos_name in qos_list or
self.frozen_qos_name == default_qos.strip() # ← 精確匹配
)
return False
except subprocess.CalledProcessError:
return False
```
### 關鍵修正
1. **QOS**:分割逗號並檢查列表成員 → `in qos_list`
2. **DefaultQOS**:精確字串比較 → `== default_qos.strip()`
這樣可以避免所有子字串匹配的誤判問題。
<br>
---
### project, user, association 都有各自的 frozen,什麼情況會發生association.frozen=true?
- ### Question
IAM data:
```
{
"memberships": [
{
"membershipId": "...",
"projectId": "...",
"user": {
"userId": "...",
"frozen": false,
...
},
"frozen": false,
...
},
"total": 1
}
```
project, user, association 都有各自的 frozen,什麼情況會發生association.frozen=true?
- ### gpt-5.4 (2026/03/23)
就這份資料結構來看,`association.frozen=true` **最可能代表:這筆「user ↔ project」的關聯本身被停用/凍結了**,但 **user 本身不一定被停用**,**project 本身也不一定被停用**。
也就是說,三種 `frozen` 很可能分別是三個不同層級:
* `user.frozen`:這個使用者帳號整體被凍結
* `project.frozen`:這個專案整體被凍結
* `association.frozen`:這個使用者「在這個專案中的 membership」被凍結
### 什麼情況容易出現 `association.frozen=true`
最常見會是這幾種語意:
1. **使用者沒有被停權,但被移出某個 project**
* 例如:
* `TJ Tsai` 帳號仍有效
* `project A` 仍有效
* 但 `TJ Tsai` 不再屬於 `project A`
* 為了保留歷史紀錄,不直接刪除 membership row,而是把 association 標成 frozen
2. **只想暫停某人對某個 project 的存取**
* 不是封鎖整個帳號
* 只是暫時禁止他進這個 project
* 這時最合理就是凍結 association,而不是凍結 user
3. **外部同步來源撤銷了關聯**
* 例如這筆 membership 是從外部群組 / HR / LDAP / IAM sync 匯入
* 後來上游資料顯示此人已不屬於該 project
* 系統可能把 association 標成 frozen,而不是直接刪掉,方便 audit / rollback
4. **角色或權限綁定失效,但想保留 membership 歷史**
* 例如 permission policy 變更、專案轉移、到期失效
* 系統保留 association,但不讓它再生效
---
### 可以怎麼理解它
你可以把它想成:
```text
user.frozen = 這個人整體不能用
project.frozen = 這個專案整體不能用
association.frozen = 這個人不能再用這個專案
```
所以實務上常見的授權判斷會像:
```text
effective_access =
!user.frozen &&
!project.frozen &&
!association.frozen
```
---
### 什麼情況「不一定」會讓 association.frozen=true
下面這兩種,不一定會連動到 association:
* **`user.frozen=true`**
* 有些系統只凍結 user,不會逐筆把所有 association 都改成 frozen
* **`project.frozen=true`**
* 有些系統只凍結 project,也不會去改所有 memberships
也就是說,`association.frozen=true` 通常不是「user/project 被凍結後自動推導出來」,而是**這筆 membership 自己的狀態**。
---
### 依你這份 JSON,我會怎麼判讀
你這筆資料像是:
* `user`: TJ Tsai
* `projectId`: 某個 project
* `membershipId`: 這筆 membership/association 的主鍵
* `tenantRole`: `TENANT_MEMBER`
因此 `association.frozen=true` 幾乎可理解成:
> **TJ Tsai 這個人本身還存在,但 TJ Tsai 與該 project 的這筆 membership 已被停用。**
---
### 最貼近實務的推測
若你們系統有「刪除成員但保留歷史」的設計,
那 `association.frozen=true` **非常可能就是 soft delete / inactive membership**。
---
### 要怎麼驗證你們系統是不是這樣
最直接的方法是做 3 組測試:
1. **只停用 user**
* 看 API 回來是 `user.frozen=true` 還是也會連動改 `association.frozen`
2. **只停用 project**
* 看 association 會不會跟著變 frozen
3. **把 user 從 project 移除,但不刪實體資料**
* 如果此時看到 `association.frozen=true`
* 那幾乎就能確定它是 membership-level disable / soft delete
如果你願意,我可以幫你把這三層 `frozen` 的語意整理成一張 **狀態判讀表**,例如:
`user.frozen / project.frozen / association.frozen -> 最終是否可登入、是否可進 project、是否可列在 member 清單中`。
<br>
{%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}