[Enroot] Slurm ↔ Pyxis plugin ↔ Enroot ↔ OCI/Docker === ###### tags: `SlinkyProject / SPANK` ###### tags: `Kubernetes`, `k8s`, `app`, `slurm`, `SlinkyProject`, `SPANK`, `Slurm Plug-in`, `Pyxis`, `Enroot` <br> [TOC] <br> ## 核心概念 Slurm 支援 Enroot,是為了讓使用者在 HPC 叢集裡,用容器安全地跑工作,而不用開 Docker 權限,還能確保環境可重現。 - ### pyxis 如何發音? > by gpt-5.2 (2025/12/24) 「Pyxis」最常見的發音是: * **英文常見讀法**:**/ˈpɪk.sɪs/**,聽起來像 **「PICK-sis」**(中文可近似「**皮克西斯**」) * **另一種也有人用**:**/ˈpaɪk.sɪs/**,像 **「PIKE-sis」**(近似「**派克西斯**」) 在 **Slurm 的 Pyxis(容器外掛)** 這個情境裡,社群多半用 **「PICK-sis」** 那種讀法。 - ### enroot 如何發音? > by gpt-5.2 (2025/12/24) 「enroot」通常會照字面拆開唸: * **英文常見讀法**:**/ɛnˈruːt/** 聽起來像 **「en-ROOT」**(重音在 ROOT,中文可近似「恩-魯特」) 也有人會更口語一點唸成: * **「EN-root」**(把 en 唸得更像字母縮寫的 EN),但最常見還是 **en-ROOT**。 <br> --- ## 設定方式 - ### 在 worker pod 中,執行 `$> enroot create xxx.sqsh` - **[關鍵]** 吃 worker pod 的環境變數 `ENROOT_DATA_PATH` (指向當前環境的 path) - ### 在 login pod 中,執行 `$> enroot create xxx.sqsh` - **[關鍵]** 吃 login pod 的環境變數 `ENROOT_DATA_PATH` (指向當前環境的 path) - ### 在 login pod 中,`執行 $> srun --container-image=xxx ...` > (i.e. 從 login pod 移轉至 worker pod 中執行 job ) - **[關鍵]** 吃 login pod 的環境變數 `XDG_DATA_HOME` (指向 worker pod 的 path) - 設定 `ENROOT_DATA_PATH` 沒有用 - 因為沒設定 `XDG_DATA_HOME`,就會直接指定預設值:`~/.local/share` https://github.com/NVIDIA/enroot/blob/main/enroot.in#L73 - `config::export()` 邏輯 https://github.com/NVIDIA/enroot/blob/main/enroot.in#L27-L29 - 如果變數已經設定,則保留原值 - 如果變數未設定,則使用預設值 - :warning: 注意事項 - 在 worker pod 直接預設 `ENROOT_*`、`XDG_*` 變數 --> 沒有加速效果 - job 中沒有 worker pod 中的環境變數 (不存在環境變數疊加問題) - 執行 `srun` 時,預設會將 login pod 中的所有環境變數,攜帶到 job 中 (不需要明碼指定參數 `--export=ALL`) <br> ## /etc/enroot/enroot.conf ### 預設內容 ``` $ cat /etc/enroot/enroot.conf #ENROOT_LIBRARY_PATH /usr/lib/enroot #ENROOT_SYSCONF_PATH /etc/enroot #ENROOT_RUNTIME_PATH ${XDG_RUNTIME_DIR}/enroot #ENROOT_CONFIG_PATH ${XDG_CONFIG_HOME}/enroot #ENROOT_CACHE_PATH ${XDG_CACHE_HOME}/enroot #ENROOT_DATA_PATH ${XDG_DATA_HOME}/enroot #ENROOT_TEMP_PATH ${TMPDIR:-/tmp} # Gzip program used to uncompress digest layers. #ENROOT_GZIP_PROGRAM gzip # Options passed to zstd to compress digest layers. #ENROOT_ZSTD_OPTIONS -1 # Options passed to mksquashfs to produce container images. #ENROOT_SQUASH_OPTIONS -comp lzo -noD # Make the container root filesystem writable by default. #ENROOT_ROOTFS_WRITABLE no # Remap the current user to root inside containers by default. #ENROOT_REMAP_ROOT no # Maximum number of processors to use for parallel tasks (0 means unlimited). #ENROOT_MAX_PROCESSORS $(nproc) # Maximum number of concurrent connections (0 means unlimited). #ENROOT_MAX_CONNECTIONS 10 # Maximum time in seconds to wait for connections establishment (0 means unlimited). #ENROOT_CONNECT_TIMEOUT 30 # Maximum time in seconds to wait for network operations to complete (0 means unlimited). #ENROOT_TRANSFER_TIMEOUT 0 # Number of times network operations should be retried. #ENROOT_TRANSFER_RETRIES 0 # Use a login shell to run the container initialization. #ENROOT_LOGIN_SHELL yes # Allow root to retain his superuser privileges inside containers. #ENROOT_ALLOW_SUPERUSER no # Use HTTP for outgoing requests instead of HTTPS (UNSECURE!). #ENROOT_ALLOW_HTTP no # Include user-specific configuration inside bundles by default. #ENROOT_BUNDLE_ALL no # Generate an embedded checksum inside bundles by default. #ENROOT_BUNDLE_CHECKSUM no # Mount the current user's home directory by default. #ENROOT_MOUNT_HOME no # Restrict /dev inside the container to a minimal set of devices. #ENROOT_RESTRICT_DEV no # Always use --force on command invocations. #ENROOT_FORCE_OVERRIDE no # SSL certificates settings: #SSL_CERT_DIR #SSL_CERT_FILE # Proxy settings: #all_proxy #no_proxy #http_proxy #https_proxy ``` <br> --- ### 指定路徑:`/run/enroot/`,會發生什麼問題? > - enroot import -> Permission denied - ### enroot.conf 設定檔 ``` $ cat /etc/enroot/enroot.conf ENROOT_RUNTIME_PATH /run/enroot/${UID}/run ENROOT_CONFIG_PATH /run/enroot/${UID}/config ENROOT_CACHE_PATH /run/enroot/${UID}/cache ENROOT_DATA_PATH /run/enroot/${UID}/data ENROOT_TEMP_PATH /run/enroot/${UID} ``` - ### 執行 `enroot import` -> Permission denied ``` $ enroot import docker://ubuntu:24.04 mkdir: cannot create directory ‘/run/enroot/122599’: Permission denied mkdir: cannot create directory ‘/run/enroot/122599’: Permission denied mkdir: cannot create directory ‘/run/enroot/122599’: Permission denied ``` - 權限 ``` $ ls -ls /run/ | grep enroot 0 drwxr-xr-x 2 root root 10 Dec 16 22:07 enroot $ ls -ls /run/enroot total 0 $ touch /run/enroot/new_file touch: cannot touch '/run/enroot/new_file': Permission denied ``` - 目錄身份是 `root:root` -> 一般 user 無法寫入,會造成 Permission denied - ### 添加相關必要的 ENROOT 環境變數 ``` export ENROOT_CACHE_PATH=/tmp export ENROOT_TEMP_PATH=/tmp export ENROOT_DATA_PATH=/tmp export ENROOT_RUNTIME_PATH=/tmp $ enroot import docker://ubuntu:24.04 [INFO] Querying registry for permission grant [INFO] Authenticating with user: <anonymous> [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 1 missing layers... parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ]. ``` - ### `$XDG_CACHE_HOME` 內容不能是 email 形式(有含`@`) ``` $ export XDG_CACHE_HOME=/tmp/tj_tsai $ enroot import docker://ubuntu:24.04 [INFO] Querying registry for permission grant [INFO] Authenticating with user: <anonymous> [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 1 missing layers... 100% 1:0=2s 20043066d3d5c78b45520c5707319835ac7d1f3d7f0dded0138ea0897d6a3188 [INFO] Extracting image layers... 100% 1:0=0s 20043066d3d5c78b45520c5707319835ac7d1f3d7f0dded0138ea0897d6a3188 [INFO] Converting whiteouts... 100% 1:0=0s 20043066d3d5c78b45520c5707319835ac7d1f3d7f0dded0138ea0897d6a3188 [INFO] Creating squashfs filesystem... Parallel mksquashfs: Using 2 processors Creating 4.0 filesystem on /home/tj_tsai@asus.com/ubuntu+24.04.sqsh, block size 131072. [===========================================================================================================================================================================|] 2897/2897 100% Exportable Squashfs 4.0 filesystem, lzo compressed, data block size 131072 uncompressed data, compressed metadata, compressed fragments, compressed xattrs, compressed ids duplicates are removed Filesystem size 57496.10 Kbytes (56.15 Mbytes) 75.16% of uncompressed filesystem size (76499.73 Kbytes) Inode table size 41358 bytes (40.39 Kbytes) 36.26% of uncompressed inode table size (114044 bytes) Directory table size 34900 bytes (34.08 Kbytes) 50.49% of uncompressed directory table size (69122 bytes) Number of duplicate files found 131 Number of inodes 3444 Number of files 2587 Number of fragments 272 Number of symbolic links 197 Number of device nodes 0 Number of fifo nodes 0 Number of socket nodes 0 Number of directories 660 Number of hard-links 2 Number of ids (unique uids + gids) 1 Number of uids 1 root (0) Number of gids 1 root (0) ``` - ### 啟動 ubuntu 容器 ``` $ enroot start ubuntu+24.04.sqsh ``` <br> --- ### 指定路徑:`/tmp/`,會發生什麼問題? > enroot import / create / start 正常 - ### enroot.conf 設定檔 ``` $ cat /etc/enroot/enroot.conf ENROOT_RUNTIME_PATH /tmp/${UID}/run ENROOT_CONFIG_PATH /tmp/${UID}/config ENROOT_CACHE_PATH /tmp/${UID}/cache ENROOT_DATA_PATH /tmp/${UID}/data ENROOT_TEMP_PATH /tmp/${UID}/tmp ``` - ### 執行 `enroot import` ``` $ enroot import docker://ubuntu:24.04 mktemp: failed to create directory via template ‘/tmp/122599/tmp/enroot.XXXXXXXXXX’: No such file or directory ``` - enroot 工具不會自動建立 tmp 目錄,只會建立 cache / data / run - 修改:將 `ENROOT_TEMP_PATH` 指向 `/tmp/${UID}` <br> - ### enroot.conf 設定檔(修正檔) ``` $ cat /etc/enroot/enroot.conf ENROOT_RUNTIME_PATH /tmp/${UID}/run ENROOT_CONFIG_PATH /tmp/${UID}/config ENROOT_CACHE_PATH /tmp/${UID}/cache ENROOT_DATA_PATH /tmp/${UID}/data ENROOT_TEMP_PATH /tmp/${UID} ``` - ### 執行 `enroot import` ``` $ enroot import docker://ubuntu:24.04 [INFO] Querying registry for permission grant [INFO] Authenticating with user: <anonymous> [INFO] Authentication succeeded [INFO] Fetching image manifest list [INFO] Fetching image manifest [INFO] Downloading 1 missing layers... parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ]. I have no name!@c2m4-0:~$ export $XDG_CACHE_HOME=/tmp bash: export: `=/tmp': not a valid identifier ``` - ### `$XDG_CACHE_HOME` 內容不能是 email 形式(有含`@`) ``` $ export XDG_CACHE_HOME=/tmp/tj_tsai $ enroot import docker://ubuntu:24.04 (略,同前面範例) ``` <br> --- ### 指定路徑:`/tmp/enroot`,會發生什麼問題? > enroot import / create / start 正常 - ### enroot.conf 設定檔 ``` $ cat /etc/enroot/enroot.conf ENROOT_RUNTIME_PATH /tmp/enroot/${UID}/run ENROOT_CONFIG_PATH /tmp/enroot/${UID}/config ENROOT_CACHE_PATH /tmp/enroot/${UID}/cache ENROOT_DATA_PATH /tmp/enroot/${UID}/data ENROOT_TEMP_PATH /tmp/enroot/${UID} ``` - ### enroot 操作正常 ``` $ enroot import docker://ubuntu:24.04 $ enroot create ubuntu+24.04.sqsh $ enroot start ubuntu+24.04.sqsh ``` - ### srun 操作 ``` $ srun --container-image=docker://ubuntu:24.04 --pty bash ``` - 但 `/tmp/enroot` 會是屬於某 user 才有權限進入的資料夾 ``` root@c2m4-0:/tmp# ls -ls total 8 0 drwx------ 3 122599 834009 60 Dec 24 04:04 enroot ``` - 如果將 `/tmp/enroot` 設成 `root:root`,user 會因為 Permission denied 而無法進入 ``` $ srun -p cpu-set -C c4m16 --container-image=docker://ubuntu:24.04 --pty bash pyxis: importing docker image: docker://ubuntu:24.04 [2025-12-19T09:09:49] error: pyxis: child 826 failed with error code: 1 [2025-12-19T09:09:49] error: pyxis: failed to import docker image [2025-12-19T09:09:49] error: pyxis: printing enroot log file: [2025-12-19T09:09:49] error: pyxis: mkdir: cannot create directory ‘/tmp/enroot’: Permission denied [2025-12-19T09:09:49] error: pyxis: mkdir: cannot create directory ‘/tmp/enroot’: Permission denied [2025-12-19T09:09:49] error: pyxis: mkdir: cannot create directory ‘/tmp/enroot’: Permission denied [2025-12-19T09:09:49] error: pyxis: couldn't start container [2025-12-19T09:09:49] error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 [2025-12-19T09:09:49] error: Failed to invoke spank plugin stack [2025-12-19T09:09:49] error: pyxis: child 837 failed with error code: 1 srun: error: c4m16-0: task 0: Exited with exit code 1 ``` - 結論:路徑中不要建立 sub-folder,建議直接是 `${UID}/` <br> --- ### 總結: - 使用 emptyDir (給 Pod 用的臨時磁碟目錄,位於當前節點) - 掛載 host SSD (更好的支援) - 避免寫到容器檔案系統(可減少 layer 寫入) 讓大量 I/O 落在 emptyDir,不污染 image layer,也通常比網路磁碟快。 <br> --- <br> ## 討論 ### Slurm 支援 enroot 的最大用途是讓 user 可以在 HPC 中跑容器,對嗎? ### `parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ].` - ### Question ``` tj_tsai@asus.com@slurm-login-slinky-77778dd5d6-v9xd6:~$ srun -p cpu-set -C c4m16 --container-image=docker://ubuntu:24.04 --pty bash pyxis: importing docker image: docker://ubuntu:24.04 [2025-12-19T09:55:24] error: pyxis: child 140 failed with error code: 255 [2025-12-19T09:55:24] error: pyxis: failed to import docker image [2025-12-19T09:55:24] error: pyxis: printing enroot log file: [2025-12-19T09:55:24] error: pyxis: [INFO] Querying registry for permission grant [2025-12-19T09:55:24] error: pyxis: [INFO] Authenticating with user: <anonymous> [2025-12-19T09:55:24] error: pyxis: [INFO] Authentication succeeded [2025-12-19T09:55:24] error: pyxis: [INFO] Fetching image manifest list [2025-12-19T09:55:24] error: pyxis: [INFO] Fetching image manifest [2025-12-19T09:55:24] error: pyxis: [INFO] Downloading 1 missing layers... [2025-12-19T09:55:24] error: pyxis: parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ]. [2025-12-19T09:55:24] error: pyxis: couldn't start container [2025-12-19T09:55:24] error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 [2025-12-19T09:55:24] error: Failed to invoke spank plugin stack [2025-12-19T09:55:24] error: pyxis: child 224 failed with error code: 1 srun: error: c4m16-0: task 0: Exited with exit code 1 ``` - ### us.anthropic.claude-sonnet-4-5-20250929-v1:0 (2025/12/19) 從錯誤訊息來看,`$XDG_CACHE_HOME` 包含了不被允許的字元(只允許 `[-a-z0-9_+,.%:/= ]`)。這通常是因為路徑中包含了 `@` 符號(來自您的使用者名稱 `tj_tsai@asus.com`)。 ### 解決方案 有以下幾種方式處理: #### **方案 1:暫時取消設定 XDG_CACHE_HOME(推薦)** 在執行 `srun` 之前: ```bash unset XDG_CACHE_HOME srun -p cpu-set -C c4m16 --container-image=docker://ubuntu:24.04 --pty bash ``` - **測試結果**:依舊失敗 <br> #### **方案 2:設定為允許的路徑** ```bash export XDG_CACHE_HOME=/tmp/cache srun -p cpu-set -C c4m16 --container-image=docker://ubuntu:24.04 --pty bash ``` - **測試結果**:成功 <br> #### **方案 3:在 srun 指令中清除環境變數** ```bash env -u XDG_CACHE_HOME srun -p cpu-set -C c4m16 --container-image=docker://ubuntu:24.04 --pty bash ``` - **測試結果**:同方案 1,依舊失敗 <br> #### **方案 4:永久修改(修改 shell 設定檔)** 在 `~/.bashrc` 或 `~/.profile` 中加入: ```bash # 針對 Slurm/Pyxis 使用簡單的快取路徑 export XDG_CACHE_HOME=/tmp/$USER/cache ``` 然後重新載入設定: ```bash source ~/.bashrc ``` --- ### **建議做法** **最簡單快速的方式**是使用方案 1 或方案 3,直接在執行 `srun` 時處理環境變數問題。 如果需要經常使用容器,建議採用方案 4 永久修改設定檔,避免每次都要手動處理。 <br> --- <br> ### data, cache, config 目錄的正確設定 - ### Question ``` # slurm-login-slinky-76bff6fd56-2bs9f:/tmp$ $ mkdir enroot-data $ mkdir enroot-cache $ mkdir enroot-config $ export ENROOT_DATA_PATH=/tmp/enroot-data $ export ENROOT_CACHE_PATH=/tmp/enroot-cache $ export ENROOT_CONFIG_PATH=/tmp/enroot-config $ mkdir xyz $ srun --export=ALL,XDG_CACHE_HOME=/tmp/xyz --container-image=docker://ubuntu:24.04 --container-mounts=/work:/mnt/work --pty bash pyxis: importing docker image: docker://ubuntu:24.04 pyxis: imported docker image: docker://ubuntu:24.04 groups: cannot find name for group ID 834009 I have no name!@c2m4-0:/$ ``` 已經設定環境變數 `ENROOT_*`,為什麼還是在 `~/` 目錄下長出來 ``` $ tree -a . ├── .bash_history ├── .cache │   └── motd.legal-displayed ├── .local │   └── share │   └── enroot │   └── pyxis_12.0 │   ├── bin -> usr/bin │   ├── boot │   ├── dev │   ├── etc ... ``` - ### `us.anthropic.claude-sonnet-4-5-20250929-v1:0` (2025/12/19) > codebase: https://github.com/NVIDIA/enroot > <!-- http://10.78.26.44:30000/deepwiki/enroot/-/issues/1 --> ### **問題分析** 我找到問題的根本原因了!從代碼分析可以看出: #### **核心問題** 在 `enroot.in` 第 27-29 行的 `config::export` 函數: ```bash config::export() { [ -n "${3-}" ] && export "$1=$2" || export "$1=${!1:-$2}" } ``` 這個函數的邏輯是: - 如果變數**已經設定**,則**保留原值** - 如果變數**未設定**,則使用預設值 #### **執行順序問題** 在第 73-83 行: ```bash config::export XDG_DATA_HOME "${HOME:-$(echo ~)}/.local/share" config::export XDG_CONFIG_HOME "${HOME:-$(echo ~)}/.config" config::export XDG_CACHE_HOME "${HOME:-$(echo ~)}/.cache" ... config::export ENROOT_CONFIG_PATH "${XDG_CONFIG_HOME}/enroot" config::export ENROOT_CACHE_PATH "${XDG_CACHE_HOME}/enroot" config::export ENROOT_DATA_PATH "${XDG_DATA_HOME}/enroot" ``` **關鍵問題**: 1. ✅ 您設定了 `ENROOT_DATA_PATH`, `ENROOT_CACHE_PATH`, `ENROOT_CONFIG_PATH` 2. ❌ 但這些變數**依賴於** `XDG_DATA_HOME`, `XDG_CONFIG_HOME`, `XDG_CACHE_HOME` 3. ❌ 當 enroot 執行時,它會**先設定 XDG_* 變數為預設值** 4. ❌ 然後使用這些 XDG_* 變數來**重新計算** ENROOT_* 路徑 ### **解決方案** #### **方案 1: 同時設定 XDG_* 環境變數(推薦)** ```bash export XDG_DATA_HOME=/tmp/enroot-data-parent export XDG_CACHE_HOME=/tmp/enroot-cache-parent export XDG_CONFIG_HOME=/tmp/enroot-config-parent export ENROOT_DATA_PATH=/tmp/enroot-data export ENROOT_CACHE_PATH=/tmp/enroot-cache export ENROOT_CONFIG_PATH=/tmp/enroot-config # 然後執行 srun srun --export=ALL,XDG_CACHE_HOME=/tmp/xyz \ --container-image=docker://ubuntu:24.04 \ --container-mounts=/work:/mnt/work \ --pty bash ``` #### **方案 2: 修改系統配置檔 (需要管理員權限)** 編輯 `/etc/enroot/enroot.conf` (或相應的安裝路徑): ```bash ENROOT_CONFIG_PATH=/tmp/enroot-config ENROOT_CACHE_PATH=/tmp/enroot-cache ENROOT_DATA_PATH=/tmp/enroot-data ``` #### **方案 3: 使用使用者配置檔** 在 `~/.config/enroot/enroot.conf` 建立配置: ```bash mkdir -p ~/.config/enroot cat > ~/.config/enroot/enroot.conf << 'EOF' ENROOT_CONFIG_PATH=/tmp/enroot-config ENROOT_CACHE_PATH=/tmp/enroot-cache ENROOT_DATA_PATH=/tmp/enroot-data EOF ``` ### **為什麼會在 ~/.local/share 下產生檔案?** 因為: 1. `XDG_DATA_HOME` 未設定,預設為 `~/.local/share` 2. `ENROOT_DATA_PATH` 被重新計算為 `${XDG_DATA_HOME}/enroot` = `~/.local/share/enroot` 3. Pyxis 在呼叫 enroot 時,會在這個預設路徑下建立容器 **建議使用方案 1**,這樣可以完全控制所有路徑,不受 XDG 規範的影響。 <br> {%hackmd vaaMgNRPS4KGJDSFG0ZE0w %}