Try   HackMD

NVIDIA Tesla Driver 安裝筆記

前言

  • Tesla 系列的 GPU 通常用於 Data Center、HPC 等大型機群的情境
    NVIDIA 針對 Tesla 的 driver 特別拉出一份文件說明,並有不同 support 週期
  • production 環境應避免從 CUDA 安裝包安裝 driver,而是從 NVIDIA driver 頁面下載安裝
    • CUDA release note 搜尋 Tesla,有此說明
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

檢查 GPU 型號

  • 列出系統的 NVIDIA 設備,並確認 GPU 是 Tesla 型號
lspci -d 10DE: | grep -i tesla

查閱 Release Notes

選擇 Tesla Driver 版本

  • 參考 NVIDIA Tesla Driver Lifecycle
  • 建議選擇 Long Term Service Branch (LTS),目前建議選擇 R450 系列
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

下載 Tesla Driver

  • 下載連結 : https://www.nvidia.com/Download/Find.aspx
  • 語言選 Chinse(Traditional),下載連結會是 tw. 開頭
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

放置 Driver 安裝檔到 Local Yum Repository

  • 維運大型機群,建議在內部環境建立 local yum repository,維持機群的 driver 版本一致,也能避免因操作失誤等行為誤升級 driver
  • 這邊是已有 local yum repository,下載及放置 driver rpm 到指定路徑的步驟
  • local yum repository server
# 下載指定版本 driver curl -R -O https://tw.download.nvidia.com/tesla/450.119.04/nvidia-driver-local-repo-rhel7-450.119.04-1.0-1.x86_64.rpm # 解壓縮 rpm rpm2cpio nvidia-driver-local-repo-rhel7-450.119.04-1.0-1.x86_64.rpm | cpio -idv # 移動解壓縮後的資料夾,到指定讀取的路徑 mv ./var/nvidia-driver-local-repo-rhel7-450.119.04 nv_rpms_450.119.04
  • client 的 yum 設定 /etc/yum.repos.d/nvidia-local.repo
[nv-450.119.04] name=yum repository for nv_rpms_450.119.04 baseurl=http://your_repo_server_ip_and_path/nv_rpms_450.119.04 enabled=1 gpgcheck=0

安裝 Driver

  • 安裝步驟
# 清除 yum cache yum clean expire-cache # 安裝 driver yum install -y nvidia-driver-latest-dkms # 如果需要圖形介面 (x window) 則需要再執行此步驟 yum install cuda-drivers

啟用 Service 以及檢查測試

  • 安裝完 driver,建議啟用 service nvidia-persistenced
# 啟用 nvidia-persistenced systemctl start nvidia-persistenced # 開機自動啟用 nvidia-persistenced systemctl enable nvidia-persistenced # 檢查 driver 版本 cat /proc/driver/nvidia/version # 測試基本指定能否正常執行和顯示 nvidia-smi # 如果有異常,可能需要重開機 systemctl reboot

ref : https://stackoverflow.com/a/13127714

檢查 RPM script

  • 檢查 rpm 的 pre/post 的 script,掌握裝 driver 的 rpm 時,額外執行了什麼動作
  • 比如有
    • 新增 nvidia-persistenced user (nvidia-persistenced-latest*.rpm)
    • 新增 kernel cmd (nvidia-driver-latest-*.rpm)
# 檢查安裝的 rpm pre/post script NV_VER=450.119.04 rpm -qp --scripts nvidia-persistenced-latest-dkms-$NV_VER*.rpm rpm -qp --scripts nvidia-driver-latest-dkms-$NV_VER*.rpm
檢查 nvidia-persistenced-latest-dkms 的 script
  • rpm -qp --scripts nvidia-persistenced-latest-dkms-450.119.04-1.el7.x86_64.rpm
  • output
preinstall scriptlet (using /bin/sh): getent group nvidia-persistenced >/dev/null || groupadd -r nvidia-persistenced getent passwd nvidia-persistenced >/dev/null || \ useradd -r -g nvidia-persistenced -d /var/run/nvidia-persistenced -s /sbin/nologin \ -c "NVIDIA persistent software state" nvidia-persistenced exit 0 postinstall scriptlet (using /bin/sh): if [ $1 -eq 1 ] ; then # Initial installation systemctl preset nvidia-persistenced.service >/dev/null 2>&1 || : fi preuninstall scriptlet (using /bin/sh): if [ $1 -eq 0 ] ; then # Package removal, not upgrade systemctl --no-reload disable nvidia-persistenced.service > /dev/null 2>&1 || : systemctl stop nvidia-persistenced.service > /dev/null 2>&1 || : fi postuninstall scriptlet (using /bin/sh): systemctl daemon-reload >/dev/null 2>&1 || : if [ $1 -ge 1 ] ; then # Package upgrade, not uninstall systemctl try-restart nvidia-persistenced.service >/dev/null 2>&1 || : fi
檢查 nvidia-driver-latest-dkms 的 script
  • rpm -qp --scripts nvidia-driver-latest-dkms-450.119.04-1.el7.x86_64.rpm
  • output
postinstall scriptlet (using /bin/sh): /usr/sbin/grubby --update-kernel=ALL --args='nouveau.modeset=0 rd.driver.blacklist=nouveau' &>/dev/null . /etc/default/grub if [ -z "${GRUB_CMDLINE_LINUX}" ]; then echo GRUB_CMDLINE_LINUX="nouveau.modeset=0 rd.driver.blacklist=nouveau" >> /etc/default/grub else for param in nouveau.modeset=0 rd.driver.blacklist=nouveau; do echo ${GRUB_CMDLINE_LINUX} | grep -q $param [ $? -eq 1 ] && GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} ${param}" done sed -i -e "s|^GRUB_CMDLINE_LINUX=.*|GRUB_CMDLINE_LINUX=\"${GRUB_CMDLINE_LINUX}\"|g" /etc/default/grub fi if [ "$1" -eq "2" ]; then # Remove no longer needed options /usr/sbin/grubby --update-kernel=ALL --remove-args='nomodeset gfxpayload=vga=normal' &>/dev/null for param in nomodeset gfxpayload=vga=normal; do sed -i -e "s|$param ||g" /etc/default/grub done fi || : preuninstall scriptlet (using /bin/sh): if [ "$1" -eq "0" ]; then /usr/sbin/grubby --update-kernel=ALL --remove-args='nouveau.modeset=0 rd.driver.blacklist=nouveau' &>/dev/null for param in nouveau.modeset=0 rd.driver.blacklist=nouveau; do sed -i -e "s|$param ||g" /etc/default/grub done fi ||:

Ansible


This work is licensed under a CC BY-NC-SA 4.0