Try   HackMD

NVIDIA 開放原始碼 GPU 核心模組的研究和改進

NVIDIA/open-gpu-kernel-modules

Expected Goal

  1. 使用靜態分析工具 (如 cppcheck, sparse [即 make C=1], clang static-analyzer) 檢測 NVIDIA/open-gpu-kernel-modules 程式碼的潛在缺失,並提交對應的 pull request
  2. 在 RTX 硬體上確認可正確載入修改後的 driver
  3. 探討可能的 lock 改進,例如: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/core/locks_common.c

Developing Environment

$ uname -a
Linux 5.13.0-44-generic #49~20.04.1-Ubuntu SMP 
Wed May 18 18:44:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Module test

Check driver status

  • 現在系統內的 nvidia module 應該會在此目錄底下
/lib/modules/$(uname -r)/kernel/drivers/video
  • 可以確認目前的系統抓取到的 nvidia module,可以看印出的 license 訊息得知目前安裝的版本
modinfo nvidia
  • NVIDIA System Management Interface, 可用來查看當前的 GPU 狀態
nvidia-smi
  • 確認目前的 GPU driver version,下列訊息代表是 proprietary version
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  [...]

Backup in advance

若是擔心覆蓋掉原本的 GPU driver 相關 module,可以先備份好下列檔案,這些檔案是原本的 NVIDIA driver。之後執行 make modules_install 時就是覆蓋這些檔案

  • 對應的 kernel 版本可以利用 uname -r 替換
/lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia-drm.ko
/lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia-modeset.ko
/lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia-peermem.ko
/lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia-uvm.ko
/lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia.ko

個人建議先去官網下載好原本 GPU driver 版本的 .run 檔,這樣之後要替換回 propreitary version 時可以用這個檔案解決 gsp.bin fireware 或相關 component 不相容的問題

DOWNLOAD NVIDIA DRIVERS


Before building

Chapter 44. Open Linux Kernel Modules

由於 proprietary 與 open 兩個版本是完全互斥的,亦即在系統內同時只能存在一種版本。

  • 先用 nvidia-uninstall 將系統內的 nvidia driver 移除
  • 再來需要到 DOWNLOAD NVIDIA DRIVERS 下載對應到 open gpu version 的 .run 檔,執行下列命令安裝對應的 gsp.bin fireware
    ​​​​sh ./NVIDIA-Linux-[...].run --no-kernel-modules
    

我因為測試時,將機器原本的 proprietary driver 版本改到與此時的 open driver 版本相同,所以後續比較沒有遇到相容性問題


Build module

先從 NVIDIA/open-gpu-kernel-modules 抓取對應的檔案

make modules -j `nproc`

Warning & Error when building

粗略看來,大部分是不能修改的,尚未深入探討

/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia/nv-dma.c: In function ‘nv_dma_use_map_resource’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia/nv-dma.c:783:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  783 |     const struct dma_map_ops *ops = get_dma_ops(dma_dev->dev);
      |     ^~~~~
In file included from ./include/linux/kernel.h:15,
                 from ./include/linux/list.h:9,
                 from ./include/linux/preempt.h:11,
                 from ./include/linux/spinlock.h:51,
                 from /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/common/inc/nv-lock.h:29,
                 from /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/common/inc/nv-linux.h:32,
                 from /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia/nv-vm.c:26:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia/nv-vm.c: In function ‘nv_get_max_sysmem_address’:
./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast
   20 |  (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
      |                            ^~
./include/linux/minmax.h:26:4: note: in expansion of macro ‘__typecheck’
   26 |   (__typecheck(x, y) && __no_side_effects(x, y))
      |    ^~~~~~~~~~~
./include/linux/minmax.h:36:24: note: in expansion of macro ‘__safe_cmp’
   36 |  __builtin_choose_expr(__safe_cmp(x, y), \
      |                        ^~~~~~~~~~
./include/linux/minmax.h:52:19: note: in expansion of macro ‘__careful_cmp’
   52 | #define max(x, y) __careful_cmp(x, y, >)
      |                   ^~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia/nv-vm.c:225:26: note: in expansion of macro ‘max’
  225 |         global_max_pfn = max(global_max_pfn, node_end_pfn(node_id));
      |                          ^~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_pmm_gpu.c: In function ‘uvm_pmm_gpu_alloc_kernel’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-uvm/uvm_pmm_gpu.c:645:16: warning: unused variable ‘gpu’ [-Wunused-variable]
  645 |     uvm_gpu_t *gpu = uvm_pmm_to_gpu(pmm);
      |                ^~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c: In function ‘cursor_plane_req_config_update’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c:88:32: warning: unused variable ‘nv_drm_plane_state’ [-Wunused-variable]
   88 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c:87:27: warning: unused variable ‘nv_dev’ [-Wunused-variable]
   87 |     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
      |                           ^~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c: In function ‘plane_req_config_update’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c:189:9: warning: unused variable ‘ret’ [-Wunused-variable]
  189 |     int ret = 0;
      |         ^~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c: In function ‘nv_drm_plane_atomic_set_property’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c:504:32: warning: unused variable ‘nv_drm_plane_state’ [-Wunused-variable]
  504 |     struct nv_drm_plane_state *nv_drm_plane_state =
      |                                ^~~~~~~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c: In function ‘nv_drm_enumerate_crtcs_and_planes’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-crtc.c:1148:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
 1148 |             struct drm_plane *overlay_plane =
      |             ^~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-modeset.c: In function ‘__will_generate_flip_event’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable ‘overlay_event’ [-Wunused-variable]
   98 |     bool overlay_event = false;
      |          ^~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable ‘primary_event’ [-Wunused-variable]
   97 |     bool primary_event = false;
      |          ^~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable ‘primary_plane’ [-Wunused-variable]
   96 |     struct drm_plane *primary_plane = crtc->primary;
      |                       ^~~~~~~~~~~~~
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-peermem/nvidia-peermem.c: In function ‘nv_mem_client_init’:
/home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
  445 |     int status = 0;
      |     ^~~
Skipping BTF generation for /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko due to unavailability of vmlinux
Skipping BTF generation for /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko due to unavailability of vmlinux
Skipping BTF generation for /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko due to unavailability of vmlinux
Skipping BTF generation for /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia.ko due to unavailability of vmlinux
Skipping BTF generation for /home/ccs100203/Desktop/Coding/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko due to unavailability of vmlinux


Copy module to the destination directory

need permission

sudo make modules_install -j `nproc`

If encounter dependency issue about configuration in kernel

可以在執行 make modules_install 之後執行以下命令

sudo depmod

If encounter SSL error when installing

SSL error while signing modules #38
How to resolve SSL error during make modules_install command?
No OpenSSL sign-file signing_key.pem leads to error while loading kernel modules

  • 如果遇到以下錯誤訊息:
At main.c:160:
- SSL error:02001002:system library:fopen:No such file or directory: crypto/bio/bss_file.c:69
- SSL error:2006D080:BIO routines:BIO_new_file:no such file: crypto/bio/bss_file.c:76
sign-file: ./certs/signing_key.pem: No such file or directory

代表在 /lib/modules/$(uname -r)/build/certs 底下沒有對應的 signature (key),可以用以下命令產生 .pem 檔並放到對應的目錄底下

  • command
openssl req -new -nodes -utf8 -sha512 -days 36500 -batch -x509 \
-config x509.genkey -outform DER -out signing_key.x509 -keyout signing_key.pem
  • x509.genkey file
[ req ]
default_bits = 4096
distinguished_name = req_distinguished_name
prompt = no
string_mask = utf8only
x509_extensions = myexts

[ req_distinguished_name ]
CN = Modules

[ myexts ]
basicConstraints=critical,CA:FALSE
keyUsage=digitalSignature
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid

Unload current Nvidia module

"Module nvidia is in use" but there are no processes running on the GPU
Switch to Console in Ubuntu 18.04 - How to Leave GUI?
Virtual console

gdm3: X display manager

What is gdm3, kdm, lightdm? How to install and remove them?

因為要確保沒有任何 process 使用 current nvidia module 才能夠成功 unload module,所以需要先切換到 tty 介面 (without GUI),並停止 gdm3。

  • switch to TTYn
# switch to TTY3~6
Ctrl+Alt+F3~F6

# get back to the GUI
Ctrl+Alt+F1 # lock screen
Ctrl+Alt+F2 # logged-in user's desktop
  • stop gdm3 (GUI)
sudo service gdm3 stop

Unload module

sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia_drm
sudo modprobe -r nvidia_modeset
sudo modprobe -r nvidia

可以用 lsmod | grep nvidia 來確認是否成功移除


Load Nvidia module

Chapter 44. Open Linux Kernel Modules
Most features of the Linux GPU driver are supported with the open flavor of kernel modules, including CUDA, Vulkan, OpenGL, OptiX, and X11. However, in the current release, some display and graphics features (notably: G-SYNC, Quadro Sync, SLI, Stereo, rotation in X11, and YUV 4:2:0 on Turing), as well as power management, and NVIDIA virtual GPU (vGPU), are not yet supported. These features will be added in upcoming driver releases.

  • 如果顯示卡是 GeForce and Workstation GPUs 系列,因為目前是 alpha-quality,所以會缺少一些 features 的支援,所以安裝時需要使用下列命令 (2022/06/07)
sudo modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1
  • 其餘的系列則可以直接使用此命令載入
sudo modprobe nvidia 
  • 載入後可以用這些命令來檢查 driver 是否正常運作
$ nvidia-smi
NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7 [...]

$ modinfo nvidia | grep license
license:        Dual MIT/GPL

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 [...]
  • 成功載入的話就可以啟用 gdm3 (GUI) 並切換回 GUI
sudo service gdm3 start

If encounter unknown symbol issue

若是使用 modprobeinsmod 安裝 nvidia module 時出錯,而在 dmesg 內發現下列訊息時,代表需要安裝對應的 gsp.bin fireware,可以看 Before building

[  469.336699] nvidia_modeset: Unknown symbol nvidia_register_module (err -2)
[  469.336742] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err -2)
[  469.336777] nvidia_modeset: Unknown symbol nvidia_unregister_module (err -2)

If has dkms issue

Dynamic Kernel Module Support

dkms status 可以查看當前 dkms 的狀態

因為我一開始的 driver 是 470 (proprietary) 版本,並且有向 dkms 註冊,所以在自行手動更新版本後,在 dkms status 會看到以下警告

nvidia, 470.129.06, 5.13.0-44-generic, x86_64: installed 
(WARNING! Diff between built and installed module!) 

How do I uninstall dkms modules if there are two of them?

我直接從 dkms 移除 470.129.06, 5.13.0-44-generic 下的 nvidia driver

sudo dkms remove nvidia/470.129.06 -k 5.13.0-44-generic

If module cannot work after loaded

Unable to determine the device handle for GPU 0000:D8:00.0: Not Found #259

如果成功載入 module,但啟動 gdm3 後卻無法正常啟用 GUI,可以測試 nvidia-smi 是否正常運作,若是遇到下列問題,代表在 load module 時沒有加入對應的參數,請看 Load Nvidia module 的解釋

Unable to determine the device handle for GPU 0000:08:00.0: Not Found

補充 nouveau driver 介紹
確認現在 nouveau driver 的普及程度,現在的 linux 是否還有在使用,或是都改用 nvidia 發行的 proprietary 版本了


Static Analyzer

Sparse

dev-tools/sparse
sparse’s documentation
Sparse: a look under the hood

  • What is Sparse

Sparse, the semantic parser, provides a compiler frontend capable of parsing most of ANSI C as well as many GCC extensions, and a collection of sample compiler backends, including a static analyzer also called sparse.

Cppcheck

Cppcheck
Cppcheck manual

  • What is Cppcheck

Cppcheck is a static analysis tool for C/C++ code. It provides unique code analysis to detect bugs and focuses on detecting undefined behaviour and dangerous coding constructs.

  • Cppcheck version
$ cppcheck --version 
Cppcheck 1.90

Check kernel-open/

The kernel interface layer

沒有任何錯誤訊息

cppcheck kernel-open/ --force

Check src/

The OS-agnostic code

將檢查重點放在 src/nvidia/src/,這是有關 nvidia.ko 的 OS-agnostic code

cppcheck src/nvidia/src/ --force > cpp_out 2> cpp_error

How does vmIndex work

src/nvidia/src/kernel/diagnostics/gpu_acct.c 下,關於 vmIndex 我認為是類似是 errno 的變數用來告知錯誤訊息,但在專案中卻找不到更改該變數的地方

How does vmIndex work in src/nvidia/src/kernel/diagnostics/gpu_acct.c #313

在討論區與 NVIDIA 開發人員討論過後,確認是因為 NVIDIA 並未開源其餘的 driver,像是 vGPU 等等,所以在此 repo 中會看到一些 random {} blocks 像是此例子,就是因為需要與其餘 proprietary driver 相容所導致的結果。

有些訊息是 cppcheck 的誤判

  • 像 Syntax Error: AST broken
    cppcheck 1.90 AS broken issue
  • src/nvidia/src/kernel/gpu/gpu.c 下,gpu/gpu_child_list.h 裡面是很多的 definition,會根據 config GPU_CHILD 的定義來決定後續的 definition,最後會把 instance 的數量總和起來。所以也是 cppcheck 的誤判
struct ChildList {
    char children[ 0 +
        #include "gpu/gpu_child_list.h"
    ];
};
src/ 全部的錯誤訊息
src/nvidia/src/kernel/diagnostics/gpu_acct.c:737:37: error: Uninitialized variable: pLiveDS [uninitvar]
    status = gpuacctLookupProcEntry(pLiveDS, searchPid, &pEntry);
                                    ^
src/nvidia/src/kernel/diagnostics/gpu_acct.c:737:46: error: Uninitialized variable: searchPid [uninitvar]
    status = gpuacctLookupProcEntry(pLiveDS, searchPid, &pEntry);
                                             ^
src/nvidia/src/kernel/diagnostics/gpu_acct.c:1057:25: error: Uninitialized variable: pList [uninitvar]
    NV_ASSERT_OR_RETURN(pList != NULL, NV_ERR_INVALID_STATE);
                        ^
src/nvidia/src/kernel/gpu/bus/arch/maxwell/kern_bus_gm107.c:2336:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/bus/kern_bus_ctrl.c:78:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/device_ctrl.c:326:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/disp/disp_channel.c:628:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/fifo/arch/maxwell/kernel_channel_gm107.c:408:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/fifo/kernel_channel.c:1923:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/fifo/kernel_fifo_ctrl.c:614:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/gpu.c:745:26: error: syntax error: +] [syntaxError]
        char children[ 0 +
                         ^
src/nvidia/src/kernel/gpu/gr/fecs_event_list.c:229:9: error: There is an unknown macro here somewhere. Configuration is required. If NV_ASSERT_OK_OR_ELSE is a macro then please configure it. [unknownMacro]
        NV_ASSERT_OK_OR_ELSE(status,
        ^
src/nvidia/src/kernel/gpu/gr/kernel_graphics.c:2137:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/gsp/kernel_gsp.c:685:67: error: Expression 'nvStatus!=NV_WARN_MORE_PROCESSING_REQUIRED,nvStatus=NV_ERR_GENERIC' depends on order of evaluation of side effects [unknownEvaluationOrder]
    NV_ASSERT_OR_ELSE(nvStatus != NV_WARN_MORE_PROCESSING_REQUIRED,
                                                                  ^
src/nvidia/src/kernel/gpu/intr/intr.c:105:13: error: There is an unknown macro here somewhere. Configuration is required. If NV_ASSERT_OK_OR_ELSE is a macro then please configure it. [unknownMacro]
            NV_ASSERT_OK_OR_ELSE(status, intrGetPendingStall_HAL(pGpu, pIntr, &pendingEngines, NULL /* threadstate */), return);
            ^
src/nvidia/src/kernel/gpu/mem_mgr/arch/maxwell/virt_mem_allocator_gm107.c:1138:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/context_dma.c:310:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/dma.c:976:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/mem_ctrl.c:319:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/mem_desc.c:4061:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/mem_mgr_ctrl.c:185:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/method_notification.c:369:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.c:1056:42: error: syntax error [syntaxError]
            "rangeStart %% pageSize == 0", /*do nothing*/);
                                         ^
src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/regmap.c:973:42: error: syntax error [syntaxError]
            "rangeStart %% pageSize == 0", /*do nothing*/);
                                         ^
src/nvidia/src/kernel/gpu/mmu/kern_gmmu.c:647:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_INDEX_IN_MASK_END is a macro then please configure it. [unknownMacro]
    FOR_EACH_INDEX_IN_MASK_END
    ^
src/nvidia/src/kernel/gpu/nvlink/kernel_nvlink.c:121:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_INDEX_IN_MASK_END is a macro then please configure it. [unknownMacro]
    FOR_EACH_INDEX_IN_MASK_END
    ^
src/nvidia/src/kernel/mem_mgr/gpu_vaspace.c:2923:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_GPU_IN_MASK_UC_END is a macro then please configure it. [unknownMacro]
    FOR_EACH_GPU_IN_MASK_UC_END
    ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:668:45: warning: Possible null pointer dereference: pPageHandle [nullPointer]
                pPhysicalAddresses[index] = pPageHandle->address;
                                            ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:666:35: note: Assignment 'pPageHandle=NULL', assigned value is 0
                    pPageHandle = NULL;
                                  ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:668:45: note: Null pointer dereference
                pPhysicalAddresses[index] = pPageHandle->address;
                                            ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:689:47: warning: Possible null pointer dereference: pPageHandle [nullPointer]
        memdescDescribe(pMemDesc, ADDR_FBMEM, pPageHandle->address, pMemDesc->Size);
                                              ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:686:27: note: Assignment 'pPageHandle=NULL', assigned value is 0
            pPageHandle = NULL;
                          ^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:689:47: note: Null pointer dereference
        memdescDescribe(pMemDesc, ADDR_FBMEM, pPageHandle->address, pMemDesc->Size);
                                              ^
src/nvidia/src/kernel/mem_mgr/video_mem.c:437:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
    SLI_LOOP_END
    ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:709:56: warning: Possible null pointer dereference: pGSCI [nullPointer]
        portMemCopy(&boardSkuNum, sizeof(boardSkuNum), pGSCI->SKUInfo.projectSKU, sizeof(pGSCI->SKUInfo.projectSKU));
                                                       ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
    GspStaticConfigInfo *pGSCI = NULL;
                                 ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:709:56: note: Null pointer dereference
        portMemCopy(&boardSkuNum, sizeof(boardSkuNum), pGSCI->SKUInfo.projectSKU, sizeof(pGSCI->SKUInfo.projectSKU));
                                                       ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:710:58: warning: Possible null pointer dereference: pGSCI [nullPointer]
        portMemCopy(&boardProjNum, sizeof(boardProjNum), pGSCI->SKUInfo.project, sizeof(pGSCI->SKUInfo.project));
                                                         ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
    GspStaticConfigInfo *pGSCI = NULL;
                                 ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:710:58: note: Null pointer dereference
        portMemCopy(&boardProjNum, sizeof(boardProjNum), pGSCI->SKUInfo.project, sizeof(pGSCI->SKUInfo.project));
                                                         ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:711:21: warning: Possible null pointer dereference: pGSCI [nullPointer]
        subVendor = pGSCI->vbiosSubVendor;
                    ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
    GspStaticConfigInfo *pGSCI = NULL;
                                 ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:711:21: note: Null pointer dereference
        subVendor = pGSCI->vbiosSubVendor;
                    ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:712:21: warning: Possible null pointer dereference: pGSCI [nullPointer]
        subDevice = pGSCI->vbiosSubDevice;
                    ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
    GspStaticConfigInfo *pGSCI = NULL;
                                 ^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:712:21: note: Null pointer dereference
        subDevice = pGSCI->vbiosSubDevice;
                    ^
src/nvidia/src/kernel/rmapi/alloc_free.c:753:9: error: Address of local auto-variable assigned to a function parameter. [autoVariables]
        pRmAllocParams->pRightsRequired = &rightsRequired;
        ^
src/nvidia/src/lib/base_utils.c:119:34: error: Syntax Error: AST broken, ternary operator lacks ':'. [internalAstError]
    return (bit < numElements*32 ? (NvBool) !!(pBitField[bit/32] & NVBIT(bit%32)) : NV_FALSE);
                                 ^
src/nvidia/src/libraries/resserv/src/rs_server.c:2142:0: error: failed to expand 'CLIENT_ENCODEHANDLE_INTERNAL', Wrong number of parameters for macro 'CLIENT_ENCODEHANDLE_INTERNAL'. [preprocessorErrorDirective]
            ? CLIENT_ENCODEHANDLE_INTERNAL(clientIndex)
^
src/nvidia/src/libraries/tls/tls.c:424:59: error: Uninitialized variable: pThreadEntry [uninitvar]
        pThreadEntry = _tlsIsrEntriesFind((NvU64)(NvUPtr)&pThreadEntry);
                                                          ^


可能的 lock 改進

TODO


Reference

Terminology

Ask questions about the codebase here #148

RM stands for "Resource Manager"

https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/minimumrequirements.html

UVM stands for Unified Virtual Memory


OLD RECORD / MEMO / OBSOLETE

最一開始機器內原本的 driver

  • filename: /lib/modules/5.13.0-44-generic/updates/dkms/nvidia.ko
  • version: 470.129.06
  • NVIDIA driver metapacjage from nvidia-driver-470 (proprietary)
  • path: /lib/modules/5.13.0-44-generic/kernel/drivers/video

但後來測試到一半就換版本了


2022/05/28 嘗試載入開源的 gpu driver

目前還是無法成功載入開源的 gpu driver,遇到下列狀況:

首先我先 unload 機器內原有的 nvidia driver (與此同時我會在 tty 介面,因為已經停用 gdm3),然後在我做完 make modules 以及 make modules_install 之後

  • 在目錄 /lib/modules/5.13.0-44-generic/kernel/drivers/video 底下執行以下命令會發現無法載入 nvidia.ko
$ ls
backlight  fbdev  nvidia-drm.ko  nvidia.ko  nvidia-modeset.ko  nvidia-peermem.ko  nvidia-uvm.ko  vgastate.ko
$ sudo modprobe nvidia.ko 
modprobe: FATAL: Module nvidia.ko not found in directory /lib/modules/5.13.0-44-generic
  • 但如果我是下面這樣執行的話是可以載入的,但載入後 driver 的 license 是 NVIDIA 而不是 DUAL MIT/GPL,代表載入到機器內原有的 driver
$ sudo modprobe nvidia
$ modinfo -l nvidia
NVIDIA
  • 透過 sudo insmod nvidia.ko 以及 dmesg 去看系統內的錯誤訊息,會看到很多 unknown symbol 在其中,代表遇到 dependency 的問題。所以我試著更改載入順序,在 nvidia, nvidia_modeset, nvidia_drm, nvidia_uvm 之間輪流載入,但是都沒辦法成功載入任意一個 module
  • 在經過研究後我試圖透過下列指令解決
sudo depmod
  • 理論上這樣就重新設定過 linux kernel 內 modules 的相依性了,但我從頭來過還是會遇到一樣的問題,所以 sudo depmod 顯然沒有解決問題,目前卡關在這

2022/06/01~02

  • 安裝自己編譯出的 module 遇到下列這些 unknown symbol 的問題
[  469.336699] nvidia_modeset: Unknown symbol nvidia_register_module (err -2)
[  469.336742] nvidia_modeset: Unknown symbol nvidia_get_rm_ops (err -2)
[  469.336777] nvidia_modeset: Unknown symbol nvidia_unregister_module (err -2)
  • 而原本機器內的 nvidia driver 在機器中有向 dkms 註冊,所以可以在 dkms status 看到 nvidia, 470.129.06, 5.13.0-44-generic, x86_64: installed

  • 重新查看 repo,發現要執行以下指令來安裝對應的 gsp.bin firmware and user-space NVIDIA GPU driver components
    sh ./NVIDIA-Linux-[...].run --no-kernel-modules
    於是我在官方的 DOWNLOAD DRIVERS 尋找對應版本的 .run 檔案,然後執行上述指令 (此時 open-gpu 的版本是 515.48.07)

  • 再來理論上我應該有對應到 515.48.07 版本的 symbol 了,但還是沒有成功載入自己的 module。

  • 接著我在 nvidia-smi 有遇到以下錯誤訊息

    ​​​​Failed to initialize NVML: Driver/library version mismatch
    
  • 但因為我不確定是否更新 dkms 就會好,所以我直接將機器的 driver 升級到 515.48.07 的 proprietary 版本,也就是執行 .run 檔案時不添加 --no-kernel-modules (還不確定這行為是否 ok)

    • 而安裝過程有一個是否像 dkms register 的選項,我是先選 no
    • 想說先避免重複註冊,或是他會幫我更新?
    • 或是我應該要直接在 dkms 內移除掉原本的 nvidia modules (我做了這個)
  • 重開機後,再重新執行 dkms status,可以看到

    ​​​​nvidia, 470.129.06, 5.13.0-44-generic, x86_64: installed 
    ​​​​(WARNING! Diff between built and installed module!) 
    
  • 執行 nvidia-smi

    ​​​​NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7 
    
  • 執行 modinfo nvidia | head

    ​​​​filename:       /lib/modules/5.13.0-44-generic/kernel/drivers/video/nvidia.ko
    ​​​​firmware:       nvidia/515.48.07/gsp.bin
    ​​​​alias:          char-major-195-*
    ​​​​version:        515.48.07
    ​​​​supported:      external
    ​​​​license:        NVIDIA
    
  • 代表雖然在 dkms 中註冊的還是舊版,但是在系統中的 dirver 已經成功更新到 515.48.07 proprietary 的版本了

  • 再來我嘗試 load 自己的 module 進去,成功 load 了,但無法在執行 gdm3 後正常使用 GUI, modprobe 時遇到的訊息

    ​​​​NVRM cpuidInfoAMD: Unrecofnized AMD processor in cpuidInfoAMD
    
  • nvidia-smi 遇到下列問題

    ​​​​Unable to determine the device handle for GPU 0000:08:00.0: Not Found
    
  • 所以我把機器內的 driver 改回 515.48.07 的 proprietary 版本 (license: NVIDIA),先讓機器可以正常使用

  • 目前 dkms 內的 nvidia, 470.129.06, 5.13.0-44-generic 已被我移除

    How do I uninstall dkms modules if there are two of them?


2022/06/06~07

在 github 的 repo 內搜尋 Unable to determine the device handle for GPU 0000:08:00.0: Not Found,找到該問題有人已經發過 issue

Unable to determine the device handle for GPU 0000:D8:00.0: Not Found #259

maintainer 已經在 515.48.07 的 commit 將我遇到的問題加入 README

Chapter 44. Open Linux Kernel Modules

  • 解決方法,載入時需要額外調整參數
modprobe nvidia NVreg_OpenRmEnableUnsupportedGpus=1