NVIDIA 開放原始碼 GPU 核心模組的研究和改進
NVIDIA/open-gpu-kernel-modules
Expected Goal
- 使用靜態分析工具 (如 cppcheck, sparse [即
make C=1
], clang static-analyzer) 檢測 NVIDIA/open-gpu-kernel-modules
程式碼的潛在缺失,並提交對應的 pull request
- 在 RTX 硬體上確認可正確載入修改後的 driver
- 探討可能的 lock 改進,例如: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/main/src/nvidia/src/kernel/core/locks_common.c
Developing Environment
Module test
Check driver status
- 現在系統內的 nvidia module 應該會在此目錄底下
- 可以確認目前的系統抓取到的 nvidia module,可以看印出的 license 訊息得知目前安裝的版本
- NVIDIA System Management Interface, 可用來查看當前的 GPU 狀態
- 確認目前的 GPU driver version,下列訊息代表是 proprietary version
Backup in advance
若是擔心覆蓋掉原本的 GPU driver 相關 module,可以先備份好下列檔案,這些檔案是原本的 NVIDIA driver。之後執行 make modules_install
時就是覆蓋這些檔案
- 對應的 kernel 版本可以利用
uname -r
替換
個人建議先去官網下載好原本 GPU driver 版本的 .run 檔,這樣之後要替換回 propreitary version 時可以用這個檔案解決 gsp.bin fireware 或相關 component 不相容的問題
DOWNLOAD NVIDIA DRIVERS
Before building
Chapter 44. Open Linux Kernel Modules
由於 proprietary 與 open 兩個版本是完全互斥的,亦即在系統內同時只能存在一種版本。
- 先用
nvidia-uninstall
將系統內的 nvidia driver 移除
- 再來需要到 DOWNLOAD NVIDIA DRIVERS 下載對應到 open gpu version 的 .run 檔,執行下列命令安裝對應的 gsp.bin fireware
我因為測試時,將機器原本的 proprietary driver 版本改到與此時的 open driver 版本相同,所以後續比較沒有遇到相容性問題
Build module
先從 NVIDIA/open-gpu-kernel-modules 抓取對應的檔案
Warning & Error when building
粗略看來,大部分是不能修改的,尚未深入探討
Copy module to the destination directory
need permission
If encounter dependency issue about configuration in kernel
可以在執行 make modules_install
之後執行以下命令
If encounter SSL error when installing
SSL error while signing modules #38
How to resolve SSL error during make … modules_install command?
No OpenSSL sign-file signing_key.pem leads to error while loading kernel modules
代表在 /lib/modules/$(uname -r)/build/certs
底下沒有對應的 signature (key),可以用以下命令產生 .pem 檔並放到對應的目錄底下
Unload current Nvidia module
"Module nvidia is in use" but there are no processes running on the GPU
Switch to Console in Ubuntu 18.04 - How to Leave GUI?
Virtual console
gdm3: X display manager
What is gdm3, kdm, lightdm? How to install and remove them?
因為要確保沒有任何 process 使用 current nvidia module 才能夠成功 unload module,所以需要先切換到 tty 介面 (without GUI),並停止 gdm3。
Unload module
可以用 lsmod | grep nvidia
來確認是否成功移除
Load Nvidia module
Chapter 44. Open Linux Kernel Modules
Most features of the Linux GPU driver are supported with the open flavor of kernel modules, including CUDA, Vulkan, OpenGL, OptiX, and X11. However, in the current release, some display and graphics features (notably: G-SYNC, Quadro Sync, SLI, Stereo, rotation in X11, and YUV 4:2:0 on Turing), as well as power management, and NVIDIA virtual GPU (vGPU), are not yet supported. These features will be added in upcoming driver releases.
- 如果顯示卡是 GeForce and Workstation GPUs 系列,因為目前是 alpha-quality,所以會缺少一些 features 的支援,所以安裝時需要使用下列命令 (2022/06/07)
- 載入後可以用這些命令來檢查 driver 是否正常運作
- 成功載入的話就可以啟用 gdm3 (GUI) 並切換回 GUI
If encounter unknown symbol issue
若是使用 modprobe
或 insmod
安裝 nvidia module 時出錯,而在 dmesg
內發現下列訊息時,代表需要安裝對應的 gsp.bin fireware,可以看 Before building
If has dkms issue
Dynamic Kernel Module Support
用 dkms status
可以查看當前 dkms 的狀態
因為我一開始的 driver 是 470 (proprietary) 版本,並且有向 dkms 註冊,所以在自行手動更新版本後,在 dkms status
會看到以下警告
How do I uninstall dkms modules if there are two of them?
我直接從 dkms 移除 470.129.06, 5.13.0-44-generic 下的 nvidia driver
If module cannot work after loaded
Unable to determine the device handle for GPU 0000:D8:00.0: Not Found #259
如果成功載入 module,但啟動 gdm3 後卻無法正常啟用 GUI,可以測試 nvidia-smi
是否正常運作,若是遇到下列問題,代表在 load module 時沒有加入對應的參數,請看 Load Nvidia module 的解釋
補充 nouveau driver 介紹
確認現在 nouveau driver 的普及程度,現在的 linux 是否還有在使用,或是都改用 nvidia 發行的 proprietary 版本了
Static Analyzer
Sparse
dev-tools/sparse
sparse’s documentation
Sparse: a look under the hood
Sparse, the semantic parser, provides a compiler frontend capable of parsing most of ANSI C as well as many GCC extensions, and a collection of sample compiler backends, including a static analyzer also called sparse.
Cppcheck
Cppcheck
Cppcheck manual
Cppcheck is a static analysis tool for C/C++ code. It provides unique code analysis to detect bugs and focuses on detecting undefined behaviour and dangerous coding constructs.
Check kernel-open/
The kernel interface layer
沒有任何錯誤訊息
Check src/
The OS-agnostic code
將檢查重點放在 src/nvidia/src/
,這是有關 nvidia.ko 的 OS-agnostic code
How does vmIndex
work
在 src/nvidia/src/kernel/diagnostics/gpu_acct.c
下,關於 vmIndex
我認為是類似是 errno
的變數用來告知錯誤訊息,但在專案中卻找不到更改該變數的地方
How does vmIndex work in src/nvidia/src/kernel/diagnostics/gpu_acct.c #313
在討論區與 NVIDIA 開發人員討論過後,確認是因為 NVIDIA 並未開源其餘的 driver,像是 vGPU 等等,所以在此 repo 中會看到一些 random {}
blocks 像是此例子,就是因為需要與其餘 proprietary driver 相容所導致的結果。
有些訊息是 cppcheck 的誤判
src/
全部的錯誤訊息
src/nvidia/src/kernel/diagnostics/gpu_acct.c:737:37: error: Uninitialized variable: pLiveDS [uninitvar]
status = gpuacctLookupProcEntry(pLiveDS, searchPid, &pEntry);
^
src/nvidia/src/kernel/diagnostics/gpu_acct.c:737:46: error: Uninitialized variable: searchPid [uninitvar]
status = gpuacctLookupProcEntry(pLiveDS, searchPid, &pEntry);
^
src/nvidia/src/kernel/diagnostics/gpu_acct.c:1057:25: error: Uninitialized variable: pList [uninitvar]
NV_ASSERT_OR_RETURN(pList != NULL, NV_ERR_INVALID_STATE);
^
src/nvidia/src/kernel/gpu/bus/arch/maxwell/kern_bus_gm107.c:2336:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/bus/kern_bus_ctrl.c:78:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/device_ctrl.c:326:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/disp/disp_channel.c:628:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/fifo/arch/maxwell/kernel_channel_gm107.c:408:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/fifo/kernel_channel.c:1923:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/fifo/kernel_fifo_ctrl.c:614:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/gpu.c:745:26: error: syntax error: +] [syntaxError]
char children[ 0 +
^
src/nvidia/src/kernel/gpu/gr/fecs_event_list.c:229:9: error: There is an unknown macro here somewhere. Configuration is required. If NV_ASSERT_OK_OR_ELSE is a macro then please configure it. [unknownMacro]
NV_ASSERT_OK_OR_ELSE(status,
^
src/nvidia/src/kernel/gpu/gr/kernel_graphics.c:2137:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/gsp/kernel_gsp.c:685:67: error: Expression 'nvStatus!=NV_WARN_MORE_PROCESSING_REQUIRED,nvStatus=NV_ERR_GENERIC' depends on order of evaluation of side effects [unknownEvaluationOrder]
NV_ASSERT_OR_ELSE(nvStatus != NV_WARN_MORE_PROCESSING_REQUIRED,
^
src/nvidia/src/kernel/gpu/intr/intr.c:105:13: error: There is an unknown macro here somewhere. Configuration is required. If NV_ASSERT_OK_OR_ELSE is a macro then please configure it. [unknownMacro]
NV_ASSERT_OK_OR_ELSE(status, intrGetPendingStall_HAL(pGpu, pIntr, &pendingEngines, NULL ), return);
^
src/nvidia/src/kernel/gpu/mem_mgr/arch/maxwell/virt_mem_allocator_gm107.c:1138:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/context_dma.c:310:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/dma.c:976:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/mem_ctrl.c:319:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/mem_desc.c:4061:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/mem_mgr_ctrl.c:185:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/method_notification.c:369:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.c:1056:42: error: syntax error [syntaxError]
"rangeStart %% pageSize == 0", );
^
src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/regmap.c:973:42: error: syntax error [syntaxError]
"rangeStart %% pageSize == 0", );
^
src/nvidia/src/kernel/gpu/mmu/kern_gmmu.c:647:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_INDEX_IN_MASK_END is a macro then please configure it. [unknownMacro]
FOR_EACH_INDEX_IN_MASK_END
^
src/nvidia/src/kernel/gpu/nvlink/kernel_nvlink.c:121:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_INDEX_IN_MASK_END is a macro then please configure it. [unknownMacro]
FOR_EACH_INDEX_IN_MASK_END
^
src/nvidia/src/kernel/mem_mgr/gpu_vaspace.c:2923:5: error: There is an unknown macro here somewhere. Configuration is required. If FOR_EACH_GPU_IN_MASK_UC_END is a macro then please configure it. [unknownMacro]
FOR_EACH_GPU_IN_MASK_UC_END
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:668:45: warning: Possible null pointer dereference: pPageHandle [nullPointer]
pPhysicalAddresses[index] = pPageHandle->address;
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:666:35: note: Assignment 'pPageHandle=NULL', assigned value is 0
pPageHandle = NULL;
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:668:45: note: Null pointer dereference
pPhysicalAddresses[index] = pPageHandle->address;
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:689:47: warning: Possible null pointer dereference: pPageHandle [nullPointer]
memdescDescribe(pMemDesc, ADDR_FBMEM, pPageHandle->address, pMemDesc->Size);
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:686:27: note: Assignment 'pPageHandle=NULL', assigned value is 0
pPageHandle = NULL;
^
src/nvidia/src/kernel/mem_mgr/pool_alloc.c:689:47: note: Null pointer dereference
memdescDescribe(pMemDesc, ADDR_FBMEM, pPageHandle->address, pMemDesc->Size);
^
src/nvidia/src/kernel/mem_mgr/video_mem.c:437:5: error: There is an unknown macro here somewhere. Configuration is required. If SLI_LOOP_END is a macro then please configure it. [unknownMacro]
SLI_LOOP_END
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:709:56: warning: Possible null pointer dereference: pGSCI [nullPointer]
portMemCopy(&boardSkuNum, sizeof(boardSkuNum), pGSCI->SKUInfo.projectSKU, sizeof(pGSCI->SKUInfo.projectSKU));
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
GspStaticConfigInfo *pGSCI = NULL;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:709:56: note: Null pointer dereference
portMemCopy(&boardSkuNum, sizeof(boardSkuNum), pGSCI->SKUInfo.projectSKU, sizeof(pGSCI->SKUInfo.projectSKU));
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:710:58: warning: Possible null pointer dereference: pGSCI [nullPointer]
portMemCopy(&boardProjNum, sizeof(boardProjNum), pGSCI->SKUInfo.project, sizeof(pGSCI->SKUInfo.project));
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
GspStaticConfigInfo *pGSCI = NULL;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:710:58: note: Null pointer dereference
portMemCopy(&boardProjNum, sizeof(boardProjNum), pGSCI->SKUInfo.project, sizeof(pGSCI->SKUInfo.project));
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:711:21: warning: Possible null pointer dereference: pGSCI [nullPointer]
subVendor = pGSCI->vbiosSubVendor;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
GspStaticConfigInfo *pGSCI = NULL;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:711:21: note: Null pointer dereference
subVendor = pGSCI->vbiosSubVendor;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:712:21: warning: Possible null pointer dereference: pGSCI [nullPointer]
subDevice = pGSCI->vbiosSubDevice;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:692:34: note: Assignment 'pGSCI=NULL', assigned value is 0
GspStaticConfigInfo *pGSCI = NULL;
^
src/nvidia/src/kernel/power/gpu_boost_mgr.c:712:21: note: Null pointer dereference
subDevice = pGSCI->vbiosSubDevice;
^
src/nvidia/src/kernel/rmapi/alloc_free.c:753:9: error: Address of local auto-variable assigned to a function parameter. [autoVariables]
pRmAllocParams->pRightsRequired = &rightsRequired;
^
src/nvidia/src/lib/base_utils.c:119:34: error: Syntax Error: AST broken, ternary operator lacks ':'. [internalAstError]
return (bit < numElements*32 ? (NvBool) !!(pBitField[bit/32] & NVBIT(bit%32)) : NV_FALSE);
^
src/nvidia/src/libraries/resserv/src/rs_server.c:2142:0: error: failed to expand 'CLIENT_ENCODEHANDLE_INTERNAL', Wrong number of parameters for macro 'CLIENT_ENCODEHANDLE_INTERNAL'. [preprocessorErrorDirective]
? CLIENT_ENCODEHANDLE_INTERNAL(clientIndex)
^
src/nvidia/src/libraries/tls/tls.c:424:59: error: Uninitialized variable: pThreadEntry [uninitvar]
pThreadEntry = _tlsIsrEntriesFind((NvU64)(NvUPtr)&pThreadEntry);
^
可能的 lock 改進
TODO
Reference
Terminology
RM stands for "Resource Manager"
https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/minimumrequirements.html
UVM stands for Unified Virtual Memory
OLD RECORD / MEMO / OBSOLETE
最一開始機器內原本的 driver
filename: /lib/modules/5.13.0-44-generic/updates/dkms/nvidia.ko
- version: 470.129.06
- NVIDIA driver metapacjage from nvidia-driver-470 (proprietary)
- path:
/lib/modules/5.13.0-44-generic/kernel/drivers/video
但後來測試到一半就換版本了
2022/05/28 嘗試載入開源的 gpu driver
目前還是無法成功載入開源的 gpu driver,遇到下列狀況:
首先我先 unload 機器內原有的 nvidia driver (與此同時我會在 tty 介面,因為已經停用 gdm3),然後在我做完 make modules
以及 make modules_install
之後
- 在目錄
/lib/modules/5.13.0-44-generic/kernel/drivers/video
底下執行以下命令會發現無法載入 nvidia.ko
- 但如果我是下面這樣執行的話是可以載入的,但載入後 driver 的 license 是 NVIDIA 而不是 DUAL MIT/GPL,代表載入到機器內原有的 driver
- 透過
sudo insmod nvidia.ko
以及 dmesg
去看系統內的錯誤訊息,會看到很多 unknown symbol 在其中,代表遇到 dependency 的問題。所以我試著更改載入順序,在 nvidia, nvidia_modeset, nvidia_drm, nvidia_uvm 之間輪流載入,但是都沒辦法成功載入任意一個 module
- 在經過研究後我試圖透過下列指令解決
- 理論上這樣就重新設定過 linux kernel 內 modules 的相依性了,但我從頭來過還是會遇到一樣的問題,所以
sudo depmod
顯然沒有解決問題,目前卡關在這
2022/06/01~02
- 安裝自己編譯出的 module 遇到下列這些 unknown symbol 的問題
-
而原本機器內的 nvidia driver 在機器中有向 dkms 註冊,所以可以在 dkms status
看到 nvidia, 470.129.06, 5.13.0-44-generic, x86_64: installed
。
-
重新查看 repo,發現要執行以下指令來安裝對應的 gsp.bin firmware and user-space NVIDIA GPU driver components
sh ./NVIDIA-Linux-[...].run --no-kernel-modules
於是我在官方的 DOWNLOAD DRIVERS 尋找對應版本的 .run 檔案,然後執行上述指令 (此時 open-gpu 的版本是 515.48.07)
-
再來理論上我應該有對應到 515.48.07 版本的 symbol 了,但還是沒有成功載入自己的 module。
-
接著我在 nvidia-smi
有遇到以下錯誤訊息
-
但因為我不確定是否更新 dkms 就會好,所以我直接將機器的 driver 升級到 515.48.07 的 proprietary 版本,也就是執行 .run 檔案時不添加 --no-kernel-modules
(還不確定這行為是否 ok)
- 而安裝過程有一個是否像 dkms register 的選項,我是先選 no
- 想說先避免重複註冊,或是他會幫我更新?
- 或是我應該要直接在 dkms 內移除掉原本的 nvidia modules (我做了這個)
-
重開機後,再重新執行 dkms status
,可以看到
-
執行 nvidia-smi
-
執行 modinfo nvidia | head
-
代表雖然在 dkms 中註冊的還是舊版,但是在系統中的 dirver 已經成功更新到 515.48.07 proprietary 的版本了
-
再來我嘗試 load 自己的 module 進去,成功 load 了,但無法在執行 gdm3 後正常使用 GUI, modprobe
時遇到的訊息
-
在 nvidia-smi
遇到下列問題
-
所以我把機器內的 driver 改回 515.48.07 的 proprietary 版本 (license: NVIDIA),先讓機器可以正常使用
-
目前 dkms 內的 nvidia, 470.129.06, 5.13.0-44-generic 已被我移除
How do I uninstall dkms modules if there are two of them?
2022/06/06~07
在 github 的 repo 內搜尋 Unable to determine the device handle for GPU 0000:08:00.0: Not Found
,找到該問題有人已經發過 issue
Unable to determine the device handle for GPU 0000:D8:00.0: Not Found #259
maintainer 已經在 515.48.07 的 commit 將我遇到的問題加入 README
Chapter 44. Open Linux Kernel Modules