Harvester GPU passthrough 設定

# Harvester GPU passthrough 設定 ## 巢狀虛擬化架構設定 * 如果是實體機可以跳過此步驟 * 設定 viommu ![image](https://hackmd.io/_uploads/H1oGHBVVkg.png) * 在 pve 上掛載 PCI Device 給 harvester 使用 ![image](https://hackmd.io/_uploads/SkQH-7PNyg.png) ## 環境檢查 * 環境檢查 ``` $ lspci | grep -i nvidia 01:00.0 VGA compatible controller: NVIDIA Corporation Device 2805 (rev a1) $ cat /proc/cmdline BOOT_IMAGE=(loop0)/boot/vmlinuz console=tty1 root=LABEL=COS_STATE cos-img/filename=/cOS/active.img panic=0 net.ifnames=1 rd.cos.oemlabel=COS_OEM rd.cos.mount=LABEL=COS_OEM:/oem rd.cos.mount=LABEL=COS_PERSISTENT:/usr/local rd.cos.oemtimeout=120 audit=1 audit_backlog_limit=8192 intel_iommu=on amd_iommu=on iommu=pt $ lspci -vvs 01:00.0 01:00.0 VGA compatible controller: NVIDIA Corporation Device 2805 (rev a1) (prog-if 00 [VGA controller]) Subsystem: Micro-Star International Co., Ltd. [MSI] Device 5174 Physical Slot: 0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 16 Region 0: Memory at f9000000 (32-bit, non-prefetchable) [size=16M] Region 1: Memory at 383800000000 (64-bit, prefetchable) [size=16G] Region 3: Memory at 383c00000000 (64-bit, prefetchable) [size=32M] Region 5: I/O ports at 5000 [size=128] Expansion ROM at fa000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <16us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Via message AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled AtomicOpsCtl: ReqEn- LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [250 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [128 v1] Power Budgeting <?> Capabilities: [420 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP+ BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [bb0 v1] #15 Kernel driver in use: vfio-pci $ find /sys/kernel/iommu_groups/ -type l | grep "0000:01:00.0" /sys/kernel/iommu_groups/11/devices/0000:01:00.0 ``` * 檢查 Nvidia 的 IOMMU Group 一定是要自己獨立一個 ``` $ vim iommu.sh #!/bin/bash shopt -s nullglob for g in $(find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V); do echo "IOMMU Group ${g##*/}:" for d in $g/devices/*; do echo -e "\t$(lspci -nns ${d##*/})" done; done; $ bash iommu.sh ...... IOMMU Group 11: 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2805] (rev a1) ``` ## 啟用 pcidevices-controller ![image](https://hackmd.io/_uploads/HJJNS7r4ke.png) ## Advanced > PCI Devices * 啟用 Nvidia Devices ![image](https://hackmd.io/_uploads/SJDFGQwNke.png) ``` $ kubectl get pcideviceclaim NAME ADDRESS NODE NAME USER NAME KERNEL DRIVER ΤΟ UNBIND PASSTHROUGH ENABLED hvx-1-000001000 0000:01:00.0 hvx-1 admin true ``` ## 建立 ubuntu vm * 指定 node 並且選擇剛剛啟用的 device ![image](https://hackmd.io/_uploads/rJo0G7D4yl.png) ![image](https://hackmd.io/_uploads/rJclJw4VJe.png) ## ubuntu install nvidia driver * 進到 ubuntu 後檢查是否有看到 NVIDIA ``` $ lspci |grep -i nvidia 0a:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1) ``` * 安裝 driver ``` $ wget https://tw.download.nvidia.com/XFree86/Linux-x86_64/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run $ apt update $ apt install gcc make $ ./NVIDIA-Linux-x86_64-550.54.14.run ``` * 裝好後執行 `nvidia-smi` 還是有問題可以參考以下方式解決 * 參考 https://medium.com/@yt.chen/nvidia-smi-%E9%80%A3%E4%B8%8D%E5%88%B0-driver-%E7%9A%84%E8%87%AA%E6%95%91%E6%96%B9%E6%B3%95-69cbed16171d ``` $ ls /usr/src | grep nvidia nvidia-550.54.14 $ apt-get install dkms $ dkms install -m nvidia -v 550.54.14 $ apt-get install linux-headers-6.8.0-49-generic $ apt upgrade $ reboot ``` * 重啟後就執行 `nvidia-smi` 就可以看到 nvidia 了 ``` $ nvidia-smi Wed Dec 11 14:36:56 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:0A:00.0 Off | N/A | | 33% 34C P0 29W / 165W | 0MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` ![image](https://hackmd.io/_uploads/BJdsXXPVkx.png) ## troubleshotting * 在啟動 vm 時有這個報錯，代表其他 group 9 的設備不是 bind 在 vfio-pci 所以導致 vm 都起不來，因此掛載的 nvidia iommu group 一定要自己獨立一個。 ``` {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"ubuntu","namespace":"default","pos":"server.go:202","reason":"virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2024-12-09T14:01:18.841359Z qemu-system-x86_64: -device {\"driver\":\"vfio-pci\",\"host\":\"0000:06:10.0\",\"id\":\"ua-hostdevice-hvx-1-000006100\",\"bus\":\"pci.11\",\"addr\":\"0x1\"}: vfio 0000:06:10.0: group 9 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')","timestamp":"2024-12-09T14:01:19.043300Z","uid":"aff1fd41-f342-4ae7-8bd1-5fc3ab47226b"} ``` * 如果遇到這個問題重新啟用 pcidevices-controller 功能 ``` $ kubectl get vm ubuntu -oyaml | grep -i message message: virt-launcher pod has not yet been scheduled message: 'failed to render launch manifest: HostDevice nvidia.com/AD106_GEFORCE_RTX_4060_TI_16GB ``` ``` virt-launcher pod has not yet been scheduled when pci passthrough is enable for gpu ``` * 參考 https://github.com/harvester/harvester/issues/4160#issuecomment-2450515323 ## 參考 https://github.com/harvester/harvester/issues/3833#issuecomment-1524900503 https://docs.harvesterhci.io/v1.4/advanced/vgpusupport/ https://gist.github.com/kralicky/0f9994526eac7ddc1808bcbfea6a8444