changed 3 years ago
Published Linked with GitHub

Ubuntu 20.04 + Nvidia Software 470

Ubuntu20.04 Driver Failure

NVIDIA out-of-box driver fails to load when system has NVIDIA GPGPUs on Ubuntu 20.04

Workaround Link

After Applied will result in halt in bootup


Enable virtualization setting

reconfirm if the following BIOS settings are enabled on your server platform (Dell 750XA server):
VT-D/IOMMU
SR-IOV


Disable nouveau driver

Ref Link

sudo vi /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

sudo update-initramfs -u sudo reboot lsmod | grep nouveau

Prerequest Run ansible playbook

HKFLAIR GIT


Install Nvidia Software & Assign to KVM

Ref Link
Install software download from Nvidia, which obtain in host driver folder.

chmod +x ./nvidia-vgpu-ubuntu-470.141.05_amd64.deb sudo apt install ./nvidia-vgpu-ubuntu-470.141.05_amd64.deb

Check status and pci info after installation.

lspci | grep NVIDIA (KVMHOST1) virsh nodedev-list --cap pci | grep 65_00_0 virsh nodedev-list --cap pci | grep ca_00_0 virsh nodedev-list --cap pci | grep 17_00_0 virsh nodedev-list --cap pci | grep e3_00_0 virsh nodedev-dumpxml pci_0000_65_00_0| egrep 'domain|bus|slot|function' virsh nodedev-dumpxml pci_0000_ca_00_0| egrep 'domain|bus|slot|function' virsh nodedev-dumpxml pci_0000_17_00_0| egrep 'domain|bus|slot|function' virsh nodedev-dumpxml pci_0000_e3_00_0| egrep 'domain|bus|slot|function'

Enable virtual function and check available vGPU instance

domain, bus, slot and function are the figure below referring to.

/usr/lib/nvidia/sriov-manage -e domain:bus:slot.function ls -l /sys/bus/pci/devices/domain\:bus\:slot.function/ | grep virtfn
/usr/lib/nvidia/sriov-manage -e 0000:41:00.0 ls -l /sys/bus/pci/devices/0000\:41\:00.0/ | grep virtfn cd /sys/class/mdev_bus/0000\:41\:00.4/mdev_supported_types grep -l "A100" nvidia-*/name cat nvidia-692/available_instances (KVMHOST1) /usr/lib/nvidia/sriov-manage -e 00:17:00.0 /usr/lib/nvidia/sriov-manage -e 00:65:00.0 /usr/lib/nvidia/sriov-manage -e 00:ca:00.0 /usr/lib/nvidia/sriov-manage -e 00:e3:00.0

Assign UUID to vGPU profile

uuidgen echo "uuid"> nvidia-700/create mdevctl define --auto --uuid uuid (KVMHost1 GB) ![]root@kvmhost1:/sys/class/mdev_bus/0000:ca:02.0/mdev_supported_types# cat nvidia-700/description num_heads=1, frl_config=60, framebuffer=**20480M**, max_resolution=4096x2160, max_instance=3 ![](https://i.imgur.com/2qrokbx.png)

Check vGPU creation

ls -l /sys/bus/mdev/devices/ mdevctl list

Run custom script sriov-manage provided by NVIDIA vGPU software and list out vGPU

/usr/lib/nvidia/sriov-manage -e 65:00.0 /usr/lib/nvidia/sriov-manage -e ca:00.0 ls -l /sys/bus/pci/devices/0000\:65\:00.0/ | grep virtfn ls -l /sys/bus/pci/devices/0000\:ca\:00.0/ | grep virtfn cd /sys/class/mdev_bus/0000\:65\:00.4/mdev_supported_types/ grep -l "A100" nvidia-*/name

Generate a correctly formatted universally unique identifier (UUID) for the vGPU and check result

uuidgen echo "4f28ef3b-6710-449d-9df0-84cf3ca29308" > nvidia-700/create mdevctl define --auto -uuid 4f28ef3b-6710-449d-9df0-84cf3ca29308 ls -l /sys/bus/mdev/devices/ mdevctl list

(KVMHost1):::info
root@kvmhost1:~# uuidgen
root@kvmhost1:/sys/bus/pci/devices/0000:65:00.5/mdev_supported_types# mdevctl list
888fd5a6-8628-4790-8eb3-3a58a75e3ed7 0000:65:00.7 nvidia-696 (defined)
5d8f8ef6-af21-43d2-9293-16412ed10553 0000:e3:00.6 nvidia-696 (defined)
d95ac5a7-f869-4547-bb87-9b66b138c9bb 0000:ca:00.6 nvidia-696 (defined)
c061fb82-6cd7-4d31-a603-e926ae97534e 0000:65:00.6 nvidia-696 (defined)
406ba918-3eae-42c5-8dac-54213d58a790 0000:17:00.5 nvidia-696 (defined)
c369486d-99ae-4900-824f-4ebf711fca97 0000:ca:00.5 nvidia-696 (defined)
3b654341-6b4a-46c0-abae-c42b01f60518 0000:ca:00.4 nvidia-696 (defined)
85f634ee-7ebf-4195-be59-a2aacaccba85 0000:65:00.4 nvidia-696 (defined)
1ea90a44-1c43-4d42-9f4f-3838a069acdd 0000:e3:00.7 nvidia-696 (defined)
73360179-570e-4b3b-943f-c0a63e0d3c1d 0000:e3:00.4 nvidia-696 (defined)
dc5c4ca7-3a3e-4db7-86dd-b7971a2246b9 0000:17:00.4 nvidia-696 (defined)
24a697fe-8f51-4e38-b1bb-7c1e62ce660b 0000:ca:00.7 nvidia-696 (defined)
4ae7ac78-03e1-4ee8-9b52-2ad3c68c9156 0000:65:00.5 nvidia-696 (defined)
83e024ea-58ea-4b09-880d-09752524ed2e 0000:17:00.7 nvidia-696 (defined)
d881f3d5-2d2e-4098-b126-8ae296b1a1e0 0000:17:00.6 nvidia-696 (defined)
b377287e-100a-498c-9669-7bc74a2434cb 0000:e3:00.5 nvidia-696 (defined)

root@kvmhost1:~#

(KVMHost2):::info
/sys/class/mdev_bus/0000:65:00.4/mdev_supported_types/nvidia-696
4f28ef3b-6710-449d-9df0-84cf3ca29308
/sys/class/mdev_bus/0000:65:00.5/mdev_supported_types/nvidia-696
44cc85a1-5f56-4f82-b71d-4a36050f2942
/sys/class/mdev_bus/0000:65:00.6/mdev_supported_types/nvidia-696
77e0cffb-b685-4496-a4be-12ffa30ffcfb
/sys/class/mdev_bus/0000:65:00.7/mdev_supported_types/nvidia-696
b06ebd67-f9eb-4ab3-b62d-f5f3762b9011
:::
Add vGPU to VM Guest

virsh shutdown test-1 virsh edit test-1
  <hostdev mode='subsystem' type='mdev' model='vfio-pci'>
    <source>
      <address uuid='4f28ef3b-6710-449d-9df0-84cf3ca29308'/>
    </source>
  </hostdev>

Debug for support

nvidia-bug-report.sh
Select a repo