Try   HackMD

Background

When training deep learning model, we often use Nvidia gpu on the server to accelerate the training, and use docker container to avoid messing up the server.

However, there are so many problems we may come up when building the environment. One of them is cuda version compatibility.

Hierarchy

There are 3 layers of abstraction when using Nvidia gpu (cuda) to train the model.

  • nvidia runtime (libcudart.so, cuda-toolkit, nvcc)
  • nvidia driver
    • nvidia user-mode driver (libcuda.so)
    • nvidia kernel-mode driver (nvidia.ko)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We must make sure these 3 layers have versions compatible with each other to run the program!!

Check versions

Runtime

We have two methods to check runtime cuda versions.

  1. nvcc -V

nvcc is a compiler to compile the deep learning source code. It will use cuda-toolkit to convert the source code into cuda language.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

As we can see the runtime version is 10.2.89.

  1. find the version of libcudart.so

We are using libcudart.so for runtime linking.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Note that libcudart.so is a soft link, and refer to a specific version.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

As we can see, it is refering to the runtime cuda version 10.2.89.

Driver

Similar to checking runtime version, we have two ways to check driver versions.

But for driver we have 2 versions to consider, user and kernel. For now, just consider them to be the same.

  1. find the libcuda.so

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We are interested in libcuda.so which is also a soft link.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can see our driver version is 470.182.03

  1. nvidia-smi

nvidia-smi is an application installed when installing the nvidia driver.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can see top row, we have Driver Version: 470.182.03

The simplest version compatibility

Also, there's an CUDA version: 11.4 on the right, it tells us that the current driver version is supposed to match the runtime cuda version 11.4 to work.

The above version compatibility sounds critical and not so flexible, right? Fortunately, Nvidia introduced some mechanisms to relax the version compatibility.

  • Backward Compatibility
  • Minor Version Compatibility
  • Forward Compatibility

Backward Compatibility

Backward compatibility allows us, given current driver version, run older cuda runtime version.

In the above examples, we can only match driver version
470.182.03 with runtime version 11.4. With backward compatibity, driver version 470.182.03 is now compatible with runtime version <=11.4 !!

Minor Version Compatibility

By backward compatibility, we know that to use a certain cuda runtime version, we must have a driver whose version is large enough.

For runtime cuda version smaller than 11, we have a different driver version threshold for each minor version.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

From cuda 11 onwards, we have the same driver version threshold for each minor version of the major version.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Forward Compatibility

This mechanism is mostly for gpus of data center level. (e.g. Tesla brand). So for normal users, it is rarely used.

However, we might encounter error like Error 804: forward compatibility was attempted on non supported HW when building the environment. So let's take a look on the mechanism and see why this might happen.

For data centers, it is troublesome to upgrade driver versions (due to the need of rigorous testing), but they also need a new enough driver to use newer runtime toolkit.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

So Nvidia provides a more convenient solution: upgrading only user-mode driver allows you to use newer runtime toolkit !!

In the previous chapters, we consider kernel-mode and user-mode driver to be of the same version. However, from now on, with forward compatibility, two driver versions can differ.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Error 804: forward compatibility was attempted on non supported HW

This happens if the system recognizes our driver versions are different.

In the following example, we have kernel-mode driver version 470.182.03.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

And the user-mode driver version is 510.108.03.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Why will this happen ?

The container image has its own libcuda.so, and the host machine has its own libcuda.so.

When starting a container, nvidia docker will select the best libcuda.so to allow us using the latest cuda runtime toolkit. However, this will requires forward compatibilitym, which is not supported on our GPU, so the error occurs !!

How to solve it ?

In the previous example, we have kernel-mode driver version 470.182.03 and user-mode driver version 510.108.03.

We only need to change the user-mode driver version to the same version as the kernel-mode driver. (i.e. both 470.182.03)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Change directory into where libcuda.so is stored.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Rebuild the softlink

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

And voila

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Check nvidia-smi

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Notice that CUDA Version is changed from 11.6 to 11.4. But we are still able to run program compiled with cuda runtime 11.6. (thanks to the minor version compatibility)

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Conclusion

In this article, we introduce the hierarchy of cuda layers, their compatibility requirements, and three mechanisms for compatibility.

References

  1. CUDA Compatibility
  2. PyTorch的CUDA错误:Error 804: forward compatibility was attempted on non supported HW