在 Windows 平台上啟用 GPU 加速 Tensorflow

使用 Windows 原生環境

適用環境:

  • Python for Windows
    • Python 版本 3.8.x (3.8.6, 3.8.7 tested)
    • 3.9.0, 3.9.1 目前 tensorflow 不支援, 無法完成 tensorflow 安裝
    • 安裝在 Python for Windows 原生橂組 venv 開設的 python 虛擬環境
    • 其他非 andconda for Windows 開設的 python 虛擬環境
  • 安裝步驟
  1. 安裝 Microsoft Visual C++ 開發工具及執行環境

    • 執行環境: "Microsoft Visual VC++ 2015-2019 redistributable" (Visual Studio 2019 內有包含)
    • 開發工具: MS Visual Studio (community 版即可)
      CUDA Toolkit 內有 Visual Studio 的 C++ 擴充套件. 安裝 Visual Studio 時請選 '使用 VC++ 桌面開發'.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    安裝時請選擇**使用 VC++ 桌面開發**

    如果有要用 CUDA Toolkit 開發 C++ 程式, 請勿只裝 "Build Tools for Visual Studio" (MSBuild). 獨立安裝的 MSBuild 和附加在 Visual C++ 的 MSBuild 安裝路徑不同, 不熟的人要再把它們重新整合好會有些難度.

  2. 安裝 python package tensorflow
    請於需求的環境中, 以 pip install 安裝.

    ​​​​pip install tensorflow
    

    Anaconda, Miniconda 的使用者請不要用 pip install 安裝, 請務必改用 conda install
    請參看 使用 Anaconda for Windows 環境 這個段落.

    請注意 tensorflow 2.x:

    1. 已經將 Keras 整合至 tensorflow 裡, 原先獨立版本的 keras 已經不再繼續開發及支援 (解 bug).
    2. 已經將 CPU 及 GPU 的版本整在一起, 只需安裝 tensorflow 即可, 無需安裝 tensorflow-gpu.
    3. 這一步驟次序無關, 可以先裝, 也可以先跳過, 驗證是否安裝正確時再裝.
  3. 安裝 nVidia Display Driver
    依自己的 nVidia GPU 型號下載並安裝 GPU driver.
    網址: https://www.nvidia.com.tw/Download/index.aspx?lang=tw

    nVidia CUDA Toolkit 安裝包裡已經包含一個相對新的相容版本的 nVidia Display Driver. 所以一般這個步驟在 windows 平台上可裝可不裝. (可以最後再決定裝不裝或者以後再安裝更新的版本).

CUDA Toolkit and Compatible Driver Versions

CUDA Toolkit Linux x86_64
Driver Version
Windows x86_64
Driver Version
CUDA 11.2.0 GA >=460.27.04 >=460.89
CUDA 11.1.1 Update 1 >=455.32 >=456.81
CUDA 11.1 GA >=455.23 >=456.38
CUDA 11.0.3 Update 1 >= 450.51.06 >= 451.82
CUDA 11.0.2 GA >= 450.51.05 >= 451.48
CUDA 11.0.1 RC >= 450.36.06 >= 451.22
CUDA 10.2.89 >= 440.33 >= 441.22
CUDA 10.1 (10.1.105 general release, and updates) >= 418.39 >= 418.96
CUDA 10.0.130 >= 410.48 >= 411.31
CUDA 9.2 (9.2.148 Update 1) >= 396.37 >= 398.26
CUDA 9.2 (9.2.88) >= 396.26 >= 397.44
CUDA 9.1 (9.1.85) >= 390.46 >= 391.29
CUDA 9.0 (9.0.76) >= 384.81 >= 385.54
CUDA 8.0 (8.0.61 GA2) >= 375.26 >= 376.51
CUDA 8.0 (8.0.44) >= 367.48 >= 369.30
CUDA 7.5 (7.5.16) >= 352.31 >= 353.66
CUDA 7.0 (7.0.28) >= 346.46 >= 347.62
  1. 安裝 nVidia CUDA Toolkit

    Tensorflow GPU 版 Windows 平台 版本配對關係

    Tensorflow
    Version
    Python
    version
    Compiler Build tools cuDNN CUDA
    tensorflow_gpu-2.4.0 3.6-3.8 MSVC 2019 Bazel 3.1.0 8.0 11.0
    tensorflow_gpu-2.3.0 3.5-3.8 MSVC 2019 Bazel 3.1.0 7.6 10.1
    tensorflow_gpu-2.2.0 3.5-3.8 MSVC 2019 Bazel 2.0.0 7.6 10.1
    tensorflow_gpu-2.1.0 3.5-3.7 MSVC 2019 Bazel 0.27.1-0.29.1 7.6 10.1
    tensorflow_gpu-2.0.0 3.5-3.7 MSVC 2017 Bazel 0.26.1 7.4 10
    tensorflow_gpu-1.15.0 3.5-3.7 MSVC 2017 Bazel 0.26.1 7.4 10
    tensorflow_gpu-1.14.0 3.5-3.7 MSVC 2017 Bazel 0.24.1-0.25.2 7.4 10
    tensorflow_gpu-1.13.0 3.5-3.7 MSVC 2015 update 3 Bazel 0.19.0-0.21.0 7.4 10
    tensorflow_gpu-1.12.0 3.5-3.6 MSVC 2015 update 3 Bazel 0.15.0 7.2 9.0
    tensorflow_gpu-1.11.0 3.5-3.6 MSVC 2015 update 3 Bazel 0.15.0 7 9
    tensorflow_gpu-1.10.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.9.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.8.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.7.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.6.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.5.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 7 9
    tensorflow_gpu-1.4.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 6 8
    tensorflow_gpu-1.3.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 6 8
    tensorflow_gpu-1.2.0 3.5-3.6 MSVC 2015 update 3 Cmake v3.6.3 5.1 8
    tensorflow_gpu-1.1.0 3.5 MSVC 2015 update 3 Cmake v3.6.3 5.1 8
    tensorflow_gpu-1.0.0 3.5 MSVC 2015 update 3 Cmake v3.6.3 5.1 8
  2. 安裝 nVidia cuDNN SDK
    下載網址: https://developer.nvidia.com/CUDNN

    • 需要另外填寫一些資料,
    • 下載對應 CUDA Toolkit 的 windows 版本.
    • 解開 ZIP 檔, 並將三個子目錄 copy 到 CUDA Toolkit 對應的子目錄中.
      • bin <> bin
      • lib <> lib
      • include <> include
  3. 檢查環境變數

    • CUDA_PATH 是否設為正確的安裝路徑.
    • CUDA_PATH_Vxx_y 是否設為正確的安裝路徑.
    • 重新開機 (必需)
  4. 驗證安裝

    • 開啟 python 或 jupyter 網頁, 輸入以下程式:

      ​​​​​​​​from tensorflow.python.client import device_lib
      ​​​​​​​​device_lib.list_local_devices()
      
    • 如果有錯誤訊息 (時間戳記 後是 W 而不是 I)

      ​​​​​​​​W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
      
      • 解法一: 請依名字 (cusolver64_10.dll) 到 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin (或對應的版本中) 找尋名稱相似的檔案cusolver64_11.dll, 將之改名即可 (用複製的也行)
      • 解法二: 請按連結 下載 cusolver64_10.dll 放到 C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin
      • Ref: tensorflow bug report #44291, #44159

    python 訊息例: (最下面有看到 GPU 就對了)

    ​​​​(vGPU) C:\Works\OpenCV>python
    ​​​​Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)] on win32
    ​​​​Type "help", "copyright", "credits" or "license" for more information.
    ​​​​>>> from tensorflow.python.client import device_lib
    ​​​​2021-01-20 10:20:34.772345: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
    ​​​​>>>
    ​​​​>>> device_lib.list_local_devices()
    ​​​​2021-01-20 10:20:40.590963: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
    ​​​​To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    ​​​​2021-01-20 10:20:40.594219: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
    ​​​​2021-01-20 10:20:41.256771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
    ​​​​pciBusID: 0000:2d:00.0 name: GeForce MX330 computeCapability: 6.1
    ​​​​coreClock: 1.594GHz coreCount: 3 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 52.21GiB/s
    ​​​​2021-01-20 10:20:41.257120: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
    ​​​​2021-01-20 10:20:41.264409: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
    ​​​​2021-01-20 10:20:41.264607: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
    ​​​​2021-01-20 10:20:41.268513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
    ​​​​2021-01-20 10:20:41.270228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
    ​​​​2021-01-20 10:20:41.278077: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
    ​​​​2021-01-20 10:20:41.282008: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
    ​​​​2021-01-20 10:20:41.282499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
    ​​​​2021-01-20 10:20:42.142988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
    ​​​​2021-01-20 10:20:42.143236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
    ​​​​2021-01-20 10:20:42.143350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
    ​​​​2021-01-20 10:20:42.144514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 1342 MB memory) -> physical GPU (device: 0, name: GeForce MX330, pci bus id: 0000:2d:00.0, compute capability: 6.1)
    ​​​​2021-01-20 10:20:42.146561: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
    ​​​​[name: "/device:CPU:0"
    ​​​​device_type: "CPU"
    ​​​​memory_limit: 268435456
    ​​​​locality {
    ​​​​}
    ​​​​incarnation: 5028390217096499099
    ​​​​, name: "/device:GPU:0"
    ​​​​device_type: "GPU"
    ​​​​memory_limit: 1408043828
    ​​​​locality {
    ​​​​bus_id: 1
    ​​​​links {
    ​​​​}
    ​​​​}
    ​​​​incarnation: 11065352565844271525
    ​​​​physical_device_desc: "device: 0, name: GeForce MX330, pci bus id: 0000:2d:00.0, compute capability: 6.1"
    ​​​​]
    ​​​​>>> exit()
    


使用 Anaconda for Windows 環境

  • Anaconda 環境目前有支援 GPU 的環境如下表
項次 python Tensorflow cuda Toolkit cuDNN
1 3.6 1.9 9.0 7.0.3 ~ 7.6.5
2 3.6 1.13~1.15 10.0.130 7.3.0 ~ 7.6.5
3 3.7 1.13~1.15 10.0.130 7.3.0 ~ 7.6.5
4 3.7 2.1 10.1.243 7.5.0 ~ 7.6.5
5 3.8 NA NA NA
  • 安裝指令
    ​​​​conda install tensorflow-gpu==1.15
    ​​​​# or
    ​​​​conda install tensorflow==2.1
    
    請注意安裝時是否列出的套件有包含 cudatoolkitcudnn
  • 請勿使用 pip install 安裝, 它不包含 cudatoolkitcudnn
  • 驗證安裝
    • 開啟 python 或 jupyter 網頁
    ​​​​from tensorflow.python.client import device_lib
    ​​​​device_lib.list_local_devices()
    


WSL2 + docker 環境

Docker Desktop 3.1.0 版起, 開始在 NVIDIA GPUs 上支援 WSL 2 GPU Paravirtualization (GPU-PV).

系統需求

  • Docker Desktop 升級至 3.1.0 版 (或以上)

  • Windows 10 Insider version (build 20150 以後). 目前為 2004 build 21292 21301 (不是己經 release 的 20H2 喔)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    目前 Windows 10 Insider 版本 (2021/01/21)

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    目前 Windows 10 Insider 版本 (2021/01/23)

  • Beta Display Drivers from NVIDIA supporting WSL 2 GPU Paravirtualization

    • 需要登錄為 nVidia 開發者, 並登入才能下載
    • 目前 GeForce 版本 465.21
  • 更新 WSL 2 Linux kernel 至最新版本.
    請以系統管理者身份開啟 cmd 指令視窗用以下指令 wsl --update 更新

    ​​​​wsl --update
    

    更新完成後, 需要重啟 WSL. (以及 docker, 因為 docker 也有二個 wsl images)

    ​​​​wsl --shutdown 
    

    建議是重新啟動 PC 比較乾脆一點.

  • 確定 WSL 2 backend 選項在 Docker Desktop 有開啟

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    啟用 WSL 2 backend

    另外如果 linux distro 要使用 docker (含 docker-compose) 指令可以不必安裝, 只要啟用整體的 WSL Integration 或者在有需要的 distro 上個別啟用.

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

    啟用 WSL Integration (WSL整合)

順便一提, 這個 docker, docker-compose 指令在 git for windowsgit-bash 環境中一樣可以執行.

驗證

  • 請於 cmd console 視窗中執行
    ​​​​docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
    
    理所當然的, 如果你想在 WSL 的 distro 中執行也是可以的.
    ​​​​wsl bash
    
    進入之後 (我試過 Ububtu.20-04, CentOS7, CentOS8 都 OK) 一樣的指令:
    ​​​​docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
    
  • 執行結果如下:
    ​​​​Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    ​​​​        -fullscreen       (run n-body simulation in fullscreen mode)
    ​​​​        -fp64             (use double precision floating point values for simulation)
    ​​​​        -hostmem          (stores simulation data in host memory)
    ​​​​        -benchmark        (run benchmark to measure performance)
    ​​​​        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
    ​​​​        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    ​​​​        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    ​​​​        -compare          (compares simulation results running once on the default GPU and once on the CPU)
    ​​​​        -cpu              (run n-body simulation on the CPU)
    ​​​​        -tipsy=<file.bin> (load a tipsy model file for simulation)
    
    ​​​​NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
    
    ​​​​> Windowed mode
    ​​​​> Simulation data stored in video memory
    ​​​​> Single precision floating point simulation
    ​​​​> 1 Devices used for simulation
    ​​​​GPU Device 0: "GeForce MX330" with compute capability 6.1
    
    ​​​​> Compute 6.1 CUDA device: [GeForce MX330]
    ​​​​3072 bodies, total time for 10 iterations: 2.427 ms
    ​​​​= 38.888 billion interactions per second
    ​​​​= 777.752 single-precision GFLOP/s at 20 flops per interaction
    

請注意你的 GPU 卡是否有列在最後面的資訊出現 (上例為: GeForce MX330)

WSL2 + docker 環境不適用 tensorflow 網站所發佈的 docker image tensorflow/tensorflow, 請改往 https://ngc.nvidia.com/catalog/containers 下載你需要的 docker images



目前還不支援的環境

  • VM 環境
  • VM + docker 環境

以上 VM 環境指 VirtualBox 或 VMWare WS Pro

tags: Windows, WSL2, tensorflow, GPU 加速, python, Python for Windows, docker for windows