# Ubuntu 20.04 Torch-GPU安裝筆記 ## 常見 conda 虛擬環境管理指令 1. 建立新環境(要指定 python 版本) ``` conda create --name {env_name} python={version} ``` 2. 啟用環境 ``` conda activate {env_name} ``` 3. 停用環境 ``` conda deactivate ``` 4. 查看當前的所有虛擬環境 ``` conda info --env ``` 5. 刪除環境 ``` conda env remove --name {env_name} ``` ## 遠端連線: Nomachine 安裝起因: Teamviewer 的運作太吃記憶體 ## 遠端連線: Teamviewer 1. 安裝 teamviewer * 下載 teamviewer 最新版 ``` wget https://download.teamviewer.com/download/linux/teamviewer_amd64.deb ``` * 安裝 teamviewer 執行檔 ``` sudo apt install ./teamviewer_amd64.deb ``` * 接下來可以去應用程式中找到 teamviewer 並開啟 2. 設定 teamviewer host 端 * 先登入免費帳號 * 進入 Extras > Options,在 Security 底下的 random password 中,選取 disable (no random password) ![image](https://hackmd.io/_uploads/SJsEx9DNR.png) ![image](https://hackmd.io/_uploads/H1hYxcD4C.png) * 選取 apply 即可 ## pytorch GPU 安裝 ### 壹、重灌 NVIDIA Driver 1. [清除既有的 nvidia driver 與 ppa](https://forums.developer.nvidia.com/t/newly-installed-drivers-are-not-found-when-nvidia-smi-is-called/82686/6) ```cmd sudo apt-get remove --purge '^nvidia-.*' sudo apt-get --purge remove "*cublas*" "cuda*" sudo apt-get --purge remove "*nvidia*" sudo add-apt-repository --remove ppa:graphics-drivers/ppa sudo rm /etc/X11/xorg.conf sudo apt autoremove sudo reboot ``` > 如果只想要移除 CUDA 而不要移除 nvidia-driver 的話: > 1. 如果 `/usr/local/cuda-{version}/bin/` 底下有 `cuda-uninstaller` 的話,直接到該路徑下運行以下程式碼即可 > ``` > sudo ./cuda-uninstaller > ``` > 2. 如果找不到 `cuda-uninstaller` 的話,直接用指令去移除 > ``` > sudo apt-get remove nvidia-cuda-toolkit > sudo apt-get remove — auto-remove nvidia-cuda-toolkit > sudo apt-get purge nvidia-cuda-toolkit or sudo apt-get purge — auto-remove nvidia-cuda-toolkit > ``` > 3. 不管用哪個方法,最後都要確保 `/usr/local/` 底下沒有其他 CUDA。(如果還是有的話就用 `rm` 指令去[移除](https://www.nc.com.tw/modules/answer/question/43)) 2. 列出支援的 GPU driver 版本 ``` ubuntu-drivers devices # or ubuntu-drivers list ``` ![Screenshot from 2024-05-30 23-18-38](https://hackmd.io/_uploads/BJS1YfINC.png) 可以選擇想要安裝的 nvidia driver 版本(通常可以按照 recommend 的版本安裝) 3. 安裝nvidia-driver並重新啟動 ``` sudo apt-get update sudo apt upgrade sudo apt install nvidia-driver-{VERSION_NUMBER} sudo reboot ``` 其中 nvidia driver 版本號要依照剛剛列出的支援 GPU driver 版本來安裝。 4. 檢查 nvidia driver 版本 ``` nvidia-smi ``` ## 貳、安裝 CUDA 1. 進入 [nvidia toolkit archive](https://developer.nvidia.com/cuda-toolkit-archive) 選擇想要安裝的 CUDA 版本 [Nvidia 驅動程式與 CUDA 版本對照表](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) 2. 使用 runfile (local) 安裝 [不要安裝到第一步的 nvidia-driver] ![Screenshot from 2024-06-01 16-42-56](https://hackmd.io/_uploads/B1mXyv_VA.png) :::danger 不要安裝 deb 版本 (會把 nvidia driver 重新再裝一次,導致之前指定的版本被洗掉) ![Screenshot from 2024-05-30 22-44-50](https://hackmd.io/_uploads/r1-bZGINC.png) 3. 將 CUDA 加入環境變數 * 開啟被隱藏的 `.bashrc` 檔案 ``` sudo gedit ~/.bashrc ``` ![image](https://hackmd.io/_uploads/SJrevSINC.png) * 在文件最下面加入以下環境變數(最終成功做法) ``` export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 ``` > 另一種環境變數新增方式: > ``` > export CUDA_HOME=/usr/local/cuda-11.3 > export PATH=$PATH:$CUDA_HOME/bin > export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.3/lib64 > ``` > 其中,`cuda-11.3` 需要替換為 local 路徑下的 cuda 資料夾名稱。 * 刷新 terminal: ``` source .bashrc ``` 4. 檢查 CUDA 版本 ``` nvcc -V ``` ![Screenshot from 2024-06-01 19-31-06](https://hackmd.io/_uploads/ryR98Y_VC.png) 5. 再次開啟 nvidia driver,檢查驅動版本與 CUDA 版本(正常來說不用 reboot) ``` nvidia-smi ``` ![Screenshot from 2024-06-01 19-05-55](https://hackmd.io/_uploads/rJk3eY_V0.png) > Error message: failed to initialize NVML: Driver/library version mismatch > check https://www.gpu-mart.com/blog/failed-to-initialize-nvml-driver-library-version-mismatch ## 參、安裝 cuDNN (共兩種安裝法) 可以先直接去安裝 pytorch,確定 pytorch 沒問題後再來安裝即可。 cuDNN 安裝時必須詳細閱讀 terminal 上的訊息顯示,有錯就要即時處理。 ### 方法一、檔案解壓縮(成功) 1. 進入 [cuDNN archive](https://developer.nvidia.com/rdp/cudnn-archive) 中,選擇相對應 CUDA 的 cuDNN 安裝 (請下載 tar 壓縮檔) (建議一定要下載與 CUDA 版本確切相符的 cuDNN) ![螢幕擷取畫面 2024-05-31 023233](https://hackmd.io/_uploads/BkTjLrUNR.png) 2. 解壓縮檔案 ``` tar -xzvf ${CUDNN_TAR_FILE} ``` 3. 手動複製相關檔案到指定資料夾下(版本號要對上) ``` sudo cp -P cuda/include/cudnn*.h /usr/local/cuda-11.4/include sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64/ sudo chmod a+r /usr/local/cuda-11.4/lib64/libcudnn* ``` ### 方法二、deb 安裝檔執行(失敗) 1. 進入 [cuDNN archive](https://developer.nvidia.com/rdp/cudnn-archive) 中,選擇相對應 CUDA 的 cuDNN 安裝 (要下載相對應版本的 3 個 deb 安裝檔) 2. 使用 `sudo dpkg -i` 指令執行安裝檔 例如: ``` sudo dpkg -i cudnn-local-repo-cross-sbsa-ubuntu2004-8.9.7.29_1.0-1_all.deb sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_arm64.deb sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb ``` ![Screenshot from 2024-06-01 19-51-57](https://hackmd.io/_uploads/ByhOit_VC.png) ### 測試 cuDNN * 複製 cuDNN sample 到一個個目錄下,這裡複製到 HOME (不同 cudnn 版本所對應的 sample 路徑不同,要對應一下) 例如: ``` cp -r /usr/src/cudnn_samples_v8 /$HOME ``` ![Screenshot from 2024-06-01 19-53-07](https://hackmd.io/_uploads/HJ02stOE0.png) * 進入該路徑下的 mnistCUDNN ``` cd $HOME/cudnn_samples_v8/mnistCUDNN/ ``` * 編譯 mnistCUDNN sample ``` make clean && make ``` * 測試編譯結果(在 mnistCUDNN 資料夾下測試) ``` ./mnistCUDNN ``` ![Screenshot from 2024-06-01 20-28-25](https://hackmd.io/_uploads/BygWVqOVA.png) 只有顯示 test PASS 時才算成功。 :::warning 有時候 cuDNN 沒有測試用的範例檔案,可以用 tensorflow 來測試: ```pyton import tensorflow as tf from tensorflow.python.client import device_lib print("Available GPUs:", tf.config.list_physical_devices('GPU')) ``` 判斷是否有成功偵測到 GPU,如果有就代表 cuDNN 是可用的: ![Screenshot from 2024-10-26 14-55-55](https://hackmd.io/_uploads/S1NnMz5lkg.png) ::: ## 肆、安裝 pytorch 1. 新建一個 python 虛擬環境後,進入 pytorch 官網 ([current version](https://pytorch.org/) or [previous version](https://pytorch.org/get-started/previous-versions/)) 2. 根據 CUDA 版本去安裝相容的 pytorch 版本 3. 測試 torch GPU 是否成功 ```python import torch torch.cuda.is_available() ``` ![Screenshot from 2024-06-01 19-17-47](https://hackmd.io/_uploads/SyIoQY_EA.png) ## 伍、其他 Ubuntu 20.04 相關安裝 ### Anaconda 1. `curl` 工具確認 ``` sudo apt install curl ``` 2. 取得 Anaconda 安裝檔並確認與官網驗證碼相符 ``` curl -O https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh sha256sum Anaconda3-2021.05-Linux-x86_64.sh ``` ![image](https://hackmd.io/_uploads/SyLmAEHUkg.png) 3. 開始安裝 ``` bash Anaconda3-2021.05-Linux-x86_64.sh ``` * 按下 ENTER 閱讀使用規範 ![image](https://hackmd.io/_uploads/rkt50ErLJe.png) * 同意使用規範 ![image](https://hackmd.io/_uploads/r1_nRVSL1e.png) * 安裝在預設路徑中 ![image](https://hackmd.io/_uploads/HJWRC4SUye.png) * 完成安裝 ![image](https://hackmd.io/_uploads/SJLflSrUye.png) 4. 將 anaconda 加入環境變數 ``` sudo vi ~/.bashrc ``` * 將以下內容加入檔案的最後一行並存檔: `export PATH="$HOME/anaconda3/bin:$PATH"` * 存檔後更新: `source ~/.bashrc` * 確認 conda 版本: `conda --version` ![image](https://hackmd.io/_uploads/HJJObSSIJg.png) ### C++/Libtorch 1. 到 [Pytorch 官網](https://pytorch.org/get-started/locally/)下載 libtorch (我下載的是 [CPU 版本](https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.3.1%2Bcpu.zip)) ![Screenshot from 2024-07-05 19-28-13](https://hackmd.io/_uploads/H1cJK8rwC.png) > 很詭異的地方在於多數的 CUDA 版本都可以用下面這個連結去找到,但偏偏我裝的 CUDA11.4 沒有連結 > https://download.pytorch.org/libtorch/cu116 # 修改cu後面的版本號 2. 解壓縮下載下來的檔案,把連結放到 home 底下 (不要放到系統資料夾中,權限問題很麻煩) ![Screenshot from 2024-07-05 19-42-36](https://hackmd.io/_uploads/SkDS2IBDA.png) 3. 新增一個資料夾,底下放兩個檔案 * example-app.cpp ``` #include "torch/torch.h" #include <iostream> int main() { torch::Tensor tensor = torch::rand({2, 3}); std::cout << tensor << std::endl; } ``` * CMakeLists.txt ``` cmake_minimum_required(VERSION 3.15) project(test_pytorch) set(CMAKE_CXX_STANDARD 17) #设置为C++11,可能报错 set(Torch_DIR ~/libtorch/share/cmake/Torch) #TODO # 添加 CMAKE_PREFIX_PATH list(APPEND CMAKE_PREFIX_PATH "~/libtorch") #TODO find_package(Torch REQUIRED) message(STATUS "Torch library status:") message(STATUS " include path: ${TORCH_INCLUDE_DIRS}") message(STATUS " lib path : ${TORCH_LIBRARIES} ") add_executable(example-app example-app.cpp) target_link_libraries(example-app "${TORCH_LIBRARIES}") ``` 4. 將路徑移到與剛剛 cpp 和 cmakelist.txt 檔案同層,輸入下列指令來建立 cmake 編譯檔,並進行檔案編譯 ``` mkdir build cd build cmake .. make ``` ![Screenshot from 2024-07-05 19-48-19](https://hackmd.io/_uploads/r1s5TIBvR.png) 編譯成功。 ## 陸、安裝歷史紀錄 ### 舊電腦 RTX 2070 當前資訊:**last updated on 20240823** * cuda version 11.8.89: `nvcc --version` ![Screenshot from 2024-08-23 18-52-46](https://hackmd.io/_uploads/Sk6G91LjA.png) * nvidia driver version 470.256.02: `nvidia-smi` ![Screenshot from 2024-08-23 18-53-26](https://hackmd.io/_uploads/HyEL9kLoA.png) * python version 3.8.19: `python --version` ![Screenshot from 2024-08-23 18-54-25](https://hackmd.io/_uploads/rkRO9kIoA.png) * pytorch version 2.0.0+cu118: `print(torch.__version__)` ![Screenshot from 2024-08-23 18-56-01](https://hackmd.io/_uploads/ByKA5kIoR.png) * cudnn version 8.2.4 歷史紀錄: * 20240530 [CUDA 11.6](https://developer.nvidia.com/cuda-11-6-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>: mismatch) * 20240531 [CUDA 11.4](https://developer.nvidia.com/cuda-11-4-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>: mismatch) * 20240601 [CUDA 12.0](https://developer.nvidia.com/cuda-12-0-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, runfile <font color = "#19A519">**SUCCESS**</font> [但 `nvidia-smi` 上顯示的 CUDA version 為 11.4] -> <font color = "red">FAIL</font> for pytorch: `pytorch.cuda.is_available() = False`) * 20240601 [CUDA 11.4](https://developer.nvidia.com/cuda-11-4-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) (For R3LIVE python, runfile <font color = "#19A519">**SUCCESS**</font>) * 20240601 CUDA 12.0 + nvidia-driver-470 (highest CUDA version 11.4) + [cuDNN](https://developer.nvidia.com/cudnn-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>) * 20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) (For R3LIVE python, <font color = "#19A519">**SUCCESS**</font>) [尚未安裝cuDNN] * 20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) + cuDNN v8.9.7 (December 5th, 2023), for CUDA 11.x (<font color = "red">FAIL</font>) * ==20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) + [cuDNN v8.2.4 (September 2nd, 2021), for CUDA 11.4](https://hackmd.io/_uploads/SySMS9OVA.png) (final <font color = "#19A519">**SUCCESS**</font>)== ### 新電腦 RTX 3060 當前資訊:last updated on 20240823 * cuda version 11.8.89: `nvcc --version` ![Screenshot from 2024-08-23 18-23-04](https://hackmd.io/_uploads/S1GXQyIj0.png) * nvidia driver version 535.183.01: `nvidia-smi` ![Screenshot from 2024-08-23 18-25-15](https://hackmd.io/_uploads/ryWsQkLsC.png) * python version 3.8.19: `python --version` ![Screenshot from 2024-08-23 18-29-28](https://hackmd.io/_uploads/S1Qj4JIjA.png) * pytorch version 2.0.0+cu118: `print(torch.__version__)` ![Screenshot from 2024-08-23 18-37-26](https://hackmd.io/_uploads/H1RFUJLj0.png) ![Screenshot from 2024-10-26 22-18-01](https://hackmd.io/_uploads/Sk8NcO9gkl.png) * cudnn: v8.9.6 歷史紀錄: * 20240926 nvidia-driver-535.183.01 + CUDA 11.8.89 + cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [deb install] (<font color = "red">FAIL</font>) * S20240926 nvidia-driver-535.183.01 + CUDA 11.8.89 + [cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] ](https://hackmd.io/_uploads/BJ1g7M9eyl.png) (final <font color = "#19A519">**SUCCESS**</font>) * ==20240926 nvidia-driver-535.183.01 + pytorch 2.4.1+cu118 for cuda 11-8(`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`) + CUDA 11.8.89 + [cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] ](https://hackmd.io/_uploads/BJ1g7M9eyl.png) (final <font color = "#19A519">**SUCCESS**</font>)== * ==20241101 nvidia-driver-535.183.01 + pytorch 2.0.0+cu118 for cuda 11-8 (`pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118`) + CUDA 11.8.89 + cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] (final **SUCCESS**)== ### 國網電腦 當前資訊:last updated on 20241019 * cuda version 11.5.119: `nvcc --verison` ![image](https://hackmd.io/_uploads/rk0Z1z-gJe.png) * nvidia driver version 550.90.07: `nvidia-smi` ![image](https://hackmd.io/_uploads/S1QNyzblkg.png) * python version 3.10.12: `python --version` ![image](https://hackmd.io/_uploads/HJ4Pyf-g1x.png) * pytorch version 1.11.0+cu115: `print(torch.__version__)` ![image](https://hackmd.io/_uploads/rk0wffZlJe.png) 歷史紀錄: * ==20241019 nvidia-driver-550.90.07 (國網裝的) + CUDA 11.5.119(國網裝的) + python3.10.12 + [pytorch1.11.0 for cuda-11.5](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu115`) (final <font color = "#19A519">**SUCCESS**</font>)== ### CVIT Lab 電腦 當前資訊:last updated on 20250317 * cuda version 12.0.76: `nvcc --version` ![image](https://hackmd.io/_uploads/Sy_MTVSIyg.png) * nvidia driver version 525.85.05: `nvidia-smi` ![image](https://hackmd.io/_uploads/BJ-BTVHIJx.png) * python version: `python --version` ![image](https://hackmd.io/_uploads/Skz6ElbFJx.png) * pytorch version: `print(torch.__version__)` ![image](https://hackmd.io/_uploads/Sy8eBxbFJg.png) 歷史紀錄: * 20250205 nvidia-driver-525.85.05 + CUDA 12.0.76 + python 3.8.5 + pytorch2.4.1 for cuda-11.8 (`pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118`) (final <font color = "#19A519">**SUCCESS**</font>) * ==20250205 nvidia-driver-525.85.05 + CUDA 12.0.76 + python 3.8.5 + pytorch2.0.0 for cuda-11.8 (`pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118`) (final <font color = "#19A519">**SUCCESS**</font>)== * ==20250317 [LOCAL] nvidia-driver-525.85.05 + CUDA 12.0.76 (`sudo sh cuda_11.3.0_465.19.01_linux.run --toolkit --silent --override --installpath=/usr/local/cuda-11.3`) + python 3.8.5 + pytorch2.0.0 for cuda-11.8 (`pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113`) (final <font color = "#19A519">**SUCCESS**</font>)== :::info 安裝 local 版 CUDA 後,如果要啟用該指令版本的話要輸入以下指令做切換 (用完之後也要切回來): ``` export PATH=/usr/local/cuda-11.3/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH ``` ::: ### CVIT Lab RTX5090 當前資訊:last updated on 20250317 歷史紀錄: * 20250317 `pip3 install torch torchvision torchaudio` # 參考資料 * [cuda vs nvidia driver : version compare](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) * [teamviewer install](https://linuxize.com/post/how-to-install-teamviewer-on-ubuntu-20-04/) * [completely remove nvidia driver](https://askubuntu.com/questions/206283/how-can-i-uninstall-a-nvidia-driver-completely) * [nvidia related install (之前抓雙螢幕時參考的資料)](https://hackmd.io/@NyCNJlzoT4ujXvXGi_HlfA/rk0IppWqK) * [Which pytorch version >2.0.1 support cuda 11.4](https://discuss.pytorch.org/t/which-pytorch-version-2-0-1-support-cuda-11-4/190446) * [Pytorch Installation CUDA 11.4](https://discuss.pytorch.org/t/pytorch-installation-cuda-11-4/200160) * [How to remove CUDA completely from Linux?](https://netraneupane.medium.com/how-to-remove-cuda-completely-from-linux-72a9b0edca53) * [linux下CUDA编译报错:fatal error: cudnn_version.h: No such file or directory](https://blog.csdn.net/ning_yi/article/details/119385309) * [NVIDIA Driver, CUDA 11.4, cuDNN v8.2.4 installation on Ubuntu 20.04](https://github.com/ashutoshIITK/install_cuda_cudnn_ubuntu_20)