# Ubuntu 20.04 Torch-GPU安裝筆記
## 常見 conda 虛擬環境管理指令
1. 建立新環境(要指定 python 版本)
```
conda create --name {env_name} python={version}
```
2. 啟用環境
```
conda activate {env_name}
```
3. 停用環境
```
conda deactivate
```
4. 查看當前的所有虛擬環境
```
conda info --env
```
5. 刪除環境
```
conda env remove --name {env_name}
```
## 遠端連線: Nomachine
安裝起因: Teamviewer 的運作太吃記憶體
## 遠端連線: Teamviewer
1. 安裝 teamviewer
* 下載 teamviewer 最新版
```
wget https://download.teamviewer.com/download/linux/teamviewer_amd64.deb
```
* 安裝 teamviewer 執行檔
```
sudo apt install ./teamviewer_amd64.deb
```
* 接下來可以去應用程式中找到 teamviewer 並開啟
2. 設定 teamviewer host 端
* 先登入免費帳號
* 進入 Extras > Options,在 Security 底下的 random password 中,選取 disable (no random password)


* 選取 apply 即可
## pytorch GPU 安裝
### 壹、重灌 NVIDIA Driver
1. [清除既有的 nvidia driver 與 ppa](https://forums.developer.nvidia.com/t/newly-installed-drivers-are-not-found-when-nvidia-smi-is-called/82686/6)
```cmd
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get --purge remove "*cublas*" "cuda*"
sudo apt-get --purge remove "*nvidia*"
sudo add-apt-repository --remove ppa:graphics-drivers/ppa
sudo rm /etc/X11/xorg.conf
sudo apt autoremove
sudo reboot
```
> 如果只想要移除 CUDA 而不要移除 nvidia-driver 的話:
> 1. 如果 `/usr/local/cuda-{version}/bin/` 底下有 `cuda-uninstaller` 的話,直接到該路徑下運行以下程式碼即可
> ```
> sudo ./cuda-uninstaller
> ```
> 2. 如果找不到 `cuda-uninstaller` 的話,直接用指令去移除
> ```
> sudo apt-get remove nvidia-cuda-toolkit
> sudo apt-get remove — auto-remove nvidia-cuda-toolkit
> sudo apt-get purge nvidia-cuda-toolkit or sudo apt-get purge — auto-remove nvidia-cuda-toolkit
> ```
> 3. 不管用哪個方法,最後都要確保 `/usr/local/` 底下沒有其他 CUDA。(如果還是有的話就用 `rm` 指令去[移除](https://www.nc.com.tw/modules/answer/question/43))
2. 列出支援的 GPU driver 版本
```
ubuntu-drivers devices
# or
ubuntu-drivers list
```

可以選擇想要安裝的 nvidia driver 版本(通常可以按照 recommend 的版本安裝)
3. 安裝nvidia-driver並重新啟動
```
sudo apt-get update
sudo apt upgrade
sudo apt install nvidia-driver-{VERSION_NUMBER}
sudo reboot
```
其中 nvidia driver 版本號要依照剛剛列出的支援 GPU driver 版本來安裝。
4. 檢查 nvidia driver 版本
```
nvidia-smi
```
## 貳、安裝 CUDA
1. 進入 [nvidia toolkit archive](https://developer.nvidia.com/cuda-toolkit-archive) 選擇想要安裝的 CUDA 版本
[Nvidia 驅動程式與 CUDA 版本對照表](https://docs.nvidia.com/deploy/cuda-compatibility/index.html)
2. 使用 runfile (local) 安裝 [不要安裝到第一步的 nvidia-driver]

:::danger
不要安裝 deb 版本 (會把 nvidia driver 重新再裝一次,導致之前指定的版本被洗掉)

3. 將 CUDA 加入環境變數
* 開啟被隱藏的 `.bashrc` 檔案
```
sudo gedit ~/.bashrc
```

* 在文件最下面加入以下環境變數(最終成功做法)
```
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
```
> 另一種環境變數新增方式:
> ```
> export CUDA_HOME=/usr/local/cuda-11.3
> export PATH=$PATH:$CUDA_HOME/bin
> export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-11.3/lib64
> ```
> 其中,`cuda-11.3` 需要替換為 local 路徑下的 cuda 資料夾名稱。
* 刷新 terminal:
```
source .bashrc
```
4. 檢查 CUDA 版本
```
nvcc -V
```

5. 再次開啟 nvidia driver,檢查驅動版本與 CUDA 版本(正常來說不用 reboot)
```
nvidia-smi
```

> Error message: failed to initialize NVML: Driver/library version mismatch
> check https://www.gpu-mart.com/blog/failed-to-initialize-nvml-driver-library-version-mismatch
## 參、安裝 cuDNN (共兩種安裝法)
可以先直接去安裝 pytorch,確定 pytorch 沒問題後再來安裝即可。
cuDNN 安裝時必須詳細閱讀 terminal 上的訊息顯示,有錯就要即時處理。
### 方法一、檔案解壓縮(成功)
1. 進入 [cuDNN archive](https://developer.nvidia.com/rdp/cudnn-archive) 中,選擇相對應 CUDA 的 cuDNN 安裝 (請下載 tar 壓縮檔) (建議一定要下載與 CUDA 版本確切相符的 cuDNN)

2. 解壓縮檔案
```
tar -xzvf ${CUDNN_TAR_FILE}
```
3. 手動複製相關檔案到指定資料夾下(版本號要對上)
```
sudo cp -P cuda/include/cudnn*.h /usr/local/cuda-11.4/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-11.4/lib64/
sudo chmod a+r /usr/local/cuda-11.4/lib64/libcudnn*
```
### 方法二、deb 安裝檔執行(失敗)
1. 進入 [cuDNN archive](https://developer.nvidia.com/rdp/cudnn-archive) 中,選擇相對應 CUDA 的 cuDNN 安裝 (要下載相對應版本的 3 個 deb 安裝檔)
2. 使用 `sudo dpkg -i` 指令執行安裝檔
例如:
```
sudo dpkg -i cudnn-local-repo-cross-sbsa-ubuntu2004-8.9.7.29_1.0-1_all.deb
sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_arm64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2004-8.9.7.29_1.0-1_amd64.deb
```

### 測試 cuDNN
* 複製 cuDNN sample 到一個個目錄下,這裡複製到 HOME (不同 cudnn 版本所對應的 sample 路徑不同,要對應一下)
例如:
```
cp -r /usr/src/cudnn_samples_v8 /$HOME
```

* 進入該路徑下的 mnistCUDNN
```
cd $HOME/cudnn_samples_v8/mnistCUDNN/
```
* 編譯 mnistCUDNN sample
```
make clean && make
```
* 測試編譯結果(在 mnistCUDNN 資料夾下測試)
```
./mnistCUDNN
```

只有顯示 test PASS 時才算成功。
:::warning
有時候 cuDNN 沒有測試用的範例檔案,可以用 tensorflow 來測試:
```pyton
import tensorflow as tf
from tensorflow.python.client import device_lib
print("Available GPUs:", tf.config.list_physical_devices('GPU'))
```
判斷是否有成功偵測到 GPU,如果有就代表 cuDNN 是可用的:

:::
## 肆、安裝 pytorch
1. 新建一個 python 虛擬環境後,進入 pytorch 官網 ([current version](https://pytorch.org/) or [previous version](https://pytorch.org/get-started/previous-versions/))
2. 根據 CUDA 版本去安裝相容的 pytorch 版本
3. 測試 torch GPU 是否成功
```python
import torch
torch.cuda.is_available()
```

## 伍、其他 Ubuntu 20.04 相關安裝
### Anaconda
1. `curl` 工具確認
```
sudo apt install curl
```
2. 取得 Anaconda 安裝檔並確認與官網驗證碼相符
```
curl -O https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh
sha256sum Anaconda3-2021.05-Linux-x86_64.sh
```

3. 開始安裝
```
bash Anaconda3-2021.05-Linux-x86_64.sh
```
* 按下 ENTER 閱讀使用規範

* 同意使用規範

* 安裝在預設路徑中

* 完成安裝

4. 將 anaconda 加入環境變數
```
sudo vi ~/.bashrc
```
* 將以下內容加入檔案的最後一行並存檔: `export PATH="$HOME/anaconda3/bin:$PATH"`
* 存檔後更新: `source ~/.bashrc`
* 確認 conda 版本: `conda --version`

### C++/Libtorch
1. 到 [Pytorch 官網](https://pytorch.org/get-started/locally/)下載 libtorch (我下載的是 [CPU 版本](https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.3.1%2Bcpu.zip))

> 很詭異的地方在於多數的 CUDA 版本都可以用下面這個連結去找到,但偏偏我裝的 CUDA11.4 沒有連結
> https://download.pytorch.org/libtorch/cu116 # 修改cu後面的版本號
2. 解壓縮下載下來的檔案,把連結放到 home 底下 (不要放到系統資料夾中,權限問題很麻煩)

3. 新增一個資料夾,底下放兩個檔案
* example-app.cpp
```
#include "torch/torch.h"
#include <iostream>
int main() {
torch::Tensor tensor = torch::rand({2, 3});
std::cout << tensor << std::endl;
}
```
* CMakeLists.txt
```
cmake_minimum_required(VERSION 3.15)
project(test_pytorch)
set(CMAKE_CXX_STANDARD 17) #设置为C++11,可能报错
set(Torch_DIR ~/libtorch/share/cmake/Torch) #TODO
# 添加 CMAKE_PREFIX_PATH
list(APPEND CMAKE_PREFIX_PATH "~/libtorch") #TODO
find_package(Torch REQUIRED)
message(STATUS "Torch library status:")
message(STATUS " include path: ${TORCH_INCLUDE_DIRS}")
message(STATUS " lib path : ${TORCH_LIBRARIES} ")
add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
```
4. 將路徑移到與剛剛 cpp 和 cmakelist.txt 檔案同層,輸入下列指令來建立 cmake 編譯檔,並進行檔案編譯
```
mkdir build
cd build
cmake ..
make
```

編譯成功。
## 陸、安裝歷史紀錄
### 舊電腦 RTX 2070
當前資訊:**last updated on 20240823**
* cuda version 11.8.89: `nvcc --version`

* nvidia driver version 470.256.02: `nvidia-smi`

* python version 3.8.19: `python --version`

* pytorch version 2.0.0+cu118: `print(torch.__version__)`

* cudnn version 8.2.4
歷史紀錄:
* 20240530 [CUDA 11.6](https://developer.nvidia.com/cuda-11-6-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>: mismatch)
* 20240531 [CUDA 11.4](https://developer.nvidia.com/cuda-11-4-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>: mismatch)
* 20240601 [CUDA 12.0](https://developer.nvidia.com/cuda-12-0-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, runfile <font color = "#19A519">**SUCCESS**</font> [但 `nvidia-smi` 上顯示的 CUDA version 為 11.4] -> <font color = "red">FAIL</font> for pytorch: `pytorch.cuda.is_available() = False`)
* 20240601 [CUDA 11.4](https://developer.nvidia.com/cuda-11-4-1-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) (For R3LIVE python, runfile <font color = "#19A519">**SUCCESS**</font>)
* 20240601 CUDA 12.0 + nvidia-driver-470 (highest CUDA version 11.4) + [cuDNN](https://developer.nvidia.com/cudnn-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local) (For R3LIVE python, <font color = "red">FAIL</font>)
* 20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) (For R3LIVE python, <font color = "#19A519">**SUCCESS**</font>) [尚未安裝cuDNN]
* 20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) + cuDNN v8.9.7 (December 5th, 2023), for CUDA 11.x (<font color = "red">FAIL</font>)
* ==20240601 nvidia-driver-470.239.06 (`sudo apt install nvidia-driver-470`) (highest CUDA version 11.4) + [CUDA 11.4.0](https://developer.nvidia.com/cuda-11-4-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=runfile_local) + python3.8 + [pytorch2.2.2 for cuda-11.8](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`conda install pytorch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 pytorch-cuda=11.8 -c pytorch -c nvidia`) + [cuDNN v8.2.4 (September 2nd, 2021), for CUDA 11.4](https://hackmd.io/_uploads/SySMS9OVA.png) (final <font color = "#19A519">**SUCCESS**</font>)==
### 新電腦 RTX 3060
當前資訊:last updated on 20240823
* cuda version 11.8.89: `nvcc --version`

* nvidia driver version 535.183.01: `nvidia-smi`

* python version 3.8.19: `python --version`

* pytorch version 2.0.0+cu118: `print(torch.__version__)`


* cudnn: v8.9.6
歷史紀錄:
* 20240926 nvidia-driver-535.183.01 + CUDA 11.8.89 + cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [deb install] (<font color = "red">FAIL</font>)
* S20240926 nvidia-driver-535.183.01 + CUDA 11.8.89 + [cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] ](https://hackmd.io/_uploads/BJ1g7M9eyl.png) (final <font color = "#19A519">**SUCCESS**</font>)
* ==20240926 nvidia-driver-535.183.01 + pytorch 2.4.1+cu118 for cuda 11-8(`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`) + CUDA 11.8.89 + [cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] ](https://hackmd.io/_uploads/BJ1g7M9eyl.png) (final <font color = "#19A519">**SUCCESS**</font>)==
* ==20241101 nvidia-driver-535.183.01 + pytorch 2.0.0+cu118 for cuda 11-8 (`pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118`) + CUDA 11.8.89 + cuDNN v8.9.6 (November 1st, 2023), for CUDA 11.x [tar unzip] (final **SUCCESS**)==
### 國網電腦
當前資訊:last updated on 20241019
* cuda version 11.5.119: `nvcc --verison`

* nvidia driver version 550.90.07: `nvidia-smi`

* python version 3.10.12: `python --version`

* pytorch version 1.11.0+cu115: `print(torch.__version__)`

歷史紀錄:
* ==20241019 nvidia-driver-550.90.07 (國網裝的) + CUDA 11.5.119(國網裝的) + python3.10.12 + [pytorch1.11.0 for cuda-11.5](https://hackmd.io/_uploads/Hynz8Fu4R.png) (`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu115`) (final <font color = "#19A519">**SUCCESS**</font>)==
### CVIT Lab 電腦
當前資訊:last updated on 20250317
* cuda version 12.0.76: `nvcc --version`

* nvidia driver version 525.85.05: `nvidia-smi`

* python version: `python --version`

* pytorch version: `print(torch.__version__)`

歷史紀錄:
* 20250205 nvidia-driver-525.85.05 + CUDA 12.0.76 + python 3.8.5 + pytorch2.4.1 for cuda-11.8 (`pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu118`) (final <font color = "#19A519">**SUCCESS**</font>)
* ==20250205 nvidia-driver-525.85.05 + CUDA 12.0.76 + python 3.8.5 + pytorch2.0.0 for cuda-11.8 (`pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118`) (final <font color = "#19A519">**SUCCESS**</font>)==
* ==20250317 [LOCAL] nvidia-driver-525.85.05 + CUDA 12.0.76 (`sudo sh cuda_11.3.0_465.19.01_linux.run --toolkit --silent --override --installpath=/usr/local/cuda-11.3`) + python 3.8.5 + pytorch2.0.0 for cuda-11.8 (`pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113`) (final <font color = "#19A519">**SUCCESS**</font>)==
:::info
安裝 local 版 CUDA 後,如果要啟用該指令版本的話要輸入以下指令做切換 (用完之後也要切回來):
```
export PATH=/usr/local/cuda-11.3/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64:$LD_LIBRARY_PATH
```
:::
### CVIT Lab RTX5090
當前資訊:last updated on 20250317
歷史紀錄:
* 20250317 `pip3 install torch torchvision torchaudio`
# 參考資料
* [cuda vs nvidia driver : version compare](https://docs.nvidia.com/deploy/cuda-compatibility/index.html)
* [teamviewer install](https://linuxize.com/post/how-to-install-teamviewer-on-ubuntu-20-04/)
* [completely remove nvidia driver](https://askubuntu.com/questions/206283/how-can-i-uninstall-a-nvidia-driver-completely)
* [nvidia related install (之前抓雙螢幕時參考的資料)](https://hackmd.io/@NyCNJlzoT4ujXvXGi_HlfA/rk0IppWqK)
* [Which pytorch version >2.0.1 support cuda 11.4](https://discuss.pytorch.org/t/which-pytorch-version-2-0-1-support-cuda-11-4/190446)
* [Pytorch Installation CUDA 11.4](https://discuss.pytorch.org/t/pytorch-installation-cuda-11-4/200160)
* [How to remove CUDA completely from Linux?](https://netraneupane.medium.com/how-to-remove-cuda-completely-from-linux-72a9b0edca53)
* [linux下CUDA编译报错:fatal error: cudnn_version.h: No such file or directory](https://blog.csdn.net/ning_yi/article/details/119385309)
* [NVIDIA Driver, CUDA 11.4, cuDNN v8.2.4 installation on Ubuntu 20.04](https://github.com/ashutoshIITK/install_cuda_cudnn_ubuntu_20)