# 2022.02 VAE模型訓練環境設定
###### tags: `Training_Log` `Environment` `Log`
## 環境與背景資訊
### 模型版本
- [VSC](https://github.com/YunghuiHsu/Moth_Project/blob/main/Moth_thermal/vsc/utils/networks.py)
- Autoencoder版本為 : [Variational Sparse Coding](https://github.com/ftonolini45/Variational_Sparse_Coding)
### 查看CUDA與cudnn版本
- [Linux 和 Windows 查看 CUDA 和 cuDNN 版本](https://www.cnblogs.com/wuliytTaotao/p/11453265.html)
- 查看CUDA
- `nvcc --version`、`cat /usr/local/cuda/version.txt`
- 查看cuDNN
- `cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2`
- 使用Pytorch: `python -m torch.utils.collect_env`
### 關於CUDA runtime 和 driver 差異的基礎關練
- ==[【CUDA】nvcc和nvidia-smi显示的版本不一致?](https://www.jianshu.com/p/eb5335708f2a)==
#### torch.utils.collect_env
:::spoiler torch.utils.collect_env
```bash=
jovyan@jupyter$ python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.8.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.8.0+cu111
[pip3] torchaudio==0.8.0
[pip3] torchvision==0.9.0+cu111
[conda] cudatoolkit 10.0.130 0 anaconda
[conda] numpy 1.21.2 py37h31617e3_0 conda-forge
[conda] torch 1.8.0+cu111 pypi_0 pypi
[conda] torchaudio 0.8.0 pypi_0 pypi
[conda] torchvision 0.9.0+cu111 pypi_0 pypi
```
:::
---
## 問題簡述
### 在as gpu以pytorch1.8、python3.7,cudnn運算時報錯
- as gpu版本:pytorch1.8.0+cu111 、python3.7
- Cuda 版本11.2(高於官方建議)、找不到cudnn版本
- torch.utils.collect_env訊息
- CUDA used to build PyTorch: 11.1
- CUDA runtime version: Could not collect
- libcudnn.so.8.1.1
- 安裝的cudatoolkit=10.2。
```bash=
user$ nvcc --version
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
user$ cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
# usr/local/cuda/include內沒有cudnn.h
# 使用whereis cudnn.h 找到檔案位於/usr/include/
# 但未找到CUDNN_MAJOR 難道沒安裝?
```
- 官方安裝建議
```bash=
# CUDA 11.1
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
```
### 錯誤訊息
- RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
>You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.
- 採用錯誤提示中的測試代碼可以運作,沒有觸發error
#### 提示的測試程式碼
:::spoiler
```python=
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([42, 128, 32, 32], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(128, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()
ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x7fb8aa8b07d0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 42, 128, 32, 32,
strideA = 131072, 1024, 32, 1,
output: TensorDescriptor 0x7fb8e50989a0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 42, 128, 32, 32,
strideA = 131072, 1024, 32, 1,
weight: FilterDescriptor 0x7fb8e49ac9f0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 128, 3, 3,
Pointer addresses:
input: 0x7fb5b9800000
output: 0x7fb5b6e00000
weight: 0x7fb496d11c00
Additional pointer addresses:
grad_output: 0x7fb5b6e00000
grad_weight: 0x7fb496d11c00
Backward filter algorithm: 5
```
:::
#### 錯誤原因推測
- CUDA環境與運作版本不一致導致衝突(CUDA version is mismatched)
- CUDA used to build PyTorch: 11.1
- CUDA runtime version: Could not collect
---
## 解決方案:
### 自行安裝cudatoolkit=11.1失敗,因為版本衝突(失敗)
- 無法取消安裝衝突的版本
### 直接升級到pytorch1.8.1(失敗)
- 仍然出現RuntimeError: CuDNN error問題
```bash=
# CUDA 11.1
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=11.3 -c pytorch -c conda-forge
```
### (成功)將起始環境設定 pytorch1.4 + cuda9.0 + python 3.6,再自行升級為1.7.1
- 由於之前模型是在1.7.x版訓練,以1.4版運作在載入.pth時會報錯,因此直接安裝升級為cuda9可支援的最高版本
#### 安裝pytorch1.7.1
- 檢視Cuda版本
```bash=
# bash terminal
jovyan@user:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
```
- 採用pip而非conda安裝避免檢查版本相依性時發生一堆衝突
- 選擇cuda 9.2 版本(as gpu提供的環境為9.0)
```bash=
# CUDA 9.2
pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
```
- 安裝完後檢視環境、cudn版本
- CUDA used to build PyTorch: 9.2與CUDA runtime version: 9.0.176均為9.x版
- cudatoolkit雖然指定安裝9.2版,不過實際安裝為10.0
- nvidia-smi driver資訊
- ` Driver Version: 470.57.02 CUDA Version: 11.4`
##### 以python -m torch.utils.collect_env指令檢視環境資訊
:::spoiler torch.utils.collect_env
```bash=
jovyan@user:~$ python3 -m torch.utils.collect_env
Is debug build: False
CUDA used to build PyTorch: 9.2
OS: Ubuntu 16.04.5 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 470.57.02
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] torch==1.7.1+cu92
[pip] torchaudio==0.7.2
[pip] torchvision==0.8.2+cu92
[conda] cudatoolkit 10.0.130 0 anaconda
[conda] mkl 2020.0 166 defaults
[conda] numpy 1.18.1 py36h7314795_1 conda-forge
[conda] torch 1.7.1+cu92 pypi_0 pypi
[conda] torchaudio 0.7.2 pypi_0 pypi
[conda] torchvision 0.8.2+cu92 pypi_0 pypi
```
:::
##### 問題解析:
- 安裝完訓模型時出現錯誤訊息:
- ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version
`GLIBCXX_3.4.22' not found (required by /opt/conda/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so)
- ~~在import cv時出現,與opencv版本有關~~
- opencv版本為3.5.x
- scipy版本問題
- [https://zhuanlan.zhihu.com/p/283537696](https://zhuanlan.zhihu.com/p/283537696)
##### 處理對策(解決)
- 更改 scipy版本
- 預設為1.4.x
- 強制移除conda內的scipy後 以pip安裝
`conda remove --force scipy`
`pip install scipy`
- 自動安裝1.5.4版後問題解決
- ~~直接以pip安裝opencv~~
- 以conda安裝會出現版本衝突問題
```bash=
pip install opencv-python
```