用Docker執行pytorch/tensorflow,只要安裝nvidia driver即可
===
在Linux下要能使用GPU,要安裝好多元件,如NV驅動程式,CUDA,CUDNN,CONDA,PYTHON,PYTORCH,TORCHVISION等,每一個版本有問題就出問題,而且如果是使用圖型介面的Ubuntu,還會常常有驅動跑掉,整個X跑不起來的情況。
為了避免這種麻煩,為何不用最好用的DOCKER呢?只要在主系統上安裝NVIDIA驅動,其它事全部交給docker解決。
# 一、安裝nvidia driver
* 先加入nvidia的ppa
```
sudo add-apt-repository ppa:graphics-drivers/ppa
```
* 如果遇到金鑰不存在,則先加入nvidia的金鑰。
```
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
```
* 更新來源
```
sudo apt-get update
```
* 開始安裝
```
sudo apt-get nvidia-430
```
* 檢查是否安裝成功
```
$ nvidia-smi [15:06:36]
Mon Jul 29 15:06:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:0A:00.0 Off | N/A |
| 30% 41C P0 58W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 36% 44C P0 1W / 250W | 0MiB / 11011MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
(base) (immust02)joshhu:4014/ $
```
# 二、安裝`docker community`
**使用環境**
* Ubuntu 16.04
* NVidia GPU
```
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
```
**檢查`docker`版本**
```
(base) (immust02)joshhu:4014/ $ docker version [15:01:38]
Client:
Version: 18.09.6
API version: 1.39
Go version: go1.10.8
Git commit: 481bc77
Built: Sat May 4 02:35:27 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.0
API version: 1.40 (minimum version 1.12)
Go version: go1.12.5
Git commit: aeac949
Built: Wed Jul 17 18:14:42 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.5
GitCommit: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc:
Version: 1.0.0-rc6+dev
GitCommit: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
docker-init:
Version: 0.18.0
GitCommit: fec3683
```
# 三、安裝`nvidia-docker`
**先安裝好`nvidia-docker`**
```
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
```
**檢查`nvidia-docker`的版本**
```
$ nvidia-docker version [15:18:42]
NVIDIA Docker: 2.0.3
Client:
Version: 18.09.6
API version: 1.39
Go version: go1.10.8
Git commit: 481bc77
Built: Sat May 4 02:35:27 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.0
API version: 1.40 (minimum version 1.12)
Go version: go1.12.5
Git commit: aeac949
Built: Wed Jul 17 18:14:42 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.5
GitCommit: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc:
Version: 1.0.0-rc6+dev
GitCommit: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
docker-init:
Version: 0.18.0
GitCommit: fec3683
```
# 四、下載已安裝好所有套件的Cuda10.0的docker image
```
docker pull moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06
```
檢查是否下載成功:
```
$ docker images [15:42:07]
REPOSITORY TAG IMAGE ID CREATED SIZE
moeidb/aigo cu10.0-dnn7.6-gpu-pytorch-19.06 492bce9e825f 3 weeks ago 17GB
nvidia/cuda 9.0-base
```
?查看container中的python版本
```
$ docker run --rm moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-19.06 python3 --version
Python 3.7.3
```
# 五、建立啟動Jupyter Lab的指令檔
* 建立一個指令檔`startj.sh`,內容如下
```
# 決定Jupyterlab該監聽本機的哪一個port
host_port=9999
# 啟動容器並取得容器ID
container_id=$(nvidia-docker run --rm -d --ipc=host -p ${host_port}:8888 -v $PWD:/workspace moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06) # 休息一會,靜待容器服務啟動
# 等待服務啟動
sleep 2.
# 擷取容器的Jupyterlab token
notebook_token=$(docker logs ${container_id} 2>&1 | grep -nP "(LabApp.*token=).*" | cut -d"=" -f 2)
# 顯示連線至Jupyterlab服務的網址
printf "Open a browser and connect to:\n
http://127.0.0.1:${host_port}/?token=${notebook_token}\n
```
* 將此指令檔設定為可執行`chmod +x startj.sh`
* 輸入`./startj.sh`,會出現一個網址,即Jupyter Lab的網址
* 進入後如圖

# 六、注意事項
* 啟動之後,這個目錄所有的檔案都是`root`權限,要注意。
* 每次重新啟動之後,所有在jupyter lab用`!pip install`的東西都要重裝
* 如果用本機正常的conda環境執行會有問題,必須先改成使用者權限
###### tags: `pytorch` `docker` `tensorflow`