Try   HackMD

用Docker執行pytorch/tensorflow,只要安裝nvidia driver即可

在Linux下要能使用GPU,要安裝好多元件,如NV驅動程式,CUDA,CUDNN,CONDA,PYTHON,PYTORCH,TORCHVISION等,每一個版本有問題就出問題,而且如果是使用圖型介面的Ubuntu,還會常常有驅動跑掉,整個X跑不起來的情況。

為了避免這種麻煩,為何不用最好用的DOCKER呢?只要在主系統上安裝NVIDIA驅動,其它事全部交給docker解決。

一、安裝nvidia driver

  • 先加入nvidia的ppa
sudo add-apt-repository ppa:graphics-drivers/ppa
  • 如果遇到金鑰不存在,則先加入nvidia的金鑰。
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
  • 更新來源
sudo apt-get update
  • 開始安裝
sudo apt-get nvidia-430
  • 檢查是否安裝成功
$ nvidia-smi                                                                                                                   [15:06:36]
Mon Jul 29 15:06:39 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 30%   41C    P0    58W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
| 36%   44C    P0     1W / 250W |      0MiB / 11011MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
(base) (immust02)joshhu:4014/ $

二、安裝docker community

使用環境

  • Ubuntu 16.04
  • NVidia GPU
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER

檢查docker版本

(base) (immust02)joshhu:4014/ $ docker version                                                                                                               [15:01:38]
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.5
  Git commit:       aeac949
  Built:            Wed Jul 17 18:14:42 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.5
  GitCommit:        bb71b10fd8f58240ca47fbb579b9d1028eea7c84
 runc:
  Version:          1.0.0-rc6+dev
  GitCommit:        2b18fe1d885ee5083ef9f0838fee39b62d653e30
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

三、安裝nvidia-docker

先安裝好nvidia-docker

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

檢查nvidia-docker的版本

$ nvidia-docker version                                                                                                         [15:18:42]
NVIDIA Docker: 2.0.3
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:27 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.5
  Git commit:       aeac949
  Built:            Wed Jul 17 18:14:42 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.5
  GitCommit:        bb71b10fd8f58240ca47fbb579b9d1028eea7c84
 runc:
  Version:          1.0.0-rc6+dev
  GitCommit:        2b18fe1d885ee5083ef9f0838fee39b62d653e30
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

四、下載已安裝好所有套件的Cuda10.0的docker image

docker pull moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06

檢查是否下載成功:

$ docker images                                                                                                                   [15:42:07]
REPOSITORY          TAG                               IMAGE ID            CREATED             SIZE
moeidb/aigo         cu10.0-dnn7.6-gpu-pytorch-19.06   492bce9e825f        3 weeks ago         17GB
nvidia/cuda         9.0-base

?查看container中的python版本

$ docker run --rm moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-19.06 python3 --version
Python 3.7.3

五、建立啟動Jupyter Lab的指令檔

  • 建立一個指令檔startj.sh,內容如下
# 決定Jupyterlab該監聽本機的哪一個port 
host_port=9999

# 啟動容器並取得容器ID
container_id=$(nvidia-docker run --rm -d --ipc=host -p ${host_port}:8888 -v $PWD:/workspace moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06) # 休息一會,靜待容器服務啟動

# 等待服務啟動
sleep 2.

# 擷取容器的Jupyterlab token
notebook_token=$(docker logs ${container_id} 2>&1 | grep -nP "(LabApp.*token=).*" | cut -d"=" -f 2) 

# 顯示連線至Jupyterlab服務的網址
printf "Open a browser and connect to:\n
        http://127.0.0.1:${host_port}/?token=${notebook_token}\n
  • 將此指令檔設定為可執行chmod +x startj.sh
  • 輸入./startj.sh,會出現一個網址,即Jupyter Lab的網址
  • 進入後如圖

六、注意事項

  • 啟動之後,這個目錄所有的檔案都是root權限,要注意。
  • 每次重新啟動之後,所有在jupyter lab用!pip install的東西都要重裝
  • 如果用本機正常的conda環境執行會有問題,必須先改成使用者權限
tags: pytorch docker tensorflow