用Docker執行pytorch/tensorflow,只要安裝nvidia driver即可 === 在Linux下要能使用GPU,要安裝好多元件,如NV驅動程式,CUDA,CUDNN,CONDA,PYTHON,PYTORCH,TORCHVISION等,每一個版本有問題就出問題,而且如果是使用圖型介面的Ubuntu,還會常常有驅動跑掉,整個X跑不起來的情況。 為了避免這種麻煩,為何不用最好用的DOCKER呢?只要在主系統上安裝NVIDIA驅動,其它事全部交給docker解決。 # 一、安裝nvidia driver * 先加入nvidia的ppa ``` sudo add-apt-repository ppa:graphics-drivers/ppa ``` * 如果遇到金鑰不存在,則先加入nvidia的金鑰。 ``` curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - ``` * 更新來源 ``` sudo apt-get update ``` * 開始安裝 ``` sudo apt-get nvidia-430 ``` * 檢查是否安裝成功 ``` $ nvidia-smi [15:06:36] Mon Jul 29 15:06:39 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:0A:00.0 Off | N/A | | 30% 41C P0 58W / 250W | 0MiB / 11019MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A | | 36% 44C P0 1W / 250W | 0MiB / 11011MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ (base) (immust02)joshhu:4014/ $ ``` # 二、安裝`docker community` **使用環境** * Ubuntu 16.04 * NVidia GPU ``` curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker $USER ``` **檢查`docker`版本** ``` (base) (immust02)joshhu:4014/ $ docker version [15:01:38] Client: Version: 18.09.6 API version: 1.39 Go version: go1.10.8 Git commit: 481bc77 Built: Sat May 4 02:35:27 2019 OS/Arch: linux/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 19.03.0 API version: 1.40 (minimum version 1.12) Go version: go1.12.5 Git commit: aeac949 Built: Wed Jul 17 18:14:42 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.5 GitCommit: bb71b10fd8f58240ca47fbb579b9d1028eea7c84 runc: Version: 1.0.0-rc6+dev GitCommit: 2b18fe1d885ee5083ef9f0838fee39b62d653e30 docker-init: Version: 0.18.0 GitCommit: fec3683 ``` # 三、安裝`nvidia-docker` **先安裝好`nvidia-docker`** ``` # Add the package repositories distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker ``` **檢查`nvidia-docker`的版本** ``` $ nvidia-docker version [15:18:42] NVIDIA Docker: 2.0.3 Client: Version: 18.09.6 API version: 1.39 Go version: go1.10.8 Git commit: 481bc77 Built: Sat May 4 02:35:27 2019 OS/Arch: linux/amd64 Experimental: false Server: Docker Engine - Community Engine: Version: 19.03.0 API version: 1.40 (minimum version 1.12) Go version: go1.12.5 Git commit: aeac949 Built: Wed Jul 17 18:14:42 2019 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.2.5 GitCommit: bb71b10fd8f58240ca47fbb579b9d1028eea7c84 runc: Version: 1.0.0-rc6+dev GitCommit: 2b18fe1d885ee5083ef9f0838fee39b62d653e30 docker-init: Version: 0.18.0 GitCommit: fec3683 ``` # 四、下載已安裝好所有套件的Cuda10.0的docker image ``` docker pull moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06 ``` 檢查是否下載成功: ``` $ docker images [15:42:07] REPOSITORY TAG IMAGE ID CREATED SIZE moeidb/aigo cu10.0-dnn7.6-gpu-pytorch-19.06 492bce9e825f 3 weeks ago 17GB nvidia/cuda 9.0-base ``` ?查看container中的python版本 ``` $ docker run --rm moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-19.06 python3 --version Python 3.7.3 ``` # 五、建立啟動Jupyter Lab的指令檔 * 建立一個指令檔`startj.sh`,內容如下 ``` # 決定Jupyterlab該監聽本機的哪一個port host_port=9999 # 啟動容器並取得容器ID container_id=$(nvidia-docker run --rm -d --ipc=host -p ${host_port}:8888 -v $PWD:/workspace moeidb/aigo:cu10.0-dnn7.6-gpu-pytorch-cv-19.06) # 休息一會,靜待容器服務啟動 # 等待服務啟動 sleep 2. # 擷取容器的Jupyterlab token notebook_token=$(docker logs ${container_id} 2>&1 | grep -nP "(LabApp.*token=).*" | cut -d"=" -f 2) # 顯示連線至Jupyterlab服務的網址 printf "Open a browser and connect to:\n http://127.0.0.1:${host_port}/?token=${notebook_token}\n ``` * 將此指令檔設定為可執行`chmod +x startj.sh` * 輸入`./startj.sh`,會出現一個網址,即Jupyter Lab的網址 * 進入後如圖 ![](https://i.imgur.com/bLGgQoG.jpg) # 六、注意事項 * 啟動之後,這個目錄所有的檔案都是`root`權限,要注意。 * 每次重新啟動之後,所有在jupyter lab用`!pip install`的東西都要重裝 * 如果用本機正常的conda環境執行會有問題,必須先改成使用者權限 ###### tags: `pytorch` `docker` `tensorflow`