# Ubuntu 安裝支援 GPU 的 Docker 本筆記在 Ubuntu-20.04.1 環境進行操作 測試安裝時間: 2024/10/05 內容分為三部分: 1. 安裝 Docker 2. 安裝 Nvidia Driver 3. 安裝 NVIDIA Container Toolkit 4. 支援 CUDA 的 Pytroch 安裝範例 :::success 以往在使用有 GPU 的 Docker 是安裝 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker),但是 Nvidia 已不再維護和支援該 repository 了。 Nvidia 是說改成用 [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) 來取代 nvidia-docker,也有提供詳細的[官方安裝文件](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html)。 ::: ## 1. 安裝 Docker NVIDIA 官方文件: [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) 這邊使用 `apt` repository 下載 1. 先解除安裝所有 Docker 相關資料。 ```bash sudo apt-get autoremove -y --purge docker-engine docker docker.io docker-ce docker-compose-plugin docker-buildx-plugin sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras # Remove images, containers, volumes, or custom configuration files sudo rm -rf /var/lib/docker sudo rm -rf /var/lib/containerd sudo rm -rf /etc/docker sudo rm /etc/apparmor.d/docker sudo groupdel docker sudo rm -rf /var/run/docker.sock sudo rm -rf /var/run/docker.pid sudo rm -rf /etc/apt/keyrings ``` 2. 更新 apt 安裝包以及相關套件。 ```bash $ sudo apt-get update $ sudo apt-get install \ ca-certificates \ curl \ gnupg \ lsb-release ``` 3. 創建一個新的目錄 `/etc/apt/keyrings`,並設置其權限,`-m 0755` 指定創建的目錄或檔案的權限,`-d` 表示要創建一個目錄。 ```bash $ sudo install -m 0755 -d /etc/apt/keyrings ``` 4. 使用 CURL 從指定 URL 下載 Docker 的 GPG 金鑰並將其保存到 `-o` 選項指定的文件位置。 ```bash $ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc ``` 5. 更改 `/etc/apt/keyrings/docker.asc` 文件的權限,使所有用戶都能夠讀取這個文件。 ```bash $ sudo chmod a+r /etc/apt/keyrings/docker.asc ``` 6. 添加 Docker repository 的源到 Ubuntu 系統中的 sources.list,讓系統能夠從 Docker 官方的源中下載和安裝 Docker 軟件包。 * `|` 是管道符號,將上一條命令的輸出作為下一條命令的輸入。 * `tee` 命令會將輸入同時輸出到終端和指定文件。這裡是將 echo 的輸出內容寫入到 `/etc/apt/sources.list.d/docker.list` 文件中,寫入內容例如: `deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu focal stable`。 * `> /dev/null` 將 `tee` 命令的標準輸出重定向到 /dev/null,即丟棄終端的輸出。 ```bash # Add the repository to Apt sources: $ echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null ``` 7. 安裝 Docker ```bash $ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin ``` 8. 確認是否安裝成功, ```bash $ sudo docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world c1ec31eb5944: Pull complete Digest: sha256:91fb4b041da273d5a3273b6d587d62d518300a6ad268b28628f74997b93171b2 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/ ``` 9. 使 Docker 成為 non-root user (可跳過) [Linux post-installation steps for Docker Engine](https://docs.docker.com/engine/install/linux-postinstall/) > The Docker daemon binds to a Unix socket, not a TCP port. By default it's the root user that owns the Unix socket, and other users can only access it using sudo. The Docker daemon always runs as the root user. 在使用 docker 命令時,都要在前面加上 `sudo`,不然無法連上 Docker daemon server。 如果不想每次都加上 sudo,則執行以下命令。 ```bash # Create the `docker` group. $ sudo groupadd docker # Add your user to the docker group. $ sudo usermod -aG docker $USER # Log out and log back in so that your group membership is re-evaluated. $ newgrp docker # Verify that you can run docker commands without sudo. $ docker run hello-world ``` 如果在這之前就已經用 `sudo` 執行過 Docker CLI commands,會出現以下錯誤。 ```bash WARNING: Error loading config file: /home/user/.docker/config.json - stat /home/user/.docker/config.json: permission denied ``` 則需要執行以下命令,更改 `~/.docker/` 的權限。 ```bash $ sudo chown "$USER":"$USER" /home/"$USER"/.docker -R $ sudo chmod g+rwx "$HOME/.docker" -R ``` 也有可能出現以下問題,但 `$USER` 已在 `groups` 裡面 ```bash $ docker run hello-world docker: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied. ``` 檢查 `~/.docker/config.json` 裡面是否有出現 `"currentContext": "desktop-linux",` 的設定,如果有請刪除,然後重新開機,再執行上面把 $USER 加入 Docker Group 的動作。 ## 2. 安裝 Nvidia Driver 已有很多文章介紹,這邊部贅述,請參考這篇文章,或是上網搜尋其他文章。 * [Ubuntu 20.04 安裝深度學習環境(Nvidia驅動、CUDA 10+CuDNN 7.6.5)](https://hackmd.io/@RinKu1998/B1MpzO3sD) ## 3. 安裝 NVIDIA Container Toolkit NVIDIA 官方文件: [Installing the NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) 1. 刪除已經存在的密鑰 ```bash $ sudo rm /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg $ sudo rm /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` 2. 下載並添加 GPG 密鑰,並添加 NVIDIA Container Toolkit 軟體包來源。 :::info 執行命令時可能會失敗,請檢查你的 proxy 設定,或多執行幾次直到成功。 ::: ```bash $ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` 檢查 /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg 裡面是否有東西,打開應該是亂碼。 ```bash $ cat /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg ``` 檢查 /etc/apt/sources.list.d/nvidia-container-toolkit.list 是否為以下內容 ```bash $ cat /etc/apt/sources.list.d/nvidia-container-toolkit.list deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) / #deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) / ``` 3. 更新安裝包 ```bash $ sudo apt-get update ``` 過程中可能出現以下訊息,請忽略這些訊息。 ```bash E: The repository 'https://nvidia.github.io/libnvidia-container/stable/deb/amd64 Release' no longer has a Release file. N: Updating from such a repository can't be done securely, and is therefore disabled by default. N: See apt-secure(8) manpage for repository creation and user configuration details. ``` 4. 安裝 nvidia-container-toolkit。 ```bash $ sudo apt-get install -y nvidia-container-toolkit ``` 5. 配置 Docker 的 container runtime。 ```bash # Configuring Docker $ sudo nvidia-ctk runtime configure --runtime=docker ``` 6. 重啟 Docker,`nvidia-ctk` 命令會修改主機上的 `/etc/docker/daemon.json`,所以要重新啟動 docker。 ```bash $ sudo systemctl restart docker ``` 7. 測試是否成功安裝 NVIDIA GPU Driver 和 NVIDIA Container Toolkit,出現圖片中的訊息才算成功。 ```bash $ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ``` ![image](https://hackmd.io/_uploads/HyDWNu00R.png) ## 4. 支援 CUDA 的 Pytroch 安裝範例 以下為 Dockerfile ```bash FROM pytorch/pytorch:2.4.1-cuda11.8-cudnn9-runtime RUN apt-get update && \ echo Y | apt-get install vim ENV CODE_DIR=/home/user WORKDIR ${CODE_DIR} ``` 構建 images ```bash $ docker build -t jeepway/pytorch-gpu:latest . ``` 創建 container,並開啟 bash ```bash $ docker run -it --rm --gpus=all jeepway/pytorch-gpu:latest /bin/bash ``` 進入 container 的 bash,執行 `python`,檢查 `torch.cuda` 是否可用,若印出 True 則代表成功了。 ```bash root@2k294esk26q2:/home/user# python >>> import torch >>> torch.cuda.is_available() True ``` ## 參考資料 * [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html) * [實作在 Docker 環境中使用 GPU](https://unimimic.github.io/posts/docker-gpu/) * [Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository)