設備規格
nvidia-smi
>>> Command 'nvidia-smi' not found, but can be installed with:
安裝的教學可以參考 [Linux] Ubuntu 安裝、移除 NVIDIA 顯示卡驅動程式(Driver)教學
sudo add-apt-repository ppa:graphics-drivers
查詢可以安裝哪個版本的驅動
sudo apt-get update
sudo apt-cache search nvidia-driver-*
我是安裝 nvidia-driver-535
,理論 install 完要自己下指令去 reboot,但是不知道為何螢幕直接黑掉,但是重開後就裝好了…
sudo apt-get install nvidia-driver-535
# sudo reboot
要出現這個才算安裝成功
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 On | N/A |
| 0% 51C P8 19W / 170W | 504MiB / 12288MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1130 G /usr/lib/xorg/Xorg 35MiB |
| 0 N/A N/A 24470 G /usr/lib/xorg/Xorg 149MiB |
| 0 N/A N/A 24633 G /usr/bin/gnome-shell 34MiB |
| 0 N/A N/A 26275 G /usr/lib/firefox/firefox 258MiB |
| 0 N/A N/A 42730 G gnome-control-center 2MiB |
+---------------------------------------------------------------------------------------+
要在 docker 裡面執行 GPU 的話要安裝:
原則上按照 docker 官方的安裝即可,如果有出現任何 ERROR 請到下面的 ERROR 區查看解法。這邊就不提供安裝 Docker Desktop 的教學,因為好像安裝了 Docker Desktop 就沒辦法在 Container 裡面使用 GPU,參考自 nvidia-docker 。
for pkg in docker.io docker-doc docker-compose podman-docker containerd runc; do sudo apt-get remove $pkg; done
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Add the repository to Apt sources:
echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
安裝完後在終端機打上
docker
應該會出現一大串,像是下面這樣,這樣就算安裝成功
Usage: docker [OPTIONS] COMMAND
A self-sufficient runtime for containers
Common Commands:
run Create and run a new container from an image
exec Execute a command in a running container
ps List containers
build Build an image from a Dockerfile
pull Download an image from a registry
push Upload an image to a registry
images List images
login Log in to a registry
logout Log out from a registry
search Search Docker Hub for images
version Show the Docker version information
info Display system-wide information
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& \
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
執行 docker image
sudo docker run -it --gpus all pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime bash
進入 docker container 的終端機,類似下面的畫面後輸入 python
後就會進入 python 的 console
依序輸入下面,如果都有正常顯示就是裝成功了
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name())
docker.socket: Failed with result 'service-start-limit-hit'
如果打這個指令會出現上面的 ERROR 的解法
sudo systemctl restart docker
解決方案參考自網路上的方法,將檔案名稱更改即可
daemon.json
–>daemon.conf
sudo mv /etc/docker/daemon.json /etc/docker/daemon.conf