NVIDIA A30 burnin environment using and building SOP

--- tags: NVIDIA, NVIDIA TESLA A30 --- # NVIDIA A30 burnin environment using and building SOP ###### tags: NVIDIA TESLA A30 ## HW equipment Mother Board: CB-1932 with NVIDIA TESLA A30 x 2 CPU: Intel® Xeon® CPU @ GHz x 2 RAM: 16GB SODIMM x OS: Ubuntu 20.04 LTS Desktop, kernel 5.4.0-42 (UEFI) Docker: 20.10.8 Cuda: 11.4 NVIDIA TensorRT docker image version: 20.11-py3 NVIDIA TensorFlow docker image version: 20.11-tf2-py3 ## environment building SOP ※ execute in root privilege ### step.1 set CLI interface as default interface I installed Ubuntu 18.04 Desktop version because changing to CLI interface is easier than upgrading linux kernel ```bash= sudo systemctl set-default multi-user.target ``` or ```bash= sudo systemctl set-default runlevel3.target ``` ### step.2 install NVIDIA drivers Download the latest stable Driver from NVIDIA offical website. URL: https://www.nvidia.com/zh-tw/geforce/drivers/ After downloading the .run file ```bash chmod 777 NVIDIA-Linux-x86_64-470.57.02.run # notice the filename when you click command, filename may be different because of its version or others apt install gcc make ./NVIDIA-Linux-x86_64-470.57.02.run # notice the filename when you click command # When you execute the .run file, click "continue install" option in every error message reboot ``` after reboot, command ```bash= nvidia-smi ``` If you install the driver successfully, it shows message like picture below: ![](https://i.imgur.com/92cObee.png) ### step.3 install docker ```bash= sudo apt-get remove docker docker-engine docker.io containerd runc sudo apt-get update sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ gnupg-agent \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo apt-key fingerprint 0EBFCD88 ``` Make sure that the result may be: 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88 ```bash=11 sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io ``` ### step.4 install nvidia container toolkit ```bash= distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo usermod -aG docker $USER sudo systemctl daemon-reload sudo systemctl restart docker ``` After rebooting, type and execute ```bash= docker run --gpus all nvidia/cuda:9.0-base nvidia-smi ``` If it's OK, it may show message like picture below: ![](https://i.imgur.com/92cObee.png) ### step.5 download docker and build burnIn environment(tensorflow) #### (1) download docker image ```bash= docker pull nvcr.io/nvidia/tensorflow:20.11-tf2-py3 ``` * We only have to pull docker image only one time #### (2) run the docker image on the specific gpu number e.g. --gpus '"device=0"' and --gpus '"device=1"', 0 represent to gpu 1, 1 represent to gpu 2 ```bash docker run --gpus '"device=0"' -it --name tf_2011_gpu0_no1 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 docker run --gpus '"device=0"' -it --name tf_2011_gpu0_no2 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 docker run --gpus '"device=1"' -it --name tf_2011_gpu1_no1 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 docker run --gpus '"device=1"' -it --name tf_2011_gpu1_no2 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 ``` ※ We used to run one container on one gpu, and it goes to the max loading. But performance of A30 is higher, we run two container on one gpu, so final we run 4 gpus at the same time (system equipped 2 A30 gpus) * CB-1932 equip 2 NVIDIA A30，--gpus '"device=0"' or --gpus '"device=1"' means that you give GPU No.0 or No.1 to that container as its resource; --gpus all means that you give all GPU to that container as its resource. Other question, please google and key in keyword: docker run –gpu * -it, i represents interactive, even we do not connect to that container，STDIN(terminal of UNIX) opens, too; -t presents tty，give that container a fake tty * --rm, when we leave container, that container will be removed * -v (volume), we use that to set a route for host and container to exchange files * -w(workspace), the path after you enter into docker And, we enter into container of tensorrt #### (3) write script in docker container ```bash= # =====it is in tensorflow container===== cd /workspace/nvidia-examples/cnn vi ./burn.sh # =====it is in tensorflow container===== ``` in burn.sh ```bash= #!/bin/bash # =====it is in tensorflow container===== for((i=1;i>0;i++)) do mpirun --allow-run-as-root -np 1 --mca btl ^openib python -u ./resnet.py --batch_size 128 --num_iter 28800 --precision fp16 --iter_unit batch done # =====it is in tensorflow container===== ``` after that ```bash= # =====it is in tensorflow container===== chmod 777 ./burn.sh exit # =====it is in tensorflow container===== ``` to back to host terminal #### (4) write script to let us to run docker container easily ```bash= # =====it is in host===== vim ./burn_gpu0_no1.sh # =====it is in host===== ``` in burn_gpu0_no1.sh ```bash= #!/bin/bash # =====it is in host===== docker run --gpus '"device=0"' -it --rm --name tf_2011_gpu0_no1 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 # =====it is in host===== ``` ```bash= # =====it is in host===== vim ./burn_gpu0_no2.sh # =====it is in host===== ``` in burn_gpu0_no2.sh ```bash= #!/bin/bash # =====it is in host===== docker run --gpus '"device=0"' -it --rm --name tf_2011_gpu0_no2 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 # =====it is in host===== ``` ```bash= # =====it is in host===== vim ./burn_gpu1_no1.sh # =====it is in host===== ``` in burn_gpu1_no1.sh ```bash= #!/bin/bash # =====it is in host===== docker run --gpus '"device=1"' -it --rm --name tf_2011_gpu1_no1 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 # =====it is in host===== ``` ```bash= # =====it is in host===== vim ./burn_gpu1_no2.sh # =====it is in host===== ``` in burn_gpu1_no2.sh ```bash= #!/bin/bash # =====it is in host===== docker run --gpus '"device=1"' -it --rm --name tf_2011_gpu1_no2 -w /workspace/nvidia-examples/cnn nvcr.io/nvidia/tensorflow:20.11-tf2-py3 # =====it is in host===== ``` ### burnIn method ```bash= sudo su # to enter root privilege cd /home/aewin chmod 777 ./burn_gpu0_no1.sh chmod 777 ./burn_gpu0_no2.sh chmod 777 ./burn_gpu1_no1.sh chmod 777 ./burn_gpu1_no2.sh ./burn_gpu0_no1.sh ./burn_gpu0_no2.sh ./burn_gpu1_no1.sh ./burn_gpu1_no2.sh ``` after that, you'll enter into tensorflow docker container ```bash= cd /workspace/nvidia-examples/cnn ./burn.sh ``` ## source: 1. https://github.com/YeeHong/CB-1921A_with_NVIDIA-T4_benchmark 1. Benchmark_SOP_T4 Benchmark Guide.docx 1. NGC-Ready-Validated-Server-Cookbook-ubuntu-18.04-v1.4.1-2020-02-03 v1.docx 1. Measuring_Training_and_Inferencing_Performance_on_NVIDIA_AI_Platforms-nv.pdf 1. https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow 1. https://docs.docker.com/engine/install/ubuntu/ 1. https://medium.com/@grady1006/ubuntu18-04%E5%AE%89%E8%A3%9Ddocker%E5%92%8Cnvidia-docker-%E4%BD%BF%E7%94%A8%E5%A4%96%E6%8E%A5%E9%A1%AF%E5%8D%A1-1e3c404c517d 1. https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt 1. https://blog.wu-boy.com/2019/10/three-ways-to-setup-docker-user-and-group/ 1. https://docs.docker.com/config/containers/resource_constraints/ 1. https://blog.csdn.net/Flying_sfeng/article/details/103343813