深度學習環境建置 (NTUST)

# 深度學習環境建置 (NTUST) ###### tags: `doc/guide` #### Chi-Hung Weng @ NTUST 2022/12/15 --- [TOC] --- :::warning :warning: I assume you have learnt about how to use: 1. SSH (to control the workstation via command line) 2. Filezilla (for uploading / downloading files). I don't have time to teach you how to setup SSH or Filezilla. However, it should be simple and I assume you can do it by yourself (or you probably can follow the instructions given by the workstation administrator). Anyway, if you still have questions, feel free to reach me during or after the lecture. ::: :::info As a client user, to use SSH: 1. If you have MacOS or Linux: fine; this ssh-client is built-in. You can open a terminal and type: ```bash= ssh -p [port_number] [your_server_ip_address] ``` to login (if password login is enabled). Further SSH usage information: [[link]](https://linux.vbird.org/linux_server/centos6/0310telnetssh.php) 2. If you are using Windows: also cool :smile:! you can install Windows Subsystem for Linux v2 (aka WSL2). I've tested WSL2 on any of the following Windows version: Win10, Win11, WinServer 2022. Here are some links: * [[WSL2 Install]](https://www.omgubuntu.co.uk/how-to-install-wsl2-on-windows-10) * [[WSL with Docker]](https://docs.docker.com/desktop/windows/wsl/) ::: :::info About downloading / uploading files: [[FileZilla]](https://filezilla-project.org) is a easy-to-use GUI Interface; supports continuous uploading! transfer files securely via SSH! Use FileZilla to upload/download files! ::: ## Course Materials * [[slides]](https://www.dropbox.com/s/d6m84xy8jrqznc6/20221215_ntust_kuberun_and_dl.pdf?dl=1) * [[code example]](https://www.dropbox.com/s/s432lcm6sjw5gro/ntuht_dl_20221215.tar.gz?dl=1) ## Reference ### Docker Command Line Examples ### Pull the image 1. Go to https://hub.docker.com/ or https://catalog.ngc.nvidia.com/ to look for images you want. 2. To pull the image, you can execute the following command: ```bash= docker pull tensorflow/tensorflow:2.10.1-gpu-jupyter ``` #### Run the image -> Containers 1. Run a TensorFlow image ```bash= # Run a TensorFlow container # Using 2 GPUs (GPU0 & GPU1) docker run -it --rm --gpus=0,1 -it --rm -p 8888:8888 --name tf -v $HOME:/workspace --shm-size 32gb tensorflow/tensorflow:2.10.1-gpu-jupyter bash ``` 2. Run a DIGITS image ```bash= # Run a container that hosts the DIGITS App # This is a daemon that listens to port 5000 docker run -d --gpus=all -p 5000:5000 --name digits -v $HOME:/workspace nvidia/digits # If you'd like to ENTER into the DIGITS container, execute the following: docker exec -it digits bash ``` #### Build your own image from an existing image ```bash= mkdir build_tf # create a folder echo '# Add `seaborn, polars` to the Google TensorFlow image FROM tensorflow/tensorflow:2.10.1-gpu-jupyter MAINTAINER Chi-Hung Weng <chihung@honghutech.com> RUN apt update && apt install -y vim RUN pip3 install seaborn polars ' > build_tf/Dockerfile # put the instruction file (Dockerfile) into the created folder # Enter the folder and Build the image cd build_tf && docker build -t="me/my_tf" . # Run the image docker run -it --rm --gpus=0,1 -it --rm -p 8888:8888 --name my_tf -v $HOME:/workspace --shm-size 32gb me/my_tf bash ``` :::warning :warning: Avoid creating a file on one layer, then delete it on the top layer. A Dockerfile example: ```dockerfile= # Append `seaborn, polars` to the Google TensorFlow image FROM tensorflow/tensorflow:2.10.1-gpu-jupyter MAINTAINER Chi-Hung Weng <chihung@honghutech.com> RUN apt update && apt install -y vim RUN pip3 install seaborn polars RUN touch aaa.txt RUN rm aaa.txt ``` Guess what? `aaa.txt` will remain within the Docker image! Some people even do this in the Dockerfile: ```dockerfile= FROM tensorflow/tensorflow:2.10.1-gpu-jupyter MAINTAINER Chi-Hung Weng <chihung@honghutech.com> RUN wget http://xxx.xxx/cool_stuff.whl # 500MB large installation file! RUN pip3 install cool_stuff.whl # install RUN rm cool_stuff.whl # remove the installation file ``` Don't do that! Do the following instead: ```dockerfile= FROM tensorflow/tensorflow:2.10.1-gpu-jupyter MAINTAINER Chi-Hung Weng <chihung@honghutech.com> RUN wget http://xxx.xxx/cool_stuff.whl && \ pip3 install cool_stuff.whl && \ rm cool_stuff.whl # i.e. create, install, remove temp files, # all within one layer! ``` ::: #### Remove containers and images ```bash= docker stop digits # stop a container docker rm digits # remove a container docker rmi me/my_tf # remove a docker image ``` ### Basic Linux * vBird: https://linux.vbird.org/linux_basic/centos7/#part3 ### Docker image providers (for Deep Learning Dev Environment) * NGC TensorFlow: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow/tags * NGC PyTorch: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch * Official TensorFlow: https://hub.docker.com/r/tensorflow/tensorflow/ * Official PyTorch: https://hub.docker.com/r/pytorch/pytorch ### Learn Docker * a nice book: https://philipzheng.gitbook.io/docker_practice/ * official docs for reference: https://docs.docker.com more: * root-less docker(https://docs.docker.com/engine/security/rootless/): to enhance security. Well, only bother this if you are an admin. Also, if you use KubeRun, then you don't have to bother this. ### KubeRun - Job running mode ```bash= python3 /home/ubuntu/check_tf_ver.py && \ nvidia-smi ``` ![](https://i.imgur.com/DuUm3WR.png) ![](https://i.imgur.com/LQEEiZM.png) ### Learn Deep Learning * Find state-of-art model implementations (if papers are with codes :arrow_right: :100:): https://paperswithcode.com * Understand CNN basics (:100: you must read this, understand every lines of words): https://cs231n.github.io * Learn Deep Learning basics (Wow! Learning by doing! :100:): https://zh.d2l.ai * Machine Learning basic theories (:100: for those who wants to understand every general bits in Machine Learning): https://cs229.stanford.edu/notes2022fall/main_notes.pdf Some more: * Data Parallelism (train a model using multi-GPUs on multi-Workstation / Servers) * Uber Horovod: https://github.com/horovod/horovod (works with TensorFlow and PyTorch, highly scalable :100:) * or you can use PyTorch's `DistributedDataParallel` (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) * or TensorFlow's `MirroredStrategy` (https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) ---