tcg-disc-wiki - HackMD

# tcg-disc-wiki This `.md` file contains some basic information about the `tcg-disc` workstation. ### Hardware description * CPU: Intel i9-10980XE (18 cores in hyperthreading, 3.0GHz base, 4.6GHz turbo) * GPU: Nvidia GeForce RTX 3090 (24GB, 10496 CUDA cores, 1.70 GHz boost) * RAM: 256GB DDR4 3000MHz * NVME: Samsung SSD 970 Evo Plus M.2 500GB * HDD: IronWolf NAS 2 TB, 64MB, 5900rpm * Operating system: Fedora 33 server ### Connecting to the workstation In order to connect to the workstation you should open an `ssh` tunnel using the syntax: ``` ssh -f -N -L <PORT>:192.168.16.123:22 <USER>@147.162.63.10 -p 7000 -oKexAlgorithms=+diffie-hellman-group1-sha1 ``` in which `<PORT>` represent a local port to be dedicated to the tunnel (choose one from 2000 to 2080) and `<USER>` must be an authorized username to access the gate server (if you get an error message, please look at the `Troubleshooting` section). Once the `ssh` tunnel has been opened you should be able to log onto the machine using the command: ``` ssh -X tcg@localhost -p <PORT> ``` **WARNING:** Do not use the gate as an intermediate server when copying files. In order to transfer data you must use the `ssh` tunnel protocol. ### Troubleshooting In some Linux distributions, the first attempt to open the ssh tunnel may fail with the following message: ``` Unable to negotiate with 147.162.63.10 port 7000: no matching host key type found. their offer: ssh-rsa,ssh-dss ``` to solve such an issue, open (or create) the file ``` ~/.ssh/config ``` and then add to the file the following lines ``` Host 147.162.63.10 KexAlgorithms +diffie-hellman-group1-sha1 HostKeyAlgorithms +ssh-dss PubkeyAcceptedKeyTypes +ssh-dss ``` ### Quick guide to Environment Modules In order to load software without exporting the required environment variables manually you can use [Environment modules](http://modules.sourceforge.net/). In order to list the available modules use the command: ``` module avail ``` In order to get a brief description (if implemented) of what the module does you can use: ``` module whatis <name_of_the_module> ``` more detailed information can be displayed (if implemented) using the command: ``` module help <name_of_the_module> ``` In order to load/unload a module you can use the commands: ``` module load <name_of_the_module> module unload <name_of_the_module> ``` In order to check the modules currently loaded you can use the syntax: ``` module list ``` ### Quick guide to Anaconda 3 If you need a `python3` virtual environment you can use the [anaconda 3](https://www.anaconda.com/products/individual) package. The package is available by default without the need of loading any module. In order to activate/deactivate a virtual environemt you can use the command: ``` conda activate <environment_name> conda deactivate ``` In order to list the installed packages you can use the command: ``` conda list ``` In order to list the available environments you can use the command: ``` conda env list ``` Please refer to the environment [reference page](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) to get more information about the environments management. ### Quick guide to GPU management In order to monitor the GPU status you can use the `nvidia-smi` utility. If everything works correctly you should see something similar to the following output: ``` +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:65:00.0 Off | N/A | | 0% 54C P0 42W / 350W | 0MiB / 24265MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ``` If a job running on the machine is using the GPU acceleration it should appear in the`Processes` section. ##### Compiling GPU accelerated codes: The current machine configuration has two independent cuda 11.2 releases that can be loaded calling the modules `cuda/cuda-11.2` and one of the nvidia [HPC SDK](https://developer.nvidia.com/hpc-sdk) modules (`nvhpc-byo-compiler/21.1`, `nvhpc-nompi/21.1` or `nvhpc/21.1.`). The `cuda/cuda-11.2` has been compiled together with the `lidcudnn` library and should be used for Machine Learning applications. The HPC SDK module loads the PGI compilers that can be used to compile [OpenACC](https://www.openacc.org/) codes. In order to compile C/C++ cuda codes you can use the base syntax: ``` nvcc --gpu-architecture=sm_86 <your_code.cu> -o <compiled_code.exe> ``` In order to compile OpenACC accelerated C++ codes you can try the base syntax: ``` nvc++ -acc -gpu=cc80 -Minfo=all <your_code.cpp> -o <compiled_code.exe> ``` ### Quick guide to Task Spooler In order to organize multiple jobs the [Task Spooler](http://manpages.ubuntu.com/manpages/xenial/man1/tsp.1.html) job scheduler can be used. In order to verify the queue status you can simply call the `tsp` program. A job can be submitted using the syntax: ``` tsp <job> ``` If the job generates output on the terminal the output will be captured by the temporary file listed under the `output` column. If you wish to redirect the output you can use the syntax: ``` tsp sh -c "<job>" ``` for example: ``` tsp sh -c "echo $PATH >> MyPathEnvVar.txt" ``` In order to kill a running job you can use the syntax: ``` tsp -k <tsp_job_id> ``` If you wish to remove a queued job you can use the syntax: ``` tsp -r <tsp_job_id> ``` **WARNING: Keep in mind that task spooler has no control over the used resources.** **WARNING: In task spooler there is no user protection, be careful when deleting jobs.** ### Quick guide to the mail delivery system A mail delivery system has been configured on the machine using the [mutt](http://www.mutt.org/) service. If you wish to get an e-mail notification when your job has been completed you can add the following line to the job script: ``` echo | mutt -s "<mail-object>" <destination-email> -a <file_to_attach> ``` A simple example of input file with email notification follows: ``` #!/bin/bash module load cuda/cuda-11.2 nvcc --gpu-architecture=sm_86 main.cu -o main.exe ./main.exe >> output.txt echo | mutt -s "Hurray Job done" my_mail@something.com -a output.txt module unload cuda/cuda-11.2 rm main.exe ```