Perceval, mighty knight of the GPUs

# Perceval, mighty knight of the GPUs ###### tags: `cluster` `perceval` ## To do * Fix WoL * Add a queuing system for GPU and CPU use ## Hardware * 2 Intel Xeon Gold 6234 8 cores/16 threads 3.3 GHz (4 GHz turbo) 24.75MB of cache * 4 32GB DDR4 2933 * Intel S4510 240GB, SATA 6Gb/s, 3D, TLC 2.5", 7.0mm, up to 2DWPD * Toshiba 3.5" 8TB SATA 6Gb/s 7.2K RPM 128M 512e * 4 PNYGeForce RTX 2080Ti 11Go GDDR6 1,35GHz 4352 cores * Supermicro SuperServer Tower 7049GP-TRT > Note : 3 years warranty on site, until 11/06/2023. ## Network * IPv4: 129.175.80.34 * Bloc: 129.175.0.0/16 * DNS name: theo-perceval.lps.u-psud.fr * MAC: ac:1f:6b:ac:a0:4e * Subnet mask: 255.255.255.0 * Supports Wake On LAN ## OS Ubuntu 20.04.1 LTS (Focal Fossa) ## Storage ### Partitionning 240GB SSD: * 189GB ext4 partition for `/` * 923MB ext4 partition for `/boot` * 30.5GB of swap 8TB mechanical drive: * 7.3TB xfs partition for `/home` ### `/home/share` directory All users can read and write in `/home/share`. #### Details All users should be part of the `users` group. `/home/share` is simply owned by the group and has `770` permissions. Users created with `adduser` are automatically assigned to the `users` group, and their home directory created with permissions `700` such that only them (and root) can access it (see Administration section). ### Quotas #### User side Quotas are enforced on the `/home` filesystem. Their primary purpose is to avoid crashing everybody's job if one user inadvertedly writes too much data in a short amount of time (because of bad choice of dump paramter for instance). If you need more space on a daily basis, ask admins to increase your quota. The soft limit is set by default to 500GB. A user can remain at most 7 days above the soft limit before file creation is denied. A warning is issued at login when the soft limit is reached. The hard limit is set by default to 600GB. The hard limit cannot be exceeded: disk writing is denied. A user can see their current disk usage with xfs_quota -c 'quota -h' The `Blocks` column displays the currently used space and the soft and hard limits are displayed in the two subsequent columns. #### Admin side Since `\home` resides on an xfs filesystem, quotas are administrated with `xfs_quota`. Without any argument, `xfs_quota` runs in interactive mode. For administration tasks, it must be run in "expert mode" with `xfs_quota -x`. While in the interactive session, quotas can be modified using limit bsoft=<soft limit> bhard=<hard limit> <username> A summary of users disk usage can be obtained with `report`. Commands run in interactive mode can also be passed via the commandline using the `-c` option of `xfs_quota`. The default limits are only applied when a user is created with `adduser`. If a user is created with `useradd`, the limits must be set manually with `xfs_quota`. (see Adiminstration section) ## Access ### SSH Perceval can be reached over ssh from the lab's network using ssh <your username>@theo-perceval.lps.u-psud.fr On first connection attempt, the server will offer it's public key for later authentication. Perceval's public keys have the following fingerprints: SHA256:WaNdylk85aA6xP+gT9GruTAoiHu4ajYSmfP93xgJQxk (ED25519) SHA256:5fx/46SdnuW+djOFCIcG8etMybN6nzw895jsuSvAQvw (RSA) Once a key is accepted, you will be asked for your password. To avoid typing the full name every time, a section can be added on the client machine in `~/.ssh/config`: Host perceval Hostname theo-perceval.lps.u-psud.fr User <your username> Now you only need to type `ssh perceval` to connect to the machine. Your password will still be required though. To avoid typing the password, you can use public key authentication. **Perceval only accepts `ed25519` and `rsa` type public keys.** If you don't have a key pair already, you can generate one with: ssh-keygen -t ed25519 ssh-keygen -t rsa -b 4096 Then, send your public key to Perceval using ssh-copy-id perceval You can now connect to Perceval typing `ssh perceval` without being prompted for your password. Note that the generated private keys (`id_rsa` or `id_ed25519`) grant access to your account on every machine that knows about them, so they should be **kept secret**. ### Fail2ban Fail2ban is running on the machine to prevent brute force attacks on ssh passwords. If one enters a bad password more than 5 times in a 10 minutes window, its IP address is banned for the next 10 minutes. The status of the jail can be checked by privileged users using fail2ban-client status sshd Fail2ban logs can be found in `/var/log/fail2ban.log`. To unban an IP before the end of the sentence, run fail2ban-client set sshd unbanip IP_ADDRESS_TO_UNBAN ### Wake On LAN FIXME The machine supports Wake On LAN. To power it up from a Linux machine connected to the lab's internal network, use for instance: wakeonlan -i 129.175.80.34 ac:1f:6b:ac:a0:4e ## Administration Root account is disabled and all administration is done via sudo. Current sudoers are: * Susana * Etienne * Frank Feel free to ask if you need to set-up a new user, or anything else ! ### User creation with `adduser` Prefer `adduser` over `useradd` for user creation: the former calls the latter but also permits more flexible administration. In particular, the script `/usr/local/sbin/adduser.local` is run after each user creation. Commands it contains so far: * Force user to change their password on first login * Apply default quota limits (see Storage section) Some default values for new users can be set in `/etc/adduser.conf`. Modifications from default: * `USERGROUPS=no` disables the creation of one group per user at user creation * `USERS_GID=100` new users are created with primary group `users (gid 100)` * `DIR_MODE=0700` makes home directory of users only accessible to them If `useradd` is used instead of `adduser`, the configurations above are not performed (default user group, home permissions and quota). ### User deletion with `deluser` Users can be deleted with `deluser`. To remove the home and mail spool properly, `--remove-home` can be added. A message might be displayed warning that the group `user` is now empty. This is not true, it is simply that it is the *primary* group of all users, and hence, usernames of the members do not appear in `/etc/group`. ## Software Here we put some technical details about some software installation. ### Python The system-wide python installation is Anaconda (`/opt/anaconda/bin` added to `PATH`). `python` defaults to `python3.8`. `python2` is also present. ### LAMMPS #### UPDATE: 28/05/23 A newer version of LAMMPS is now installed, and linked from /usr/local/bin. This was the stable version of lammps downloaded at 28/05/23. Only the build_most_gpu described below is currently compiled, but adding other builds should be straightforward if anyone wants them. Benchmarks have not been rerun. #### Older text: The currently built version of LAMMPS is that of 3 March 2020. It can be called from anywhere as `lmp`. The build version contains USER-OPT, USER-OMP, USER-INTEL and GPU accelerator packages. See LAMMPS documentation for the proper way to run with each accelerator package https://lammps.sandia.gov/doc/Speed_packages.html. FFTW is called for FFT calculations. The full list of available packages and options can be obtained by running `lmp -help`. There are no general rules to obtain the optimal performances. For each problem, tests must be run, trying each accelerator package with various options. Taking advantage of Perceval's GPU though should make things significantly faster ! #### Build details We take advantage of the new `cmake` building tool to build LAMMPS with various optimisation packages in separated subdirectories of `/opt/lammps-3Mar20` In each building directory, a bash file contains the call to `cmake` with options used for this build. * `build_minimal`: minimal build setting, just checking that essential libraries are present. * `build_most`: build including most packages, excluding the ones that require external libraries. Includes threaded FFTW support. No OpenMP threading support. * `build_most_omp`: same as `build_most`, but also includes OpenMP support. On systems using gcc 9.x, openmp is disabled by default. This is not properly documented on LAMMPS website. To enable openmp, `-D BUILD_OMP=yes` must be used, and the script `src/USER-OMP/hack_openmp_for_pgi_gcc9.sh` must be run to modify source files of USER-OMP package according to the new opemp semantics. In contrast with what is written in LAMMPS documentation, `-D LAMMPS_OMP_COMPAT=4` has no effect, the script must be used. Correct steps are documented here: https://github.com/lammps/lammps/issues/1482 The directory `src/USER-OMP` was saved to `archive/USER-OMP_unmodified` before running the hack script. * `build_most_intel`: same as `build_most_omp` but also includes USER-INTEL package that enables specific vectorisation optimisations on Intel processors. Like USER-OMP, the package contains OpenMP directives that are not compatible with gcc 9.x. Hence, I copied the hack script from USER-OMP and ran it on USER-INTEL as well. The unmodified sources are saved in `archive/USER-INTEL_unmodified`. Better results might be achieved with an Intel compiler. * `build_most_gpu`: same as `build_most_intel` but also includes the GPU package that allow offloading some calculations to GPUs. The oldest GPU architecture supported by the package is not supported anymore by CUDA 11, leading to compilation errors. A patch, not mentionned in LAMMPS documentation, must be applied to fix this. It is documented here : https://github.com/lammps/lammps/issues/2343. The affected file `GPU.cmake` has been saved unchanged in `archive/` The package was build with double precision. The default is "mixed" which uses float precision for pair forces, but double accumulator. I traded speed for precision here. The binary built in `build_most_gpu` which contains all accelerator packages discussed here, is linked into `/usr/local/bin`, which is in `PATH` such that LAMMPS can be called by everybody as `lmp`. #### Benchmarks Benchmarks are run in separate subdirectories of `/opt/lammps-3Mar20/my_bench/`. 3 problems are used for benchmarking: * LJ: atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration, 32k particles. * EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration, 32k particles. * Rhodo = rhodopsin protein in solvated lipid bilayer, CHARMM force field with a 10 Angstrom LJ cutoff (440 neighbors per atom), particle-particle particle-mesh (PPPM) for long-range Coulombics, NPT integration, 32k particles. Reported values are the loop times in seconds. ##### FFT Using rhodo benchmark script. ~5% speed gain by using FFTW build with threading and SIMD instruction sets supports. Not great, but free. |Build|Comments|Time (serial)|Time (4 MPI tasks)| |:---:|:---:|:---:|:---:| |most_noFFTW|Using KISS provided FFT library|21.62s|5.65s| |most_FFTWnothreads|Using FFTW built without threading support|21.41s|5.63s| |most_FFTWthreads|Using FFTW built with threading and SIMD support|20.71|5.51| All subsequent builds use FFTW with threading and SIMD support for FFT calculations. The following benchmarks investigate the impact of various LAMMPS accelerator packages. ##### CPU acceleration Some packages allow optimisation for CPU calculations and threading support using OpenMP. * USER-OMP: threading and various optimisations. * USER-OPT: generic optimisations (no threading). * USER-INTEL: intel-processors specific optimisations. Plays nicely with threading support offered by USER-OMP. * KOKKOS: a part of the framework deals with CPU and threading optimisations. The main purpose of Kokkos is portability, but it performs usually (much) worst than USER packages. Therefore, I left it aside. MPI parallelises the code by defining domains in the simulation box, and assigning each domain to an MPI task. This greatly reduces need for communication and allows good scaling in large systems. OpenMP threading on the other hand parallelises calculation at the particle loop level which works in a similar way in systems of all sizes. Tuning of number of MPI tasks (via `-np [val]` option of `mpirun`) and OpenMP threads per task (via `-pk omp [val]` option of USER-OMP) is required to get the most out of USER-OPT and USER-INTEL optimisations. The Intel Xeon cores of Perceval support hyperthreading (HT) technology, which allows each core to run 2 threads in parallel without additionnal cost. The efficiency of optimisation is problem specific. Different packages and parameters must be tested for each specific model. The benchmarks below can serve as guidelines. ###### vanilla Using no other settings than that of presets "most". The corresponding build is in `build_most`. ||lj|eam|rhodo| |:---:|:---:|:---:|:---:| |**1 MPI task**|1.466|3.837|20.739| |**2 MPI tasks**|0.764|1.966|10.645| ###### most_omp The USER-OMP package is included in the "most" preset. Here, we enable its optimised styles with `-sf omp`, and use threading provided by OpenMP. The corresponding build is in `build_most_omp`. ||lj|eam|rhodo| |:---:|:---:|:---:|:---:| |**1 MPI task, 1 OMP thread/task**|1.297|3.604|19.112| |**1 MPI task, 2 OMP thread/task**|0.974|2.766|15.089| |**2 MPI tasks, 1 OMP thread/task**|0.683|1.845|9.786| |**2 MPI tasks, 2 OMP thread/task**|0.524|1.448|7.773| 6-12% speedup with respect to vanilla build. ###### most_opt The USER-OPT package proposes templates to avoid branching and provides optimised styles. It is included in "most" preset and enabled with `-sf opt`. It has no influence on threading. The corresponding build is in `build_most_omp` as above, we simply run it with different flags. ||lj|eam|rhodo| |:---:|:---:|:---:|:---:| |**1 MPI task**|1.337|3.474|19.000| |**2 MPI tasks**|0.682|1.778|9.681| Performs slightly worst than USER-OMP optimisations on lj benchmarks but slightly better on eam and rhodo ###### most_intel These benchmarks are run with USER-INTEL package enabled via `-sf intel` combined with threading provided by USER-OMP. There are a lot of possible tweaking for this package. The one built in `build_most_intel` uses default settings. ||lj|eam|rhodo| |:---:|:---:|:---:|:---:| |**1 MPI task, 1 OMP thread/task**|1.258|3.450|29.31| |**1 MPI task, 2 OMP thread/task**|1.003|2.433|22.89| |**2 MPI tasks, 1 OMP thread/task**|0.661|1.760|14.97| |**2 MPI tasks, 2 OMP thread/task**|0.518|1.253|11.83| No improvement over most_omp for lj benchmark. ~7% speed with respect to most_omp without threading, ~12% using threading on eam benchmark. Terrible performances on rhodo benchmark ! ##### GPU acceleration GPU acceleration can be achieved via the GPU package or KOKKOS. The later is focused on portability and less efficient. Only the GPU package was built. Test are required to tune the number of GPUs and the number of MPI tasks per GPU. There must be at least one CPU core per GPU. ###### most_gpu The build used in this benchmark is in `build_most_gpu`. It contains all accelerator packages discussed above plus the GPU package. The benchmarks were run with GPU offloading activated via `-sf gpu`. Only 1 MPI task, 1 OpenMP thread and 1 GPU were used. ||lj|eam|rhodo| |:---:|:---:|:---:|:---:| |**1 MPI task, 1 OMP thread/task, 1 GPU**|0.1584|0.2700|2.2626| |**2 MPI tasks, 1 OMP thread/task, 1 GPU**|0.1297|0.2161|2.2047| Bringing the GPU in increases speed by a factor ~10. The GPU becomes the bottleneck: adding an extra MPI task only brings a marginal speed-up. ### FFTW FFTW 3.3.8 is installed, with threading support and vectorisation enabled. #### Build details The current version was built with the following options: * `--enable-shared`: build the libraries in both shared and static versions. Shared libraries are necessary for LAMMPS to compile with threaded ffts. For some reason, if FFTW is build without threading support, shares libraries are not required. * `--enable-threads --enable-openmp`: build with both built-in and openmp threading support. LAMMPS requires threading via openmp. * `--enable-mpi`: self-explained. * `--enable-sse2 --enable-avx --enable-avx2 --enable-avx512`: allow use of SIMD (vector) CPU instructions if available. The selected one are the instruction sets supported by the Intel Xeon Gold cores of Perceval. The line with all options used for building FFTW is saved in `/opt/fftw-3.3.8/mybuild.sh`. The installed version successfully passed `make check`. FFTW binaries, documentations etc have been put into standard places by `make install`. They can be removed properly at any time with `make uninstall`. ### BorgBackup BorgBackup is installed and can be used to make some backups directly to the NAS. See the NAS wiki for an explanation and example scripts of Borg usage. ## Graphics ### Cuda Cuda v12 is installed on Perceval ### Xorg Using `nvidia-smi` showed that Xorg was using the GPUs (~1% of each). Appart from the 4 Nvidia GPUs, `lshw` reports another VGA compatible device: AST1150. I could not find the device on the vendor's website (ASPEED Technology), but I am not sure it is a proper graphics card. The brand provides server equipment and for other product, I have seen forum thread stating that the card only includes minimal graphics capabilities, for terminal display essentially. To try prevent Xorg from using the GPUs, I added a file `/etc/X11/xorg.conf` following [this thread](https://askubuntu.com/questions/869496/force-xorg-to-use-cpu-not-gpu). Since we don't have intel integrated graphics, I tried modifying the sections guessing the Driver item in particular. It turns ou that is not correct. Upon restart (with `systemctl restart display-manager`), Xorg encounters problems logged in `/var/log/Xorg.x.log`. With the config file `/etc/X11/xorg.conf`: * Xorg does not appear anymore in `nvidia-smi` * ssh X forwarding seems to work * Frank seems to still be able to use the GPUs with python * When plugging a screen directly on the machine, tty1 with the graphical interface does not work If we are happy with it, we can probably let it like that for the moment. Apparently, Xorg also interfaces with the GPUs for some applications, so its broken configuration might cause problems later on. Also, I have no idea where xorg is now running. To revert the changes, remove `/etc/X11/xorg.conf` and restart Xorg with `systemctl restart display-manager`.