Why using HooMD on DelftBlue

# Why using HooMD on DelftBlue 1. If your system is big enough you can consider to start using DelftBlue HPC cluster to run your simulations. The very basic idea is to exploit the high number of CPUs of DelftBlue to parallelize and speed your simulations. 2. If you have properly setup your simulation script there is no need to change anything in your script, what change is the way you launch the simulation script. # Installing HOOMD with MPI enabled The easiest and recommended way to install HOOMD is through conda, but it does not provide GPU and MPI (CPU parallelization) support. Parallelization can significantly improve simulation times, so it's worth considering compiling the package from source. For an (always?) updated version of the building process refers to https://github.com/glotzerlab/hoomd-blue/blob/trunk-patch/BUILDING.rst ## CPU Parallelization (No GPU) To compile HOOMD with CPU parallelization support, you can follow these steps: 1. Request an interactive session: ```shell srun --job-name="build hoomd" --partition=compute --account=research-as-bn --time=01:00:00 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4GB --pty bash ``` 2. Load necessary modules: ```shell module load 2024r1 module load python module load openmpi module load cmake module load intel/oneapi module load tbb module load py-numpy module load py-pip ``` 3. Prepare the environment: ```shell! git clone --recursive https://github.com/glotzerlab/hoomd-blue python3 -m venv hoomd-venv source hoomd-venv/bin/activate export CMAKE_PREFIX_PATH=$VIRTUAL_ENV python3 -m pip install gsd python3 hoomd-blue/install-prereq-headers.py -v -y ``` (probably you need to change an url in the code to https://github.com/pybind/pybind11/archive/v2.13.6.tar.gz) 4. Compile from source ```shell! cmake -B build/hoomd -S hoomd-blue -D ENABLE_MPI=on -D ENABLE_TBB=on -DPYTHON_EXECUTABLE=/scratch/psillano/tools/hoomd-venv/bin/python -DCMAKE_INSTALL_PREFIX=/scratch/psillano/tools/hoomd-venv cmake --build build/hoomd -j4 cmake --install build/hoomd ``` This will install the HOOMD Python package in the virtual environment. To use HOOMD, always activate your Python virtual environment with: ```shell= source hoomd-venv/bin/activate ``` ### Testing Installation You can test if everything works correctly with: ``` python -c "import hoomd; print(hoomd.version.mpi_enabled);" ``` For more examples and information on MPI, check [hoomd-examples](https://github.com/glotzerlab/hoomd-examples/tree/trunk) and for benchmark [hoomd-benchmarks](https://github.com/glotzerlab/hoomd-benchmarks) ## Using HOOMD with MPI Support To use HOOMD with MPI support, you can either use an sbatch file or use an interactive session. ### Submitting a jobs with sbatch file Create a new file `run_job.sh`, copypaste the following and edit where needed ``` #!/bin/bash #SBATCH --partition=compute #SBATCH --time=00:05:00 #SBATCH --ntasks=4 # Number of CPUs #SBATCH --cpus-per-task=1 # Number of threads, changing has no effect on hoomd simulations #SBATCH --mem-per-cpu=300M # per node #SBATCH --account=research-as-bn #SBATCH --output=out.%j #SBATCH --error=err.%j #SBATCH --mail-user=<your email account> #SBATCH --mail-type=ALL ##you can also set BEGIN/END # Load modules: module load 2022r2 module load openmpi/4.1.1 module load py-numpy module load intel/oneapi module load tbb/2021.7.0 source /home/psillano/scratch/tools/hoomd-venv/bin/activate srun -n $SLURM_NTASKS python3 your_hoomd_script.py ``` Launch the job with `sbatch run_job.sh` ### Using an interactive session Using an interactive session, in the terminal: ```sh module load 2022r2 module load openmpi/4.1.1 module load py-numpy module load intel/oneapi module load tbb/2021.7.0 source /home/psillano/scratch/tools/hoomd-venv/bin/activate srun --job-name="build hoomd" --partition=compute --account=research-as-bn --time=00:02:00 --ntasks=4 --cpus-per-task=1 --mem-per-cpu=400MB --pty python3 lj_performance.py ``` ### How to choose the number of CPUs #### Very brief theory The very basic idea of MPI is to split your simulation box in different domains and assign each one of them to a single CPU processor. In this way each CPU need to compute force, velocities etc etc only for a subset of your total number of particles leading to a speedup linear with the number of processor. This means that if you use 8 CPUs you will get a 8x speedup factor respect to the single CPU baseline. That's the theory, in practical case is not easy to get such a linear speedup with the number of processor. That is why **you need to benchmark** your simulation to understand what are the best value of CPU number. #### Setting the number of CPUs The following lines from the sbatch file are the one related to the number of CPUs. ``` #SBATCH --ntasks=4 # Number of requested CPUs ... ... srun -n $SLURM_NTASKS python3 your_hoomd_script.py ``` `$SLURM_NTASKS` is an environment variable equal to the number of --ntasks defined above. In this way your hoomd script will use **all** the allocated CPUs. There is a small difference between the CPUs and the number of MPI tasks but it is not central for our use (if you're interested check the references) For example you can test quickly your simulation script and check your speedup factor (assuming you are printing somewhere the speed of the simulation): ``` #!/bin/bash #SBATCH --partition=compute #SBATCH --time=00:05:00 #SBATCH --ntasks=20 # Number of requested CPUs #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=300M # per node #SBATCH --account=research-as-bn #SBATCH --output=out.%j #SBATCH --error=err.%j #SBATCH --mail-user=<your email account> #SBATCH --mail-type=ALL ##you can also set BEGIN/END # Load modules: module load 2022r2 module load openmpi/4.1.1 module load py-numpy module load intel/oneapi module load tbb/2021.7.0 source /home/psillano/scratch/tools/hoomd-venv/bin/activate # serial job - 1 CPU srun -n 1 python3 your_hoomd_script.py # Parallel job - 20 CPUs srun -n 20 python3 your_hoomd_script.py ``` - Always compare your parallel version with a serial job (1 CPU only) - If you are a benchmark nerd you can check [GitHub - glotzerlab/hoomd-benchmarks: A collection of benchmarks for HOOMD-blue](https://github.com/glotzerlab/hoomd-benchmarks) This is some of very quick results I tried with this benchmark helper | MPI proc | tps N 50k| speedup | |--------|-----| ------ | | 1 (Serial) | 15 | 1 | | 24 | 224 | 15 | | 48 | 495 | 33 | So you already see how the speedup is below the theoretical linear scaling but still you get lot of improvements! # FAQ ### Why it is not always a good idea using GPUs on DelftBlue - GPU number on Delft blue system are limited so it can happen that you spend lot of time in queuing system, so the gpu speed up is not worth. - Your system need to be quite big (more than 10k particles) to use at maximum the GPU capabilities - Sometimes it is useful to request less resource There is always the trade off between the speedup advantage and the queuing time: more resources you ask, you will wait more in queue. ### If you want to try HOOMD MPI with GPU on your computer You can also use GPU on DelftBlue but requesting GPUs on DelftBlue can lead to longer queuing times so often it is not worth, instead if you have a laptop/desktop with a good GPU it could be worth to try. you need to take care of all dependencies of HOOMD: `$ <your favorite package-manager> install cmake eigen git python numpy pybind11 openmpi` you will also need: NVIDIA CUDA Toolkit (at least 9.0) change the cmake configuration command to `cmake -B build/ -S . -D ENABLE_MPI=on -D ENABLE_TBB=on -D CMAKE_INSTALL_PREFIX="." -D ENABLE_GPU=on` Follow installation steps from step 3 and test your installation with ``` python -c "import hoomd; print(hoomd.device.GPU());" ``` ### Why you dont set the number of threads There is also the option for use more than one thread, multithreading, on HooMD but is used only for HPMC hard particle montecarlo simulation, so setting more than 1 thread with the sbatch option `--cpu-per-task` will not have any (positive) speedup! # Useful references - [Delftblue documentation](https://doc.dhpc.tudelft.nl/delftblue/) - More info on [how to choose number of MPI tasks and Threads ](https://www.ibm.com/docs/en/pessl/5.5?topic=performance-selecting-mpi-tasks-computational-threads) - [Common SLURM environment variables](https://docs.hpc.shef.ac.uk/en/latest/referenceinfo/scheduler/SLURM/SLURM-environment-variables.html) - From mailing list: [TBB threads do not give more performance](https://groups.google.com/g/hoomd-users/c/Vdn0FVselkY/m/DaJ9PHeMAAAJ) - Check my next post ;)