# EESSI test suite sync meetings ## Planning - every 2 weeks on Thursday at 14:00 CE(S)T - next meetings: - Thu 14 Mar'24 14:00 CET --- ## Meeting (2024-02-29) - Disable running CUDA-capable modules on CPU [#110](https://github.com/EESSI/test-suite/pull/110) - Merged - Now, for Snellius GPU config [here](https://github.com/EESSI/test-suite/blob/c8e917ca7ca5f2a290d60ae8d5cd75420ad2147f/config/surf_snellius.py#L102), it will NOT run OSU on host buffers. I don't think this is desired, I think we should have OSU _not_ use the hook... - Always request GPUs on partitions that require it [#116](https://github.com/EESSI/test-suite/pull/116) - Reviewed, works - Small open request on prepending `check_always_request_gpu` - Merged! ### TODO before 0.2.0 release: TODO before 0.2.0 release: - [this](https://github.com/EESSI/test-suite/issues/101) can be closed now? - Closed! - Remove 0.2.0 tag [here](https://github.com/EESSI/test-suite/issues/107)? - Sam will try to clean up the code style before the 0.2.0 - Release when? ### Next steps? - Satish: OpenFOAM test - Example in the OpenFOAM demo is not part of OpenFOAM 11 anymore, but we could still copy it from an older version and ship it with the test suite - software-layer demo is also broken for this reason - Formatting has changed for Form dictionary in OpenFOAM 11 compared to previous versions - After formatting adaptations: getting segfaults for the software-layer demo (motorbike case) - ESPResSo (ask Xin?) - PyTorch (Caspar) - LAMMPS (Lara) - will reach out to Tielen Potisk - Multiple options: - suggestions from Tielen (`bench/in.lj` and `bench/in.rhodo`) - from Alan (from an HPC carpentry https://www.hpc-carpentry.org/tuning_lammps/04-lammps-bottlenecks/index.html#case-study-rhodopsin-system) - Meluxina has an internal LAMMPS test that Kenneth asked them to share - Sam: Cleanup! :) - Test json-export format from ReFrame - CP2K (Sam) - Add `ALWAYS_REQUEST_GPU` to `surf_snellius.py` config (Caspar) - Add CUDA-capable OSU to EESSI (Caspar) - Add the `--report-file` option to daily runs on Vega / Karolina, or put it in the config file (?) (Caspar) - Create release notes for 0.2.0 (Lara) ## Meeting (2024-02-14) - Unify test names `EESSI_<SOFTWARE_NAME>` [#108](https://github.com/EESSI/test-suite/pull/108) - Merged - Filter invalid scales in ReFrame config [#111](https://github.com/EESSI/test-suite/pull/111) - Requires ReFrame config files to be adapted and set ALL scales as features (unless you want to exclude them) - Merged - Related docs PR [#156](https://github.com/EESSI/docs/pull/156) - Merged - Software layer bot config added [#112](https://github.com/EESSI/test-suite/pull/112) - Merged - Removed again in [#113](https://github.com/EESSI/test-suite/pull/113) - Since it relates to `software-layer` CI, it makes more sense to keep it there. Now part of [#467](https://github.com/EESSI/software-layer/pull/467). - Merged - common_eessi_init function now works for both pilot and production repos [#114](https://github.com/EESSI/test-suite/pull/114) - Merged - Disable running CUDA-capable modules on CPU [#110](https://github.com/EESSI/test-suite/pull/110) - Hook now defines that if it is a CUDA module and the requested device type is CPU, the partition should have _both_ the CPU and GPU features - Should we do this at the hook level (i.e. apply to all software?) Or only for OSU in particular? - E.g. GROMACS, TensorFlow or PyTorch built against GPU would run just fine. - Let's discuss [this comment](https://github.com/EESSI/test-suite/pull/110/files#r1489764397) - Always request GPUs on partitions that require it [#116](https://github.com/EESSI/test-suite/pull/116) - Satish: did you plan to review? ### TODO before 0.2.0 release: TODO before 0.2.0 release: - common_config update to use EESSI_CVMFS_REPO, so that it can be used with `software.eessi.io` => Done! - GROMACS PR to software-layer => what's the status? - GROMACS might take a longer time to actually make it into the production repo. - The only relation with the test suite is that we would like to swap the CI for the test suite to use `software.eessi.io`, but can't as long as GROMACS is missing there - We don't want to wait for this, it is not essential to have that CI swap before the 0.2.0 release - OSU test LD_LIBRARY_PATH removed and implement hook for filtering CUDA module based tests on pure CPU nodes => [#110](https://github.com/EESSI/test-suite/pull/110) - Sam will try to clean up the code style before the 0.2.0 - Let's aim for a release in 29/02/2024 ### Next steps? - Satish: OpenFOAM test - Example in the OpenFOAM demo is not part of OpenFOAM 11 anymore, but we could still copy it from an older version and ship it with the test suite - software-layer demo is also broken for this reason - ESPResSo (ask Xin?) - PyTorch - LAMMPS (Lara) - will reach out to Tilen Potisk - Sam: Cleanup! :) - Test json-export format from ReFrame - CP2K ## Meeting (2024-02-01) - test-suite on Hortense - Issue [2970](https://github.com/reframe-hpc/reframe/issues/2970) was opened in ReFrame 24 Aug by @vkarak. - Will be addressed by @boegel and Lara. - Linked to the following issue [68](https://github.com/EESSI/test-suite/issues/68) in the test-suite. - Lara submits for every partition but will be discussed further. - Lara will make more clear in [this issue](https://github.com/reframe-hpc/reframe/issues/2970#issuecomment-1920987171) that passing `--partition` to sbatch on command line actually solved our issue, and that we request them if ReFrame can do this (based on the `--access` configuration) and/or if it could be configured separately - OSU tests - It's merged! - Three updates to be made: - LD_LIBRARY_PATH removed (see below)[Satish] - Docs[Lara] - Changing name of the test: OSU_Microbenchmark_EESSI[Kenneth], TensorFlow_EESSI[Kenneth] => https://github.com/EESSI/test-suite/pull/108 - CUDA modules on pure CPU partitions [#101](https://github.com/EESSI/test-suite/issues/101) - Isn't that _test_ specific? GROMACS works fine, because it does a dlopen in the code path. Executables that are dynamically linked to CUDA can not be run on non-GPU nodes. - Yes, but the _easiest_ is just to say 'we never use CUDA modules on pure CPU nodes'. And its a very clear rule. - Satish will implement and roll back the `LD_LIBRARY_PATH` in OSU test - bot now picks up on `bot/test.sh` and `bot/check-test.sh` script in target repo - How do we proceed? Who? - We will just run TF & OSU, 1 node (or 2 core in case of OSU pt to pt), irrespective of which software was installed - Next step will be to filter on _relevant_ tests that are related to the actual change in the software-layer PR. - Caspar will have a look at this - Test suite doc improvements from xin [here](https://github.com/EESSI/docs/pull/143) - Needs another review? => Lara will have another look - Filter out incompatible scales [#100](https://github.com/EESSI/test-suite/issues/100) - Good idea. Any idea how? Who can do it? - Caspar will take a stab at this - Discuss ReFrame meeting yesterday - Two options: - Option 1: We use the perflog mechanism that's already there, and _add_ a field to indicate if the result should be used as reference. - Challenge: what if you upgrade your system? You'll have to alter the field that indicates if results are used as reference and put all those to - Option 2: Have ReFrame export/add performance numbers from a run to a database (e.g. passing `reframe --export-references=<my_sql_database>`), together with the test hash + system + partition. Then, have ReFrame read those performance number (or an average) from a query on that sql database (`reframe --use-reference=<my_sql_database>`) - Kenneth will create an issue to update the common_config so that it picks up on `EESSI_CVMFS_REPO` to select the right repository (based on the current environment) - => https://github.com/EESSI/test-suite/issues/107 - Satish is working on OpenFOAM + ESPResSo - Have a look at the example for fixtures, it also contains examples of how to reuse the stage dir in the dependent tests - Sam will look into the httpjson perflog handler - [docs](https://reframe-hpc.readthedocs.io/en/stable/config_reference.html) see logging.handlers_perflog.type¶ TODO before 0.2.0 release: - common_config update to use EESSI_CVMFS_REPO, so that it can be used with `software.eessi.io` - GROMACS PR to software-layer - OSU test LD_LIBRARY_PATH removed and implement hook for filtering CUDA module based tests on pure CPU nodes ## Meeting (2024-01-18) - test-suite on Hortense - Issue [2970](https://github.com/reframe-hpc/reframe/issues/2970) was opened in ReFrame 24 Aug by @vkarak. - Will be addressed by @boegel and Lara. - Linked to the following issue [68](https://github.com/EESSI/test-suite/issues/68) in the test-suite. - Lara submits for every partition but will be discussed further. - Merged [#96](https://github.com/EESSI/test-suite/pull/96) which adds `--mem` to configuration files - This was done for each partition and can be done commonly for all partition. - Caspar: Agreed, should be in some common config. Who will do it? How? - Maybe a `options: eessi.testsuite.common_config.get_common_options()` would be enough? - OSU tests - Sam reviewed, comments need to be checked by Satish - https://github.com/EESSI/test-suite/pull/54#discussion_r1451741808 . - Running CUDA modules on the pure CPU nodes using stubs: - Currently, CUDA module generating pure `cpu` test will fail on `cpu` nodes. - Currently, remove the `cpu` tests from CUDA modules. - GROMACS CUDA module runs on CPU devices without complaining where as OSU crashes.? - Should we not allow running CUDA modules on CPU nodes at all? - Currently not a blocker, but open an issue. - 32 GB of memory for point to point tests is too much. - Contact OSU for checking this and also better error reporting. - Currently not a blocker, but open an issue. - Play with this option: -M, --mem-limit SIZE set per process maximum memory consumption to SIZE bytes - Install CUDA OSU module, talk to Snellius system admins and get an update on Caspar's request. - Lara tested on Hortense CPU, had issues on GPU but those seemed not specific to OSU. - Merge now including collectives and figure out the problems later. - Hand the test-suite to other partners. - bot now picks up on `bot/test.sh` and `bot/check-test.sh` script in target repo - currently as part of the build phase, in build environment - bot is ready but not doing anything for now: OSU and TensorFlow good candidates. - GROMACS tests have been failing. - Xin tested docs to see if it was clear how to run (tested on Snellius) - Some issues, you have to be quite careful in what you do - Xin will create issue / PR with suggestions for improvement - PR is opened and WIP. - https://github.com/EESSI/docs/pull/143 - MultiXscale deliverable finished and is online. - goals for next weeks - Sam/Satish: finish OSU PR - Sam - CUDA samples - maybe port over test from VUB test suite to EESSI test suite - Kenneth: - maybe look into GROMACS CI test - Xin: - docs - Espresso test - Satish - Espresso test along with Xin. - OpenFOAM test.(https://github.com/eessi/eessi-demo) - fix GROMACS CI test when there's too many cores - skip if there's too many cores available per node - print message that there's too many cores available, give useful suggestion ## Meeting (2023-12-06) - test-suite on Hortense - Issue [2970](https://github.com/reframe-hpc/reframe/issues/2970) was opened in ReFrame 24 Aug by @vkarak. - Linked to the following issue [68](https://github.com/EESSI/test-suite/issues/68) in the test-suite. - I did some testing to see if adding more parameters in the jobsubmission command would work. - On Tier-2 at the Ugent this bypasses the SLURM settings that are set and allows you to submit to the right cluster with specifying --clusters in command - On Hortense however passing --partitions in the command does not overwright the ste enviroment and the job is submitted to the wrong partition - For testing the EESSI software stack, one solution could be to just purge the sticky modules before running the ReFrame command _or_ unset `SBATCH_PARTITION` - For testing the local software stack, you might want to keep the modules and _only_ unset `SBATCH_PARTITION` - E.g. `0 0 * * * unset SBATCH_PARTITION; EESSI_CI_SYSTEM_NAME=<cluster_name> /path/to/test-suite/CI/run_reframe_wrapper.sh` - or 0 0 * * * module purge --force; EESSI_CI_SYSTEM_NAME=<cluster_name> /path/to/test-suite/CI/run_reframe_wrapper.sh` - Merged [#96](https://github.com/EESSI/test-suite/pull/96) which adds `--mem` to configuration files - Will be used in OSU test - OSU tests - Sam reviewed, comments need to be checked by Satish - E.g. device_type should be parameter, since CUDA based module can/should also do CPU based tests - Lara tested on Hortense CPU, had issues on GPU but those seemed not specific to OSU - We will keep collectives 'as is' (though we change the obvious things that will also be changed for pt2pt) - Merged CI config [PR #93](https://github.com/EESSI/test-suite/pull/93) - Merged "split off `assign_default_num_cpus_per_node` into its own hook" ([PR #95](https://github.com/EESSI/test-suite/pull/95)) - bot now picks up on `bot/test.sh` and `bot/check-test.sh` script in target repo - currently as part of the build phase, in build environment - Xin tested docs to see if it was clear how to run (tested on Snellius) - Some issues, you have to be quite careful in what you do - Xin will create issue / PR with suggestions for improvement - can split up Installation & Configuration page into 3 pages: - Installation - Basic configuration: - minimal required things to get EESSI test suite working - template configuration file to copy-paste - step-by-step guide - default configuration that can be updated - via ReFrame configuration hierarchy support? - via `default_configuration` that is imported into ReFrame configuration file? - Advanced configuration - goals for next weeks - Caspar/Kenneth/Lara: finish MultiXscale deliverable - Sam/Satish: finish OSU PR - Sam - CUDA samples - maybe look into GROMACS CI test - maybe port over test from VUB test suite to EESSI test suite - Xin: - docs - Espresso test - fix GROMACS CI test when there's too many cores - skip if there's too many cores available per node - print message that there's too many cores available, give useful suggestion ## Meeting (2023-11-22) - attending: Caspar, Satish, Xin, Kenneth, Lara, Sam - merged PRs - add scales `1_cpn_2_nodes` and `1_cpn_4_nodes` ([PR #94](https://github.com/EESSI/test-suite/pull/94)) - OSU test ([PR #54](https://github.com/EESSI/test-suite/pull/54)) - Sam's suggestions have been processed (put not pushed to PR yet) - `cpus-per-task` is missing when using `2_nodes` - will be fixed by new `assign_default_num_cpus_per_node` hook introduced in PR #95 when that's called from OSU test - also need to specify `--mem` to make sure enough memory is available - config files need to be updated to specify how to request amount of memory - would also be useful to known max mem per node from config file (nice to have) - required to fix test failure on Hortense because default memory per core is not efficient - also requires update to docs - => follow-up PR after merging PR #54 - Sam did not look at collective test yet - Updated CI driving scripts ([PR #93](https://github.com/EESSI/test-suite/pull/93)) - issue on AWS cluster with 2-node tests - it's not overlapping partitions - probably just a configuration issue with the clusters - can easily create additional nodes to rule out problem with CI config - Caspar also has a config ready for Karolina EuroHPC system - not tested on GPU partition (since we don't have access) - Caspar had to ask to also install CVMFS on login nodes - split off `assign_default_num_cpus_per_node` into its own hook ([PR #95](https://github.com/EESSI/test-suite/pull/95)) - only code restructing, no functional changes - dry run in CI confirms nothing got broken - will be useful for OSU PR - GPU test - GPU support in EESSI is coming soon (tm), see https://hackmd.io/CT1cMyT5SIShddFFTMqqjA?view - OSU test in ReFrame hpctestlib also supports GPU - smoke test: run CUDA sample binary - GROMACS - using --cpu-only + --gpu-only options supported by ReFrame ([issue #83](https://github.com/EESSI/test-suite/issues/83)) - we set `gpus_per_nodes` after setup phase, so ReFrame is not aware of it - feature request for ReFrame to filter generated tests after we set `gpus_per_nodes`? - `--cpu-only` and `--gpu-only` are being deprecated in favor of `-E` option - plan a meeting with Vasileious/ReFrame - interested in joining: Caspar, Kenneth, Sam, Satish - on making tests more portable w.r.t. performance checks - could ReFrame always log perf data in machine-readable format (JSON?) - extras in configuration file - currently only "equals" comparison is possible - would be nice to do >= check (for memory for example) - so tests that require too much memory would not be generated - other alternative is to skip tests based on available memory specified in config file (after tests were generated) - should open issue on this? - Xin is planning to play with test suite - run test suite on SURF's Spider cluster - OpenFOAM test - also (relevant for MultiXscale) - Espresso - https://espressomd.github.io/doc/introduction.html#sample-scripts - LAMMPS - https://github.com/lammps/lammps/tree/develop/examples - have good contacts for this in the project - GROMACS test fails when there's too many cores per node - should skip CI on 192-core Genoa nodes on SURF ------- ## Previous meetings - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2024-01-18) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-12-06) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-11-22) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-11-08) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-10-19) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-10-04) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-09-20) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-09-06) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-08-25) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-08-09) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-software-testing-(27%E2%80%9007%E2%80%902023) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-06-28) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-06-15) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-31) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-05-17) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-04-20) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-30) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-03-10) (incl. 2023-02-23) - https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-test-suite-(2023-02-09)