# VSC ReFrame meeting 2022-02-10
## Attendees
- Sam Moors (VUB)
- Kenneth Hoste (UGent)
- Franky Backeljauw (Antwerp)
- Michele Pugno (Antwerp)
- Robin Verschoren (Antwerp)
- Maxime Van den Bossche (Leuven)
- Steven Vandenbrande (Leuven)
## Agenda
-
## Notes
- currently separate systems for `hydra`, `hortense`, ...
- could also have a common `vsc` system
- could also be tags instead?
- `vsc` tag for tests that should work anywhere
- site-specific tags for tests that don't work everywhere (yet): `ugent`, `vub`, `kul`, ...
- `mpi`, `single_node`, `gpu`?
- current tests work for:
- VUB (Sam)
- Hortense (Kenneth)
- KUL (Steven)
- default launcher in ReFrame config: mpirun
- KUL tests assume this in their tests
- UAntwerpen (Michele)
- launcher: site-specific or same one everywhere
- mpirun (used-focused/Torque) vs srun (Slurm)
- allowing site-specific launcher (using RE `vsc:*` tag??)
- `vsc:torque`, `vsc:slurm`
- identify site via `$VSC_INSTITUTE` environment variable
- common version of reframe 3.10.1 (developers are reckless)
-
## Conformity checks
* agreed in CUE (see list at ...)
* presence and validity of `VSC_` environment variables
* presence of system tools like `singularity`, ... + version
* availability + path of different shared filesystems (home/data)
* important for e.g. Globus
* local testing (Ghent VSC account on Ghent system) vs cross-site testing
* how do users check storage quota (tools?)
* UAntwerpen uses `myquota` command
* UGent Tier-2: via accountpage
* Tier-1 Hortense scratch:
* ideas
* availablility of common software modules?
* ReFrame module
* EESSI stack?
* toolchains
* MPI launcher?
* srun (VUB, UA @ Vaughn)
* mpirun (KUL, UA @ Leibniz)
* mympirun (UGent)
* testing the VSC network (connecting to other VSC sites, ...)
* perfsonar project (VSC project @ UA), `iperf`
* should not be part of ReFrame test suite
* continuous performance monitoring
* connectivity + performance
## Cluster health checks
* submitting simple jobs to different partitions, queues (in order of importance)
* single-core, multi-core, multi-node?, single-GPU?
* Different schedulers might be a problem e.g. torque vs slurm
* A job that tests itself and the env variables of the executing instance
* test the node file or equivalent env variable
* tests verify that recommendations in docs work (and keep working)
* also list jobs, delete jobs, ...
## Cluster performance tests
* CPU tests
* HPL (LINPACK)
* c-ray (ray tracing)
* BLAS-Tester
* memory tests
* STREAM
* shared storage tests
* IOR
* network tests
* OSU Microbenchmarks (latency, bandwidth)
* basic MPI tests (hello world, ring, ...)
## Application tests (commonly used software)
* CP2K
* GROMACS
* Python, numpy
* R, Bioconductor
* TensorFlow
* OpenFOAM
- collect data about:
- functionality, verification of results, performance
## Security issues
- https://github.com/wpoely86/vsc-security
## Links
* CSCS: https://github.com/eth-cscs/reframe/tree/master/hpctestlib
* HPC-VUB: https://github.com/vub-hpc/reframe-tests
* StackHPC: https://github.com/stackhpc/hpc-tests
* HPC-UGent: https://github.ugent.be/hpcugent/vsc-testing/tree/master/ReFrame (private internal repo currently)
* HPC-KUL: (not public yet?)
* SURF (not public yet?)
* Univ. of Birmingham (not public yet?)
* non-ReFrame repos:
* https://github.com/EESSI/eessi-demo
## How to run/check the tests?
* common repo: https://github.com/vscentrum/vsc-test-suite
* run from any site, automatically spawn to all clusters?
* dedicated credits account on Leuven systems + Hortense?
* how to collect/present the data?
* currently ReFrame only logs perf data
* about to change, see https://github.com/eth-cscs/reframe/issues/2394
* Fake sanity with performance as workaround?
* send logs to ELK stack?
* GreyLog + Grafana?
* Push it back in github repo? Easier way
* Otherwise how do we access running server with log manager?
* run weekly/monthly?
* Difference between large and small tests?
* dealing with different scheduler frontends (Torque, Slurm)
* using tags, create system partitions with a common prefix
## Goals by next meeting
- next meeting: Thu 10 Mar 2022 - 14:00
- folder structure (`reframe -R -r`), create a `tests` folder with: (Michele)
- `run.sh --tags xxx --site yyy`
- `tests`
- `common.py`
- `constants.py`
- `UGENT = 'ugent'`
- `cue`
- `common.py`
- `env.py`
- `micro`
- `mpi`
- `common.py`
- `hello.py`
- `apps`
- `python`
- `numpy.py`
- `openfoam`
- `motorbike.py`
- CUE tests
- env (Kenneth, Franky)
- see CUE list
- is it defined?
- is value correct?
- do path variables point to existing paths? (Different test class in the implementation)
- tools (Sam, Michele)
- is command available
- check version (range, greater equal than )
- shared FSs (Robin, Sam)
- `/home`
- `/data/<site>/<account>`
- `/scratch/...`
- MPI hello world (Steven, Kenneth)
- Franky: share list of env + tools CUE
- script Michele to run all the tests
```
export BIN_DIR=/apps/antwerpen/reframe/versions/current/bin
export TESTS_DIR=/apps/antwerpen/reframe/testsuite
export RFM_CONFIG_FILE=$TESTS_DIR/config/settings.py
$BIN_DIR/reframe -v --prefix $TESTS_DIR --perflogdir $TESTS_DIR/perflogs -s $TESTS_DIR/stage -o $TESTS_DIR/output -c $TESTS_DIR -R -r --performance-report
```