VSC ReFrame meeting 2022-02-10

# VSC ReFrame meeting 2022-02-10 ## Attendees - Sam Moors (VUB) - Kenneth Hoste (UGent) - Franky Backeljauw (Antwerp) - Michele Pugno (Antwerp) - Robin Verschoren (Antwerp) - Maxime Van den Bossche (Leuven) - Steven Vandenbrande (Leuven) ## Agenda - ## Notes - currently separate systems for `hydra`, `hortense`, ... - could also have a common `vsc` system - could also be tags instead? - `vsc` tag for tests that should work anywhere - site-specific tags for tests that don't work everywhere (yet): `ugent`, `vub`, `kul`, ... - `mpi`, `single_node`, `gpu`? - current tests work for: - VUB (Sam) - Hortense (Kenneth) - KUL (Steven) - default launcher in ReFrame config: mpirun - KUL tests assume this in their tests - UAntwerpen (Michele) - launcher: site-specific or same one everywhere - mpirun (used-focused/Torque) vs srun (Slurm) - allowing site-specific launcher (using RE `vsc:*` tag??) - `vsc:torque`, `vsc:slurm` - identify site via `$VSC_INSTITUTE` environment variable - common version of reframe 3.10.1 (developers are reckless) - ## Conformity checks * agreed in CUE (see list at ...) * presence and validity of `VSC_` environment variables * presence of system tools like `singularity`, ... + version * availability + path of different shared filesystems (home/data) * important for e.g. Globus * local testing (Ghent VSC account on Ghent system) vs cross-site testing * how do users check storage quota (tools?) * UAntwerpen uses `myquota` command * UGent Tier-2: via accountpage * Tier-1 Hortense scratch: * ideas * availablility of common software modules? * ReFrame module * EESSI stack? * toolchains * MPI launcher? * srun (VUB, UA @ Vaughn) * mpirun (KUL, UA @ Leibniz) * mympirun (UGent) * testing the VSC network (connecting to other VSC sites, ...) * perfsonar project (VSC project @ UA), `iperf` * should not be part of ReFrame test suite * continuous performance monitoring * connectivity + performance ## Cluster health checks * submitting simple jobs to different partitions, queues (in order of importance) * single-core, multi-core, multi-node?, single-GPU? * Different schedulers might be a problem e.g. torque vs slurm * A job that tests itself and the env variables of the executing instance * test the node file or equivalent env variable * tests verify that recommendations in docs work (and keep working) * also list jobs, delete jobs, ... ## Cluster performance tests * CPU tests * HPL (LINPACK) * c-ray (ray tracing) * BLAS-Tester * memory tests * STREAM * shared storage tests * IOR * network tests * OSU Microbenchmarks (latency, bandwidth) * basic MPI tests (hello world, ring, ...) ## Application tests (commonly used software) * CP2K * GROMACS * Python, numpy * R, Bioconductor * TensorFlow * OpenFOAM - collect data about: - functionality, verification of results, performance ## Security issues - https://github.com/wpoely86/vsc-security ## Links * CSCS: https://github.com/eth-cscs/reframe/tree/master/hpctestlib * HPC-VUB: https://github.com/vub-hpc/reframe-tests * StackHPC: https://github.com/stackhpc/hpc-tests * HPC-UGent: https://github.ugent.be/hpcugent/vsc-testing/tree/master/ReFrame (private internal repo currently) * HPC-KUL: (not public yet?) * SURF (not public yet?) * Univ. of Birmingham (not public yet?) * non-ReFrame repos: * https://github.com/EESSI/eessi-demo ## How to run/check the tests? * common repo: https://github.com/vscentrum/vsc-test-suite * run from any site, automatically spawn to all clusters? * dedicated credits account on Leuven systems + Hortense? * how to collect/present the data? * currently ReFrame only logs perf data * about to change, see https://github.com/eth-cscs/reframe/issues/2394 * Fake sanity with performance as workaround? * send logs to ELK stack? * GreyLog + Grafana? * Push it back in github repo? Easier way * Otherwise how do we access running server with log manager? * run weekly/monthly? * Difference between large and small tests? * dealing with different scheduler frontends (Torque, Slurm) * using tags, create system partitions with a common prefix ## Goals by next meeting - next meeting: Thu 10 Mar 2022 - 14:00 - folder structure (`reframe -R -r`), create a `tests` folder with: (Michele) - `run.sh --tags xxx --site yyy` - `tests` - `common.py` - `constants.py` - `UGENT = 'ugent'` - `cue` - `common.py` - `env.py` - `micro` - `mpi` - `common.py` - `hello.py` - `apps` - `python` - `numpy.py` - `openfoam` - `motorbike.py` - CUE tests - env (Kenneth, Franky) - see CUE list - is it defined? - is value correct? - do path variables point to existing paths? (Different test class in the implementation) - tools (Sam, Michele) - is command available - check version (range, greater equal than ) - shared FSs (Robin, Sam) - `/home` - `/data/<site>/<account>` - `/scratch/...` - MPI hello world (Steven, Kenneth) - Franky: share list of env + tools CUE - script Michele to run all the tests ``` export BIN_DIR=/apps/antwerpen/reframe/versions/current/bin export TESTS_DIR=/apps/antwerpen/reframe/testsuite export RFM_CONFIG_FILE=$TESTS_DIR/config/settings.py $BIN_DIR/reframe -v --prefix $TESTS_DIR --perflogdir $TESTS_DIR/perflogs -s $TESTS_DIR/stage -o $TESTS_DIR/output -c $TESTS_DIR -R -r --performance-report ```