EasyBuild PyTorch working group

# EasyBuild PyTorch working group ## Goal How can we streamline process to support new PyTorch versions? ## Action points *(based on discussion on 20251009 - by KH)* - **[Kenneth]** implement support for `--test-step-mode` in framework - `skip`, `minimal`, `basic` (default), `full` as possible values? - `--test-step-mode=skip` as replacement for `--skip-test-step`? - **[Alexander]** make PyTorch easyblock aware of it - at least for `basic` mode - current PyTorch test step would then correspond to `full` - **[Alex?]** define clear policy w.r.t. merging PyTorch easyconfig PRs - minimal set of test installation that should pass - break up process to add support for new PyTorch versions - experimental easyconfig (passing basic test step) - mature easyconfig (passing full test step) - report problems in issue, follow up in subsequent PRs (adding more patch files, etc.) - **[Kenneth]** implement support for experimental easyconfigs - `experimental` easyconfig parameter that can be set to `True` - don't include those easyconfig files in EasyBuild release? - auto-add `-EXPERIMENTAL` to module name + auto-hide module file - for PyTorch: passing `basic` test step is sufficient to merge as experimental easyconfig file - **[Loris? Alex?]** evaluate performance benefits of from-source PyTorch installation - vs container images - vs `pip install torch` installation ## Meetings ### 20251212 *(Fri 12 Dec'25 14:00 CET)* attendees: - Kenneth Hoste (HPC-UGent, Belgium) - Alex Domingo (Vrije Universiteit Brussel, Belgium) - Lara Peeters (HPC-UGent, Belgium) - Emmanuel Kiefer (LuxProvide, Luxembourg) - Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany) #### Notes - PyTorch 2.7.1 test suite ``` == FAILED: Installation ended unsuccessfully: An error was raised during test step: 'Too many failed tests (8), maximum allowed is 6' (took 45 hours 20 mins 28 secs) == Results of the build can be found in the log file(s) /tmp/eb-znw1uvam/easybuild-PyTorch-2.7.1-20251207.140221.MdHXg.log WARNING: 8 test failures, 0 test errors (out of 262242): distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors) distributed/test_store (1 failed, 32 passed, 0 skipped, 0 errors) dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors) inductor/test_cpu_select_algorithm (2 failed, 22 passed, 1577 skipped, 0 errors) inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors) test_testing (1 failed, 1997 passed, 67 skipped, 0 errors) ``` - test build took ~45h, 8 failing tests out of 262k (0.00005%...) - filter out individual tests that take too long (see EasyBuild log) - PyTorch devs shard the tests across different systems - core test set - tests that catch problems with compatibility of dependency versions - CUTLASS, Triton - cut up development phase - initially: only core tests + "important" (groups of) tests that really caught problems for us in the past - tagged as experimental - should also be triggered for stuff that depends on this PyTorch ```python name = torchvision dependencies = [ # PyTorch/2.7.1-foss-2025a-CUDA-13 # $EASYBUILD_EXPERIMENTAL_EASYCONFIGS=PyTorch,torchvision ('PyTorch', '2.7.1') ] experimental = True # implies hidden module ``` - allows to more easily install stuff on top of it (which may also surface problems) - experimental modules should be hidden from users - module load message to signal to users that this module is experimental - 'experimental' value for moduleclass - still include in EasyBuild release - policy for "good enough to be experimental" - failing tests in full test suite (<1% failing tests)? - check # failing tests per test suite (max. 1% failing per test suite?) - example: ``` Failed suites (12): distributed/tensor/test_attention (2/6) distributed/test_c10d_functional_native (1/32) distributed/test_c10d_nccl (1/6) dynamo/test_misc (2/596) dynamo/test_structured_trace (3/29) inductor/test_cutlass_backend (145/152) inductor/test_fp8 (12/336) inductor/test_max_autotune (26/182) inductor/test_pad_mm (1/19) inductor/test_torchinductor_strided_blocks (26/297) inductor/test_triton_kernels (2/361) profiler/test_profiler (1/81) ``` - lowest bar: pass `core` tests - cfr. feedback in https://github.com/pytorch/pytorch/issues/167721 - patch out failing tests that we're happy to ignore (based on comparison with upstream build) - how many tests? how long would it take? - test systems - `jsc-zen3` bot (A100) - should cover atleast 2 GPU generations => also cover an H100 system - fully vetted: - more thorough testing # eb --test-mode=full - => goal is to make it non-experimental, after checking details on failing tests - Alexander uses pip installed PyTorch (from upstream) to check whether tests work upstream - Python (+ CUDA) module to set up Python venv and pip install in it - used to run tests with upstream build of PyTorch - `cmd.sh` interactive shell is used to determine shell environment - used to determine whether failing test is worth looking into, or not (do-not-care tests, only fails for specific platform, etc.) - PyTorch CI takes open `DISABLED test_xxx` issues into account - see for example https://github.com/pytorch/pytorch/issues/170259 - PyTorch shows string like `python test/distributed/tensor/test_attention.py RingFlexAttentionTest.test_ring_flex_attention_document_mask` to easily run particular test in isolation - communication with PyTorch devs - see request for input on minimal test suite: https://github.com/pytorch/pytorch/issues/167721 - (Kenneth) try to get in touch with PyTorch devs, set up a call - additional links w.r.t. PyTorch testing & benchmarking (via Xavier B.) - https://hud.pytorch.org/ - https://github.com/pytorch/pytorch/wiki/Using-hud.pytorch.org - https://hud.pytorch.org/benchmark/benchmark_list --- ### 20251009 - kickoff meeting attendees: - Alex Domingo (Vrije Universiteit Brussel, Belgium) - Loris Ercole (CECAM) - Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany) - Kenneth Hoste (HPC-UGent, Belgium) - Adam Huffman (University of Oxford, UK) - Emmanuel Kiefer (LuxProvide, Luxembourg) - Jure Pečar (EMBL, Germany) - Lara Peeters (HPC-UGent, Belgium) - Jörg Saßmannshausen (Imperial College London, UK) #### Summary of current situation - PyTorch test suite is a PITA - it's taking forever... - can be over 24h on older systems (AMD Rome @ HPC-UGent) - @ TU Dresden: over 33h on 64-core AMD Rome with 6x H100 - handful of flaky tests (sometimes even hanging tests) - failing tests shouldn't be ignored, because they can indicate real problems - <insert link to example where a GCC vectorizer bug was found> - can point to problem with wrong dependency version (or to a bug in a dependency being used) - growing complexity in build - cfr. Triton (which by itself is a small nightmare) - depends on *specific* versions of NCCL & cuDNN - PyTorch easyblock is quite complex - [easyblock PR #3803](https://github.com/easybuilders/easybuild-easyblocks/pull/3803) helps a bit by adding CI for `get_test_results` function used by PyTorch easyblock #### Impact - lots of wasted time - failing tests => not completing installations - work to try and fix tests - no recent PyTorch versions supported by EasyBuild #### Ideas/questions - does it really make sense to run full PyTorch test suite to verify an installation? - something more lightweight? - => "integration test" scripts found by Alex - run more reasonable battery of tests by default - support for running full test suite for those that want to - only "end-to-end" tests, maybe in combination with small part of PyTorch test suite - many tests in PyTorch test suite are: - only testing a niche feature - flaky/poorly written/make assumptions - it's pretty easy to run PyTorch test suite for an existing PyTorch installation - PyTorch test suite consists of a bunch of groups of tests - we could identify ones that are reasonable to run in default mode - try to focus on core features - framework feature to opt-in to running full PyTorch test suite - `eb --test-step-mode=intense PyTorch.eb` - `--skip-test-step` could (eventually) be replaced with `--test-step-mode=skip` - list set of tests in separate file like `test-pytorch-2.7-intense.yml` - `eb --test-step-input=PyTorch:/tmp/test-pytorch-2.7-intense.yml` - ```python test_step_input = { 'basic': 'basic.yml', 'intense': 'intense.yml', } ``` - support for specifying how a max. time for test step - `eb --test-step-max-time=1h PyTorch.eb` - where would we get reasonable accurate info on this? - depends on hardware, available resources, PyTorch version, etc. - collecting timing info for tests would be helpful, so each site can figure out for themselves which excessively long tests to skip - maybe update PyTorch easyblock to allow: - `python -m easybuild.easyblocks.pytorch run-test-suite` - set clear target to include new PyTorch easyconfig in test suite - at least 2 (common) GPU generations - jsc-zen3 test bot w/ A100 - H100 somewhere? - don't block PR when a couple of PyTorch tests fail for some people - streamline process to get PyTorch easyconfigs merged - PyTorch is becoming a common dependency, so lots of other easyconfig PRs are being blocked... - clear policy on what should be achieved before merge would help - try to get more people up to speed on how to maintain PyTorch easyblock/easyconfigs - have a way to quickly merge updated PyTorch easyconfig in repo, but not included in EasyBuild release - separate `experimental/easyconfigs` folder? - only let EasyBuild pick up on it when it's told to be allowed - `eb --use-experimental-easyconfigs` - also `-EXPERIMENTAL` as `versionsuffix`? - add automatically to module name being installed? - also make it a hidden module file? - `experimental = True` in easyconfig file - don't include these in EasyBuild release - also auto-add `-EXPERIMENTAL` to install path? - initially use pre-built wheels for PyTorch - from PyPI? from NVIDIA (for CUDA-aware installs)? - `PyTorch` wrapper to allow for in-place update from wheel to from-source installation? - only really works for pure Python packages that only do `import torch` - Alex' experiment with `torchvision` - playing with wheels vs from-source installations of PyTorch/torchvision - using pre-built wheel of PyTorch with torchvision from-source on top works, but you then can't swap to a from-source PyTorch - only affects stuff that link to `libtorch.so` - so swapping PyTorch install with pre-built wheel with from-source built implies also reinstalling torchvision & co - bundle stuff together that links to PyTorch library - `PyTorch-bundle-PyPI` (wheel installs) vs `PyTorch-bundle-EasyBuild` (from source builds) - how can we identify things that link to PyTorch library? - using pre-built wheel didn't show any significant performance degradation - how well does our from-source installation perform vs prebuilt binary wheels? - especially on GPU - depends on which CUDA kernels are being used, which GPU is being used - "pure PyTorch" scripts that Alex found can be useful here - https://github.com/pytorch/examples - https://github.com/pytorch/benchmark - can we get input from PyTorch developers? #### Notes - issue with having to use other libuv @ LuxProvide - not using internal libuv in tensorpipe - for specific problems/questions: open issue/discuss in Slack - is there a way to figure version of dependencies expected by PyTorch - yes, sort of, see comments in recent PyTorch PRs like [easyconfigs PR #23923](https://github.com/easybuilders/easybuild-easyconfigs/pull/23923) - can we extract versions that were used from pre-built wheels? - maybe a range of versions? #### Action points - (Kenneth/Alex) implement support for `--test-step-mode` in framework - `skip`, `minimal`, `basic` (default), `full` as possible values? - `--test-step-mode=skip` as replacement for `--skip-test-step`? - (Alexander) make PyTorch easyblock aware of it - at least for `basic` mode - current PyTorch test step would then correspond to `full` - define clear policy w.r.t. merging PyTorch easyconfig PRs - minimal set of test installation that should pass - break up process to add support for new PyTorch versions - experimental easyconfig (passing basic test step) - mature easyconfig (passing full test step) - report problems in issue, follow up in subsequent PRs (adding more patch files, etc.) - implement support for experimental easyconfigs - `experimental` easyconfig parameter that can be set to `True` - don't include those easyconfig files in EasyBuild release? - auto-add `-EXPERIMENTAL` to module name + auto-hide module file - for PyTorch: passing `basic` test step is sufficient to merge as experimental easyconfig file - evaluate performance benefits of from-source PyTorch installation - vs container images - vs `pip install torch` installation #### Next meeting - Thu 13 Nov'25 15:00 CET - OK for Xavier, Alex, Jörg (TBC: Alexander) ---