# EasyBuild PyTorch working group
## Goal
How can we streamline process to support new PyTorch versions?
## Action points
*(based on discussion on 20251009 - by KH)*
- **[Kenneth]** implement support for `--test-step-mode` in framework
- `skip`, `minimal`, `basic` (default), `full` as possible values?
- `--test-step-mode=skip` as replacement for `--skip-test-step`?
- **[Alexander]** make PyTorch easyblock aware of it
- at least for `basic` mode
- current PyTorch test step would then correspond to `full`
- **[Alex?]** define clear policy w.r.t. merging PyTorch easyconfig PRs
- minimal set of test installation that should pass
- break up process to add support for new PyTorch versions
- experimental easyconfig (passing basic test step)
- mature easyconfig (passing full test step)
- report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
- **[Kenneth]** implement support for experimental easyconfigs
- `experimental` easyconfig parameter that can be set to `True`
- don't include those easyconfig files in EasyBuild release?
- auto-add `-EXPERIMENTAL` to module name + auto-hide module file
- for PyTorch: passing `basic` test step is sufficient to merge as experimental easyconfig file
- **[Loris? Alex?]** evaluate performance benefits of from-source PyTorch installation
- vs container images
- vs `pip install torch` installation
## Meetings
### 20251212
*(Fri 12 Dec'25 14:00 CET)*
attendees:
- Kenneth Hoste (HPC-UGent, Belgium)
- Alex Domingo (Vrije Universiteit Brussel, Belgium)
- Lara Peeters (HPC-UGent, Belgium)
- Emmanuel Kiefer (LuxProvide, Luxembourg)
- Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany)
#### Notes
- PyTorch 2.7.1 test suite
```
== FAILED: Installation ended unsuccessfully: An error was raised during test step: 'Too many failed tests (8), maximum allowed is 6' (took 45 hours 20 mins 28 secs)
== Results of the build can be found in the log file(s) /tmp/eb-znw1uvam/easybuild-PyTorch-2.7.1-20251207.140221.MdHXg.log
WARNING: 8 test failures, 0 test errors (out of 262242):
distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
distributed/test_store (1 failed, 32 passed, 0 skipped, 0 errors)
dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
inductor/test_cpu_select_algorithm (2 failed, 22 passed, 1577 skipped, 0 errors)
inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors)
test_testing (1 failed, 1997 passed, 67 skipped, 0 errors)
```
- test build took ~45h, 8 failing tests out of 262k (0.00005%...)
- filter out individual tests that take too long (see EasyBuild log)
- PyTorch devs shard the tests across different systems
- core test set
- tests that catch problems with compatibility of dependency versions
- CUTLASS, Triton
- cut up development phase
- initially: only core tests + "important" (groups of) tests that really caught problems for us in the past
- tagged as experimental
- should also be triggered for stuff that depends on this PyTorch
```python
name = torchvision
dependencies = [
# PyTorch/2.7.1-foss-2025a-CUDA-13
# $EASYBUILD_EXPERIMENTAL_EASYCONFIGS=PyTorch,torchvision
('PyTorch', '2.7.1')
]
experimental = True # implies hidden module
```
- allows to more easily install stuff on top of it (which may also surface problems)
- experimental modules should be hidden from users
- module load message to signal to users that this module is experimental
- 'experimental' value for moduleclass
- still include in EasyBuild release
- policy for "good enough to be experimental"
- failing tests in full test suite (<1% failing tests)?
- check # failing tests per test suite (max. 1% failing per test suite?)
- example:
```
Failed suites (12):
distributed/tensor/test_attention (2/6)
distributed/test_c10d_functional_native (1/32)
distributed/test_c10d_nccl (1/6)
dynamo/test_misc (2/596)
dynamo/test_structured_trace (3/29)
inductor/test_cutlass_backend (145/152)
inductor/test_fp8 (12/336)
inductor/test_max_autotune (26/182)
inductor/test_pad_mm (1/19)
inductor/test_torchinductor_strided_blocks (26/297)
inductor/test_triton_kernels (2/361)
profiler/test_profiler (1/81)
```
- lowest bar: pass `core` tests
- cfr. feedback in https://github.com/pytorch/pytorch/issues/167721
- patch out failing tests that we're happy to ignore (based on comparison with upstream build)
- how many tests? how long would it take?
- test systems
- `jsc-zen3` bot (A100)
- should cover atleast 2 GPU generations => also cover an H100 system
- fully vetted:
- more thorough testing # eb --test-mode=full
- => goal is to make it non-experimental, after checking details on failing tests
- Alexander uses pip installed PyTorch (from upstream) to check whether tests work upstream
- Python (+ CUDA) module to set up Python venv and pip install in it
- used to run tests with upstream build of PyTorch
- `cmd.sh` interactive shell is used to determine shell environment
- used to determine whether failing test is worth looking into, or not (do-not-care tests, only fails for specific platform, etc.)
- PyTorch CI takes open `DISABLED test_xxx` issues into account
- see for example https://github.com/pytorch/pytorch/issues/170259
- PyTorch shows string like `python test/distributed/tensor/test_attention.py RingFlexAttentionTest.test_ring_flex_attention_document_mask` to easily run particular test in isolation
- communication with PyTorch devs
- see request for input on minimal test suite: https://github.com/pytorch/pytorch/issues/167721
- (Kenneth) try to get in touch with PyTorch devs, set up a call
- additional links w.r.t. PyTorch testing & benchmarking (via Xavier B.)
- https://hud.pytorch.org/
- https://github.com/pytorch/pytorch/wiki/Using-hud.pytorch.org
- https://hud.pytorch.org/benchmark/benchmark_list
---
### 20251009 - kickoff meeting
attendees:
- Alex Domingo (Vrije Universiteit Brussel, Belgium)
- Loris Ercole (CECAM)
- Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany)
- Kenneth Hoste (HPC-UGent, Belgium)
- Adam Huffman (University of Oxford, UK)
- Emmanuel Kiefer (LuxProvide, Luxembourg)
- Jure Pečar (EMBL, Germany)
- Lara Peeters (HPC-UGent, Belgium)
- Jörg Saßmannshausen (Imperial College London, UK)
#### Summary of current situation
- PyTorch test suite is a PITA
- it's taking forever...
- can be over 24h on older systems (AMD Rome @ HPC-UGent)
- @ TU Dresden: over 33h on 64-core AMD Rome with 6x H100
- handful of flaky tests (sometimes even hanging tests)
- failing tests shouldn't be ignored, because they can indicate real problems
- <insert link to example where a GCC vectorizer bug was found>
- can point to problem with wrong dependency version (or to a bug in a dependency being used)
- growing complexity in build
- cfr. Triton (which by itself is a small nightmare)
- depends on *specific* versions of NCCL & cuDNN
- PyTorch easyblock is quite complex
- [easyblock PR #3803](https://github.com/easybuilders/easybuild-easyblocks/pull/3803) helps a bit by adding CI for `get_test_results` function used by PyTorch easyblock
#### Impact
- lots of wasted time
- failing tests => not completing installations
- work to try and fix tests
- no recent PyTorch versions supported by EasyBuild
#### Ideas/questions
- does it really make sense to run full PyTorch test suite to verify an installation?
- something more lightweight?
- => "integration test" scripts found by Alex
- run more reasonable battery of tests by default
- support for running full test suite for those that want to
- only "end-to-end" tests, maybe in combination with small part of PyTorch test suite
- many tests in PyTorch test suite are:
- only testing a niche feature
- flaky/poorly written/make assumptions
- it's pretty easy to run PyTorch test suite for an existing PyTorch installation
- PyTorch test suite consists of a bunch of groups of tests
- we could identify ones that are reasonable to run in default mode
- try to focus on core features
- framework feature to opt-in to running full PyTorch test suite
- `eb --test-step-mode=intense PyTorch.eb`
- `--skip-test-step` could (eventually) be replaced with `--test-step-mode=skip`
- list set of tests in separate file like `test-pytorch-2.7-intense.yml`
- `eb --test-step-input=PyTorch:/tmp/test-pytorch-2.7-intense.yml`
- ```python
test_step_input = {
'basic': 'basic.yml',
'intense': 'intense.yml',
}
```
- support for specifying how a max. time for test step
- `eb --test-step-max-time=1h PyTorch.eb`
- where would we get reasonable accurate info on this?
- depends on hardware, available resources, PyTorch version, etc.
- collecting timing info for tests would be helpful, so each site can figure out for themselves which excessively long tests to skip
- maybe update PyTorch easyblock to allow:
- `python -m easybuild.easyblocks.pytorch run-test-suite`
- set clear target to include new PyTorch easyconfig in test suite
- at least 2 (common) GPU generations
- jsc-zen3 test bot w/ A100
- H100 somewhere?
- don't block PR when a couple of PyTorch tests fail for some people
- streamline process to get PyTorch easyconfigs merged
- PyTorch is becoming a common dependency, so lots of other easyconfig PRs are being blocked...
- clear policy on what should be achieved before merge would help
- try to get more people up to speed on how to maintain PyTorch easyblock/easyconfigs
- have a way to quickly merge updated PyTorch easyconfig in repo, but not included in EasyBuild release
- separate `experimental/easyconfigs` folder?
- only let EasyBuild pick up on it when it's told to be allowed
- `eb --use-experimental-easyconfigs`
- also `-EXPERIMENTAL` as `versionsuffix`?
- add automatically to module name being installed?
- also make it a hidden module file?
- `experimental = True` in easyconfig file
- don't include these in EasyBuild release
- also auto-add `-EXPERIMENTAL` to install path?
- initially use pre-built wheels for PyTorch
- from PyPI? from NVIDIA (for CUDA-aware installs)?
- `PyTorch` wrapper to allow for in-place update from wheel to from-source installation?
- only really works for pure Python packages that only do `import torch`
- Alex' experiment with `torchvision`
- playing with wheels vs from-source installations of PyTorch/torchvision
- using pre-built wheel of PyTorch with torchvision from-source on top works, but you then can't swap to a from-source PyTorch
- only affects stuff that link to `libtorch.so`
- so swapping PyTorch install with pre-built wheel with from-source built implies also reinstalling torchvision & co
- bundle stuff together that links to PyTorch library
- `PyTorch-bundle-PyPI` (wheel installs) vs `PyTorch-bundle-EasyBuild` (from source builds)
- how can we identify things that link to PyTorch library?
- using pre-built wheel didn't show any significant performance degradation
- how well does our from-source installation perform vs prebuilt binary wheels?
- especially on GPU
- depends on which CUDA kernels are being used, which GPU is being used
- "pure PyTorch" scripts that Alex found can be useful here
- https://github.com/pytorch/examples
- https://github.com/pytorch/benchmark
- can we get input from PyTorch developers?
#### Notes
- issue with having to use other libuv @ LuxProvide
- not using internal libuv in tensorpipe
- for specific problems/questions: open issue/discuss in Slack
- is there a way to figure version of dependencies expected by PyTorch
- yes, sort of, see comments in recent PyTorch PRs like [easyconfigs PR #23923](https://github.com/easybuilders/easybuild-easyconfigs/pull/23923)
- can we extract versions that were used from pre-built wheels?
- maybe a range of versions?
#### Action points
- (Kenneth/Alex) implement support for `--test-step-mode` in framework
- `skip`, `minimal`, `basic` (default), `full` as possible values?
- `--test-step-mode=skip` as replacement for `--skip-test-step`?
- (Alexander) make PyTorch easyblock aware of it
- at least for `basic` mode
- current PyTorch test step would then correspond to `full`
- define clear policy w.r.t. merging PyTorch easyconfig PRs
- minimal set of test installation that should pass
- break up process to add support for new PyTorch versions
- experimental easyconfig (passing basic test step)
- mature easyconfig (passing full test step)
- report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
- implement support for experimental easyconfigs
- `experimental` easyconfig parameter that can be set to `True`
- don't include those easyconfig files in EasyBuild release?
- auto-add `-EXPERIMENTAL` to module name + auto-hide module file
- for PyTorch: passing `basic` test step is sufficient to merge as experimental easyconfig file
- evaluate performance benefits of from-source PyTorch installation
- vs container images
- vs `pip install torch` installation
#### Next meeting
- Thu 13 Nov'25 15:00 CET
- OK for Xavier, Alex, Jörg (TBC: Alexander)
---