# Software Installations on Rocky 9 This document: https://hackmd.io/@UPPMAX/rocky9sw Weekly meeting on Mondays at 13:15. [AE document about Pelle installations](https://hackmd.io/PMixbGE_TE2mgjkJ1KbMbw) ## Hardware - pelle\* up 10-00:00:0 123 idle p\[1-115,201-206,251-252\] - fat up 10-00:00:0 2 idle p\[251-252\] 96 (hyperthreads) 2 & 3TB - gpu up 10-00:00:0 6 idle p\[201-206\] - 201-4 64 (hyperthr) 377GB 10 L40 - 205-6 64 (hyperthr) 377GB 2 H100 ## Test users - Jira [UPP-709](https://jira.its.uu.se/browse/UPP-709) ## Meeting 2025-09-22 Björn och Pär - Slurm - mem - tas betalt på ett smart sätt - - Prata dokumentation om CPU-timmar nästa vecka - detaljer fair-share diskuteras i KUL-grupp i veckan ## Meeting 2025-09-01 - partitionsnamn - kurser när till pelle? - BC utreder från allokeringsgrupp - vilka programvaror behövs - statistik på interaktiva - hur ofta? - hur långa allokeringar? - får kolla med sacct ... - tidslinje behövs - ## Meeting 2025-06-09 - attendees: Björn , Diana, Jerker, Martin - pelle1, pelle2 fungerar att logga in på - /sw/arch mounted to /sw/nodetypes/el9_zen4 - Diana building UCX and OpenMPI for Ethernet - Letting users in, to be done - TOTP up - test project? marcus? - ThinLinc? - not configured - OpenOnDemand? - via web, - not like COSMOS/Dardel via ThinLinc - ## Meeting 2025-06-02 - Attendees: Björn, Martin, Pavlin, Pär - Installed eb: - Python bare - matlab/2023b - julia bare - comsol/6.2 - Installed in generic - pixi - *for demonstration purposes only* - `/sw/generic/pixi-tools/setup/tools-list.txt` - env: samtools-1.21 - env: bwa-0.7.19 - uv - *for demonstration purposes only* - `/sw/generic/uv-tools/setup/tools-list.txt` - mpi problem with ``rdma`` - fix to be done from EB side - ask in community? BC or Diana? - jira ## Meeting 2025-05-19 - Attendees: Björn, Douglas, Martin, Pavlin, Pär - Björn: Get errors submitting Slurm jobs Pär found that user name lookup does not work on the Slurm controller server, and will investigate. - Several tools are known not to work on Pelle (projinfo, uquota, ...), there is no `"uppmax"` module. - Pär: MODULEPATH is empty when logging in, let us know when you know what it should be set to. - Pär: Jerker would like an estimate when modules/software can be ready enought to allow a few test users in. - `/sw/arch` directory available since last week. (`cat /sw/arch/.arch` to check which one is used) - Diana is working on `/sw/arc/eb`, initially for el9_zen4. - Björn and Douglas requested a few system packages to install (I don't have the list for the nodes) Martin will handle this. - Discussions about what software should be installed where. - MATLAB and similar go into /sw/generic. We should investigate using EB for these installations. ## Meeting 2025-05-12 - Attendees: Björn C, Diana, Martin, Pär, Douglas - `module use` and `load` works with modules installed via Easybuild / from `/sw/EasyBuild/rackham/modules/all`, for example - .tcl/.lua modules from `/sw/mf` give an error upon load: "/etc/lmod/cluster.tcl": no such file or directory" - This is because our .tcl files include the line `/sw/mf/common/includes/functions.tcl`, which is trying to `source /etc/lmod/cluster.tcl`, but it is not available for Pelle - this is related to legacy modules - ``module load legacy`` ? - wait a bit! - some static libraries in ordinary Pelle tree - slurm igång men tror att varje nod har en kärna - konfig fr.o.m tisdag (Martin) - vänta lite - sinfo -R - SW tree! ### Directory structure (Can this be settled?) /sw/ - nodetypes / - el9_zen4/ -> /sw/arch - el9_haswell/ -> /sw/arch - el9_westmere/ -> /sw/arch (s229, 4TB) (not needed?) - ~~el9_broadwell/~~ -> /sw/arch (not needed? current rackham nodes) - ~~el9_nehalem/ (s230, 2TB) -> /sw/arch~~ - eb_sources/ - eb_recipes/ - **generic/** stays in the top level, - this is only Matlab, containers etc ### content in nodetypes folders - haswell/ many (but not all?) tools - zen4/ as many tools as possible - westmere/ rather empty (Kraken etc) - instead x86_64 Suggested directory structure for `sw`: ```bash= [user@rackham5 SW-Directory-Structure]$ tree /sw/ └── nodetypes/ └── el9_broadwell/ ├── eb4 ├── eb5 │ ├── eb5venv │ ├── modules │ └── software │ └── some-software │ └── some-version │ ├── bin │ ├── easybuild # contains .log files generated by EasyBuild │ └── lib64 ├── local/ │ ├── modules │ └── software │ └── some-software │ └── some-version │ ├── bin │ ├── lib64 │ └── localbuild # contains README files of manual installations └── spack/ # if needed └── el9_nehalem/ ├── eb4 ├── ... ├── eb5 ├── ... ├── local ├── ... └── spack # if needed ├── ... └── el9_haswell/ ├── ... └── .../ ├── ... └── generic # software that depends not on node types ├── local ├── ... └── eb_sources # source files for EasyBuild └── eb_recipes # official development branch of easybuild-easyconfigs └── apps # legacy software from rackham/snowy/etc └── bioinfo # legacy software from rackham/snowy/etc └── comp # legacy software from rackham/snowy/etc └── parallel # legacy software from rackham/snowy/etc └── mf # legacy software module files ``` ## Meeting 2025-04-07 - Attendees: Björn C, Diana, Martin, Douglas - Pelle - pip på systemnivå - kan göras snart! - status - Gorilla problem med nätverk - ToDOs - lmod - vänta eller ej på gorilla? - slurm tar tid med inställningar - what to call the GPU nodes? - - Which programs still by sysexps? - matlab - Nice tool: nodetype.sh Architectures: * `el9_broadwell` * `el9_haswell` * r1071 (IRHAM 1001-1075 have el9, the rest el7) * `el9_westmere` (for `s229`, 4TB) * needed * a small possibility to upgrade new nmode to like 4 TB with vendor's penalty * `el9_nehalem` (for `s230`, 2TB) * not needed * `el9_zen4` (?) (for the new AMD CPUs) * `el9_icelake` (?) (Bianca 64-core nodes) ## Meeting 2025-03-31 - Attendees: Björn C - status - Rockham - GPU 1050/1052 down - Pelle login - pelle1: 48 cores - no hyperthreading - For AE - pelle3-sw: write to sw - pelle3-ae: not write to sw - Think about bianca installations when pelle is in use - keep rackham5 for correct architecture? ## Meeting 2025-03-03 - 1050/52 har GPU: funkar ## Meeting 2025-02-10 - Status - Perhaps focus will be for Pelle --> - rackham1-3 will probably not be upgraded to Rocky9 - rackham4 can still be used for testing by us and users - 1052 har GPU men är drainad - lmod/EB - lmod 8.7.56 funkar ej för EB4 (kan komma fix för EB5) - ej tillgänglig än för rocky9-uppdatering - EB5 är ej officiell ännu - ToDos - Martin - [x] uppdaterar rocky9 på loginnod - [ ] 1052 väcks - Pär - [ ] archspec ## Meeting 2025-02-03 - Status - interactive - funkar - grafik funkar - decvore ej uppsatt ## Meeting 2025-01-27 - People: Diana, Martin, Pär, Björn, Pavlin - Status - salloc/sbatch funkar utan att behöva reservera! - interactive ej i funktion nu - X11 i salloc funkar ej - vanliga R-modulen funkar inte - - ToDos: - [x] Martin kollar installation X11 - devcore - interactive (problem med skrivrättigheter till scratch) ## Meeting 2025-01-20 - People: Diana, Björn, Pär, Martin - Status: - nothing new - [x] compute nodes needs to be woken up after the maintenance ## Meeting 2025-01-13 - Status - Check avail nodes: `sinfo -T rocky9` - `r[252,1001-1003,1005-1072]` - Not open yet for users ## Meeting 2024-10-21 How **must** bindmounts work? We wish to have only one architecture-specific bindmount per node, so how to do this? Must there be an intervening directory to make this easier? `/sw -> /sw/el9-broadwell` **NO** the rest of `/sw` is not available with this scheme Preferred: `/sw/arch -> /sw/architectures/el9_broadwell` `/sw/arch -> /sw/architectures/el9_haswell` Suggested directory structure for `sw`: ```bash= [user@rackham5 SW-Directory-Structure]$ tree /sw/ └── architectures/ └── el9_broadwell/ ├── eb4 ├── eb5 │ ├── eb5venv │ ├── modules │ └── software │ └── some-software │ └── some-version │ ├── bin │ ├── easybuild # contains .log files generated by EasyBuild │ └── lib64 ├── local/ │ ├── modules │ └── software │ └── some-software │ └── some-version │ ├── bin │ ├── lib64 │ └── localbuild # contains README files of manual installations └── spack/ # if needed └── el9_nehalem/ ├── eb4 ├── ... ├── eb5 ├── ... ├── local ├── ... └── spack # if needed ├── ... └── el9_haswell/ ├── ... └── .../ ├── ... └── eb_sources # source files for EasyBuild └── eb_recipes # official development branch of easybuild-easyconfigs └── apps # legacy software from rackham/snowy/etc └── bioinfo # legacy software from rackham/snowy/etc └── comp # legacy software from rackham/snowy/etc └── parallel # legacy software from rackham/snowy/etc └── mf # legacy software module files ``` Architectures (determined via `install-methods/nodetype.sh`): * `el9_broadwell` * `el9_haswell` * `el9_westmere` (for `s229`, 4TB) * `el9_nehalem` (for `s230`, 2TB) * `el9_zen4` (?) (for the new AMD CUPs) * `el9_icelake` (?) (Bianca 64-core nodes) Legacy software currently available: * add as additional directories under the `/sw/el9_archspec` directories * these are **hardlinked** directories to the legacy `/sw/` directories * `/sw/el9_broadwell/apps` is a **hardlink** to `/sw/apps` * the `mf` directory allows using current `MODULEPATH` values from e.g., rackham * the eb5, eb4, spack etc `MODULEPATH` values come first ## Meeting 2024-10-14 Participants: MA, BC, DS, DI, JvB ### EasyBuild updates: - `rpath` option still leads to module dependencies - moving deps to builddeps is not a good option - alternative 1 DS is working on: have a "bio" toolchain that contains commonly used dependecies for bio software - alternative 2: add a hook when writing the module file ## Meeting 2024-10-07 Participants: MA, BC, PL ### Status - We are not ready to make public for test users - Douglas (not here): EB installations are progressing - Martin: Problems with slow disks on one of the new Rocky9 compute nodes. Other nodes work fine! - Pär got the path to the EB ``broadwell-rocky9`` module tree. - ``/sw/EasyBuild/broadwell-rocky9/modules`` ### Plans - Pär: Will make ``lmod`` modules available at login on racham4 (as on Rackham-CentOS) ## Meeting 2024-09-23 Participants: DS, BC, JB, AH, MA, DI, PL ### Status - system tools are installed now - like gcc, make, cmake, autoconf etc ### plans - Use new Toolchain(s) =>2023 (?) - First hand use predetermined toolchain - if too old, change in eb file - For old tools rather flag compilation to take care of older code - Douglas do the installations the coming week ### More nodes to build on - possibly ### Other - Domus support until April 2025 ## Meeting 2024-09-16 Participants: BC, DI, DS, JB, PL, MA, AH, ### Status - Prioritization on this - deadlines from System group - vulnerabilities come closer - **Alt 1**: Main suggestion from Jerker - Ready by **October** maintenance to release **rackham4** for users - Ready by **November** maintenance to be released on **whole Rackham** - **Alt 2**: - Release Rocky9 and new software tree on whole Rackham **not in two steps** - **Alt 3**: - no way in universe that this will work - Pär has started with slurm but has not configured yet - almost done - how many connected to rocky9? - Decide later exact number - good to have many so people like to test it! - order 50? - What have we so far learned regarding - reusing - binaries? - most work - check if problem - compiled c/fortran programs - some work? - ideally all should be recompiled - ?just copy-paste to new file tree? - Environmental variables - CLUSTER - we need to differentiate RACKHAM from "ROCKHAM" - Variable to say the module from rocky9 should be used - Examples: - broadwell-centos7 - broadwell-rocky9 - haswell-rocky9 - sandybridge-centos7-TeslaT4 - syntax to find out arch - facter - archspec ?? - ``./nodetype.sh sandybridge_redhat7_tesla_t4`` - https://github.com/UPPMAX/install-methods/blob/main/nodetype.sh - etc... - look at other centres for inspiration - HPC2N - `ls /cvmfs/ebsw.hpc2n.umu.se/` - Output: - ``amd64_ubuntu1604_bdw/ amd64_ubuntu1604_skx/ amd64_ubuntu2004_skx/ amd64_ubuntu2204_skx/ common/ amd64_ubuntu1604_bdz/ amd64_ubuntu1804_skx@ amd64_ubuntu2004_zen2/ amd64_ubuntu2204_zen2/ x86_64_centos7_zen2/`` - Good! Let's use that kind of output! - Perhaps other order - like: **el9** in the end like for rpm:s for instance - (RedHat family v.9) - https://www.eessi.io/docs/software_layer/cpu_targets/ - https://serverfault.com/questions/952825/how-to-detect-nvidia-gpu-with-puppet - script - architecture (at least for new clusters) Commands we use to detect architecture and GPU model: `archspec cpu` for the architecture, installed as module `nvidia-smi --query-gpu name --format=csv,noheader` for the GPU model See also https://serverfault.com/questions/952825/how-to-detect-nvidia-gpu-with-puppet ### Finding archetucture like things ``` [rackham5: ~/github-sync/local/install-methods] $ ./nodetype.sh el7_broadwell [r1001: ~/github-sync/local/install-methods] $ ./nodetype.sh el9_haswell [s185: ~/github-sync/local/install-methods] $ ./nodetype.sh el7_sandybridge_tesla_t4 ``` ### Discuss+DECIDE - Alternatives and prioritizations as of the now clear deadlines - See see above - We run for alternative 1 - Mount points - OK - - EB tree - /sw/Easybuild/broadwell-rocky9 - etc - alternatives - keep and use new on Pelle/Maja - **same as Pelle/Maja** YES - perhaps some changes for the new systems - Module tree - still ``/sw/mf/..`` for Rocky9 - Pelle/Maja: we'll see - alternatives - **keep (almost)** and use new on Pelle/Maja YES - different on pelle (somewhat) - same as Pelle/Maja - non-EB?? how to combine - same procedure as before to use mf files? and run post-install scripts YES? - almost - makeroom can still be used - Software file tree - /sw/apps +/sw/bioinfo /sw/EB/... - 1. non eb: basically the same - 2. new tree for EB - alternatives - A. keep and use new on Pelle/Maja YES - keep old for reference - B. same as Pelle/Maja NO - EB builds are automatically categorized - what about non-EB? - keep file-tree structure for those? - just adding architecture - Gorilla is the main "drive" - domus may be used for to some extent JUST in the beginning ### Who does what? - EB builds - "manual" installations - ??copy over some - ### Order of things? Change if neccessary! 1. lmod and EB working 2. GCCcore builds (then each software are faster to build) 3. Make lists of tools to install and sort into EB/non-eb 4. install ### System installations - compilers etc (basic) yum () - gcc 11.4 - g++ 11.4 - gfortran 11.4 - cmake 3.26.5 - make 4.2 - autotools - libtool 2.4.16 (2014 OK) - automake 1.12 - autoconf 2.69 - m4 1.4.19 - gawk 5.1.0 - perl 5.32 - archspec sys and module - Även lista här: https://jira.its.uu.se/browse/SYS-1734 - Also: https://jira.its.uu.se/browse/UPP-684 - UPP-681: https://jira.its.uu.se/projects/UPP/issues/UPP-681?filter=allopenissues ### ToDos until 23 Sep - BC - DI - DS - think of arch description for arch variable/directories - ... ## Meeting 2024-09-02 Participants: Douglas, Diana, Björn, Pär - Pär installs lmod - configs left ## Meeting 2024-05-27 Participants: Douglas, Diana, Martin A, Pär, Björn C - update from Åke (HPC2N) via Martin A - they have started looking into EESSI - Åke agreed to give a presentation for UPPMAX - still unclear if or how we'd intergate EESSI on our clusters - on demand solution for RStudio, Jupyter, ...? - some of the performance issues on Bianca could be improved by moving the software from castor to cygnus - Doug working on a list of software that there are EasyBuild recipes for - packages that may pose security vulnerabilities: should they be installed system-wide or via EasyBuild? - the latter solution is possible, but we'll need to make sure we reinstall software that depends on these ## Meeting 2024-05-20 Participants: Diana, Pavlin, Jonas, Jayant, Martin A, Pär, Douglas - EESSI seems like good idea, but - security concerns b/c of unknown binaries - do they impose some deployment strategy, e.g. arch specific mounts? - http caching proxies and other system support required for performance - System Experts looking further: Martin A first, then Pär - App Experts looking further: Pavlin, then Douglas - Martin A willingly hands off compiler/MPI installations - how to handle intel OneAPI builds/RPMs still needs answer ## Meeting 2024-05-13 Participants: Diana, Pavlin, Pär, Andreas, Douglas - Current and future hardware architecture - future: https://jira.its.uu.se/browse/SYS-1734 (see ...final.pdf document) - summary: Zen4 48-core CPUs, AVX-512 support - L40s/H100 new GPUs - current architectures to consider - snowy7 CPU (moving forward, for 2TB and 4TB nodes, Sandy Bridge for thin nodes, Westmere, even older, for huge nodes, no AVX2) - rackham7 CPU (Broadwell) - rackham9 CPU - irham9 CPU (Irma and Bianca have identical CPUs, Haswell, tiny ISA deficiencies relative to Broadwell) - miarka9? this one has e.g. AVX512 (Cascade Lake, also in Miarka-style Dis nodes) - bianca new 512s 9 (Ice Lake, more advanced AVX512, maybe similar to Zen4 in ISA support) - A100 GPU nodes 9 (Zen2 (?)) - where do we have separate GPU trees? and how do we deploy? - System-wide package installations - what should be included? - htop, image and pdf viewer, browser/firefox, R core??? - Structure of the software tree - current CLUSTER split is below version (/sw/bioinfo/TOOL/VERSION/CLUSTER), so that by default same compilation used on all clusters - EasyBuild and other tools can benefit from greater usage of March=native and other hardware-specific optimisations - forget CLUSTER, switch to ARCH, and have full software tree is split at a high level (e.g, /sw/ARCH/TOOL/VERSION) - when one version (e.g., static binary) works across all clusters, can do "shallow copy" using hardlinks (which might be more trouble than it's worth) - interacts with deployment of sw tree on nodes; how would sysperts like to do that? - Installation methods: "manual" installs / EasyBuild / Spack - we/AEs would like to rely more on EasyBuild installations - with this in mind, it may work better if AEs take over installation of the compilers and accessories (GCC, Intel, OpenMPI, ...) - still some open EasyBuild questions to answer - which version do we start with? EasyBuild 5 is in "beta" and probably have better RPATH support - we need good RPATH support - versioning without toolchain suffixes - python virtualenv support of some kind, `EBPYTHONPREFIXES` or something similar - how to version with patches and commits consistently - how to handle toolchain sunsetting in EB5 (only the last 6 toolchains / last 3 years) - Test nodes: - `r1001` - irham node - Rocky9 installation - `r1002` - irham node - packages/libs were updated on top of the old file structure - what are the differences between the two? Summary: - packages to be installed for usable system - firefox, eog/image viewer, evince/PDF viewer, htop - R-core may not require as many packages as previously thought - how to provision /sw to nodes - split tree on ARCH not on CLUSTER - do we also split on GPU ? - we should consider providing software via EESSI - and build our system on top of this using its hierarchy - certain assumptions may dictate what we choose to do - a main provisioning question: do nodes of different architectures mount different /sw trees, *or* do all nodes have access to the full tree and use `$ARCH` or something like it to access the appropriate subtree - the former makes certain things easier, e.g., path to tool is always the same but it resolves to the appropriate architecture on each node - the latter makes it easier to debug build and other issues - if we want to reduce duplication using, e.g., hard links between different trees, will these work if these span mount points? - providing a stable hierarchy could circumvent the need for this, as of course would not using architecture-specific mounts - Martin A not available, so postponed discussions of compiler/MPI builds and aspects of /sw provisioning until he is available - Set up recurring weekly rocky9/sw meeting Mondays at 13