Getting Scientific Software Installed (detailed agenda)

# Getting Scientific Software Installed **Please flesh out the table of contents per session by Thu 28 August at the latest!** ### Detailed agenda #### 09:00-10:00 Welcome - informal, coffee, say hi, etc. - start presentations at 10:00 #### 10:00-10:15 Introduction [speaker: Kenneth, sparring: Alex] - `[1 slide]` general points: goal of the training, agenda, practical org, ... - `[3 slides]` what's important to take into account when installing scientific software? - software + dependencies (libraries) - briefly explain that software can use code from multiple "places" (libraries) - write in Python, heavy lifting done by written in C++/Fortran/Rust that was compiled to libraries controlled through Python code - tblite example - interpreted vs compiled software + interpreted backed by compiled libraries - computers only execute binary code - source code needs to be "translated" to binary code - interpretation on the fly (JiT) at runtime - compiling to binary executable - binary programs use hardware instructions - generic binary with broad compatibility - does it actually run? (`Illegal instruction`/SIGILL errors) - x86_64 binary doesn't run on Arm CPUs, and vice versa (except when emulator software is sitting in between) - some specific instructions may not be supported by the hardware platform you're running on (AVX-512 instructions on AMD Zen3) - binary optimized for your hardware, good performance (`-march=native` compiler option) - impact of vector instructions like AVX-512 or SVE - "tuning" according to hardware characteristics: cache size, # functional units, etc. - hardware platform - CPU vs GPU - CUDA runtime as requirement for running software on NVIDIA GPUs - CPU microarchitectues (optimized software installations) - CPU-specific builds, $VSC_* environment variables - `[1 slide]` what will you find in this training: - different use cases for different levels of expertise 1. use software already available, installed by somebody else - central modules, EESSI, existing containers images 2. install software on your own, incl. bringing pre-built binaries to the cluster - conda+mamba+pixi, language pkg mgrs (Python venv/R/Julia/Rust+cargo) 3. building/compiling software from source code - EasyBuild/Spack, manual build #### 10:15-10:30 Central modules [speaker: Kenneth, sparring: Alex] - basics of environment modules, why we use them, ... - different implementations: Lmod vs Tcl-based Environment Modules - shell environment - how software is found (and dependencies are found): $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, ... - central software stack - software installations optimized for CPU of partition/system you are using - cpu_rome vs cpu_milan on Tier-1 Hortense - access via environment modules tool to "activate" software in your shell session/job - concept of toolchains - toolchain components - "generations" of software installations - one toolchain at a time, compatibility - incl. hands-on demo (5min) #### 10:30-11:00 EESSI [speaker: Lara, sparring: Sam] - VM in AWS with `sudo` for hands-on demo of native installation - single-node without CernVM-FS: access via container vs cvmfsexec #### 11:00-11:30 Containers (using) [speaker: Sam, sparring: Kenneth] - What are containers? (container vs container images) - OCI image format - Why Apptainer and not Docker on HPC systems - Singularity(RC) vs Apptainer - there are alternatives: podman, Sarus, CharlieCloud, ... - When to use containers - existing container image that has everything you need (examples: NGC - https://catalog.ngc.nvidia.com/containers, biocontainers, ...) - exact same software on different systems - quickly testing before getting it installed system-wide and optimized software module - software with very complex, custom build procedures - software that is too old to be built the available toolchains - software that has lots of deps with version restrictions that are incompatible with the available toolchains (often conda software) - software that was built for a different distribution of Linux than the host OS. - reproducing an environment to run a workflow created by someone else. - When not to use containers - isolation from host system can be a bad thing - doesn't work with JupyterLab in OnDemand web portal - containers + (multi-node) MPI can be tricky - compatible MPI installation required on host - https://permedcoe.github.io/mpi-in-container/#mpi-inside-a-container - container image may be large (multi-GB) - x86_64 vs Arm CPU systems - Performance - container images are usually built generically ("mobility of compute"), no architecture-specific optimizations => sacrificing performance for mobility of compute - Trust/security - do you know what's in the container images you are using, how the software was built/configured/patched/etc. - isolating the container from the outside environment (`apptainer exec -C`) may require bind mounts (`-B`) - when importing from docker hub: always use images from the official organization - GPU use case - NVIDIA - `--nv` option to `apptainer` to make GPUs available inside container - NGC container registry - AMD GPUs (LUMI) - container image must be prepared for use on AMD GPUs (no CUDA support) - hands-on with Apptainer - relevant environment variables to configure Apptainer: control cache/tmpdir - `apptainer run`, `apptainer exec`, `apptainer shell` - using https://hub.docker.com/r/dkukhl/tblite ? - `apptainer shell docker://dkukhl/tblite` - unclear what's in there, how to use it... - use self-built container image for tblite - refer to https://docs.vscentrum.be/compute/software/containers.html for more info/details #### 11:30-12:00 Conda+mamba+pixi [speaker: Maxime, sparring: Alex] - Conda: a package and environment manager running on all major operating systems and hardware platforms Main ideas: - programming language-agnostic - support for having multiple independent environments - taking care of all (incl. OS) dependencies (except for glibc) - no root privileges required - preference for precompiled Conda packages Common Conda package repositories ('channels'): - 'default' channels like 'main' and 'r' (note: subject to Anaconda's Terms of Service) - 'community' channels like 'conda-forge' and 'bioconda' Mentioning different implementations: Anaconda; Miniconda; Miniforge; Micromamba; Pixi - Example environment creation and activation: Setting up Miniforge (note: conda-forge = set as only channel by default) ``` $ module load Miniforge3/25.3.0-3 If you haven't done so already, ensure that Conda/Mamba is not using your $VSC_HOME, for example: $ conda config --append envs_dirs $VSC_DATA/conda_envs $ conda config --append pkgs_dirs $VSC_SCRATCH/conda_pkgs $ cat ~/.condarc # to verify these settings Creating an environment: $ conda create -n myfirstenv tblite=0.3.0 "libblas=*=*openblas" # alternatively: 'mamba create ...' $ source activate myfirstenv $ tblite --version Switching to MKL: (see also https://conda-forge.org/docs/maintainer/knowledge_base/#switching-blas-implementation) $ conda install "libblas=*=*mkl" Deactivation: $ conda deactivate myfirstenv Analyzing dependencies (using conda-tree installed in a separate environment, using an input YAML file): $ printf "dependencies:\n - conda-tree=1.1.1\n" > environment.yaml $ conda env create -n mysecondenv -f environment.yaml $ source activate mysecondenv $ conda-tree -n myfirstenv deptree Cleaning package caches (which can grow to tens of GBs): $ conda clean --all ``` - Pitfalls: * Cannot be used together with (centrally installed) modules * Potential lack of microarchitecture optimizations (see also https://conda-forge.org/docs/maintainer/knowledge_base/#microarch) * Reserve the 'base' environment for conda-related packages such as conda-build or conda-tree (see also https://www.anaconda.com/docs/tools/working-with-conda/environments) * Comparatively high storage demands (large volumes and number of files) Containerizing helps to reduce the latter * Potential difficulties in reproducing environments ``` $ conda env export > env.yaml $ conda env create -n mynewenv -f env.yaml For other platforms or (if relevant) other μarchs, need to add --no-builds to the export command $ conda env export --from-history # can be useful to only see what you requested ``` * 'default' channels like 'main' and 'r' are subject to Anaconda's Terms of Service If you are an academic user, the license permits you to freely use packages from these channels. But when leaving the university (or if you're from industry to begin with), these channels require a paid license. - Other Conda package managers: * Micromamba: single executable file; potentially faster https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html $ "${SHELL}" <(curl -L micro.mamba.pm/install.sh) $ Micromamba binary folder? [~/.local/bin] Init shell (bash)? [Y/n] n Configure conda-forge? [Y/n] Y You can initialize your shell later by running: micromamba shell init $ export PATH="~/.local/bin:${PATH}" # you can add this to your ~/.bashrc Need to manually configure your envs_dirs and pkgs_dirs in ~/.condarc ("micromamba config --append" won't work) * Miniconda: is what we used to document, but carries certain disadvantages (easy to install packages in a base environment by mistake; easy to run into problems due to 'default' channels). If you want to use Miniconda, preferably use a centrally installed Miniconda3 module. Using the local installer results in duplicating the same 25k and ~250M(?) files for each user. * Pixi: project- rather than environment-centric; more emphasis on speed; more reproducible environments via lockfiles <similar tblite example as above, with pixi> #### 12:00-13:00 Lunch Break #### 13:00-13:30 Language package managers: Python's pip+venv [speaker: Steven, sparring: Robin] - Welcome to (Python package management) hell! - Motivate the need for isolated environments - Motivate the need for automatic dependency resolution - When to use pip+venv (referring to earlier presentations) - Conda+mamba+pixi => provide nearly complete isolation, resulting in high storage demands - Central module installation => burden for VSC support teams, not feasible to install all Python packages - pip+venv allows extending Python packages that are centrally installed (e.g. `NumPy`) with your own packages - How to use pip+venv - Creating a venv: ``` cd $VSC_DATA python -m pip venv my_first_venv source my_first_venv/bin/activate ``` - Installing packages with pip; de facto standard index is [PyPI](https://pypi.org/) (the Python Package Index); installing a package is in most cases as simple as `pip install <package_name>`; - Reproducing an environment via a requirements file; note importance of fixing versions - The [vsc-venv](https://docs.hpc.ugent.be/Linux/setting_up_python_virtual_environments/#vsc-venv-python-virtual-environment-wrapper-script) project: combination with modules and architecture specifics ``` module load vsc-venv echo "tblite == 0.5.0" > requirements.txt echo "SciPy-bundle/2024.05-gfbf-2024a" > modules.txt source vsc-venv --activate -r requirements.txt --modules modules.txt ``` - Alternatives - Many package managers are available - [UV](https://docs.astral.sh/uv/guides/install-python/) is a modern, high-performance Python package manager and installer written in Rust. It serves as a drop-in replacement for traditional Python package management tools like pip, offering significant improvements in speed, reliability, and dependency resolution. - [poetry](https://python-poetry.org/): Python packaging and dependency management made easy #### 13:30-14:00 Language package managers: R/Julia/Rust/Java [speaker: Alex, sparring: Kenneth] - single-slide with high-level overview for each of R/Julia/Rust/Java - packages/libraries/crates - compare with Python (packages) & pip/venv - `[15min incl. demo]` R (packages/libraries) - How to find available R packages in modules? - The module `R` is a basic installation with a minimum set of R packages just to allow development in R - The module `R-bundle-CRAN` provides an extensive collection of R packages from the CRAN repository. Packages cover all scientific domains. - The module `R-bundle-Bioconductor` provides an extensive collection of R packages from the Bioconductor repository. Packages cover the bio sciences. - Some R packages can be provided with specific modules. This happens for R packages that need a large collections of libraries (e.g. INLA, Seurat) or for R packages that need specific non-R dependencies (e.g. RPostgreSQL) - R packages ussually are *extensions* of these bundle modules: ``` $ module spider spider ggplot2 -------------------------------- ggplot2: -------------------------------- Versions: ggplot2/3.4.4 (E) ggplot2/3.5.1 (E) Names marked by a trailing (E) are extensions provided by another module. -------------------------------- For detailed information about a specific "ggplot2" package (including how to load the modules) use the module's full name. Note that names that have a trailing (E) are extensions provided by other modules. For example: $ module spider ggplot2/3.5.1 -------------------------------- $ module spider ggplot2/3.5.1 -------------------------------- ggplot2: ggplot2/3.5.1 (E) -------------------------------- This extension is provided by the following modules. To access the extension you must load one of the following modules. Note that any module names in parentheses show the module location in the software hierarchy. R-bundle-CRAN/2024.11-foss-2024a ``` - How to install your own R packages? - simple case where `pkg_name` will be downloaded from a remote repo, usually CRAN ``` install.packages("ggplot2") ``` - control what you install: specify version by using `install_version` - install a specific version ``` library(remotes) install_version("ggplot2", version = "3.5.1") ``` - from a specific repository ``` install.packages("ggplot2", repos = "http://R-Forge.R-project.org", dependencies = TRUE) ``` - control what you install: dependencies... - from a repository in GitHub ``` library(devtools) install_github("tidyverse/ggplot2") ``` - control what you install: check version - from a local file ``` install.packages("/path/to/ggplot2-3.5.1.zip", repos = NULL, type = "source") ``` Caveats: Some R packages might involved compiled code in other languages (e.g. C). In those cases, the resulting package might only work on systems with the same CPU architecture as the one use for the installation. - Location of installed R packages R will check multiple directories for packages/libraries and will try to install new packages in the top directory reported by the function `libPaths()`: ``` > .libPaths() [1] "/user/brussel/101/vsc10122/R/x86_64-pc-linux-gnu-library/4.4" [2] "/apps/brussel/RL8/skylake-ib/software/R-bundle-CRAN/2024.11-foss-2024a" [3] "/apps/brussel/RL8/skylake-ib/software/R/4.4.2-gfbf-2024a/lib64/R/library"` ``` 1. Your personal R library. Can be changed with environment variable `R_LIBS_HOME`. By default located under `$R_HOME/library`. 2. Site libraries added by module files. Read-only. 3. Main R library of the active R interpreter. Read-only. All these libraries are specific to the active version of R in use. If you install some package in your personal R library and switch to a different R module with a different version, you will have to install it again. - RStudio Projects RStudio Projects provide a self-contained, organized environment in R. Each project has its own working directory, workspace and history which helps to avoid conflicts between different projects. The **vsc-Rproject** tool helps you create RStudio Projects, and allows to compile extensions in a more portable way and simplifies setting up your R environment (including selecting the correct R module). Once activated, commands such as `install.packages()` or `devtools::install_github()` will yield portable installations in a well-defined, project specific location. 1. Load the vsc-Rproject module ``` $ module load vsc-Rproject ``` 2. Define the version of R and any other modules as base of your project in a `modules.txt` file ``` $ echo "R/4.4.2-gfbf-2024a" > $VSC_DATA/modules.txt $ echo "R-bundle-Bioconductor/3.20-foss-2024a-R-4.4.2" >> $VSC_DATA/modules.txt ``` 3. Create your R project ``` $ vsc-rproject create MyProject --modules="$VSC_DATA/modules.txt" ``` 4. Activate your R project ``` $ vsc-rproject activate MyProject ``` R projects made with **vsc-Rproject** are self-contained in their own folder, which by default is located in your `$VSC_DATA`. This allows to easily move your project between clusters, as you only need to have the same modules used for the creation of the project available. - `[5min]` Julia - How to find available Julia packages in modules? - The module `Julia` provides a standard installation of Julia without any extra packages. - Some specific Julia packages can be provided with their own modules (e.g. IJulia) - How to install your own Julia packages? The best approach to install Julia packages is by making Julia environments. These are self-contained collections of packages that can be activated on-the-fly for each different project and easily shared with colleagues. 1. Make a new Julia environment ``` $ mkdir MyProject $ julia julia> using Pkg julia> Pkg.activate("MyProject") Activating new project at `/user/brussel/101/vsc10122/julia/MyProject` julia> ``` 2. Add software to your Julia environment ``` julia> Pkg.add("TightBindingToolkit") Resolving package versions... Installed StridedViews ─────────── v0.4.1 Installed PyPlot ───────────────── v2.11.6 Installed TupleTools ───────────── v1.6.0 [...] Updating `/user/brussel/101/vsc10122/julia/MyProject/Project.toml` Updating `/user/brussel/101/vsc10122/julia/MyProject/Manifest.toml` [...] Precompiling project... 186 dependencies successfully precompiled in 107 seconds. 36 already precompiled. ``` 3. Start using the new software ``` julia> using TightBindingToolkit ``` Caveats: * The packages installed do not go into the environment folder, but into your Julia *depot* in your home directory (`~/.julia`). Move your personal julia depot to scratch storage to avoid filling up your home directory ``` $ mkdir -p ~/.julia $ mv -i ~/.julia $VSC_SCRATCH/julia $ ln -s $VSC_SCRATCH/julia ~/.julia ``` - `[2min]` Rust (crates) - `cargo install` - control what gets installed - `Cargo.lock` - simple Python/R/Julia example for hands-on exercise (pick one) #### 14:00-14:30 Manual installation [speaker: Alex, sparring: Robin] - `cmake` + `make` example - constraints: which CPU you're targetting, link to OS, all manual steps (harder to reproduce) - `buildenv` modules - compilers (GCC/LLVM/Intel/NVHPC/...), toolchains (MPI/BLAS/LAPACK/...), CMake, Makefile, environment ($PATH, ...) - required dependencies - from scratch vs on top of modules - CPU-specific builds - link to other VSC trainings #### 14:30-15:00 Coffee break #### 15:00 building Apptainer containers [speaker: Sam, sparring: Robin] - interactively from Docker image - without modifications: create SIF image directly `apptainer build ubuntu-24.04.sif docker://ubuntu:24.04` - with modifications (interactive): first create sandbox ``` apptainer build --sandbox mysandbox docker://ubuntu:24.04 apptainer shell --writable --fakeroot mysandbox # (make changes, install extra packages) apptainer build mycontainer.sif mysandbox ``` - using Apptainer definition file - `apptainer build --fakeroot mycontainer.sif mydefinition.def` - show definition file for our `tblite` container image - from Dockerfile via docker image 1. create/download Dockerfile 1. create Docker image with Docker or Podman - needs a machine with docker/podman: windows: WSL 1. create tarball from Docker image `docker save -o my_docker_image.tar my_docker_image:my_tag` 1. convert tarball to apptainer image `apptainer build my_apptainer_image.sif docker-archive:my_docker_image.tar` https://apptainer.org/docs/user/main/docker_and_oci.html#containers-in-docker-archive-files - or translate Dockerfile to Apptainer definition file - GPU-enabled containers - start from prebuilt CUDA (NGC) or ROCM images as a base - generic containers vs architecture-specific containers - same considerations apply as for native builds - MPI-enabled containers - compatibility with host MPI - HPC container wrapper https://github.com/CSCfi/hpc-container-wrapper creating containers for conda/pip packages #### 15:30 EasyBuild+Spack [speaker: Kenneth, sparring: Steven] - (30min => 10-15 slides max + hands-on demo) - steep learning curve to use these tools (especially if something doesn't work as hoped) - when to use these tools? - automate installation procedure - missing installations in central modules - test stuff supported but not installed centrally yet (EasyBuild) - updating existing easyconfig - EasyBuild - what is it + important aspects - configuration (`$VSC_OS_LOCAL/$VSC_ARCH_LOCAL$VSC_ARCH_SUFFIX`) - basic usage - on top of existing modules - on top of EESSI (https://www.eessi.io/docs/using_eessi/building_on_eessi/) - Spack - what is it + important aspects - configuration: control where stuff goes (`~/.spack`) - basic usage - binary cache (https://cache.spack.io) - compare with EESSI - main differences between EasyBuild & Spack - support by VSC support teams: limited for Spack - use cases - combo with existing modules (EasyBuild) - with Spack, external packages (but untested with EB installations) - using existing an easyconfig vs updating easyconfig vs writing easyconfig - hands-on demo - same software with EasyBuild + Spack - pre-install compiler with Spack #### 16:00-16:15 Creating patch files ??? [Kenneth] #### 16:15-16:30 Conclusions [Kenneth?] - comparison matrix over various tools + various aspects - like in https://archive.fosdem.org/2018/schedule/event/installing_software_for_scientists/attachments/slides/2437/export/events/attachments/installing_software_for_scientists/slides/2437/20180204_installing_software_for_scientists.pdf (see slide 31) - aspects to compare: - supported software - how-easy-to-use on your laptop/PC (Linux/macOS/Windows) - user experience - performance of installed software - reproducibility - time-to-result - hands-on session (per VSC sites) #### 16:30-17:00 Q&A