Try   HackMD

Archive of LUMI Coffee Break 2023

This is the archive document of the past sessions of the public LUMI coffee break that happened in 2023.
The archived questions and answers of 2022 sessions can be found here: https://hackmd.io/@lust/coffeearchive2022

Link to document for the next coffee break: https://md.sigma2.no/lust-coffee-break

Past sessions:

2023-11-29 13:00 13:45 CET

LUMI Updates

  • Integration of the third phase hardware, as well as a software stack upgrade (512 new LUMI-C nodes 65,536 cores improved interconnect capability, Cray Programming Environment 23.09, together with upgrades of the Cray Operating System and the system firmware to improve the system stability and performance.)

Presentation

  • Features and information about the newly released web access portal to LUMI: www.lumi.csc.fi

Questions & Answers

  1. Can I have a tmux session persist after logout or ssh disconnect? I know there are plugins such as tmux-ressurect to save and restore sessions, but there are cases where I might be running a task in a tmux session which I would like to be able to resume or reconnect to after a logout. I understand some of the issues one can raise around this approach, but there are other less abusive uses of tmux like simply storing a working environment (inc. environment variables) which I find very useful. Hopefully you can help answer, or maybe suggest tools to restore the full working environment of several sessions. Thanks!

    • The tmux executable is on the system but we do not actively support it. The resources on the login nodes are limited and if everybody leaves things running, we're run out of resources.

      Personally I have bash functions in my login environment that I use to initialise specific session types (modules and environment variables).

      As the environment on LUMI sometimes changes it is also not a good idea to store it unless you understand very well the problems that can occur.

      Not refreshing the environment also has other disadvantages. E.g., we've seen corrupt Lmod environments and logging out and in helps to clean things up. Or we change things on the system that are then not picked up, and if users keep using the same session for weeks and then submit tickets it's really not the first thing we think about as the cause of the problem.

  2. We would like to use Gaussian quantum chemistry software on LUMI (I'm sure thre are more users who would be interested to use it). Gaussian doesn't support "bring your own license" approach, and they strictly determine the location where it is allowed to use. Gaussian customer support clarifed that our license is only valid on the licensed location (e.g. university campus). They also mentionned that CSC has a supercomputer center license, which allows external accademic users access to the binary code, but this license is not valid for LUMI. CSC has last year expressed an interst to obtaining a Gaussian license for LUMI as well, but so far, there has been no steps toward that. Can you clarify if and when Gaussian software will be available on LUMI?

    • LUST will not invest in the license. If CSC wants to for their users, they can, but we don't speak for CSC. It is not a very scalable code (its LINDA parallelisation technology is basically technology from the early '90s that does not exploit modern fast interconnects well) and unless you have a source code license, it may not even run on LUMI. We have very bad experiences with some other codes already that come as binaries. The interconnect can be a problem, but people should also realise that the compute nodes only run a subset of SUSE Linux as some daemons are disabled (and we know software that doesn't work or has limited functionality because of this). Software that can use the AMD GPUs has a higher priority for our central budget. They only support some NVIDIA GPUs and no AMD GPUs.

      For a system whose prime focus is development of exascale technologies their license that forbids comparison with other codes is also not interesting.

  3. I'm already in contact with support about this, so sorry if this is a repetition. I'm trying to get some multi-node pytorch code to run using torchrun but for some reason it fails with NCCL (connection) errors. The code works on a single node and I earlier on had a variety that (sometimes) worked with multiple nodes, but irregularily failed. Support has tried pytorch examples for multi-node code which seemed to work, but the code I have still fails. The code in

    • You are talking here to the same people as those who do the ticket so we really cannot say anything here more.

      The message I got from AMD is that torchrun is actually not the ideal way to run PyTorch on LUMI. When they built the container for LUMI, they started PyTorch via Python itself.

      • Ok, but what way should be used then? handling it all manually, when there is a wrapper in place that should exactly care about all of these issues concerning multi-node settings?

      The script I have seen uses srun outside the container with each container starting a Python process with access to 1 GPU.

      • I also tried one srun now within an allocation. Same issue.

      Basically, the script I got is

      ​​​​​​​​#!/bin/bash -e
      
      ​​​​​​​​wd=$(pwd)
      ​​​​​​​​jobid=$(squeue --me | head -2 | tail -n1 | awk '{print $1}')
      
      
      ​​​​​​​​#
      ​​​​​​​​# Example assume allocation was created, e.g.:
      ​​​​​​​​# N=1 ; salloc -p standard-g  --threads-per-core 1 --exclusive -N $N --gpus $((N*8)) -t 4:00:00 --mem 0
      ​​​​​​​​#
      
      ​​​​​​​​set -x
      
      ​​​​​​​​SIF=/appl/local/containers/sif-images/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0.sif
      
      ​​​​​​​​# Utility script to detect the master node
      ​​​​​​​​cat > $wd/get-master.py << EOF
      ​​​​​​​​import argparse
      ​​​​​​​​def get_parser():
      ​​​​​​​​    parser = argparse.ArgumentParser(description="Extract master node name from Slurm node list",
      ​​​​​​​​            formatter_class=argparse.ArgumentDefaultsHelpFormatter)
      ​​​​​​​​    parser.add_argument("nodelist", help="Slurm nodelist")
      ​​​​​​​​    return parser
      
      
      ​​​​​​​​if __name__ == '__main__':
      ​​​​​​​​    parser = get_parser()
      ​​​​​​​​    args = parser.parse_args()
      
      ​​​​​​​​    first_nodelist = args.nodelist.split(',')[0]
      
      ​​​​​​​​    if '[' in first_nodelist:
      ​​​​​​​​        a = first_nodelist.split('[')
      ​​​​​​​​        first_node = a[0] + a[1].split('-')[0]
      
      ​​​​​​​​    else:
      ​​​​​​​​        first_node = first_nodelist
      
      ​​​​​​​​    print(first_node)
      ​​​​​​​​EOF
      
      ​​​​​​​​rm -rf $wd/run-me.sh
      ​​​​​​​​cat > $wd/run-me.sh << EOF
      ​​​​​​​​#!/bin/bash -e
      
      ​​​​​​​​# Make sure GPUs are up
      ​​​​​​​​if [ \$SLURM_LOCALID -eq 0 ] ; then
      ​​​​​​​​    rocm-smi
      ​​​​​​​​fi
      ​​​​​​​​sleep 2
      
      ​​​​​​​​export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-\$SLURM_NODEID"
      ​​​​​​​​export MIOPEN_CUSTOM_CACHE_DIR=\$MIOPEN_USER_DB_PATH
      
      ​​​​​​​​# Set MIOpen cache to a temporary folder.
      ​​​​​​​​if [ \$SLURM_LOCALID -eq 0 ] ; then
      ​​​​​​​​    rm -rf \$MIOPEN_USER_DB_PATH
      ​​​​​​​​    mkdir -p \$MIOPEN_USER_DB_PATH
      ​​​​​​​​fi
      ​​​​​​​​sleep 2
      
      ​​​​​​​​# Report affinity
      ​​​​​​​​echo "Rank \$SLURM_PROCID --> \$(taskset -p \$\$)"
      
      
      ​​​​​​​​# Start conda environment inside the container
      ​​​​​​​​\$WITH_CONDA
      
      ​​​​​​​​# Set interfaces to be used by RCCL.
      ​​​​​​​​export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
      
      ​​​​​​​​# Set environment for the app
      ​​​​​​​​export MASTER_ADDR=\$(python /workdir/get-master.py "\$SLURM_NODELIST")
      ​​​​​​​​export MASTER_PORT=29500
      ​​​​​​​​export WORLD_SIZE=\$SLURM_NPROCS
      ​​​​​​​​export RANK=\$SLURM_PROCID
      ​​​​​​​​export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
      
      ​​​​​​​​# Run app
      ​​​​​​​​cd /workdir/mnist
      ​​​​​​​​python -u mnist_DDP.py --gpu --modelpath /workdir/mnist/model
      
      ​​​​​​​​EOF
      ​​​​​​​​chmod +x $wd/run-me.sh
      
      ​​​​​​​​c=fe
      ​​​​​​​​MYMASKS="0x${c}000000000000,0x${c}00000000000000,0x${c}0000,0x${c}000000,0x${c},0x${c}00,0x${c}00000000,0x${c}0000000000"
      
      ​​​​​​​​Nodes=4
      ​​​​​​​​srun --jobid=$jobid -N $((Nodes)) -n $((Nodes*8)) --gpus $((Nodes*8)) --cpu-bind=mask_cpu:$MYMASKS \
      ​​​​​​​​  singularity exec \
      ​​​​​​​​    -B /var/spool/slurmd \
      ​​​​​​​​    -B /opt/cray \
      ​​​​​​​​    -B /usr/lib64/libcxi.so.1 \
      ​​​​​​​​    -B /usr/lib64/libjansson.so.4 \
      ​​​​​​​​    -B $wd:/workdir \
      ​​​​​​​​    $SIF /workdir/run-me.sh
      

      that a colleague of mine tested.

      It is also the basis of what we have tried to pack in a wrapper module for the PyTorch containers we got from AMD.

  4. Are there plans to add other IDEs like PyCharm along with VSCode?

    • You can install the PyCharm VScode plugin, but be careful that files are not all installed in your home (small file quota there)
  5. How can you run Pytorch from OpenOnDemand?

  6. Will you look at supporting mojo programming language in the future, once the language has developed to a more mature level?

    • It is unfortuantely very difficult for us to support amny packages and software packages as we are a very small team. Instead we provide a simple way for you to install the packages yourself using EasyBuild: https://docs.lumi-supercomputer.eu/software/installing/easybuild/

    • Once there is a Mojo Easyconfig available you can install it easily. maybe ask the developers to create one or send us a support request and we can have a look.

    • [Kurt] Well actually, I notice it is something for AI software with Python so it would have to go in those containers. And you can probably just instal it with pip on top of one of the existing containers

  7. Regarding Jupyter, why it is required to load virtual environment to use python packages and not using system-wise installations instead?

    • LUMI is a multi-user system. There is no configuration that is good for everybody, so something systemwide does not make sense. Moreover, another restriction that limits what we can do system-wide, is that Python distributes packages in too many small files that puts a severe load on the file system, so we prefer Python in containers.

      Virtual environments actually also became important simply because there are so many package conflicts in Python.

    • You can also install your python environment easily with the container wrapper: https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/ . It encapsulates it nicely in a container and puts no strain on the LUMI file system

  8. Comment: mlflow may be added to web interface, in addition to tensorboard. This could be useful also for pytorch users.

2023-10-25 13:00 13:45 CEST

LUMI Updates

  • Integration of the third phase hardware, as well as a software stack upgrade (512 new LUMI-C nodes 65,536 cores improved interconnect capability, Cray Programming Environment 23.09, together with upgrades of the Cray Operating System and the system firmware to improve the system stability and performance.) Planned completion date: 6 November 2023
  • LUMI users survey: a comprehensive survey will be sent to all LUMI users in November. Results of this survey will be presented to LUMI users in December. We encourage all you to fill it as accurately as possible. Thank you.
  • Hackathon - Kraków, Poland - 27.11-1.12 preceded by a profiling course on 22.11.
  • Preliminary courses next year:

Questions & Answers

  1. I am interested in profiling the memory performance of MPI applications. Is there a tool to measure the memory bandwidth or cycles lost waiting for memory accesses such as the one provided by PAPI. I know that this PAPI performance counter is currently not available for zen3 processors but is there any other tool that can measure memory performance? I have used CrayPat but I haven't found anything relevant.

  2. We are experiencing extremely low GPU performance with multi-node training. We assume that the performance bottleneck occurs due to slow inter-node communication as a result of not using the aws-ofi-rccl plugin that takes the advantage of the Slingshot 11 interconnect. Do you think this plugin can considerably affect the inter-node communication speed?
    Are the instructions related to multi-gpu training https://docs.lumi-supercomputer.eu/software/packages/pytorch/ still up-to-date, or are there any additional steps required for a proper installation of aws-ofi-rccl?
    We followed the recommendations from https://docs.lumi-supercomputer.eu/software/packages/pytorch/ and built a Singularity container based on the official rocm/pytorch image. The difference is that we installed extra python packages inside the container rather than creating a virtual environment in the host directory outside the container. The singularity-bindings module and the aws-ofi-rccl plugin are also installed, however, based on the training logs, the script does not find aws-ofi-rccl and applies the internal rccl implementation instead. Despite the fact that singularity-bindings should define all the necessary env variables (and the path itself to aws-ofi-rccl .lib file exists inside the container), RCCL still does not use the plugin.

    • Yes, it can seriously affect performance

      AMD has prepared some containers with PyTorch with built-in AWS OFI RCCL plugin. We're working on making these available in a more user-friendly way, but they basically work as described in the demo in the last 4-day course and are in /appl/local/containers. Expect them to be broken though after the current maintenance as the OFI plugin nearly always breaks even after minor system updates. It will take some time to repair.

      We're also still looking for a way to more easily extend t3.

    Thank you. Did you use some specific version of the official Rocm pytorch container, or did you install the plugin inside the container manually? We used the most recent version of rocm/pytorch, but did not find the plugin (neither libfabric required by the plugin) installed in the container.

    • AMD made this container for us so we don't have the full information. As far as we know they used an available binary for PyTorch, but the AWS plugin is compiled in the container but on a system that has the Cray PE installed as it links to specific libraries outside the container, including the libfabric library on LUMI. That libfabric library is (a) a proprietary version as it contains the code for the Slingshot 11 NIC and (b) should match other components in the network stack, so it would not be a good idea to put that in the container itself. No way you can redo this hence, as I doubt you have access to a system with the Cray PE that would give you sufficient rights to build docker containers, but I assume you should be able to build upon their container.

2023-09-27 13:00 13:45 CEST

Presentation

LUMI Updates

  • LUMI-G will be reserved for HPL benchmarks from Friday, 29 September, 16:00 EEST (15:00 CEST) to Tuesday, 3 October, 08:00 EEST (07:00 CEST).
    A reservation will be created to ensure no jobs are running when the break starts. Shorter jobs may run if they can finish before the reservation begins. Please note that LUMI-C is not included in this reservation and will remain in use.
  • Hackathon: Optimizing for AMD GPUs, 27.11.2023-1.12.2023, Kraków, Poland https://www.lumi-supercomputer.eu/events/hackathon-nov23/

Questions & Answers

  1. Q: Is it recommended/possible to use multiprecision in my tensorflow models during training?

    Answer: That's a question to ask to a Tensorflow specialist. It requires domain knowledge to answer. We are a small team and can impossibly have expertise in all applications and how they behave on LUMI.

    However, the AMD GPUs are relatively strong in FP32 and very strong in FP64 but not so strong in lower precision formats so it may not pay off as much as you would expect from some other GPUs.

    Comment on the answer: it is also a question on GPU type, with NVIDIA, the command: tf.keras.mixed_precision.set_global_policy("mixed_float16")works transparently (and I am not specialist neither)

  2. Q: For containr, where the image is stored (on which quota does it go)?

    Answer: That depends on where you install it. We recommend that the image is stored in your project folder. The image will only be a single file.

  3. Q: For my conda installation on LUMI, I followed the instructions provided on the LUMI container wrapper doc page, unlike containr build mentioned today. Seems like it did build on the file system. So should I do it again differently?
    The commands I used were:

    ​​​​$ module load LUMI
    ​​​​$ module load lumi-container-wrapper
    ​​​​$ mkdir MyEnv
    ​​​​$ conda-containerize new --prefix MyEnv env.yaml
    ​​​​$ which python
    ​​​​{my project path}/MyEnv/bin/python
    

    Answer: It does put some things on the file system like wrapper scripts but the main installation is done in a SquashFS file that will be among those files. But the container wrapper does, e.g., create wrapper scripts for everything in the bin directory in the container so that you don't need to use singularity commands to start something in the container.
    Comment: You can use cotainr as an alternative to the LUMI container wrapper. Please take a look at the LUMI docs page Installing Python Packages for more details about the differences.

  4. What does the --system option imply?

  5. Does the --system option installs ROCM GPU optimized BLAS/LAPACK releases when selecting lumi-g ?

    Answer: The system flag defines a base image for your container. For LUMI-G it will include the full ROCm stack. You can use --base-image, if you want to specify your own base image.

  6. Is there a command similar to --post-install for cotainr that is present in the lumi-container-wrapper?

    • The --post-install command allows commands to be executed inside the container after creating the squashfs file.
    • --post-install is not available in containr and for best practice you should re-build the container with the python file.
  7. Q: Being new to containers in general, is it possible to have my "core" image built with containr, and when running it, pip install new packages to use in the container for one certain project? Thank you.

    Answer Containers are read-only once created so pip install would do an installation outside the container in the default Python installation directories.

  8. I use conda env for the ML part of my code but I have also Cray compilers and tools to use with this. What are your suggestions for such mixed requirements ?

    Answer I don't think there is a generic answer to that. The problem is that if you start to mix software compiled with different compilers, you can have conflicts between run-time libraries of compilers. Ideally you'd use the same compilers as those ML parts were built with, but that's something that you don't always know Unfortunately compiling our own PyTorch and Tensorflow to ensure compatibility is outside what LUST can do given the size of the team, and is something that AMD advised us not to do as it is very difficult to get it right.

  9. As an addition to the question about post-install: Singularity has the option to make a "sandbox" image, so that you are able to install linux packages in the shell after creation. Wouldnt this be an easy addition, that doesnt make it too complicated for the basic user? A sandbox option.

    Answer Cotainr actually exploits sandbox to build the container. But it is not a good practice to build containers as you may loose reproducibility as you don't have a definition file anymore that covers the whole container.

  10. Does cotainr works also with pipenv?

    Answer Currently only Conda is supported, but the documentation does show a way to add pip packages to the environment.

  11. I am running an R code on LUMI-C using the small partition. How can I efficiently allocate a whole node in order to cut billing units?. Are there any specific commands to adjust the minimum number of GB per core to be allocated?

    Answer It is possible to allocate a full node in the small partition by using the #SBATCH --exclusive flag in SLURM, but you might as well run in the standard partition as well, which allocated a full node by default. Same with memory: there are flags to specify the amount of memory per core, per rank, per GPU etc in SLURM (please see the sbatch man page).

  12. Is it easy to add something like "module load" in the cotainr build script, to start with a known group of packages? Thank you.

    Answer That doesn't make much sense in the spirit of containers as containers try to offer a self-contained environment. The primary use of modules is to support multiple versions of one package which is not of use in containers.

    Packages also are not taken from the system, but currently from the Conda repositories and for containers in general from different software repositories.

  13. Does LUMI container image include Cray compiler ? And if yes could, this container by use on our PC ?

    Answer The Cray compiler is NOT public domain but licensed software so you cannot use it outside of LUMI or other systems that have a license for it. There actually exists a containerised version of the PE but it is given only to users who have signed heavy legal documents in a specific project.

    So the Cray compiler is also not contained in the base images of cotainr and never will unless HPE would open source the PE which they will not do anytime soon as it is one of the selling points for their systems.

  14. With a singularity container built with cotainer based on a conda env, is it possible to add pacakages to the environment after the container is built?

    Answer Please see the answers above.

  15. So is there container with gnu/llvm +rocm for building fortran code for LUMI in our PC ?

    Answer Why not simply install the gnu compilers or ROCm + LLVM (the AMD compilers are LLVM-based) on your PC? ROCm in a container would still need an AMD GPU and driver installed in the PC if you want to test the software also and not only compile. In fact, not even all compiles would work as software installation scripts sometimes need the GPU to be present as they try to automatically check the type etc.

    Comment The point was to have a good starting point, that User Support have already tested .

    Reply Testing is not absolute anyway as a container still depends on the underlying hardware, OS kernel and drivers. It is a misconception that containers are fully portable. And having the software in a container can make development harder as you will always be mixing things in the container with things outside it.

    Moreover, we don't have the resources to test such things thoroughly. It is already not possible to thoroughly test everything that runs on the system, let alone that we can test if things would also work on other systems.

    Comment tank's bye

  16. Has cotainr images access to all lustre filesystems?

    Answer Yes, but you need to do the binding as explained in our singularity documentation.

2023-08-30 13:00 13:45 CEST

Presentation

  • Recent changes to the LUMI system configuration
  • Spack package manager

LUMI Updates

Future LUMI Events

1-day LUMI intro course

Short introduction to the LUMI architecture and setup. It will include lessons about the hardware architecture, compiling, using software and running jobs efficiently. After the course you will be able to work efficiently on both the CPU (LUMI-C) as well as GPU partition (LUMI-G).
https://www.lumi-supercomputer.eu/events/lumi-intro-course-21sep23/
Registration deadline: 15. September 2023, 16:00 CEST

Advanced comprehensive 4-day LUMI course

3.6.10. Hybrid (Warsaw and online)
Includes lessons about compiling and using software, programming models (HIP and OpenMP offload), porting, executing jobs, and optimizing applications to run on AMD MI250X.
https://www.lumi-supercomputer.eu/events/general-lumi-course-oct2023/
Registration deadline: 25. September 2023, 16:00 CEST

Hackathon: Optimizing for AMD GPUs

27.11.1.12. On site in Krakow
The hackathon is a way for you to get free help to optimize your software for the AMD GPUs available on LUMI-G!
https://www.lumi-supercomputer.eu/events/hackathon-nov23/
Registration deadline: 25. September 2023, 17:00 CEST

Questions & Answers

  1. Q: Regarding the installation of python packages, I was trying to install mpi4jax, a package for using MPI in jax (it's used by some codes I want to run), and encountered an error in trying to install it using the command pip3 install mpi4jax==0.3.12 no-build-isolation. Before trying the installation, I load the modules etc. by

export EBU_USER_PREFIX=/project/project_XXXXXXXXX/EasyBuild
module load LUMI/22.08
module load partition/G
module load jax/0.4.1-cpeCray-22.08-rocm5.3
module load wheel
source venv_for_mpi_test/bin/activate

At the end of installation, I then get the error "error: command '/opt/cray/pe/craype/2.7.8/bin/cc' failed: No such file or directory" (this directory with 2.7.8. really does not exist). Is it somehow possible to tell pip to use another version (in the folder there seem to versions 2.7.17, 2.7.18, 2.7.20 and default)? [In case this is more suitable for a support ticket, I can ask this there instead.]

Answer: It is in fact practical to submit a support ticket. I wonder if it would work with more recent LUMI/22.12 software stack.

Thanks, I'll write a support ticket.

  1. Q: I do not know all the limitations but if you create many spack instances in a project you can hit the quota regarding number of files, right?

    Answer: Yes, it is true. For this reason we provide "central" spack installation with most common software pieces already available (so called upstream spack instance). If you use spack module then you wouldn't need to install everything in your own directory.

2023-06-28 13:00 13:45 CEST

Presentation

  • AI workloads on LUMI (René Løwe Jacobsen)

LUMI Updates

  • Cray operating system + Slingshot upgrade July 11-13 (to be confirmed)
  • Maintenance break planned from 15. August (installation of Phase 3 HW)

Trainings

Questions & Answers

  1. Q: Is there a possibility to share paths between projects?
    For example /projappl/project_xxxxxxxxx are missing o+x, so I can't share e.g. /projappl/project_xxxxxxxxx/public/useful_file

    Answer: It is not possible to share paths between different projects as they are likely to have different members/PIs.

    The only way to do so is to have you added to both projects.

  2. Q: Why are login nodes and interactive nodes sometimes very slow? Sometimes users are spawning a dozen processes on login nodes.

    Answer: Most of the time it's not the user that are causing the slow down but an ongoing filesystem issue. It's a problem we've had for some time now. Regarding users spawning a lot of processeses, it's probably during compilation, which is generaly fine. There is a huge chance that the "users" are in fact LUMI user support team members :)

    Some file system problems may also be due to user behaviour though, in particular users accessing lots of small files and putting a high load on the metadata servers. Things like colouring in ls or tab completion in directories with lots of files also cause a slow down due to excess metadata operations.

  3. Q: On this page https://github.com/Lumi-supercomputer/ml-examples/tree/main/tensorflow/hvd, options for installing Horovod are described. The first one uses cray-python and tensorflow-rocm from pip. But it does not explain what module is necessary to load for Horovod to function, despite that the environment is without a running mpirun executable (GPU-aware), required by Horovod. If OpenMPI is loaded first, the horovod package will not compile with NCCL support. Knowing what other packages than cray-python (and OpenMPI?) are necessary when installing horovod with pip and not a docker, would be helpful.

    Answer:

    • Please, notice that that's not lumi's official documentation.

    • The idea there is to install horovod within the container. That's why it doesn't need any system modules loaded. The image already has a horovod installation, but somehow it doesn't work for us. We are only replacing it.

      OpenMPI is used only as a launcher while all the communication is done via rccl.

      Nevertherless, since you are having issues with those instructions, we will have a look to see if anything needs to be changed. We should probably update it to the latest image.

  4. Q: On the same page, this script https://github.com/Lumi-supercomputer/ml-examples/blob/main/tensorflow/hvd/run.sh loads many modueles and sets environment variables. But they are not explained, which makes it difficult to experiment with new dockers. Likewise, it is not explained why some areas are mapped into the Singularity image (e.g., app). Where is this information available?

    Answer:

  5. Q: Are there X2go-ish services available on Lumi? Did get any results searching the docs.

    Answer: We provide a VNC server that can be used with any VNC client or directly in your browser via novnc. In the future this will be replaced with Open OnDemand but no date is set for that yet. The workflow is as follow:

    ​​​​$ module load lumi-vnc
    ​​​​$ start-vnc
    ​​​​... will print info on how to create the SSH tunnel and
    ​​​​... how to connect
    ​​​​$ your-gui-app
    ​​​​$ stop-vnc
    
    • module spider lumi-vnc also displays additional usage information
  6. Q: Is Emacs available in some module? (module spider emacs returned nothing.)

    Answer: No, there is no emacs module. Emacs is installed on the login nodes but not present on the compute nodes.

  7. Q: How can the package Accelerate from Huggingface be loaded, working with Pytorch2 and AMD infrastructure?

    Answer: (Christian) One option is to use cotainr to build a Singularity/Apptainer container from a conda/pip environment. On LUMI you can module load LUMI, module load cotainr to get access to cotainr. You may use this (somewhat outdated) PyTorch+ROCm example as a starting point. Modify the conda environment YAML file to your needs, i.e. include the torch 2.0 and accelerate pip packages. You probably want to use a ROCm 5.3 based version for best compatibility with LUMI, i.e. use --extra-index-url https://download.pytorch.org/whl/rocm5.3. If you try this and it doesn't work, please submit a ticket to the LUMI user support team via https://lumi-supercomputer.eu/user-support/need-help/.

  8. Q: What is the best way of transferring data between /scratch and /flash for large files (200GB)? I have tried cp and rsync with IO errors (but it may be my file that is corrupted).

    Answer: We currently do experience a lot of I/O errors though. The larger the data transfer, the more likely one will bump into it. Our sysadmins need more information though to be able to find where the problem originates. The server on which it occurs, file systems involved and a rather precise time window are essential information as otherwise it is impossible to find things back in the log files of the server/node and of the file servers. So far we have not seen any recent cases where this was caused by file corruption. Usually the copy worked eventually.

    We have an easybuild recipe for mpiFileUtils. The dcp command can vastly improve the performance of file copy from one file system to an other. For example the copy of a 200GB of data to the flash take 12 mins 20s with the regular cp while on 4 processes dcp takes 2 mins 4s. It can even be faster if you use multiple nodes (1 process per node): 48s.

    Another alternative may be to try to do it via the LUMI-O object storage as each transfer then only involves a single LUSTRE file system which would halve the chance of a failure because of Lustre. I have noticed that rclone can complete failed transfers efficiently but I am not sure if it can do so for parts of files or if it always starts from the start of a file for a failed transfer. The latter would be of no help if the transfer of a single large file fails.

  9. Q: Is the Zoom link working for anyone else?

    Answer: Most of us have unfortunately problems to join. -Thanks!

    • At least 16 people are connected to https://cscfi.zoom.us/j/65727034273?pwd=VEdtY2trVUVKTEhxajZMbFhETWV2Zz09
    • is that the correct Zoom meeting? AFAICT nothing is happening there
    • Today we are using https://uit.zoom.us/j/66674562260?pwd=ckx3UnZrTEU3aS9rYmlIMks3dG5tQT09 (Passcode: 036094) not the CSC one
      • this link does not work for me (Duma@ecmwf + 5) "An unknown error occured", and for following through web-browser is asking for meeting id and password
      • tried to extract password from the url did not work
        • same for me (Duma) - new error "Something went wrong"
      • Same here, does not work for me (Caspar@SURF +1)
      • "The server encountered an internal error and was unable to process your request. Error Code:103008" appears for me, when joining via browser & posted password (Taavi R. + 5)
      • This error code seems to be a Zoom server overload issue https://developers.zoom.us/docs/meeting-sdk/windows/error-codes/
      • We can confirm as well that the Zoom doesn't work in any configuration (TalTech)
      • Also doesn't work for me, neither zoom nor browser, passcode doesn't help it (bcsj@jyu)
    • We have a short 10min presentation about AI workloads which will be recorded and uploaded later.
  10. What is the Zoom Meeting Passcode?

    • 036094 but doesn't help either
    • can not join the meeting
  11. Our project is stuck on that MPI_Improbe doesn’t work on slingshot libfabric. There is the same problem on Frontier and from today until Friday, our DIRAC team will work on this during official Frontier hackathon with HPE people [Steve Abbott and others]. We’ll try to build an effficient workaround and we’ll also make inefficient check with TCP. Previously we discussed this with Kurt Lust. Could we get in touch on this during the hackathon in somewhat more interactive way, maybe even someone from LUMI joining our team’s hackathon’s Slack thread?
    LUMI #1919
    https://docs.nersc.gov/current/#ongoing-issues
    https://www.olcf.ornl.gov/frontier-hackathon-june-2023/

    Answer I'm not sure we have people with enough experience in that field available who can free themselves that quickly. Some of us are on vacation already so others are very busy with tickets. It is also more an L3 support thing which currently isn't really our task.

    Q: Thanks. Of course, I did not mean LUMI support joining the effort directly, but more read how it developed and maybe adding an idea what we could try, as anyone on LUMI using this feature will have the same problem. We're using it as part of ExaTensor library. So should I instead put a summary in the ticket after the hackathon is over?

    A: Maybe. I am actually not so sure it is used that much as there are more communication networks that require libfabrics and don't support UCX so you'd think that if that call is used that much it would have been added already. But sooner or later we'll run in another issue with this.

  12. Q: I would be interested in best practices for setting up containerized virtual environments/condas. I have been following the docs (https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/) but since every conda container requires a lot of space and time to set up, this slows down my workflow a lot, especially when it comes to using other people's repos (e.g. for baselines). Is there a way to stack or combine conda containers, so I don't have a dozen installations of torch? (I can't connect to zoom but will listen to the recording later)

    Answer: You should be able to work with overlays or the bind command on containers, if you want to provide some data or libraries, which are not available in the image you are using.

    (Christian) I am not sure though, that the container-wrapper supports overlays and binds. You may want to ask this question on the container-wrapper GitHub issue tracker.

  13. Q: Is GPU-GPU direct remote memory access available and tested?

    Answer: This is the kind of question that should be asked during a course when HPE and AMD staff are available. And what do you mean with direct remote memory access? Within the node accessing the memory of another GPU? Or one-sided MPI or a similar distributed memory protocol?

  14. Q: A question about the 22.08 / 22.12 / 23.03 stacks. Should it, in principle, be possible to use all of them, or are the newer ones lacking some modules? For, e.g., using rocm dockers. (Sorry if I am violating the QA format with this addendum, but I was not thinking about mixing, but that, it looked like to me, aws-ofi-rccl, for instance, was not available >22.08, and if that means the above versions can't be used.)

    Answer They cannot be mixed easily. In fact, different programming environments cannot be mixed easily either. This is because of likely incompatibilities between runtime libraries.

    22.12 is currently far less complete than 22.08 which is basically because we lack development time as it is way to busy with tickets. For 23.03 we only develop recipes with the Cray compilers at the moment because otherwise we would be doing exactly the same anyway as with 22.12. But the Cray compiler does contain an important bug fix.

    Recipes need to be ported from 22.08 to 22.12. Sometimes that is trivial, sometimes it is not. I suspect the one for aws-ofi-rccl will need some revision as it interfaces precisely with some of the components that have seen more serious changes in the last update.

    It is better to submit a ticket for that so that work can be assigned and tracked, but I know the person who made and especially tested the previous aws-ofi-rccl recipe is very busy.

  15. Is there a best practice for installing both JAX and PyTorch in the same environment? Unfortunately JAX-rocm and CONDA seem to be not working so far.

    Answer:

    • I guess you are already are using container images that have jax or pytorch. If that's the case, I would recommend getting the jax container and installing there pytorch, following the intructions on it's home page. Jax is kind of tough to install and at least until recently there were no pypi binaries for it. Pytorch, on the other hand, can be install easily with pip.

      You don't need to create a new container image. Just create a virtual environment within the container using a host directory and pip-install pytorch there.

    • (Christian) Altenatively, you can try to use cotainr to build a Singularity/Apptainer container from a conda/pip environment which includes both Jax and PyTorch. On LUMI you can module load LUMI, module load cotainr to get access to cotainr. You may use this (somewhat outdated) PyTorch+ROCm example as a starting point. Modify the conda environment YAML file to your needs, i.e. include the torch, jax, and jaxlib-rocm pip packages. You probably want to use a ROCm 5.3 based version for best compatibility with LUMI, i.e. use --extra-index-url https://download.pytorch.org/whl/rocm5.3. I have only tested PyTorch and Jax independently using this approach, so I can't promise it will work when both are added to the same container. If you try this and it doesn't work, please submit a ticket to the LUMI user support team via https://lumi-supercomputer.eu/user-support/need-help/.

  16. Is there any plan on adding docker/podman in addition to singularity/apptainer?

    Answer: Docker and podman are not supported on LUMI. But containers can be converted to singularity containers using the singularity build command.

    Docker in particular requires rights that will never be handed to regular users. It can only be run safely with respect to other users on cloud environments that also use virtualisation so that they can safely give certain root rights to users. Not sure about podman but (1) LUMI is not a container cloud so containers are not our primary focus and (2) we already have our hands full supporting what we have. Containers are very hard to support on LUMI as people wrongly assume containers are portable. They are not: they interface with hardware, kernel and kernel extensions and drivers and if those are different from those assumed when building the container, the container may fail.

  17. Not about containers, but I am experiencing very long queue times for my jobs submitted to the standard CPU partition, like over 15 hours for a job that should only run for about 1 hour. Is this normal?

    Answer Yes, LUMI is getting a bit busy for multiple reasons. One is that we are currently suffering from a lot of broken nodes that are waiting for repair so the size of LUMI-C is reduced at the moment. Another one is likely that the machine was strongly oversubscribed as users where not using their compute time (so more projects were allocated to compensate) while now people with projects close to terminating start to compute a lot. Moreover, if you are submitting a very large job in terms of number of nodes the scheduler has to collect those nodes which takes time. On the other hand, while collecting such nodes the scheduler will start short jobs that need fewer nodes if it expects that it will not need those shortly as it will still take too much time to gather the nodes for the bigger job. This is called backfill.

  18. Q: I had issues joining the zoom, where will the recorded presentation be uploaded ? Thanks.
    Answer Only of the presentation itself, and if recording it did not fail due to zoom problems It will likely appear somewhere in the LUMI training materials web site.

  19. Q: I notice that aws-ofi-rccl automatically replaces Lmod "PrgEnv-cray/8.3.3" with "cpeGNU/22.08" when I load it. Could you explain this behavior, and if the GNU environment is necessary to use (alos it terms of python version) when using Singuarlity dockers?

2023-05-24 13:00 13:45 CEST

Presentation

  • LUMI-O (Jørn Dietze)

Questions & Answers

  1. Q: Does the LUMI Support team have any experience with using SLATE on LUMI? (https://icl.utk.edu/slate/) It seems like this was more or less specifically developed with an eye towards modern clusters including Frontier, which is like the big brother of LUMI as I understand it

    Answer: If it is not in the LUMI Software Library we likely haven't. We are a level 1 and 2 support team, not a level 3 support team, nor are we researchers in HPC, so we don't get nor have time to experiment with such libraries beyond trying to install them if they are requested more often.

  2. Q: Is there a way in .bashrc to distinguish between a terminal I opened and a slurm job being submitted?

    Answer: You can check for the existence of some SLURM-specific variables. I check for SLURM_JOB_ID to change my prompt so that if I am in an salloc session I can see so from the prompt. I'm not sure what would be the best way to distinguish between an salloc session and a batch session.

  3. Q: We have been advised to store, for example for ML training, data to /scratch on LUMI. Should LUMI-O be preferred, or not?

    Answer: Depends on your software and dataset. Scratch can certainly get a higher bandwidth but only on large files. And LUMI-O is not mounted as regular storage but requires explicit support from your code to get the data, e.g., using toolboxes that can talk the S3 protocol.

  4. Q: What are typical use cases for LUMI-O?

    Answer: I wish we knew ourselves as we have hardly had any training at all ourselves on object storage. But definitely:

    • Longer-term storage for your project for data that is not always needed. But LUMI-O storage will also disappear 3 months after the end of your project! It is billed at half the rate of storage space in /project and /scratch and is permanent for the duration of the project just as /project.
    • Uploading data to LUMI can be speadier via LUMI-O. It just seems that rclone is a lot better at uploading over high latency connections than sftp.
    • Some software can talk directly to object storage and use data from that storage. If all the promotion about the object file system is right, then this could be very good if you have lots of small files (that would then be object in the object storage). I have seen bioinformatics software that can directly access object storage.
  5. Q: What software is running on the backend? (MinIO/ Ceph Rados Gateway?)

    Answer: Again, we don't know ourselves as we didn't get much information ourselves. But definitely Ceph. It is based on what CSC did for the Allas object storage system but does not have all features of Allas.

  6. Q: Could you show us how to upload data to LUMI from a private S3 bucket?

    Answer: I don't know what is the best way to go directly from one object storage system to another.

    The web site that you use to create the key for use from LUMI also has a field where you can get snippets of configuration files to access the LUMI-O storage from outside LUMI, that you can then put in your local configuration files. Names are different though: To access LUMI-O via rclone from your local PC the names will be lumi-46XXXXXXX-o and lumi-46XXXXXXX-pub. The advantage of this is that you can actually have configurations for different project simultaneously.

  7. Q: Quotas on LUMI-O vs LUMI-P/LUMI-F. Do I need to apply for LUMI-O quota specifically, or is it "automatic" (but with different "cost" (TBhrs))

    Answer: Every project gets storage on LUMI-O automatically. And just as for /scratch, /project and /appl, only what you actually use at a given time is billed from your storage billing units, at half the rate of /scratch and /project and 5% of the rate for /flash.

  8. Q: What is special about LUMI-O? Object storage? What is the difference from other LUMI parts?

    Answer: Object storage is completely different from regular file systems. Internally it works in a completely different way, and towards the computer it is not mounted as a directory system but via tools and libraries that talk directly to the object file system using, e.g., the S3 protocols.

  9. Q: What is the underlying architecture/implementation of LUMI-O (e.g. Ceph object store or some vendor specific solution)

    Answer: Standard HPE servers with a Ceph setup done by CSC.

  10. Q: What is the latency on the quota data update? It still reports 0 on lumidata.eu after uploading 2GB and waiting a couple of minutes.

    Answer: Quota reporting is still broken

  11. Q: Is the portal on lumidata.eu based on some open source project or is it customed made for LUMI?

    Answer: I guess so but we don't have information about what is used and not access to the servers to look around. Very few people have access to those components of LUMI, to keep them as secure as possible.

  12. Q: Does using LUMI-O consume the TiB hours allocated to the project?

    Answer: Yes, at half the rate of /scratch and /project.

  13. Q: Out of curiosity: If you don't know about the architecture, then who does know?

    Answer: Some people at CSC. There is a strict wall between sysadmins and support team and we only see the application-oriented and user-oriented side of the system but have no access to system management etc. We basically work from a regular userid without elevated privileges though we are member of one (or soon two) projects whose files are visible to every user. The software stack and soon some training materials are offered to users through regular projects.

  14. Object storage? What is the difference between LUMI-O and other LUMI parts?

  15. what way? okok What are the use cases? real-world use cases for object storage on LUMI???

    See question 4.

  16. Q: So you recommended to store small files there instead of /scratch. Does the same apply for large files? Thanks!

    Answer: For large files LUSTRE will be much faster, especially if your code uses MPI I/O under the hood and if the tuning parameters such as chunck size and number of object servers used (see or courses!) are set properly. We've seen cases where Lustre was >100 times faster than the fastest one of my colleagues got from LUMI-O.

  17. Q: have you answered 15th ?

  18. Q:THANKS

  19. are there Quantum tutorials/lessons on LUMI?

    • i am aware about at least one: TREX Quantum Monte Carlo for chemistry on LUMI

    ye but what about Quantum applications on LUMI?

    • Quantum Chemistry or Quantum Computing?

      And it is not our task to provide trainings in specific research domains. Research communities should do that.

  20. Q:Are there more PyTorch tutorials/trainings apart from https://docs.lumi-supercomputer.eu/software/packages/pytorch/

    Answer: Same as in the previous question, user communities are responsible for such trainings in the EuroHPC model, not us. We are way too small to have expertise in all specific research domains or all applications. Even if you would only take the major applications it is still way too much.

    IF you'd ask each research domain that could use LUMI well which are the 2 or 3 most important codes in their domain that we should offer trainings you'd end up with very different opinions about the most important packages and a list of 100 or more packages.

  21. Q: Maybe not so much a question I installed SLATE using SPACK earlier this week, but I've been struggling to figure out the right environment to use when including SLATE in my C++ code. I suppose the best might be to write a ticket to support if I get completely stuck?

    Answer: The answer will depend on the options that you used to install the software with Spack. Basically you will have to recreate the environment that Spack used. And that starts with the compiler for the CPU parts that was used as that will determine which compiler module should be loaded. After that there is no choice anymore for the right MPI library as there is only one per compiler (at least if you take the default versions). For hip you'll need the amd module if all other code was also compiled with the AMD ROCm compilers, or rocm if the other code was compiled with the GNU or Cray compilers. One of these also needs to be loaded to have proper GPU support in MPI which I would expect to be essential for SLATE.

    The Spack setup tries to work moduleless and calls the compilers directly adding options itself, but that is indeed an extra complication afterwards if you want to manually compile additional code that uses the Spack-installed software.

2023-01-04 13:00 13:45 CET

Updates

  • Extended beta testing of LUMI-G until acceptance
  • Pilot projects end this Monday Jan. 9
  • Training
    • Introduction to LUMI-G hardware and programming environment - Jan. 11 (fully booked)
    • 4-day LUMI training Feb. 14-17

Questions & Answers

  1. Is it possible to access the LUMI filesystem through sshfs? I tried this with version 3.7.1 with the use of the ssh key pair, but the connection was refused.
    Answer:

    I can't reproduce the problem with the same SSHFS version. Can you please send a ticket so that we can look in more details into the issue?

  2. Shortened question: Was it an intentional design decision to set up Lumi-G nodes with more GPU memory (512 GB) than system memory (480 GB accessible to users, minus TMP etc.)?
    Full issue description:
    Lumi-G nodes have 512 GB system memory of which 480 GB can be requested by the user. The four MI-250X have 4*128 GB = 512 GB GPU memory together. I am running a code where data is read in in the CPU code, and then transferred to the GPUs to be worked on in parallel. The result are some thousand large matrices that are returned to the CPU code for some postprocessing, and then a smaller amount of data is written to disk. My full problem set needs about 1 TB of memory on the GPUs, but it can be easily split up to fit into two calls to the GPUs. But actually, I have to split the problem into three calls so that I don’t run out of system memory when transferring the matrices back to the CPU, considering that the read in data and TMP reduce the 480 GB available memory further. Three calls to the GPUs cause some overhead that I could avoid if I was able to make full use of the GPU memory at once.
    Long Question: Is this a very unique challenge that nobody else runs into, or is the amount of system vs. GPU memory something that would have been considered in depth in the design phase of LUMI? It seems like it would be rather cheap to design the nodes with at least some more system than GPU memory, compared to the opposite.

    Answer:

    The main reason only 480 GB RAM is available is that some memory is, in practice, reserved for the operating system on the node. You can never really use 100% of the node memory for your application without using swapping (and get really bad performance as a result). Regarding the node design: it would have been really expensive to expand the memory further on the GPU nodes. The nodes only have 1 CPU socket and 8 DIMM slots, so they are already using 64 GiB DIMMs to reach 512 GiB capacity. The next step would have been 128 GiB DDR4 DIMMs. They do exist but are very expensive and likely would have run outside the specifications of the DDR4 standard. Note that LUMI does have some CPU-only nodes with 4 TB of RAM (2 TB per socket) and these are build with 128 GB DIMMs but that comes with a bandwidth penalty likely not only caused by using two physical DIMMs per channel but also by the structure of a single physical DIMM itself which have more ranks per DIMM than usual.

    In general, distributing a data set over several compute nodes is always a tricky problem. I would not necessarily look at it as a failure and/or suboptimal resource use if you only use e.g. 120 GiB out of 128 GiB of available GPU memory, like in your case. It also often the case that artifically increasing the data set, or array size, to fit the best data distribution over several nodes (e.g. by avoiding prime factors in the array dimensions) leads to increased performance.

    Also note that your analysis of having 4x128 GB on the GPUs is not entirely correct. The MI250 GPU is a dual die design with each die having 64 GB, and there is already a big bandwidth restriction between two dies. Which is also why LUMI reports 8 GPUs instead of 4. The GPUs themselves should be treated as a NUMA system with 8 GPUs with 64GB each and very severe bandwidth restrictions between those rather than as a 4x128GB system. So the data distribution should be done with this in mind, i.e. targeting max 64 GiB per "chunk" of data.

    LUMI may not be the best machine for your research. LUMI is designed as a GPU-first system where the CPU plays a secondary role. In fact, the SlingShot interconnect cards are directly connected to the GPU and not to the CPU and have better access to GPU memory than to CPU memory. LUMI is also designed in the first place for distributed memory applications while your research seems to rather require a large shared memory system. Systems like the NVIDIA DGX series may be a better match for you. Certainly the DGX H100 offers more bandwidth between GPUs than LUMI can offer and both the A100 and H100 offer a larger CPU memory size (2TB). For 1TB of GPU memory in a shared memory fashion and without using multiple nodes you'll likely have to wait for one or two more generations of GPU systems.

  3. Is an update to ROCm version planned near future? The current version is outdated for recent PyTorch builds (PyTorch 2.0 is on its way).

    Answer:

    Some day it will be updated and we are looking if we can do an install ourside system space of 5.3 which still seems to work with the drivers we have on the system. But we don't know when for various reasons. If "near future" is in the next few weeks then don't hold your breath.

    1. The system is still in the hands of HPE as it has not yet been accepted, so they decide about the configuration which is currently what they need for their benchmarks.
    2. Even when the system is in our hands, any ROCm version that requires a newer release of the drivers than we have will have to wait for an update of the OS. HPE Cray have an OS update ready but after our experiences with the previous update where we were one of the first to install it, we're not in a hurry to install that update.
    3. And it is work that the sysadmins have to do, not the support team. And they are currently even more overloaded than LUST is. It is now more important to get the system more stable than to start changing things before fully understanding the current state of the system.
    4. LUMI is a multi-user system. If newer drivers break older versions of ROCm, users with software that still rely on them have as much reason to complain as you have now. Not every software developer can keep up with the rapidly appearing ROCm versions, and reproducibility requirements, something that is becoming important in modern science, also require that you can use the same binaries for a while.
    5. ROCm has a number of hard coded paths which make it difficult to have 100% working multiple ROCm versions on the system. Which is why we are looking if we can compile the ROCm stack ourselves, but this is a highly non-trivial thing.

    If ROCm 5.3 is all you need: Users have reported success installing that with the software that uses it in a container, which also solves the problems with the paths as you then have proper control over /opt/rocm. We've had at least one pilot project that used PyTorch that way.

  4. For containers, would apptainer be available in LUMI instead of singularity?

    Answer:

    We take whatever comes with the OS and is supported by HPE Cray.

    It looks like one can only install one version of singularity at a time and since apptainer is so close to singularity I guess it implies that installing apptainer next to singularity may also be impossible.

    If the version of apptainer supported by Spack works on LUMI, you could always try that one

  5. I am planning to attend the LUMI-G training next wednesday. How do I log in correctly on the HPC?

    Answer:

    The invitation and instruction to join the training project will be send by email tomorrow to the peoples who registered.

    Note that the training was also announced as a training for people who already have some experience with the HPE Cray environment and preferably with LUMI as there the one day training cannot go over all introductory details, that's why we have the four day one in February or March (likely February as was said). So it would be better to have a look at the LUMI documentation before the training.

  6. Does HIP work on LUMI-G? We have some software formely developed for nvidia and would like to port it to lumi.

    Answer:

    Yes. It is one of the two main programming models on LUMI (with OpenMP offload).

  7. Q: I have succesfully tested Gromacs2022.3 GPU as singularity container from AMD. Is there EasyBuild recipe for GPU? I have found only CPUs?

    Answer:

    There is draft version here https://github.com/Lumi-supercomputer/LUMI-EasyBuild-contrib/blob/gromacs-gpu/easybuild/easyconfigs/g/GROMACS/GROMACS-2022-cpeAMD-22.08-GPU.eb

    Please be adviced it will likely produce program without similar performance while it is (1) based on Sycl and not HIP directly and (2) use MPI for multi-node runs instead of thread-MPI which works on one node only.

    The other thing is that the AMD container uses a HIP port of GROMACS which the GROMACS developers do not want to support. They have chosen to use hipSYCL instead. LUST is currently experimenting with the SYCL version.
    AG: Thanks.

  8. Q: What is the chance of me and my colleague getting to Jan 11 LUMI-G training when we're on the waiting list since Monday? It would be extremely helpful for our project to be able to participate.

    Answer:

    There is a chance but I don't know yet how many users are on the waiting list and we have to see how many will write that they are not joining.
    I'm looking into how many users are on the waiting list and will send out invitation emails with the course information tomorrow.

2023-01-25 13:00 13:45 CET

Updates

Questions & Answers

  1. Q: the nodes do not have internet access right now. is there a plan to enable that? It would be very useful to be able to log job progress directly from the node to a cloud service.

    Answer: It is still not complete for C nodes. It will be working once we get to rebooting all C nodes (hopefully this week). G nodes should be all ok.

  2. Q: Is there a possibilty to use Midnight Commander (or some other file manager) on LUMI?

    Answer: It is possible to use mc. But I guess you are asking for whether we can install it for you? There is currently no module available unless you load the spack module first. But you can use the mc binary directly /appl/lumi/spack/22.08/0.19.0/opt/spack/mc-4.8.28-fsariox/bin/mc.

  3. Q: MATLAB is installed locally through a login node, but batch job submission failed due to the license problem, is there a way to solve the issue?

    Answer: MATLAB is provided by LUMI for visualization nodes. Users should
    connect to their own license server if they really want to use it on the
    compute nodes, we do not have enough licenses to have users using multiple
    licenses simulataneously.

    The license server currently can not be contacted from the compute nodes
    because of the lacking internet access, see the first question for that.

  4. Q: I am trying to run my pytorch project on LUMI, for that I need to compile some cuda stuff (mask2former ops). I have a working Tykky environment for Puhti but I'm not sure whether it's possible to compile mask2former for AMD GPUs. I need some assistance in porting my setup from Puhti to Lumi.

    Answer: If it is pure Python code that runs on top of PyTorch it would likely run as PyTorch is ported. But if it contains CUDA kernels in one way or another then it will not run as CUDA is only supported on NVIDIA GPUs. CUDA is an NVIDIA proprietary model, if you use code that needs CUDA your only options are either porting the code or look for a system with NVIDIA GPUs.

  5. Q: There is a Slurm extension which reports energy usage of the runs. Is there any plan to enable this extension? It would be nice to see how much energy is consumed for each run. (Link to slurm extension: https://slurm.schedmd.com/cray.html#admin_guide)

    Answer: Not clear if that can come soon. So far a standard Slurm installation that comes directly from HPE is used, and sysadmins are not really willing to install additional plugins in that.

  6. Q: Are there some obvious performance pitfalls for PyTorch on LUMI-G nodes? I did very quick benchmark of one of my models on single MI250X GPU on LUMI vs a single NVidia A100 GPU. According to pure specs MI250X should have higher performance, but in my case A100 was noticeably faster (3 minutes vs 4 minutes for 10 epochs). I haven't had time yet to look too much into myself, so I am wondering if others have also noticed something like this or there is issue with my code.

    Answer: Are you comparing one MI250x GCD vs A100? The specs can be decieving. Yes, from a computing power perspective you can think that the MI250x should be faster but you also have to consider the rest of the specifiaction and in particular memory bandwidth. The MI250x per GCDs memory bandwith is 80% the bandwidth of the A100.

  7. Q: EGL / off-screen rendering on LUMI-G. (I've sent a ticket, but not received even the number for it. Title: "EGL for mujoco/dm_control")
    In this context: https://github.com/deepmind/dm_control#rendering
    But the main question is: "Does AMD/MI200 support off-screen rendering at all?"

    • Rendering is part of the training loop of the Neural Networks. One generates images for the neural network. Certainly this doesn't work out through LUMI-D.
    • Can you be a bit more explicit with "removed from these cards?" Do you know any more technical details?

    Answer: MI2XX are compute GPUs. They can not be used for rendering. In fact, everything related to rendering has been removed from these cards. LUMI-D will be available in the future for such use cases (NVIDIA A40).

    AMD pro render:

  8. Q: I'm using COMSOL Multiphysics on LUMI with Aalto license. Is there any possibility that LUMI will have its own COMSOL licenses for the users?

    Answer: The basic principle for software that is not sufficiently general to a large user community is that users bring their own license. We do have a software budget but it is extremely small compared to everything for which users want us to pay the licenses. We've had requests that could easily consume our whole budget with a single package.

2023-02-22 13:00 13:45 CET

Updates

  • Enforcement of Lustre quotas (storage & max files) and storage billing units from March 6.
  • LUMI system update most likely in March. LUMI users and RAs will be informed as soon as dates are confirmed.
Trainings

Questions & Answers

  1. Q: How to use container on LUMI G with array jobs. Could you show some example scripts?

    • Answer: This question is vary vague. What do you exactly mean? Starting software in an array job is not different from starting software in a regular job.
  2. Q: How to use GUI/ rendering on LUMI D?

    • Via lumi-vnc (VNC server). And it may be best for now to install software in a container. We're working on a lumi-vnc version with better support for vglrun as there seem to be some problems with it at the moment.

      As with all software in the central software stack I suggest to have a look at the help of the module (module help lumi-vnc) and in some cases there is also more information in the LUMI Software Library.

      LUMI-G is a much larger part of the investment in LUMI so getting that partition to be used optimally has a higher priority.

  3. Q: Which network interface should be used for inter-node communication? I assume not any of the hsn* interfaces as those seem to be used for gpu connect but there's only one other interface besides that (nmn) and that doesn't appear to work (I get back "no route to host")
    Answer:

    • Only the HSN adapters should be used! Those are the ones that connect to the slingshot fabric regardless of where the data comes from - GPUs or not.
  4. Q: Is there already some experience with setting up a WRF-Chem installation on LUMI? I have seen that WRF is available on Easybuild. Is it possible to set some switches for installation in order to install WRF-Chem specifically?
    Answer: WRF-4.3.3-cpeGNU-22.08-chem.eb has WRF_CHEM=1 set.

  5. Q: How do you use Easybuild correctly?

    • Answer: Starting from LUMI docs https://docs.lumi-supercomputer.eu/software/installing/easybuild/
      Is there something more specific you'd like to be clarified?

    • Q: Does it work like the Software-Packages can be installed and removed again easily, like eb -remove PACKAGE?

    • A: There is no remove in EasyBuild as it is very difficult to do a correct remove operation. The thing is that a package may be a dependency for another package so removing a package may break other packages. On the other hand, removing a package may also leave the packages that it depends upon unused so should these then also be removed? It is actually impossible to do the complete analysis in an automated way as even the module system cannot detect all dependencies as manual module use statements may add software that depends on what is already installed but not found by the module system.

      The talk about software in the last LUMI course can be found here.

  6. Q: What is the best way to use jupyter-notebook and keep the kernel alive for substantial applications ?
    When tunneling jupyter-notebook through local port, the kernel dies after loading dataset and executing a training with tensorflow. It may to be linked to the memory limit.
    I tried tunneling jupyter-notebook from an interactive shell with additional resources (salloc) but it did not solve my issue.
    Answer:

    • You should be able to start a notebook as part of your submission script. Make sure you use the IP 0.0.0.0 (ip option to the notebook) then you should be able to connect from the login nodes by checking the target compute node (e.g. nid007566) or tunnel there through the login nodes. For the memory limitations, using mem 0 in your submission script might help.

2023-03-29 13:00 13:45 CET

Updates

  • LUMI system upgrade is progressing well.

Trainings

  • 2023-04-13 Profiling course
  • 2023-05-09 LUMI Intro Course
  • 2023-05-16 LUMI Intro Course
  • 2023-05-30 - 2023-06-02 General LUMI Training (hybrid, Tallinn)
  • 2023-10-03 - 2023-10-06 General LUMI Training (hybrid, Warsaw)

Questions & Answers

  1. Q: Is there any way to predict how the update will affect EasyBuild-installed software (I am thinking about VASP)?

    Answer: Everything build in LUMI/22.08 will hopefully still work, and we have a trick to easily work with LD_LIBRARY_PATH to get the libraries that really came with that version of the PE. For 21.12 and 22.06: The modules will likely be hidden, and we will use an Lmod feature to map the missing compiler modules onto newer ones. However, users who really want to run will still be able to run (that trick did work after the previous update, when 21.08 was removed from the system). We'll likely remove all remains of 21.08 though to clean up a bit.

  2. Q: My users are experiencing trouble connecting through SSH keys during the update, but this will not be problem afterwards, right?

    Answer: It shouldn't. But if you get messages about host keys that have changed, theat means that the node is still booting and not yet meant to be up for regular users. There is an interval in which the maintenance message is no longer shown but the system is not ready yet.

2023-04-26 13:0013:45 CEST

Questions & Answers

  1. Q: Anyone have successfully run alphafold on LUMI-G?
    Answer: Sam (AMD), not yet but I am gathering infomation on that. It has been successfully run on Instinct GPUs on other sites.

  2. Q: So should I try reuploading a new ssh key? I'm a user from Poland and I have access through PLGrid/Puhuri portal. I understand, thank you for your help. I will send a ticked in the following days.

    Answer: There are no passwords on LUMI, only SSH keys, so it means something is wrong with the ssh key.

    You should also check on the client side if you are using the right key. And if you are a user of Finland, did you accept the conditions for LUMI as before that you have no access.

    It is better to send in a ticket so that we have the necessary information to check if they key was indeed entered correctly. You should use MyAccessID to upload the key with the same credentials that you used when accepting the LUMI invite. But there are so many possible causes that we cannot just answer this here without more data.

End of archive