# Archive of LUMI Coffee Break 2024 This is the archive document of the past sessions of the public LUMI coffee break. The archived questions and answers of 2023 sessions can be found here: https://hackmd.io/@lust/coffeearchive2023 The archived questions and answers of 2022 sessions can be found here: https://hackmd.io/@lust/coffeearchive2022 :::info Link to document for the __next coffee break__: https://md.sigma2.no/lust-coffee-break ::: [TOC] ## Past sessions: ### Special after upgrade seminar: 2024-10-02 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - Advanced 4-day LUMI course -- 28.-31.10. in Amsterdam - Course on compiling and using software, programming models (HIP and OpenMP offload), porting, executing jobs, and optimizing applications to run on AMD MI250X - Registration: https://www.lumi-supercomputer.eu/events/advanced-lumi-course-2024/ - Moving your AI training jobs to LUMI: A Hands-On Workshop -- 26.-27.11. in Ostrava - Workshop on efficient usage of LUMI-G for AI workloads and on scaling from one to multiple GPUs - Registration: https://www.lumi-supercomputer.eu/events/lumi-ai-workshop-nov2024/ - Training material (slides, exercises and recordings) - 2-day Getting started with AI on LUMI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ - 2-day Performance analysis and optimization: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/ #### Questions & Answers 1. Any knowledge when EasyBuild-recipes for Pytorch rocm 6 are coming? - Software like Pytorch should be installed in containers as that software consists of so many small files which is not a good match with Lustre. - Note that we have our own EasyBuild recipes and don't support those that come with EasyBuild. Intel gives multiple problems on LUMI and AMD, and the foss toolchain has an MPI implementation that is hard to get to work well on LUMI CPU nodes and near impossible on the GPU nodes (apart from the fact that EasyBuild is still working on ROCm support). - They will not come. We offer Pytorch via containers using a build that AMD says is appropriate for LUMI and don't have the personpower to create our own builds or offer all software in many different ways. Moreover, as said above, Pytorch belongs in a container and not on the Lustre file system due to performance issues due to the lots of small files in many Python installations. You're talking about a multiple person-month effort for a single version of a single package if I look how long it takes the current maintainer in EasyBuild to keep track of Pytorch with the GNU compiler for NVIDIA. - There is probably a misunderstading in the question. When are the CONTAINERS used in Easybuild recipes such as https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/PyTorch-2.2.2-rocm-5.6.1-python-3.10-vllm-0.4.0.post1-singularity-20240617/ that can be extended and used according the need of lustre system. - So far Kurt Lust made those recipes. It was a personal idea of him, as the people teaching the AI course preferred a different approach. But Kurt is rather busy this month with the Hackathon in Brussels and the course in Amsterdam. So no promises when he will have time to look into it unfortunately. Good to know that these modules are actually appreciated, we will try to put some time into it. 2. We observe about 30% slowdown in GPU MPI jobs compared to previous LUMI system. Is this expected? Now we use CC, previously used hipcc but we were not able to make it work after the update. - No. Most people report equal speed to a modest increase for GPU software. - Did you have the rocm module loaded? With the AMD compilers (amd/6.0.3 module now, amd/5.2.3 before the update) you didn't need to load the rocm module, but now you do for some ROCm functionality and GPU aware MPI. That could in fact explain why hipcc didn't work. - I've observed this same behaviour and already reported it, I find a 50% slowdown with ELPA. 3. Are you planning to build software that so far is only present as an Easybuild recipe? E.g. Paraview, it is a long build, could be easier to provide "normal" prebuilt modules for that - ParaView is offered as a container in the web interface to LUMI, or at least, will be again as soon as the NVIDIA visualisation nodes are operational again. - We don't provide pre-built modules for software that is hard for us to test or may need different configurations to please everybody. A central installation is problematic to manage as you can only add to it and not remove as you don't know when people are using a package. So stuff that is broken for some but works for others sticks on the system forever. We follow the approach that is used in the big USA centres also more and more, i.e., not much centrally installed software for more flexibility in managing the stack and reparing broken installations. After all, if a apckage is in a project, the members in the project only need to talk to each other to find out when it can be updated. - I see and it makes sense. Thanks! 4. Is there an estimate when ROCm 6.2 and the new profiling tools will be available on LUMI? - No. The person who built these modules has left the team so someone will have to take over and investigate. - I believe ROCm 6.2.1 is available through custom module. - Interesting! Please tell us how to load it. - Still in testing. Not yet ready. - `module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules` - `module load rocm/6.2.1 5. Will George's slides please be available for download after the webinar? - We'll ask him, and a recording should also be available one of the next days. 6. Will modules like singularity-filesystems etc become available by default or will we keep having to use `module use /appl/local/training/modules/AI-20240529` - We've already needed several specialised versions of it for users, so no. There is no single "fits all" configuration of this module. - Unfortunately, the central software stacks on LUMI have been designed in a way that prevents us from providing these modules as part of those stacks. We are looking at alterantive ways to provide something similar, but no timeline at this point unfortunately. 7. We have recently attempted to transition training LLMs from NVIDIA-based supercomputers to LUMI-G. The setup is based around Pytorch, along with some packages compiled from source using Hipify and Hipcc wrapped in a Singularity container. However, we observe a slowdown of over 200%, along with increased memory requirements for GPUs. Are there any tips or obvious mistakes users make when managing such transitions? (A100-40GB, bfloat16) - You ca find training material (recordings, slides) from the last AI workshop here: 2-day Getting started with AI on LUMI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ - We will have another (updated) AI workshop in November, maybe that might be interesting for you: https://www.lumi-supercomputer.eu/events/lumi-ai-workshop-nov2024/ - Otherwise you can also open a ticket describing your problem and we will have a look - You may need to - review how RCCL is initialized - thanks! - Additional question: Is the material accurate also post update? - Most of the material for the AI workshop is still accurate, but some of it needs minor updates, e.g., to versions of modules. 8. Is there training material for porting CUDA kernels into ROCm compatible? - https://fs.hlrs.de/projects/par/events/2024/GPU-AMD/day2/07.%20Porting%20Applications%20to%20HIP%20-%20Local.pdf#:~:text=%E2%80%A2%20Hipify-perl%20and%20hipify-clang%20can%20both 9. After the upgrade, the structure below (hip-python) fails with the error `hipErrorInvalidValue` `Copy = hip.hip_Memcpy2D(**copy_upload)` `err = hip.hipMemcpyParam2DAsync(Copy, stream)` While this works fine with rocm-5.2.3 - Probably we need a reproducer - Is this a runtime error from HIP? - I am not sdure about the hip-python but what is this command of hipMemcpy2D? 10. What is the method to hand over (large) (collection of) files to the LUMI support team, now that `/tmp/* is mangled? - You can use LUMI web interface to create LUMI-O bucket and share it with us; use private buckets only! 11. If we make dirs / files world-readable-writable under /tmp, will LUMI support accept it and overwirte / remove it once they take it over? - It really would need a different filesystem structure of users/projects to have a nice solution. ### 2024-09-25 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - Advanced 4-day LUMI course -- 28.-31.10. in Amsterdam - Course on compiling and using software, programming models (HIP and OpenMP offload), porting, executing jobs, and optimizing applications to run on AMD MI250X - Registration: https://www.lumi-supercomputer.eu/events/advanced-lumi-course-2024/ - Moving your AI training jobs to LUMI: A Hands-On Workshop -- 26.-27.11. in Ostrava - Registration will be online soon. Check https://www.lumi-supercomputer.eu/events - Training material (slides, exercises and recordings) - 2-day Getting started with AI on LUMI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ - 2-day Performance analysis and optimization: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/ - Maintenance - After upgrade information webinar: 2.10. #### Questions & Answers 1. Is there a plan or a possibility in a near future to mount the sigma2 servers (e.g. NIRD) on LUMI. This way it will spare a lot of duplication of data for the case we can afford some delay for reading the files. - No, this is not planned and I'm quite sure it will not happen. It is just not practical for performance as well as security reasons. 2. How is the (driver) update schedule for LUMI looking in the future? Will there be more frequent updates than so far; or should we plan to stick with then Rocm 6.0-6.1 for a long time? Having at least Rocm-6.1 or newer would be nice to have (even if it is just AMD CLR 6.1). - Probably we will stay on the ROCm 6.0 driver for some time but we are working on making newer ROCm version available through modules. Also note that the driver supports 2 minor version up and down (5.7, 5.8, 6.0, 6.1, 6.2). We will also provide container 3. After the update, the attribute `hipDeviceAttributeDirectManagedMemAccessFromHost` gave a different result. What caused the change in that flag? (Through the AMD CLR and ROCR source code, this change seems to be related to the new `rocminfo` entry `Coherent Host Access` which is enabled on the compute nodes; but what does it signify?) - Sam (AMD), I was not aware of any changes to the attributes. Let me investigate and I'll post here my conclusions. - The issue I have is that I don't have a system anymore where I could check what one would read before. I would expect this to be zero as the management granularity would be a page. - (side note from the questioner, if it helps: Frontier also has that attribute enabled on the compute nodes; however not for the MI210 GPUs on the login nodes there) 4. Is there a way to get job notification emails on Lumi? I have tried with `#SBATCH --mail-type=ALL` and my email address such `#SBATCH --mail-user=user@email.com` but I do not get any notification emails. - No, unfortunately, it is not enabled as there have been problems at CSC before. - There are also no plans to enable this function of Slurm email notifications. It is not that suitable with architecture of LUMI based on isolated services, and also there has been abuse of this functionality on other systems causing email spam. 5. I've noticed that `LUMI/24.03` now includes `PrgEnv-nvhpc/8.5.0` and `PrgEnv-nvidia/8.5.0`, but does not include `PrgEnv-amd/8.5.0`. For the NVHPC / NVIDIA variants, I expect that this is just a small issue and that they're not intended to be there. My question is if `PrgEnv-amd/8.5.0` is a compiler environment which is (or will be) supported. This may influence our testing for our software and which installations we provide via EasyBuild for example. It _does_ exist when just logging into LUMI (using CrayEnv). - We will check up on that. `PrgEnv-amd` should be available. - Actually it is installed but is hidden (shouldn't be though), you can load it. - You're right, `PrgEnv-amd` is simply hidden. `cpeAMD` is actually missing though (on `partition/L`, it's there on `partition/G`) - The latter is normal. As it is irrelevant in `partition/L` and `partition/C`, it is not even installed in those partitions. 6. In the changelog for `CCE/17.0.1`, the following change is mentioned: "Added OMPT device tracing support to enable profiling". However, when trying to actually use it, I ran into two issues. First, I was unable to use the OpenMP Tools Interface at all without linking the tool into the application, though I would expect that setting `OMP_TOOL_LIBRARIES` should be sufficient. Secondly, trying to register an event to get events from the GPU failed. What is the state of CCE/17.0.1 in that regard? I know that ROCm has support (with some limiations we've reported to them over the months / years), but we're interested in supporting Cray compilers as well. - (Alfio) I will check it. As far as I can see, the functionality is used with perftools (check `man perftools`), still not clear the level of support in CCE. I will suggest to open a ticket with a reproducer and we will follow up. - Will do :+1: 7. Will this session be recorded? How can I get the recording if it's available? - No, this session will not be recorded but the answers will be written down here. The presentation of the session next week about the changes to the system will be recorded. 8. When will the LUMI/24.03 support lumi-container-wrapper? I find that I cannot load it after loading the module. Also when running `module spider XXX`, the documentation says to load a few modules, but I still can't load the module I want after following the documentation. For example loading Cmake or Ninja. - We installed it last Friday, so it should be available now after loading `LUMI/24.03` 9. Hello, I've recently got a LUMI project and I was wondering if there is any way to know the amount of GPU and CPU hours used by every project member. - Unfortuantely not, we don't have the resources at the moment to develop something like that. You can get some numbers from `sreport` but it's not straightforward. 10. How can I compile an application with HIP support? Are there any changes to the previous method, It doesn't seem to be working for my application - You should be able to continue use the same tools. What has changed is the location of the header files. There has been a deprecation warning since ROCm 5.3 and on ROCm 6 the deprecated headers were removed. If you can share the exact error you are seeing we might be able to help more. 11. Re: billing calculations : It seems natural that LUMI would provide a utility into which we feed the intended resources (as specified for SLURM) and what falls out is the number of billing units that will be used up. Is there such a tool yet? - It is trickier than it appears. The are more intricacies as the billing units depend also on the subsystem used and also on the exact parameters being used in the script. Especially with regards to core to memory ratio, see also https://docs.lumi-supercomputer.eu/runjobs/lumi_env/billing/. The sysadmins have access to all the databases and systems necessary but we as the support team are restricted (and the admins are way too busy keeping the machine running) ### 2024-06-26 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - LUMI Hackathon -- 14.-18.10. in Brussels - Optimize your code with help of experts from HPE, AMD and LUST - Advanced 4-day LUMI course -- 28.-31.10. in Amsterdam & online - Course on compiling and using software, programming models (HIP and OpenMP offload), porting, executing jobs, and optimizing applications to run on AMD MI250X - Training material (slides, exercises and recordings) - 2-day Getting started with AI on LUMI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ - 2-day Performance analysis and optimization: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/ - Maintenance - [Updated information on planned breaks and notifications of issues ](https://www.lumi-supercomputer.eu/lumi-service-status/) #### Questions & Answers 1. Hello, I am Nitik from Aalto. The ticket no is [LUMI #4441] PyTorch (NVidia and AMD GPU problems). I have built Pytorch env on LUMI but I am facing issues when I use it to train the ML model. The error is as follows: ``` terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 10] NCCL watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=56, OpType=ALLREDUCE, NumelIn=632976, NumelOut=632976, Timeout(ms)=600000) ran for 600509 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x15404ac7cea7 in /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/torch/lib/libc10.so) ``` It should be noted that I am training ensemble of models in serial in one job using multi-nodes and multi-gpus and it crashes because of the above-mentioned error and generates the "core" file. - Any help would be appreciated. Thank you! - What does the following envs show in the logs? export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=INIT,COL 2. Hello im Will from UKAEA UK, just reaching out for some help getting a code running on AMD GPUs (im pretty new to this). The code already runs on FRONTIER in the US so "should" be straight forward but im pretty new to GPU, so getting some issues. - Could you please specify which code you are trying to use? - Sure. Its a code called CGYRO (plasma turbulence code). It is Fortran + ROCM - srun -n2 -c1 --gpus-per-task=1 --partition=standard-g --gpu-bind=closest /users/wihornsby/gacode/cgyro/src/cgyro 0 GTL_DEBUG: [0] hsa_init (in gtlt_hsa_ops.c at line 453): HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. sys-2 : UNRECOVERABLE error on system request No such file or directory Encountered during an OPEN of unit 1 Fortran unit 1 is not connected - Try to remove --gpu-bind=closest, does it run? Ah sorry, the error when I run that srun command is: srun: error: AssocMaxSubmitJobLimit srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) - Could you specify the project? (-A ...) (you can check your projects via `lumi-workspaces` command) - Ah yes, sorry, forgot that, now we get, I should note im in an interactive session: ``` srun: job 7549378 queued and waiting for resources srun: job 7549378 has been allocated resources - sys-2 : UNRECOVERABLE error on system request No such file or directory Encountered during an OPEN of unit 1 Fortran unit 1 is not connected srun: error: nid005391: task 0: Exited with exit code 2 srun: launch/slurm: _step_signal: Terminating StepId=7549378.0 slurmstepd: error: *** STEP 7549378.0 ON nid005391 CANCELLED AT 2024-06-26T14:16:52 *** srun: error: nid005391: task 1: Terminated srun: Force Terminated StepId=7549378.0 ``` - This is an application error, it cannot find a file. Can you check which file is that one? Unit 1 is a bit strange... Yes seems like it, ill have to ask the Devs. but this is just a standard regression test script they supply so "should" work. Thanks for the help! -Let me have a look. Found it, its looking for a file 'input.cgyro.gen' which is present in the directory. So this is strange. Could you open the file? I mean, are the permission right? You can do a simple program to make sure it works. Seems like it opens in a quick fortran code I wrote. Uhm... no idea at this point, open a ticket on the LUMI system. I would also try to change the unit (1000?). No sure what the problem can be. Yeh, ill have to dig into the source code a bit. Thank you for the help anyways. There is a path parameter which maybe set incorrectly. But this will take more time. 3. Hello! I'm Feliks from Aalto University. I've been using IBM quantum computers in my research and would now like to try Helmi. Earlier this month I was at the Introduction to quantum computing and FiQCI course to learn how to use Lumi. However, that course used a predefined course environment with all the necessary Python libraries etc. Now I'm struggling with how to access the Python libraries I need when trying to run quantum circuits on Helmi. My main question is what's the recommended way to run a notebook in Jupyterlab on Lumi so that I can access Qiskit + any other Python libraries I might need. Also, do I need to run the notebook on Lumi in order to access Helmi, or is there a way to run it locally on my laptop and still access Helmi, like I can do with IBM machines. - ~~How to run Jupyter on LUMI and connect to it via SSH form your local machine (running jupyter ther ein the browser): https://stackoverflow.com/questions/69244218/how-to-run-a-jupyter-notebook-through-a-remote-server-on-local-machine~ - Docs on how to run Qiskit on LUMI/Helmi: https://docs.csc.fi/computing/quantum-computing/helmi/running-on-helmi/ - Course page on how to extend environments (also have a look at the rest of the course): https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/extra_07_VirtualEnvironments/ 4. Hi there. I am Fabian from the Technical University of Denmark. I am running some Computer Vision task using the following image: PyTorch/2.1.0-rocm-5.6.1-python-3.10-singularity-20231123. I have a DDP setup on a single node, with at least 2 GPUs. When profiling my code, it seems like the DataLoader is the clear bottleneck. The data I am loading is quite heavy, but I am also wondering if this is related to the CPU/GPU bindings or the modules that I am loading when submitting the job. I hoped that one of you could have a brief look at my code and tell me if I am doing it correctly. I am loading the data into Cache, so the loading time goes down significantly during training. However, it still takes around 40% of step time. Maybe it's an entirely different problem? Alternatively, maybe you can comment on how to set the preferred number of workers on lumi-g? - In general, we recommend using 1 process/rank/task mapped to each of the (up to) 8 GPUs (GCDs) and using up to 7 DataLoader workers with --cpus-per-task=7 per process/rank/task. - GPU binding course page: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/M_1_04_ApplicationPlacement/ - Rocm profiler: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/M_2_01_AMD_tools_1/ - AI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ 5. Are there any monitoring tools that work out of the box? W&B (weights & baises) does not seem to find the correct gpu metrics for AMD - Which metrics are you looking for? I believe this is [the list of metrics that W&B can extract from AMD GPUs](https://docs.wandb.ai/guides/app/features/system-metrics#amd-gpu) - which are fewer than it can extract from an Nvidia GPU... - AI workshop: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ - Rocm profiler: https://lumi-supercomputer.github.io/LUMI-training-materials/paow-20240611/M_2_01_AMD_tools_1/ - PyTorch installation: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/#installation-with-easybuild ### 2024-05-22 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - _Performance analysis and optimization_ - 2 day workshop about finding and fixing performance bottle necks. Participants are encouraged to bring their own workflows - 11.--12.6. in Oslo (Norway) - [Register here ](https://www.lumi-supercomputer.eu/events/performance-analysis-and-optimization-workshop-2024/)until 30.5.2024 - Maintenance - [Updated information on planned breaks and notifications of issues ](https://www.lumi-supercomputer.eu/lumi-service-status/) #### Questions & Answers 1. How about weekly coffee break? One month is too long time. - Tickets remain the main mechanism to solve problems. People expect too much from what we can solve during a coffee break. - And we have lots of other work also. It does require a significant amount of resources from our side to do those coffee breaks with a sufficient number of people with various specialists present. 2. Pytorch is phasing out support for ROCm<6.0, will this affect anything on LUMI? (Christian) Short answer: When LUMI receives system upgrades with newer version of ROCm, you will be able to run newer versions of PyTorch. On the other hand older versions of PyTorch will become unsupported on LUMI following system upgrades. The LUMI system upgrade cadence is determined by many factors, including which versions of the most relevant applications, we can support. The next system upgrade is scheduled for August/September 2024 and will hopefully provide ROCm 6.0. More detailed answer: To have AMD GPU support, you need ROCm compatible versions of: 1. The AMDGPU/KFD Linux kernel driver – installed by the LUMI system administrators as part of the system upgrades 2. The ROCm user space components – installed in the system space (default version), in the LUMI software stack or included in a container (base) image 3. The application (PyTorch) built with ROCm support - installed in the container, you use for e.g. PyTorch - (George) After upgrade LUMI will support up to ROCm 6.1 (though likely on the 5.7 driver) AMD only provides a +/- 2 ROCm release "tested compatibility" claim. As of May 2024, the AMDGPU/KFD driver on LUMI, aligned with ROCm 5.2, has "tested compatibility" with ROCm versions 5.0, 5.1, 5.2, 5.3, and 5.4. However, ROCm 5.5 and 5.6 also works for most use cases. To use ROCm >= 5.7, we need a newer AMDGPU/KFD driver on LUMI. Using anything built for ROCm >= 5.7 before the next system upgrade is very likely going to break. Historically, we have seen 4-5 ROCm releases a year, roughly corresponding to a "tested compatibility" window of about +/- ½ year. Thus, older versions of ROCm quickly become unsupported on LUMI following system upgrades. Applications are typically built for a range of ROCm releases, providing a larger (application) compatibility window, but you might not be able to run very old or the most recent versions of PyTorch on LUMI. Moreover, it is not only about AI software. There are other users on the system also who also have their software requirements. The MPI library, e.g., also needs to be compatible with the ROCm version being used, so we need a compatible version of the programming environment also before ROCm can be upgraded. 3. What is the outlook for working with sensitive data on LUMI? This is definitely holding our groups AI work back at the moment... - (Kurt) It depends on the level of protection you want, but not very good. It is difficult to create an environment that is according to the laws of one country. Doing so according to laws in all 11 consortium countries, or even worse, all EuroHPC members, is near impossible. 4. Is it possible to link the webinterface to containers on LUMI? So people can use singularity containers through the browser? (Christian) We don't support this (yet). It also depends on which application you would like to run in a container through the web interface. If you are looking for a way to run JupyterLab with your containerized environment, and you like to live on the edge, the following should allow you to use a container with the web interface: 1. Make sure the `jupyterlab` and `nbclassic` packages are installed in the container. 2. Create an executable bash script that runs your containerized Python. For the LUMI PyTorch container (which doesn't include jupyter, making it a really bad example, but anyway...), this may look something like: ```bash #!/bin/bash module load LUMI/23.09 module load PyTorch/2.2.0 singularity exec \$SIF bash -c '\$WITH_CONDA; python "\$@"' -- "\$@" ``` 3. In the web interface, when launching JupyterLab, for Python select "custom" and provide the output of the `realpath <your_bash_script_for_the_containerized_python>` in the "Path to Python" text field. 5. We're having node failures occur during our LLM training runs with 1024 node on LUMI standard-g. Every time time a node failure happens, our priority gets reset and we end up in the back of the queue effectively. Is there something that can be done about either the node failures or the priority issue? - If you run for 24 hours on 1000 nodes it is rather unlikely not to have a node failure. We currently have around one node failure per 1000 nodes per day. The only thing you can do is to develop software that can resist a node failure and continue on the remaining nodes. Already around 2010 computer scientist were pointing out that this would be a necessity to run on an exascale computer, and even though the current exascale machines look very different from the predictions 15 years ago, it turns out to be all to true. - And we've discussed this with the sysadmins also. Apart from that resetting the priority is standard behaviour of Slurm, changing it is dangerous as some crashes are caused by the behaviour of the software itself, triggering hardware problems or bugs in the OS and drivers. If you restart immediately on the remaining good nodes and then some other ones, the result would be that you are slowly taking down the machine. - Maybe this framework could also be useful: https://pytorch.org/docs/stable/elastic/run.html. Some additonal comments - I think it would require to "overbook" on GPUs so that it can replace failing ones, so running the jobs would become more expensive in terms of GPUh - "Worker failures are handled gracefully by restarting all workers.", so you'd still need to checkpiont fairly often to not lose progress, but you'd avoid requeueing and maybe some initial setup overhead compared to restarting the slurm job completely - Can SLURM be prevented from killing the whole job when a node failure is encountered? Otherwise even if the job can internally handle failing nodes, it won't do any good if SLURM then kills the whole thing.. - I am not sure the status but check for fault tolerance on LLMs? Something like https://arxiv.org/abs/2310.12670 but I have not tried something 6. We've had multiple cases of weird OOM errors when running training jobs on standard-g. The errors happen with about 30-40 GiB allocated (not very close to the 64GiB limit) and the error message claims there is ~20 TiB of memory free. Here's an example: ```torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 108.00 MiB (GPU 0; 63.98 GiB total capacity; 29.40 GiB already allocated; 21822.04 GiB free; 29.50 GiB reserved in total by PyTorch)``` - With ROCm >5.5 there are known problems also with memory reporting on the current drivers but I'm not sure if even this kind of numbers can be unreliable. Basically only ROCm versions up to 5.4 are guaranteed to work properly on LUMI at the moment. E.g., thet 20 TiB free is definitely a result of that incompatibility between newer ROCm versions and the drivers, we've seen that number with TensorFlow also. ### 2024-04-17 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - _Supercomputing with LUMI_ - 2 day introduction into specifics of LUMI - 2-3.5 at SURF in Amsterdam (The Netherlands) & online - [Information and registration](https://www.lumi-supercomputer.eu/events/supercomputing-with-lumi-may2024/) - _Moving your AI training jobs to LUMI: A Hands-On Workshop_ - 2 day workshop about running AI traning efficiently on LUMI - 29.--30.5. in Copenhagen (Denmark) & streaming of the lectures (no hands-ons) - [Information and registration](https://www.lumi-supercomputer.eu/events/lumi-ai-workshop-may2024/) - _Performance analysis and optimization_ - 2 day workshop about finding and fixing performance bottle necks. Participants are encouraged to bring their own workflows - 11.--12.6. in Oslo (Norway) - Maintenance - [Updated information on planned breaks and notifications of issues ](https://www.lumi-supercomputer.eu/lumi-service-status/) - Cooling distribution unit flush - To solve issues with corrosion and insufficient cooling of some nodes - Planned finish: 21.4. - Lustre file system upgrade - To fix bug which could otherwise lead to data loss - Planned finish: 21.4. #### Breakout rooms Join freely any Zoom breakout room you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. Hello dear colleagues. I'd like to ask you a question about alternative BLAS libraries. The default libsci with LUMI program environment causes numerical problems with program JDFTx, corresponding to BLAS routines. I ran tests several times and these problems occur from time to time. The same is with PrgEnv-cray. I suspect that the problem is some race condition because it occurs from time to time. libsci uses threaded blas. I tried to build libflame from source but w/o success. I tried to install openblas, but w/o success. How can I build some nonthreaded optimized blas library from source? I'll create ticket with more details. (Alfio, HPE): Creating a ticket is the best solution to provide more support. There is a bunch of variables you can try to disable specific HPE optimizations: CRAYBLAS_LEVEL1_LEGACY, CRAYBLAS_LEVEL2_LEGACY, CRAYBLAS_LEVEL3_LEGACY. About libflame installation, you can check the LUMI easybuild package (https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/l/libFLAME/). Ok, thank you very much. 2. I am a postdoc at Aalto University. I use Puhti in CSC and Triton at Aalto regularly for my work. I am here to understand how LUMI can help me in addition to the resources I have already been using. I would like to use Gaussian with another Python package together. This will need around 200 cpus per simulation. I am currently preparing a proposal to AKA for the special call for LUMI usage. I would like to get some help regarding my proposal as well. So, my requirements are a bit general...... (Emmanuel, LUST): starting with a CSC allocation https://www.lumi-supercomputer.eu/get-started-2021/users-in-finland/ and LUST assistance https://www.lumi-supercomputer.eu/user-support/need-help/ Molpro: Needs UCX for MPI so prebuilt binaries will not work. Compiling from sources should be possible if you have access to the sources. ORCA: Seems to work with our Open MPI 4.1.6, but it will produce a warning when running on a single node. 3. Which build system can you advice for compiling static binaries, preferably with musl-libc? The aim is to be able to build programs with multiple dependencies once and avoid messing with installation of dependencies in future. (Alfio, HPE): time ago I tried https://github.com/intoli/exodus... Thanks a lot!!! ### 2024-03-27 13:00 -- 14:00 CET #### LUMI Updates - Trainings - Advanced general 4-day LUMI course - 23.--26.4. in Espoo (Finland) - [Course page and registration](https://www.lumi-supercomputer.eu/events/comprehensive-general-lumi-course/) - _LUMI Intro_ - 2-3.5 at SURF in Amsterdam (The Netherlands) - _Getting started with AI on LUMI_ workshop - 29.--30.5. in Copenhagen (Denmark) #### Breakout rooms Join freely any Zoom breakout room you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. Where can i get information on a beginner to LUMI information? Im comfortable on a local computer and server, but LUMI is new to me. I have an account but dont know where to beginn to move my programs and data. - (Emmanuel) LUMI documentation is available here https://docs.lumi-supercomputer.eu/ The LUMI User Support Team are also organizing LUMI intro courses (the next one is planned at SURF in Amsterdam 2-3.5 and will be available online as well). Please note that the documentation or these events are not meant for HPC beginners. Depending on where you are located, you probably have access to different HPC courses organised by your institution or competence center. - (Kurt) Materials from past trainings are also available (or documented if they are on the system itself but not downloadable due to restrictions) in the [LUMI Training Materials website](https://lumi-supercomputer.github.io/LUMI-training-materials/). This includes recordings from the presentations, slides, and in some cases additional notes. 2. The billing policy is not alway clear to me. For example, if I have the option ```#SBATCH --exclusive```or ```#SBATCH --mem=0```, I am not sure how much it will cost. It would be very convenient to have a report at the end of each run giving the number of hGPU or hCPU used. - (Kurt) The reason why this is not currently done is becasue the actual billing is not done by Slurm but implemented by a script that runs periodically, as the actual "price" is computed based on multiple Slurm parameters about the job and not computed by Slurm itself. The billing policy is best understood from the principle "you are billed for all resources that others cannot use because of your job" which is only fair. E.g., if you ask for exclusive node use either by using one of the "allocated per node" partitions or by adding `--exclusive` for one of the "allocatable by resource" partitions, it is only fair that you get billed for all resources of the node, i.e., 128 core-hours per hour for the CPU nodes or 4 GPU-hours per hour for the GPU nodes. The same holds when you request all memory on a node, even if you would be only using a single core or GPU, as you make the whole node unavailable for other users. But the same principle is also used if you use a disproportionate amound of a certain resource compared to the other resources. E.g., if you use 50% of the cores or 50% of the CPU RAM in a GPU node yet request only 1 GCD (1/2 of an MI250X GPU) you'd still be charged for using half a GPU node, i.e., 2 GPU-hours per hour as in essence 2 full MI250X's cannot be used efficiently anymore or without restrictions by other users. 3. It seems to me that the keras function `fit` has a memory leak or increase memory usage with the epoch. Can I print after each epoch the GPU memory usage? - (Kurt) I don't think you can as memory reporting with the HIP API is broken when using ROCm 5.6 (and likely also ROCm 5.5) on the 5.2 driver that we currently have on LUMI. I don't know if there is any way to get a reliable number. - (Sam AMD) As Kurt says the driver is a bit dated which precludes the use of some APIs. You can try `https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_usage` to get your usage and see if you get sensible results. GPU total memory info can be corrupted due to driver imcompatibility but the "used" memory might be still reported correctly. Other option is to have an interactive session `srun --jobid <yourjobid> --interactive --pty bash` where you can run `rocm-smi` while your job is running to see what the drivers reports. 4. Is there any information related to the latency of launching kernels on the GPUs on LUMI-G? In our application (materials science, DFT), compared to running on NVIDIA GPUs, we found out that the overall runtime of functions is slower on MI250X but the individual kernels (as shown in the tracing software: perfetto & NSIGHT) are faster. I'm pasting below a screenshot in case someone has investigated this already. It's produced via rocprof, using a very simple code that launches empty kernels with different grid/block size with user markers (shown in the bottom panel) ![](https://md.sigma2.no/uploads/2a492107-d073-4056-b3a9-d263b988d68e.png) - (Kurt) It is probably best to ask this question directly in the AMD room during the coffee break. We have observed also that starting kernels on AMD can be more expensive than on NVIDIA hardware. It's not clear how much of this is due to hardware limitations and how much of this can be improved through future drivers and ROCm versions (our driver is very old). And we have no precise data. - (Sam - AMD) Let's discuss this in the coffee break. I'd say that 10us to start a GPU kernel is kind of expected. So the moment kernel timming starts to be under 10us, the latency starts being exposed. I don't have current what the expectations for NVIDIA should be. - ( George - AMD) I do not see what is in the left, do you have more kernel calls? The first call to kernel takes more time. 5. I mainly use python. My batch script has ``` module purge module use /appl/local/csc/modulefiles module load pytorch/2.0 ``` in it. Do I have to do this the second time? - (Kurt) You have to do this in every shell where you want to use it. Since it is in a batch script, it should be OK for the lifetime of the shell in the batch job. Your next batch job of course needs the same lines again in its job script. But note that you are using modules from a local software stack provided by CSC, so if there are problems with this version of PyTorch itself, support comes from CSC and not from LUST as we are not at all involved with building those modules. 6. With Score-P v8.4 being released, which includes improved support for HPE Cray systems, we thought about contributing an updated EasyBuild recipe (with the respective EasyBlock) for users on LUMI. In general, Score-P v8.x might be interesting as it includes support for HIP. So far I've created easyconfigs for our dependencies (CubeLib, CubeWriter, OTF2, OPARI2) and am trying to get Score-P built as well. I've got a few questions though: - Is there interest in general to provide an updated version of Score-P in your environment, or at least accessible for user installations? - My current idea would be to provide configs for LUMI-L (Cray, GNU, AOCC). Is there any interest for Cray-amd / AMD? - Right now, CubeGUI (which is used to examine the profiles) is missing, since it requires having Qt. Building Qt5 just for this application seems to be overkill. A workaround would be to offer the [CubeGUI AppImage](https://apps.fz-juelich.de/scalasca/releases/cube/4.8/dist/CubeGUI-4.8.2.AppImage) to users. Would this be a feasible solution? *Answer from Lust* - Any contribution is certainly welcome. It would go in the contributed repository from which users can install software themselves as we cannot guarantee timely updates when something on the system changes nor can we guarantee that we can actively support the package. It has not been done because of lack of time on our side with the amount of people we have who can do this work. - Not sure about the second question, but given that it is an important tool in the EuroHPC ecosystem where some projects want to focus on European tools, I expect it is only a matter of time before it will be requested. - Getting QT5 to build and work correctly can be extremely hard on some systems so we haven't embarked on it either, so any other solution can be considered. We're now more and more containerising GUI software and making them available via the web interface. 7. I am new user in LUMI. My supervisor add me her grant and I have finished all the registration process. But I do not know what is my ssh key to join LUMI cluster. Could you please guide me? Thanks - The whole process is [documented extensively in the LUMI documentation](https://docs.lumi-supercomputer.eu/firststeps/). With the number of users on the system compared to the number of support people we cannot guide everybody individually through the procedure but have to refer users to the documentation first (and to trainings in the [LUMI training archive](https://lumi-supercomputer.github.io/LUMI-training-materials/), but this specific topic is not treated in the courses). - You should update your profile https://mms.myaccessid.org/profile/ with the public part of your SSH key. 8. Hi, I have problem in accessing to my project. I could only access one project, and whenever I want to access in vscode, it ask me to type password which I don't understand at all. - This is a problem with the way you use VS code on your PC and not an issue of LUMI. You need to correctly configure remote access in VScode to use your ssh key. Sorry but even with in the lumi platform, I cannot see the project - What do you mean? Open OnDemand? Are you even sure you are a member of the project? Do you see it in the output of the "groups" command on the command line or in the output of `lumi-workspaces`? Also, you may have to start the VScode app in the web interface in the correct project to see it by default. - Just checked, actually this is not needed, I can go to the directory of all my projects if I use VScode in the right way in Open OnDemand. 9. I would be interested by a training in high-level deep learning python libraries like keras/tf. It would help to get the good practices regarding these libraries used on LUMI (especially for medium-to-large size applications) - We will have an AI course at the end of May in Copenhagen but are still preparing that. We are not a support team for specific applications so cannot give you guidance for each and every application. But some common rules are: - Those packages should be used from containers as the LUMI file system doesn't like lots of small files and in particular if you are then doing distributed training it would slow down the file system for everybody. - Be careful with everything that requires communication. Whether you use something MPI-based or something RCCL-based, they have to use the libfabric library from the system if you want high performance, which basically means for MPI that you'll likely need to build the container in a way that it uses the Cray MPICH implementation, and for RCCL, that you need a plugin. This is also the reason why we offer some pre-built containers that are documented in the [LUMI Software Library](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs), your starting point for all searches about a very specific package. - Datasets should use appropriate formats and not be stored as millions of small files on the system as again this puts a very high strain on the file system. We're working on some more documentation for PyTorch as that is by far the most used AI package on LUMI, so we focus the few efforts that we can do ourselves in that domain on PyTorch first (being computer experts and not having domain experts for each and every science domain on LUMI in the LUST team). ### 2024-02-28 13:00 -- 14:00 CET #### LUMI Updates - Trainings - Advanced general 4-day LUMI course - 23.--26.4. in Espoo (Finland) - _Getting started with AI on LUMI_ workshop - 29.--30.5. in Copenhagen (Denmark) #### Breakout rooms Join freely any you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. SLURM has been lagging consistently for me over the past few weeks. Takes a while to respond to commands. Anybody else notice this? - (Emmanuel): Can you be more specific? Which slurm commands have been lagging? I try to figure out if it is a slurm issue or a filesystem issue. - (OP) Actually I found out that it is not a SLURM issue but a Python script I started using to color the output of my squeue alias to apply different colors depending on job state, etc. Without the script it is working just fine, sorry! Perhaps I will look for a Bash alternative, unless you can suggest an efficient way of invoking the script. The script works with pipe redirection, right now I use a command similar to `squeue --(options) | script.py`. It takes a few seconds each time on LUMI, and on another cluster it is basically instant. - (Kurt) Python on a big LUSTRE filesystem is not a good idea if your Python packages has a lot of dependencies as it will access lots of small files when starting (and maybe even when running if for some reason Python decides not to cache code of dependencies of dependencies). That may explain why it is slower than on some other machines. Of course the length of the queue on LUMI compared to that other machine can also be an issue. My experience is that Lua, and then trying to put everything in as little files as possible, is the better scripting tool in such cases. - (OP) Did it with a cascade of greps, works like a charm. 2. My code keeps failing without specific reason (process terminated), when I start large jobs with many nodes. The same code works fine with fewer nodes. - (Emmanuel): We would need more information from your side to be able to help you. Is it related to an existing ticket? Which application are you running? Batch script? Error file? I would recommend you to first open a ticket and you can discuss during the coffee break with the dedicated people from LUST, HPE, or AMD (depending on the origin of the issue). 3. I am running Quantum Espresso and it is working perfectly, but some specific executable (projwfc.x) seem to be missing. Perhaps it is a small issue. Thanks for the support. - That is entirely normal. Quantum ESPRESSO can be built in many different configurations. Moreover, a decent installation manual is missing. We are in the first place computer specialists, and though many of us have a past in research, we cannot cover all research fields nor can we cover all tools, not even the major tools, in the small team that we have. As a result, we often have to guess which build options are most relevant, unless we get enough info on that in the ticket that triggered the development of the EasyBuild recipe, so we look into installation tools that are used elsewhere, basically at the way CSCS builds on Cray as they have a long tradition with Cray systems, and the way EasyBuild and Spack do it, and build those configurations (and in this case we mostly based our installation on the one at CSCS as they have close ties with the EuroHPC Centre of Excellence that develops Quantum ESPRESSO). The tool you need is in a different module of Quantum ESPRESSO that is not build that way, so you will have to adapt the EasyBuild recipe for your specific needs. It looks like the only change you might need is to copy the EasyConfig `QuantumESPRESSO-7.2-cpeGNU-22.12.eb` and change the line ``` buildopts = "all epw" ``` into ``` buildopts = "all epw pp" ``` if the short manual on postprocessing tools that I found for an older version of QuantumESPRESSO is still relevant. We make our full installation recipes available both in GitHub and via the [LUMI Software Library](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/) so that users can actually check how the package is installed and what it contains and does not contain, and so that they can customise a build recipe to their needs without having to start from scratch. 4. We are trying to run some simulations that are writing alot of temporary data on disc. Since there are no locally mounted Scratch discs on the the individual nodes, and to reduce I/O overhead, we want to run the simulation in the memory of individual nodes instead. Usually we would simply ssh into each nodes, and create and execute the necessary files from there, but since LUMI does not support this, is there a better way of doing this? I was looking to allocate multiple nodes in a job, and then periodically ssh into the nodes. - See the [LUMI documentation](https://docs.lumi-supercomputer.eu/), on the ["Interactive jobs" page](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/), section ["Using `srun` to check running jobs"](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/#using-srun-to-check-running-jobs) for the replacement of using ssh to get in the compute node. It places the resulting shell in the example commands inside the job, so it is restricted by the cores and memory allocated to the job and cannot eat into resources allocated to a different job on the node. Using `ssh` to go into the compute nodes is not supported because ssh is not job-aware. While Slurm does have a system to ensure that you cannot ssh into a node that has no job of you running, it cannot properly restrict the resources that you start in that shell to use only your resources. And doing something different on the exclusive partitions and the shared partitions would only cause confusion. - I'm not sure is something like [mpiFileUtils](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/m/mpiFileUtils/) can also be of help in your data management problems. Thank you for the answer, it seems to be the right approach. However, allocating the ressources seems confusing to me. Lets say I have a script requiring 8 cores with 1 task each, that I want to run in a specific node. If I run srun with only 1 task, the code executed can not use multiple cores/tasks. If I run srun with N tasks, it appears to execute the code N times? 5. A singularity container with pytorch specifically for LUMI has been provided by AMD (https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/). However, as far as I understand, it is not possible to install new packages to the container once it's constructed? If so, what is the suggested workflow for using this 'recommended' container instead of building our own using the cotainr building tool which was presented last year? Or is the intention that these two resources should somehow be combined? - (Christian) Cotainr may be used to build a container from one of the base images also used in https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/, i.e. you can use one of the `lumi-rocm-rocm-X.Y.Z.sif` images under `/appl/local/containers/sif-images` on LUMI as a base image with cotainr - combining the two approaches, you mention, as detailed in [this example](https://github.com/DeiC-HPC/cotainr/tree/lumi_sfantao_pytorch_example/examples/LUMI/sfantao_pytorch). Unfortunately, right now this example fails due to [a bug in Singularity](https://github.com/DeiC-HPC/cotainr/issues/52). We believe we have a fix for this, which will hopefully *soon* be implemented on LUMI. Until then, a workaround is to use the latest main branch GitHub version of cotainr and use the tar archive of `lumi-rocm-rocm-X.Y.Z` as a base image, as detailed in that bug report. - (Christian) Alternatively, you may try to create an overlay on top of the existing PyTorch container from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/ as detailed in https://md.sigma2.no/CBC_GEUCTCSvdvhpxAChkQ?view. Note that this involves installing pip packages into an existing conda evironment already containing pip packages, which is [discouraged by the conda developers](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#pip-in-env) - your milages may vary. If using this approach, you might have problems with reproducibility, since I don't think you have any guarantees for being able to create the exact same conda/pip environment if you do it twice (with some time in-between). You also have to make sure to keep track of the overlay(s) yourself - or consider [embedding them](https://docs.sylabs.io/guides/3.5/user-guide/persistent_overlays.html#overlay-embedded-in-sif). - (Christian) Finally, you may also use the container from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/ as a bootstrap argument in a Sigularity definition file, which you may use to build a custom container on your own hardware. This is not really something LUST can support, but it is an option if you are used to building Singularity containers yourself on your own hardware. 6. I'm having trouble getting PyTorch's nn.Conv1d to work with AMD ROCm. See https://pastebin.com/DRH8si8N for a minimal example that breaks when using ROCm but works fine on CPU. In general, I have not had issues with any other features from PyTorch (using Python 3.12.2 and torch 2.2.0+rocm5.7). - ROCm 5.7 is known to not fully work on LUMI at the moment. In fact, even in 5.6 memory reporting was already broken on the driver we currently have on LUMI, and your error message points into the direction of trying to use precisely those functions. We can only give some guarantees up to ROCm 5.4. Any newer version of ROCm has some problems on LUMI currently which will remain until after the next update. It is already sure that then we will be able to support newer versions of ROCm, but it is not clear how far we will be able to go (but it looks like the driver we get then should be new enough to support ROCm 5.7). As there is still uncertainty about the availability of the software that we want to install in the next update, and as it will be first tested on a small test system, no firm date has been set for the next update. 7. I have a general question on compiler installation/upgrade procedures followed by HPE. I am working on two machines (LUMI & another MI250x machine installed by HPE) for porting my code to AMD GPUs. I am using CCE/15.0.1 on both machines for compiling my Fortran+OpenMP code. On LUMI, my code compiles without any issues with cce/15.0.1, but on the other machine the same compiler version cce/15.0.1 shows some compile time issues with code when it tries to compile some Fortran interfaces to C++ functions. In both cases, the source code and compiler module (cce/15.0.1) are the same. I am wondering whether the two compiler modules can be different even when the major and minor versions are the same (as shown with installed modules). Is it a general practice to add patches to compiler modules to upgrade them while sometimes keeping the version numbering same ? In general, I am trying to understand how the same compiler modules on two different machines can behave differently with same source code. Is there a way for a general user to verify that the compiler modules, having same versions, are exactly the same ? Any ideas on this would help me advance with my debugging. - Other libraries on the system can be different and cause the problems. CCE 15 is supported on both SUSE 15 SP4 and SP5, but they may be packaged differently. There may also be differences between the modules that HPE distributes for SUSE and for Red Hat derived systems. We've also yet to see any recent CPE release without packaging errors somewhere so it may even be just a packaging error on the HPE side and a different version of the installed package on your side due to a different OS or version of the OS or different cluster management environment. So the CCE 15.0.1 module for various systems will certainly be based on the same sources at HPE, but may be built and packaged differently for different OS variants and system management software, or may have different packaging errors. I've noticed there tend to be different release announcements even based on the system management software you're running (HPE has currently two environments) with different packages for both, so it may very well be that a packaging error does not impact our version (for CSM) but does impact the version for the other management environment. I don't know what machine you are talking about, but I know that we have a commercial package that already works on a cluster at KTH in Sweden but not on LUMI even though these systems are also extremely similar, and it is likely due to interaction with some library outside (in that case) the MPI module that is different. Welcome to the joy of shared libraries where an executable or a single library alone doesn't tell the whole story anymore... (Well, the shared libraries have benefits also of course.) Incidently KTH is using the other system management environment so the problem we observe with that package may entirely come down to library differences in both environments. The question is way to vague though to give a more concrete suggestion of what could be going on, and since the problem is on "the other machine" is really more of a support issue for their support team. 8. A very practical question reqarding the AMD profiling tool rocprof. By running the command `srun -u -n1 rocprof --hip-trace --roctx-trace MY_EXEC ARGS` I can get a json file that contains HIP API calls, GPU kernel executions and my own marked regions which is the reason `--roctx-trace` is used. The json file can be visualized in chrome://tracing/ in Google Chrome browser. I can get a lot of information from the visualizer but it seems that the gridSize and blockSize for the kernels executed is missing. Is there a way to get this information ? For comparison, NVIDIA's corresponding Trace visualizer, NSIGHT Systems, provides this information upon checking a GPU kernel, as show in the screenshot below: ![](https://md.sigma2.no/uploads/41669904-4c74-42f8-a4ad-f26a73ce3c79.png) To be very specific, I'm interested in getting the `grid: <<<x,y,z>>>`and `block: <<<a,b,c>>>` but via rocprof. The kernel configuration parameters, such as the above, are determined automatically based on the sizes of the arrays. 9. When we use our own singularity container for official pytorch image, we cannot load aws-ofi plugin. - Too vague to answer what is the issue. But it is entirely possible, e.g., that there is a conflict between the runtime libraries of the compilers used for the software in the container and the AWS OFI plugin if it is built with our EasyConfig and compilers on LUMI. Sometimes there is no other solution than compile it by hand with the compilers in the container. It is because of these difficulties that we provide containers ourselves. We don't get any errors when we load the aws-ofi plugin, but our distributed training job is slow and we find that the plugin is not loaded from the NCCL info. 10. When we use AMD jax containers, the GPUs don't appear, and we can't import jax library properly. - Again to vague to answer. How do you request the resources for the job and start the container, as the answer may be there instead? For example, you need to explicitly request GPU resources, otherwise they will not be visible to your job. None of the containers here seem to work: https://hub.docker.com/r/rocm/jax As for running we have tried both using an interactive session where we get a single node as well as sbatch. rocm-smi shows the presence of all gpus inside the container. However using jax devices() returns only CPU devices. With the exact same setup I do not face these issues with PyTorch containers. I am happy to any more specific details that can help with the debugging. (Tarun) - Oh, I misunderstood your question. You're using the docker containers and not ours (that are still hidden in /appl/local/containers/sif-images as we still need to do the necessary EasyBuild packaging to make them easier to use). We cannot support the docker containers and they are not guaranteed to be compatible with LUMI - rather the contrary - which is why we provide our own containers for AI packages. They are built for a different communication network also. We can get help from the people that built the containers that we offer in `appl/local/containers` if something does not work, but we are in no way involved with the way the docker containers are built so we really cannot see at all if they could run on LUMI and why or why not. 11. How to use the most recent version of ROCm like ROCm 6.0 in the singularity container, we try to use recent ones, but it doesn't work. - (Peter) I refer to the answer to question 6 above. ROCm 6 will likely not work with our current drivers. 12. I would like to have a general comment on the current status of AOCC compilers in comparison with Cray CCE compilers. Is Cray CCE is still the recommended one or AMD compilers have picked up in terms of OMP performance. on GPUs ! - Is this for GPU or CPU? GPUs ! - The Cray compiler on the system is based on a newer version of Clang and LLVM then either the AOCC and ROCm compilers that we currently have on the system. Certainly for CPU we still expect that the case after the next system update as Cray is now rather closely tracking Clang and LLVM. For Fortran the story is different as Cray has its own Fortran frontend. AOCC uses classic flang and my impression is that ROCm uses new flang. For GPU the Cray compiler depends on some parts of ROCm, so the answer is less clear. 13. I wish to run a AI/ML inference pipeline which would use multiprocessing on the CPU side and each CPU core would bind to one GPU each. How does one choose the SLURM and srun invocation for this (P.S. I use tensorflow to load the ML model) ? - Ask in the AMD room, but I guess it depends on how your application choses the GPUs, and then the trick is to use a CPU bind mask with as first element a core in the CCD connected to the GPU that the software would chose for the first process, etc. 14. Related support ticket #3848: Is there any chance that newer Cray CPEs can be provided in some way? We're interested in testing fixes being mentioned in the patch notes of CPE 23.12 related to the OpenMP Tools Interface, which is still broken with CPE 23.09 / CCE 16.0.6. Newer ROCm versions are quite interesting for us as well, but this was already answered in previous questions. - Please don't ask questions here that we are already answering in tickets. This is very confusing and double work for everybody. I cannot even give the answer here that we want to give for legal reasons. 15. What is recommended monitoring and process manager? I have tried to start htop but I was not able to load appropriate module. I have also tried nmon, it is not bad but it does not show all 256 "processors" on nodes. I mean 128 physical cores+128 multithreading. - You need to use either CrayEnv or the LUMI stack before you can load the module that contains `htop`. Thank you, I use following commands ``` #!/bin/bash #PARTITION L - large memory and login node module purge module load LUMI/23.09 partition/L module craype-x86-rome module load libfabric/1.15.2.0 module load craype-network-ofi module load perftools-base/23.09.0 module load xpmem/2.5.2-2.4_3.50__gd0f7936.shasta module load cce/16.0.1 module load craype/2.7.23 module load cray-dsmml/0.2.2 module load cray-mpich/8.1.27 module load cray-libsci/23.09.1.1 module load PrgEnv-cray/8.4.0 module load GSL/2.7.1-cpeCray-23.09-sequential module load cray-libsci_acc/23.09.1.1 module load cray-fftw/3.3.10.5 module load libxc ``` then `module load systools/23.09` module as a dependence of htop. Then try loading htop. `module load htop` It emits the following message: Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "htop" Try: "module spider htop" to see how to load the module(s). - There is no module called `htop`. The `htop` command is provided by the `systools` module and if you carefully read the output that you get when you do a `module spider htop` that is precisely what it tells you. We don't want a separate module for every little command on LUMI as that creates too many small files for the file system to deal with. So all those little tools are bundled in modules. - Also, you're loading way too many modules by hand. Everything from `craype-x86-rome` up to and including `PrgEnv-cray` is also loaded or will be unloaded as soon as you load that GSL module as that one loads the `cpeCray/23.09` module that basically loads the programming environment as it was used to build the GSL module. Actually, what I want to achieve is to load exactly the same environment each time in order to avoid breaking things after updates or whatsoever. Did I understood correctly that I reload some modules several times? How can I achieve the same environment but in optimal way. - To inspect thread pinning and process binding you can use `lumi-CPEtools` module and execute `hybrid_check` tool instead of your target executable; it will print exact thread pinning and process binding. Thank you very much. 16. What is recommended console manager? Is there midnight commander installed? I use self built fully static mc. But it can not edit files due to possibly nls or other language files missing. - We do not use these tools ourselves, so don't have a good recommendation. `mc` tends to be tricky to compile and install, we tried it before. We have it installed in the latest spack module. Try `module load spack/23.09` and then `module load mc/4.8.27-gcc-7o5`. Completely untested, though. Ok, thank you very much. It is much better than reinstalling by myself. I have tried `export SPACK_USER_PREFIX=$PROJ/spack module load spack/23.09 module load mc/4.8.27-gcc-7o5` But there is no module `mc/4.8.27-gcc-7o5`. Should I install mc through spack? Did I understood correctly? - I checked myself independently from the person who gave the previous part of the answer, and it works for me (at least on the login nodes where I tried). Are you sure you didn't make a typo? What does `module avail mc` say after loading the `pack/23.09` module? When I try `module load spack/23.09` it emits this: ``` Lmod has detected the following error: Please set`$SPACK_USER_PREFIX to where you want to install Spack`packages. We recommend using your project persistent storage for this. E.g. /project/project_<project-number>/spack While processing the following module(s): Module fullname Module Filename --------------- --------------- spack/23.09 /appl/lumi/modules/SoftwareStack/spack/23.09.lua ``` Then I try `module avail mc`, it emits `No module(s) or extension(s) found!` But `module spider mc` finds several modules: mc: Versions: mc/4.8.26-gcc-xu mc/4.8.27-gcc-gkn mc/4.8.27-gcc-z2a mc/4.8.28-gcc-fs Other possible modules matches: libxdmcp libxdmcp-1.1.4-gcc-7.5.0-u4e2ln2 mc-4.8.27-gcc-7.5.0-gkngylg termcap termcap-1.3.1-gcc-7.5.0-s4oijlb Finally I have managed to load mc using these commands: export SPACK_USER_PREFIX=$PROJ/spack module load spack/23.03-2 module load mc-4.8.27-gcc-7.5.0-gkngylg I have also tried spack/22.08 with mc/4.8.26-gcc-xu. Both preinstalled variants work well with editing, however mouse clicking doesn't work. My statically built mc doesn't edit well but mouse clicking works. Still some room for improvement. Thank you very much for your hints. They were really helpful. ### 2024-01-31 13:00 -- 14:00 CET #### LUMI Updates - LUMI intro course: Thursday, 8.2.2024, online - 1-day course about using LUMI effectively. Requires some HPC experience. - Info and registration: https://www.lumi-supercomputer.eu/events/lumi-intro-course-feb08/ #### Presentation 15min - [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/): A tool to simplify execution of large workflows on HPC clusters by allowing you to execute a large number of tasks without having to manually submit jobs into Slurm. #### Breakout rooms Join freely any you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Geting started on LUMI (Join if you are new on LUMI and have questions) - Room 4: HyperQueue #### Questions & Answers 1. Any progress regarding how to work with sensitive data on LUMI? - "An architecture draft will be reviewed in week 5. Implementation is planned for Spring/Summer 2024." 2. I am porting a code to run on LUMI-G, and encountered a strange data transfer issue (CPU-GPU) which I can't understand. The code is calling "hiprandGenerateUniformDouble" at this point and out of 8 MPI processes only RANK 0 is able to show the device generated random numbers on host after updating them from device. Rest of the ranks fail (Memory access fault message with CRAY_ACC_DEBUG variable) while updating data back to host from their respective devices. The data transfer is managed by OpenMP pragmas. I have verified (with omp_get_default_device() & hipGetDevice()) that all MPI ranks are well running on their own devices. I have this short test ready to quickly go through the issue. Would it be possible for someone to have a look at this issue with me during this coffee break session? Thanks - It is not obvious for me what might be causing this. What comes to mind is a mismatch of set device IDs in the OpenMP runtime and the hipRAND handler. To narrow down the issues search space I'd make sure that each rank only sees a single GPU with ROCR_VISIBLE_DEVICES. For instance one can use: ROCR_VISIBLE_DEVICES=$SLURM_LOCALID. I'll (Sam from AMD) be in the coffee break and we can take a closer look. 3. We have recently undertaken the task of porting URANOS, a Computational Fluid Dynamics code, to AMD GPUs. While the code using the OpenACC standard. it was predominantly optimized for NVIDIA machines, so we have encountered some performance challenges on AMD cards. We are reaching out to inquire whether there are individuals within the LUMI staff who can share some pieces of kwnoledge in optimizing code performance specifically for AMD GPUs. We would greatly appreciate any assistance or guidance - We may need HPE people to also chime in here. My experience with OpenACC comes from very flat codes. Here, the performance implications are a mix of runtime overheads and kernel performance. The former can be assessed with a trace of the GPU activity and the later can be done with a comparison of kernel execution time with other vendors. I've seen the Cray Compiler OpenACC runtime being a bit conservative on how to control dependencies with some redundant runtime calls that can be lifted. Other things might come from register pressure and some device specific tunning (loop tilling for example). The register pressure is connected with the setting of launch bounds - unfortunatelly setting the number threads is not sufficient and a thread limit clause needs to be used instead. Tilling requires change a bit the code. We can discuss further during the coffe break. 4. We try to understand why we don't the performance we exepct from the GPUs on LUMI-G but our software is too complicated to trace itself. So I'm looking for much simpler examples, to measure individual functionallities, such as data transfers, FFTs, bandwidth, etc. Is there a repository of simple to complex examples for GPU execution on LUMI-G? - Not sure if it will cover everything needed but AMD has some examples used for training: https://github.com/amd/HPCTrainingExamples. There are also the AMD blog notes that can help with some trimmed down examples https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-readme/. These are not really benchmarks and not meant for performance assessment but could be helpful for testing along those lines. 5. How does HyperQueue compare to other workflow managers such as Nomad (by Hashicorp)? :::info End of archive :::