# Archive of LUMI Coffee Break 2024 This is the archive document of the past sessions of the public LUMI coffee break. The archived questions and answers of 2023 sessions can be found here: https://hackmd.io/@lust/coffeearchive2023 The archived questions and answers of 2022 sessions can be found here: https://hackmd.io/@lust/coffeearchive2022 :::info Link to document for the __next coffee break__: https://md.sigma2.no/lust-coffee-break ::: [TOC] ## Past sessions: ### 2024-04-17 13:00 -- 14:00 CEST #### LUMI Updates - Trainings - _Supercomputing with LUMI_ - 2 day introduction into specifics of LUMI - 2-3.5 at SURF in Amsterdam (The Netherlands) & online - [Information and registration](https://www.lumi-supercomputer.eu/events/supercomputing-with-lumi-may2024/) - _Moving your AI training jobs to LUMI: A Hands-On Workshop_ - 2 day workshop about running AI traning efficiently on LUMI - 29.--30.5. in Copenhagen (Denmark) & streaming of the lectures (no hands-ons) - [Information and registration](https://www.lumi-supercomputer.eu/events/lumi-ai-workshop-may2024/) - _Performance analysis and optimization_ - 2 day workshop about finding and fixing performance bottle necks. Participants are encouraged to bring their own workflows - 11.--12.6. in Oslo (Norway) - Maintenance - [Updated information on planned breaks and notifications of issues ](https://www.lumi-supercomputer.eu/lumi-service-status/) - Cooling distribution unit flush - To solve issues with corrosion and insufficient cooling of some nodes - Planned finish: 21.4. - Lustre file system upgrade - To fix bug which could otherwise lead to data loss - Planned finish: 21.4. #### Breakout rooms Join freely any Zoom breakout room you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. Hello dear colleagues. I'd like to ask you a question about alternative BLAS libraries. The default libsci with LUMI program environment causes numerical problems with program JDFTx, corresponding to BLAS routines. I ran tests several times and these problems occur from time to time. The same is with PrgEnv-cray. I suspect that the problem is some race condition because it occurs from time to time. libsci uses threaded blas. I tried to build libflame from source but w/o success. I tried to install openblas, but w/o success. How can I build some nonthreaded optimized blas library from source? I'll create ticket with more details. (Alfio, HPE): Creating a ticket is the best solution to provide more support. There is a bunch of variables you can try to disable specific HPE optimizations: CRAYBLAS_LEVEL1_LEGACY, CRAYBLAS_LEVEL2_LEGACY, CRAYBLAS_LEVEL3_LEGACY. About libflame installation, you can check the LUMI easybuild package (https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/l/libFLAME/). Ok, thank you very much. 2. I am a postdoc at Aalto University. I use Puhti in CSC and Triton at Aalto regularly for my work. I am here to understand how LUMI can help me in addition to the resources I have already been using. I would like to use Gaussian with another Python package together. This will need around 200 cpus per simulation. I am currently preparing a proposal to AKA for the special call for LUMI usage. I would like to get some help regarding my proposal as well. So, my requirements are a bit general...... (Emmanuel, LUST): starting with a CSC allocation https://www.lumi-supercomputer.eu/get-started-2021/users-in-finland/ and LUST assistance https://www.lumi-supercomputer.eu/user-support/need-help/ Molpro: Needs UCX for MPI so prebuilt binaries will not work. Compiling from sources should be possible if you have access to the sources. ORCA: Seems to work with our Open MPI 4.1.6, but it will produce a warning when running on a single node. 3. Which build system can you advice for compiling static binaries, preferably with musl-libc? The aim is to be able to build programs with multiple dependencies once and avoid messing with installation of dependencies in future. (Alfio, HPE): time ago I tried https://github.com/intoli/exodus... Thanks a lot!!! ### 2024-03-27 13:00 -- 14:00 CET #### LUMI Updates - Trainings - Advanced general 4-day LUMI course - 23.--26.4. in Espoo (Finland) - [Course page and registration](https://www.lumi-supercomputer.eu/events/comprehensive-general-lumi-course/) - _LUMI Intro_ - 2-3.5 at SURF in Amsterdam (The Netherlands) - _Getting started with AI on LUMI_ workshop - 29.--30.5. in Copenhagen (Denmark) #### Breakout rooms Join freely any Zoom breakout room you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. Where can i get information on a beginner to LUMI information? Im comfortable on a local computer and server, but LUMI is new to me. I have an account but dont know where to beginn to move my programs and data. - (Emmanuel) LUMI documentation is available here https://docs.lumi-supercomputer.eu/ The LUMI User Support Team are also organizing LUMI intro courses (the next one is planned at SURF in Amsterdam 2-3.5 and will be available online as well). Please note that the documentation or these events are not meant for HPC beginners. Depending on where you are located, you probably have access to different HPC courses organised by your institution or competence center. - (Kurt) Materials from past trainings are also available (or documented if they are on the system itself but not downloadable due to restrictions) in the [LUMI Training Materials website](https://lumi-supercomputer.github.io/LUMI-training-materials/). This includes recordings from the presentations, slides, and in some cases additional notes. 2. The billing policy is not alway clear to me. For example, if I have the option ```#SBATCH --exclusive```or ```#SBATCH --mem=0```, I am not sure how much it will cost. It would be very convenient to have a report at the end of each run giving the number of hGPU or hCPU used. - (Kurt) The reason why this is not currently done is becasue the actual billing is not done by Slurm but implemented by a script that runs periodically, as the actual "price" is computed based on multiple Slurm parameters about the job and not computed by Slurm itself. The billing policy is best understood from the principle "you are billed for all resources that others cannot use because of your job" which is only fair. E.g., if you ask for exclusive node use either by using one of the "allocated per node" partitions or by adding `--exclusive` for one of the "allocatable by resource" partitions, it is only fair that you get billed for all resources of the node, i.e., 128 core-hours per hour for the CPU nodes or 4 GPU-hours per hour for the GPU nodes. The same holds when you request all memory on a node, even if you would be only using a single core or GPU, as you make the whole node unavailable for other users. But the same principle is also used if you use a disproportionate amound of a certain resource compared to the other resources. E.g., if you use 50% of the cores or 50% of the CPU RAM in a GPU node yet request only 1 GCD (1/2 of an MI250X GPU) you'd still be charged for using half a GPU node, i.e., 2 GPU-hours per hour as in essence 2 full MI250X's cannot be used efficiently anymore or without restrictions by other users. 3. It seems to me that the keras function `fit` has a memory leak or increase memory usage with the epoch. Can I print after each epoch the GPU memory usage? - (Kurt) I don't think you can as memory reporting with the HIP API is broken when using ROCm 5.6 (and likely also ROCm 5.5) on the 5.2 driver that we currently have on LUMI. I don't know if there is any way to get a reliable number. - (Sam AMD) As Kurt says the driver is a bit dated which precludes the use of some APIs. You can try `https://www.tensorflow.org/api_docs/python/tf/config/experimental/get_memory_usage` to get your usage and see if you get sensible results. GPU total memory info can be corrupted due to driver imcompatibility but the "used" memory might be still reported correctly. Other option is to have an interactive session `srun --jobid <yourjobid> --interactive --pty bash` where you can run `rocm-smi` while your job is running to see what the drivers reports. 4. Is there any information related to the latency of launching kernels on the GPUs on LUMI-G? In our application (materials science, DFT), compared to running on NVIDIA GPUs, we found out that the overall runtime of functions is slower on MI250X but the individual kernels (as shown in the tracing software: perfetto & NSIGHT) are faster. I'm pasting below a screenshot in case someone has investigated this already. It's produced via rocprof, using a very simple code that launches empty kernels with different grid/block size with user markers (shown in the bottom panel) ![](https://md.sigma2.no/uploads/2a492107-d073-4056-b3a9-d263b988d68e.png) - (Kurt) It is probably best to ask this question directly in the AMD room during the coffee break. We have observed also that starting kernels on AMD can be more expensive than on NVIDIA hardware. It's not clear how much of this is due to hardware limitations and how much of this can be improved through future drivers and ROCm versions (our driver is very old). And we have no precise data. - (Sam - AMD) Let's discuss this in the coffee break. I'd say that 10us to start a GPU kernel is kind of expected. So the moment kernel timming starts to be under 10us, the latency starts being exposed. I don't have current what the expectations for NVIDIA should be. - ( George - AMD) I do not see what is in the left, do you have more kernel calls? The first call to kernel takes more time. 5. I mainly use python. My batch script has ``` module purge module use /appl/local/csc/modulefiles module load pytorch/2.0 ``` in it. Do I have to do this the second time? - (Kurt) You have to do this in every shell where you want to use it. Since it is in a batch script, it should be OK for the lifetime of the shell in the batch job. Your next batch job of course needs the same lines again in its job script. But note that you are using modules from a local software stack provided by CSC, so if there are problems with this version of PyTorch itself, support comes from CSC and not from LUST as we are not at all involved with building those modules. 6. With Score-P v8.4 being released, which includes improved support for HPE Cray systems, we thought about contributing an updated EasyBuild recipe (with the respective EasyBlock) for users on LUMI. In general, Score-P v8.x might be interesting as it includes support for HIP. So far I've created easyconfigs for our dependencies (CubeLib, CubeWriter, OTF2, OPARI2) and am trying to get Score-P built as well. I've got a few questions though: - Is there interest in general to provide an updated version of Score-P in your environment, or at least accessible for user installations? - My current idea would be to provide configs for LUMI-L (Cray, GNU, AOCC). Is there any interest for Cray-amd / AMD? - Right now, CubeGUI (which is used to examine the profiles) is missing, since it requires having Qt. Building Qt5 just for this application seems to be overkill. A workaround would be to offer the [CubeGUI AppImage](https://apps.fz-juelich.de/scalasca/releases/cube/4.8/dist/CubeGUI-4.8.2.AppImage) to users. Would this be a feasible solution? *Answer from Lust* - Any contribution is certainly welcome. It would go in the contributed repository from which users can install software themselves as we cannot guarantee timely updates when something on the system changes nor can we guarantee that we can actively support the package. It has not been done because of lack of time on our side with the amount of people we have who can do this work. - Not sure about the second question, but given that it is an important tool in the EuroHPC ecosystem where some projects want to focus on European tools, I expect it is only a matter of time before it will be requested. - Getting QT5 to build and work correctly can be extremely hard on some systems so we haven't embarked on it either, so any other solution can be considered. We're now more and more containerising GUI software and making them available via the web interface. 7. I am new user in LUMI. My supervisor add me her grant and I have finished all the registration process. But I do not know what is my ssh key to join LUMI cluster. Could you please guide me? Thanks - The whole process is [documented extensively in the LUMI documentation](https://docs.lumi-supercomputer.eu/firststeps/). With the number of users on the system compared to the number of support people we cannot guide everybody individually through the procedure but have to refer users to the documentation first (and to trainings in the [LUMI training archive](https://lumi-supercomputer.github.io/LUMI-training-materials/), but this specific topic is not treated in the courses). - You should update your profile https://mms.myaccessid.org/profile/ with the public part of your SSH key. 8. Hi, I have problem in accessing to my project. I could only access one project, and whenever I want to access in vscode, it ask me to type password which I don't understand at all. - This is a problem with the way you use VS code on your PC and not an issue of LUMI. You need to correctly configure remote access in VScode to use your ssh key. Sorry but even with in the lumi platform, I cannot see the project - What do you mean? Open OnDemand? Are you even sure you are a member of the project? Do you see it in the output of the "groups" command on the command line or in the output of `lumi-workspaces`? Also, you may have to start the VScode app in the web interface in the correct project to see it by default. - Just checked, actually this is not needed, I can go to the directory of all my projects if I use VScode in the right way in Open OnDemand. 9. I would be interested by a training in high-level deep learning python libraries like keras/tf. It would help to get the good practices regarding these libraries used on LUMI (especially for medium-to-large size applications) - We will have an AI course at the end of May in Copenhagen but are still preparing that. We are not a support team for specific applications so cannot give you guidance for each and every application. But some common rules are: - Those packages should be used from containers as the LUMI file system doesn't like lots of small files and in particular if you are then doing distributed training it would slow down the file system for everybody. - Be careful with everything that requires communication. Whether you use something MPI-based or something RCCL-based, they have to use the libfabric library from the system if you want high performance, which basically means for MPI that you'll likely need to build the container in a way that it uses the Cray MPICH implementation, and for RCCL, that you need a plugin. This is also the reason why we offer some pre-built containers that are documented in the [LUMI Software Library](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs), your starting point for all searches about a very specific package. - Datasets should use appropriate formats and not be stored as millions of small files on the system as again this puts a very high strain on the file system. We're working on some more documentation for PyTorch as that is by far the most used AI package on LUMI, so we focus the few efforts that we can do ourselves in that domain on PyTorch first (being computer experts and not having domain experts for each and every science domain on LUMI in the LUST team). ### 2024-02-28 13:00 -- 14:00 CET #### LUMI Updates - Trainings - Advanced general 4-day LUMI course - 23.--26.4. in Espoo (Finland) - _Getting started with AI on LUMI_ workshop - 29.--30.5. in Copenhagen (Denmark) #### Breakout rooms Join freely any you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Getting started on LUMI (Join if you are new on LUMI and have questions) #### Questions & Answers 1. SLURM has been lagging consistently for me over the past few weeks. Takes a while to respond to commands. Anybody else notice this? - (Emmanuel): Can you be more specific? Which slurm commands have been lagging? I try to figure out if it is a slurm issue or a filesystem issue. - (OP) Actually I found out that it is not a SLURM issue but a Python script I started using to color the output of my squeue alias to apply different colors depending on job state, etc. Without the script it is working just fine, sorry! Perhaps I will look for a Bash alternative, unless you can suggest an efficient way of invoking the script. The script works with pipe redirection, right now I use a command similar to `squeue --(options) | script.py`. It takes a few seconds each time on LUMI, and on another cluster it is basically instant. - (Kurt) Python on a big LUSTRE filesystem is not a good idea if your Python packages has a lot of dependencies as it will access lots of small files when starting (and maybe even when running if for some reason Python decides not to cache code of dependencies of dependencies). That may explain why it is slower than on some other machines. Of course the length of the queue on LUMI compared to that other machine can also be an issue. My experience is that Lua, and then trying to put everything in as little files as possible, is the better scripting tool in such cases. - (OP) Did it with a cascade of greps, works like a charm. 2. My code keeps failing without specific reason (process terminated), when I start large jobs with many nodes. The same code works fine with fewer nodes. - (Emmanuel): We would need more information from your side to be able to help you. Is it related to an existing ticket? Which application are you running? Batch script? Error file? I would recommend you to first open a ticket and you can discuss during the coffee break with the dedicated people from LUST, HPE, or AMD (depending on the origin of the issue). 3. I am running Quantum Espresso and it is working perfectly, but some specific executable (projwfc.x) seem to be missing. Perhaps it is a small issue. Thanks for the support. - That is entirely normal. Quantum ESPRESSO can be built in many different configurations. Moreover, a decent installation manual is missing. We are in the first place computer specialists, and though many of us have a past in research, we cannot cover all research fields nor can we cover all tools, not even the major tools, in the small team that we have. As a result, we often have to guess which build options are most relevant, unless we get enough info on that in the ticket that triggered the development of the EasyBuild recipe, so we look into installation tools that are used elsewhere, basically at the way CSCS builds on Cray as they have a long tradition with Cray systems, and the way EasyBuild and Spack do it, and build those configurations (and in this case we mostly based our installation on the one at CSCS as they have close ties with the EuroHPC Centre of Excellence that develops Quantum ESPRESSO). The tool you need is in a different module of Quantum ESPRESSO that is not build that way, so you will have to adapt the EasyBuild recipe for your specific needs. It looks like the only change you might need is to copy the EasyConfig `QuantumESPRESSO-7.2-cpeGNU-22.12.eb` and change the line ``` buildopts = "all epw" ``` into ``` buildopts = "all epw pp" ``` if the short manual on postprocessing tools that I found for an older version of QuantumESPRESSO is still relevant. We make our full installation recipes available both in GitHub and via the [LUMI Software Library](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/) so that users can actually check how the package is installed and what it contains and does not contain, and so that they can customise a build recipe to their needs without having to start from scratch. 4. We are trying to run some simulations that are writing alot of temporary data on disc. Since there are no locally mounted Scratch discs on the the individual nodes, and to reduce I/O overhead, we want to run the simulation in the memory of individual nodes instead. Usually we would simply ssh into each nodes, and create and execute the necessary files from there, but since LUMI does not support this, is there a better way of doing this? I was looking to allocate multiple nodes in a job, and then periodically ssh into the nodes. - See the [LUMI documentation](https://docs.lumi-supercomputer.eu/), on the ["Interactive jobs" page](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/), section ["Using `srun` to check running jobs"](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/interactive/#using-srun-to-check-running-jobs) for the replacement of using ssh to get in the compute node. It places the resulting shell in the example commands inside the job, so it is restricted by the cores and memory allocated to the job and cannot eat into resources allocated to a different job on the node. Using `ssh` to go into the compute nodes is not supported because ssh is not job-aware. While Slurm does have a system to ensure that you cannot ssh into a node that has no job of you running, it cannot properly restrict the resources that you start in that shell to use only your resources. And doing something different on the exclusive partitions and the shared partitions would only cause confusion. - I'm not sure is something like [mpiFileUtils](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/m/mpiFileUtils/) can also be of help in your data management problems. Thank you for the answer, it seems to be the right approach. However, allocating the ressources seems confusing to me. Lets say I have a script requiring 8 cores with 1 task each, that I want to run in a specific node. If I run srun with only 1 task, the code executed can not use multiple cores/tasks. If I run srun with N tasks, it appears to execute the code N times? 5. A singularity container with pytorch specifically for LUMI has been provided by AMD (https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/). However, as far as I understand, it is not possible to install new packages to the container once it's constructed? If so, what is the suggested workflow for using this 'recommended' container instead of building our own using the cotainr building tool which was presented last year? Or is the intention that these two resources should somehow be combined? - (Christian) Cotainr may be used to build a container from one of the base images also used in https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/, i.e. you can use one of the `lumi-rocm-rocm-X.Y.Z.sif` images under `/appl/local/containers/sif-images` on LUMI as a base image with cotainr - combining the two approaches, you mention, as detailed in [this example](https://github.com/DeiC-HPC/cotainr/tree/lumi_sfantao_pytorch_example/examples/LUMI/sfantao_pytorch). Unfortunately, right now this example fails due to [a bug in Singularity](https://github.com/DeiC-HPC/cotainr/issues/52). We believe we have a fix for this, which will hopefully *soon* be implemented on LUMI. Until then, a workaround is to use the latest main branch GitHub version of cotainr and use the tar archive of `lumi-rocm-rocm-X.Y.Z` as a base image, as detailed in that bug report. - (Christian) Alternatively, you may try to create an overlay on top of the existing PyTorch container from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/ as detailed in https://md.sigma2.no/CBC_GEUCTCSvdvhpxAChkQ?view. Note that this involves installing pip packages into an existing conda evironment already containing pip packages, which is [discouraged by the conda developers](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#pip-in-env) - your milages may vary. If using this approach, you might have problems with reproducibility, since I don't think you have any guarantees for being able to create the exact same conda/pip environment if you do it twice (with some time in-between). You also have to make sure to keep track of the overlay(s) yourself - or consider [embedding them](https://docs.sylabs.io/guides/3.5/user-guide/persistent_overlays.html#overlay-embedded-in-sif). - (Christian) Finally, you may also use the container from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/ as a bootstrap argument in a Sigularity definition file, which you may use to build a custom container on your own hardware. This is not really something LUST can support, but it is an option if you are used to building Singularity containers yourself on your own hardware. 6. I'm having trouble getting PyTorch's nn.Conv1d to work with AMD ROCm. See https://pastebin.com/DRH8si8N for a minimal example that breaks when using ROCm but works fine on CPU. In general, I have not had issues with any other features from PyTorch (using Python 3.12.2 and torch 2.2.0+rocm5.7). - ROCm 5.7 is known to not fully work on LUMI at the moment. In fact, even in 5.6 memory reporting was already broken on the driver we currently have on LUMI, and your error message points into the direction of trying to use precisely those functions. We can only give some guarantees up to ROCm 5.4. Any newer version of ROCm has some problems on LUMI currently which will remain until after the next update. It is already sure that then we will be able to support newer versions of ROCm, but it is not clear how far we will be able to go (but it looks like the driver we get then should be new enough to support ROCm 5.7). As there is still uncertainty about the availability of the software that we want to install in the next update, and as it will be first tested on a small test system, no firm date has been set for the next update. 7. I have a general question on compiler installation/upgrade procedures followed by HPE. I am working on two machines (LUMI & another MI250x machine installed by HPE) for porting my code to AMD GPUs. I am using CCE/15.0.1 on both machines for compiling my Fortran+OpenMP code. On LUMI, my code compiles without any issues with cce/15.0.1, but on the other machine the same compiler version cce/15.0.1 shows some compile time issues with code when it tries to compile some Fortran interfaces to C++ functions. In both cases, the source code and compiler module (cce/15.0.1) are the same. I am wondering whether the two compiler modules can be different even when the major and minor versions are the same (as shown with installed modules). Is it a general practice to add patches to compiler modules to upgrade them while sometimes keeping the version numbering same ? In general, I am trying to understand how the same compiler modules on two different machines can behave differently with same source code. Is there a way for a general user to verify that the compiler modules, having same versions, are exactly the same ? Any ideas on this would help me advance with my debugging. - Other libraries on the system can be different and cause the problems. CCE 15 is supported on both SUSE 15 SP4 and SP5, but they may be packaged differently. There may also be differences between the modules that HPE distributes for SUSE and for Red Hat derived systems. We've also yet to see any recent CPE release without packaging errors somewhere so it may even be just a packaging error on the HPE side and a different version of the installed package on your side due to a different OS or version of the OS or different cluster management environment. So the CCE 15.0.1 module for various systems will certainly be based on the same sources at HPE, but may be built and packaged differently for different OS variants and system management software, or may have different packaging errors. I've noticed there tend to be different release announcements even based on the system management software you're running (HPE has currently two environments) with different packages for both, so it may very well be that a packaging error does not impact our version (for CSM) but does impact the version for the other management environment. I don't know what machine you are talking about, but I know that we have a commercial package that already works on a cluster at KTH in Sweden but not on LUMI even though these systems are also extremely similar, and it is likely due to interaction with some library outside (in that case) the MPI module that is different. Welcome to the joy of shared libraries where an executable or a single library alone doesn't tell the whole story anymore... (Well, the shared libraries have benefits also of course.) Incidently KTH is using the other system management environment so the problem we observe with that package may entirely come down to library differences in both environments. The question is way to vague though to give a more concrete suggestion of what could be going on, and since the problem is on "the other machine" is really more of a support issue for their support team. 8. A very practical question reqarding the AMD profiling tool rocprof. By running the command `srun -u -n1 rocprof --hip-trace --roctx-trace MY_EXEC ARGS` I can get a json file that contains HIP API calls, GPU kernel executions and my own marked regions which is the reason `--roctx-trace` is used. The json file can be visualized in chrome://tracing/ in Google Chrome browser. I can get a lot of information from the visualizer but it seems that the gridSize and blockSize for the kernels executed is missing. Is there a way to get this information ? For comparison, NVIDIA's corresponding Trace visualizer, NSIGHT Systems, provides this information upon checking a GPU kernel, as show in the screenshot below: ![](https://md.sigma2.no/uploads/41669904-4c74-42f8-a4ad-f26a73ce3c79.png) To be very specific, I'm interested in getting the `grid: <<<x,y,z>>>`and `block: <<<a,b,c>>>` but via rocprof. The kernel configuration parameters, such as the above, are determined automatically based on the sizes of the arrays. 9. When we use our own singularity container for official pytorch image, we cannot load aws-ofi plugin. - Too vague to answer what is the issue. But it is entirely possible, e.g., that there is a conflict between the runtime libraries of the compilers used for the software in the container and the AWS OFI plugin if it is built with our EasyConfig and compilers on LUMI. Sometimes there is no other solution than compile it by hand with the compilers in the container. It is because of these difficulties that we provide containers ourselves. We don't get any errors when we load the aws-ofi plugin, but our distributed training job is slow and we find that the plugin is not loaded from the NCCL info. 10. When we use AMD jax containers, the GPUs don't appear, and we can't import jax library properly. - Again to vague to answer. How do you request the resources for the job and start the container, as the answer may be there instead? For example, you need to explicitly request GPU resources, otherwise they will not be visible to your job. None of the containers here seem to work: https://hub.docker.com/r/rocm/jax As for running we have tried both using an interactive session where we get a single node as well as sbatch. rocm-smi shows the presence of all gpus inside the container. However using jax devices() returns only CPU devices. With the exact same setup I do not face these issues with PyTorch containers. I am happy to any more specific details that can help with the debugging. (Tarun) - Oh, I misunderstood your question. You're using the docker containers and not ours (that are still hidden in /appl/local/containers/sif-images as we still need to do the necessary EasyBuild packaging to make them easier to use). We cannot support the docker containers and they are not guaranteed to be compatible with LUMI - rather the contrary - which is why we provide our own containers for AI packages. They are built for a different communication network also. We can get help from the people that built the containers that we offer in `appl/local/containers` if something does not work, but we are in no way involved with the way the docker containers are built so we really cannot see at all if they could run on LUMI and why or why not. 11. How to use the most recent version of ROCm like ROCm 6.0 in the singularity container, we try to use recent ones, but it doesn't work. - (Peter) I refer to the answer to question 6 above. ROCm 6 will likely not work with our current drivers. 12. I would like to have a general comment on the current status of AOCC compilers in comparison with Cray CCE compilers. Is Cray CCE is still the recommended one or AMD compilers have picked up in terms of OMP performance. on GPUs ! - Is this for GPU or CPU? GPUs ! - The Cray compiler on the system is based on a newer version of Clang and LLVM then either the AOCC and ROCm compilers that we currently have on the system. Certainly for CPU we still expect that the case after the next system update as Cray is now rather closely tracking Clang and LLVM. For Fortran the story is different as Cray has its own Fortran frontend. AOCC uses classic flang and my impression is that ROCm uses new flang. For GPU the Cray compiler depends on some parts of ROCm, so the answer is less clear. 13. I wish to run a AI/ML inference pipeline which would use multiprocessing on the CPU side and each CPU core would bind to one GPU each. How does one choose the SLURM and srun invocation for this (P.S. I use tensorflow to load the ML model) ? - Ask in the AMD room, but I guess it depends on how your application choses the GPUs, and then the trick is to use a CPU bind mask with as first element a core in the CCD connected to the GPU that the software would chose for the first process, etc. 14. Related support ticket #3848: Is there any chance that newer Cray CPEs can be provided in some way? We're interested in testing fixes being mentioned in the patch notes of CPE 23.12 related to the OpenMP Tools Interface, which is still broken with CPE 23.09 / CCE 16.0.6. Newer ROCm versions are quite interesting for us as well, but this was already answered in previous questions. - Please don't ask questions here that we are already answering in tickets. This is very confusing and double work for everybody. I cannot even give the answer here that we want to give for legal reasons. 15. What is recommended monitoring and process manager? I have tried to start htop but I was not able to load appropriate module. I have also tried nmon, it is not bad but it does not show all 256 "processors" on nodes. I mean 128 physical cores+128 multithreading. - You need to use either CrayEnv or the LUMI stack before you can load the module that contains `htop`. Thank you, I use following commands ``` #!/bin/bash #PARTITION L - large memory and login node module purge module load LUMI/23.09 partition/L module craype-x86-rome module load libfabric/1.15.2.0 module load craype-network-ofi module load perftools-base/23.09.0 module load xpmem/2.5.2-2.4_3.50__gd0f7936.shasta module load cce/16.0.1 module load craype/2.7.23 module load cray-dsmml/0.2.2 module load cray-mpich/8.1.27 module load cray-libsci/23.09.1.1 module load PrgEnv-cray/8.4.0 module load GSL/2.7.1-cpeCray-23.09-sequential module load cray-libsci_acc/23.09.1.1 module load cray-fftw/3.3.10.5 module load libxc ``` then `module load systools/23.09` module as a dependence of htop. Then try loading htop. `module load htop` It emits the following message: Lmod has detected the following error: These module(s) or extension(s) exist but cannot be loaded as requested: "htop" Try: "module spider htop" to see how to load the module(s). - There is no module called `htop`. The `htop` command is provided by the `systools` module and if you carefully read the output that you get when you do a `module spider htop` that is precisely what it tells you. We don't want a separate module for every little command on LUMI as that creates too many small files for the file system to deal with. So all those little tools are bundled in modules. - Also, you're loading way too many modules by hand. Everything from `craype-x86-rome` up to and including `PrgEnv-cray` is also loaded or will be unloaded as soon as you load that GSL module as that one loads the `cpeCray/23.09` module that basically loads the programming environment as it was used to build the GSL module. Actually, what I want to achieve is to load exactly the same environment each time in order to avoid breaking things after updates or whatsoever. Did I understood correctly that I reload some modules several times? How can I achieve the same environment but in optimal way. - To inspect thread pinning and process binding you can use `lumi-CPEtools` module and execute `hybrid_check` tool instead of your target executable; it will print exact thread pinning and process binding. Thank you very much. 16. What is recommended console manager? Is there midnight commander installed? I use self built fully static mc. But it can not edit files due to possibly nls or other language files missing. - We do not use these tools ourselves, so don't have a good recommendation. `mc` tends to be tricky to compile and install, we tried it before. We have it installed in the latest spack module. Try `module load spack/23.09` and then `module load mc/4.8.27-gcc-7o5`. Completely untested, though. Ok, thank you very much. It is much better than reinstalling by myself. I have tried `export SPACK_USER_PREFIX=$PROJ/spack module load spack/23.09 module load mc/4.8.27-gcc-7o5` But there is no module `mc/4.8.27-gcc-7o5`. Should I install mc through spack? Did I understood correctly? - I checked myself independently from the person who gave the previous part of the answer, and it works for me (at least on the login nodes where I tried). Are you sure you didn't make a typo? What does `module avail mc` say after loading the `pack/23.09` module? When I try `module load spack/23.09` it emits this: ``` Lmod has detected the following error: Please set`$SPACK_USER_PREFIX to where you want to install Spack`packages. We recommend using your project persistent storage for this. E.g. /project/project_<project-number>/spack While processing the following module(s): Module fullname Module Filename --------------- --------------- spack/23.09 /appl/lumi/modules/SoftwareStack/spack/23.09.lua ``` Then I try `module avail mc`, it emits `No module(s) or extension(s) found!` But `module spider mc` finds several modules: mc: Versions: mc/4.8.26-gcc-xu mc/4.8.27-gcc-gkn mc/4.8.27-gcc-z2a mc/4.8.28-gcc-fs Other possible modules matches: libxdmcp libxdmcp-1.1.4-gcc-7.5.0-u4e2ln2 mc-4.8.27-gcc-7.5.0-gkngylg termcap termcap-1.3.1-gcc-7.5.0-s4oijlb Finally I have managed to load mc using these commands: export SPACK_USER_PREFIX=$PROJ/spack module load spack/23.03-2 module load mc-4.8.27-gcc-7.5.0-gkngylg I have also tried spack/22.08 with mc/4.8.26-gcc-xu. Both preinstalled variants work well with editing, however mouse clicking doesn't work. My statically built mc doesn't edit well but mouse clicking works. Still some room for improvement. Thank you very much for your hints. They were really helpful. ### 2024-01-31 13:00 -- 14:00 CET #### LUMI Updates - LUMI intro course: Thursday, 8.2.2024, online - 1-day course about using LUMI effectively. Requires some HPC experience. - Info and registration: https://www.lumi-supercomputer.eu/events/lumi-intro-course-feb08/ #### Presentation 15min - [HyperQueue](https://it4innovations.github.io/hyperqueue/stable/): A tool to simplify execution of large workflows on HPC clusters by allowing you to execute a large number of tasks without having to manually submit jobs into Slurm. #### Breakout rooms Join freely any you think is interesting for you - Room 1: AMD - Room 2: HPE - Room 3: Geting started on LUMI (Join if you are new on LUMI and have questions) - Room 4: HyperQueue #### Questions & Answers 1. Any progress regarding how to work with sensitive data on LUMI? - "An architecture draft will be reviewed in week 5. Implementation is planned for Spring/Summer 2024." 2. I am porting a code to run on LUMI-G, and encountered a strange data transfer issue (CPU-GPU) which I can't understand. The code is calling "hiprandGenerateUniformDouble" at this point and out of 8 MPI processes only RANK 0 is able to show the device generated random numbers on host after updating them from device. Rest of the ranks fail (Memory access fault message with CRAY_ACC_DEBUG variable) while updating data back to host from their respective devices. The data transfer is managed by OpenMP pragmas. I have verified (with omp_get_default_device() & hipGetDevice()) that all MPI ranks are well running on their own devices. I have this short test ready to quickly go through the issue. Would it be possible for someone to have a look at this issue with me during this coffee break session? Thanks - It is not obvious for me what might be causing this. What comes to mind is a mismatch of set device IDs in the OpenMP runtime and the hipRAND handler. To narrow down the issues search space I'd make sure that each rank only sees a single GPU with ROCR_VISIBLE_DEVICES. For instance one can use: ROCR_VISIBLE_DEVICES=$SLURM_LOCALID. I'll (Sam from AMD) be in the coffee break and we can take a closer look. 3. We have recently undertaken the task of porting URANOS, a Computational Fluid Dynamics code, to AMD GPUs. While the code using the OpenACC standard. it was predominantly optimized for NVIDIA machines, so we have encountered some performance challenges on AMD cards. We are reaching out to inquire whether there are individuals within the LUMI staff who can share some pieces of kwnoledge in optimizing code performance specifically for AMD GPUs. We would greatly appreciate any assistance or guidance - We may need HPE people to also chime in here. My experience with OpenACC comes from very flat codes. Here, the performance implications are a mix of runtime overheads and kernel performance. The former can be assessed with a trace of the GPU activity and the later can be done with a comparison of kernel execution time with other vendors. I've seen the Cray Compiler OpenACC runtime being a bit conservative on how to control dependencies with some redundant runtime calls that can be lifted. Other things might come from register pressure and some device specific tunning (loop tilling for example). The register pressure is connected with the setting of launch bounds - unfortunatelly setting the number threads is not sufficient and a thread limit clause needs to be used instead. Tilling requires change a bit the code. We can discuss further during the coffe break. 4. We try to understand why we don't the performance we exepct from the GPUs on LUMI-G but our software is too complicated to trace itself. So I'm looking for much simpler examples, to measure individual functionallities, such as data transfers, FFTs, bandwidth, etc. Is there a repository of simple to complex examples for GPU execution on LUMI-G? - Not sure if it will cover everything needed but AMD has some examples used for training: https://github.com/amd/HPCTrainingExamples. There are also the AMD blog notes that can help with some trimmed down examples https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-readme/. These are not really benchmarks and not meant for performance assessment but could be helpful for testing along those lines. 5. How does HyperQueue compare to other workflow managers such as Nomad (by Hashicorp)? :::info End of archive :::