This is the archive document of the past sessions of the public LUMI coffee break.
The archived questions and answers of 2024 sessions can be found here: https://hackmd.io/@lust/coffeearchive2024
The archived questions and answers of 2023 sessions can be found here: https://hackmd.io/@hpc/coffeearchive2023
The archived questions and answers of 2022 sessions can be found here: https://hackmd.io/@hpc/coffeearchive2022
Link to document for the next coffee break: https://md.sigma2.no/lust-coffee-break
Presentation by Gregor Decristoforo about the LUMI AI guide
I use the PyTorch module, and create a virtual environment on top of that for additional packages. The number of files in my virtual environment is around ~1500 files. Is this a large number causing a slow execution of my code or strain on the Lustre file system? If it does affect the speed, is it only limited to the module importing phase? Once training begins, should we expect normal performance?
When I create an interactive session for VSCode through the web browser, I load the pytorch/2.4 module. However, when I try to select the Python interpreter in the VSCode session, I encounter "invalid Python interpreter" error. Ticker number is LUMI #5981.
I'm in the optimization stage of my project and found that my rocblas _<T>gemm
routines are delivering a poor performance on a single Die of MI250X
compared to how the same code performs on a A100
NVidia GPU. To my knowledge, LUMI's GPUs should surpass NVidia's by at least a factor of 2x
. Specially in FP64
operations.
A close examination using the rocblas-bench with square matrix multiplication, has revealed an inconsistent behavior on the GEMM routines. For example, rocblas_dgemm
routine:
In this figure, two performance behaviors are found: the highest performance is obtained with arrays filled with integer values randomly picked from a limited range, and a lower performance when data is initialized with values generated from trigonometric functions. In otherwords, a performance degradation is observed on dgemm
with the only difference being the arrays initialization.
Q1: Is this the expected behavior? and if not, how can we get the maximum performance with arbitrary input data?
LUMI support
and the ROCm/Tensile
project on github
(see issue #2118).These questions are outside of what LUST can do. We're L1-L2 and not L3. Discuss directly with AMD who is usually present in the coffee break,
in a break-out room. And the hackathons serve to help with optimizing code.
But your idea about the speed of an MI250X die is wrong. There is more than peak performance. There is also the cost to
start up a kernel, and the memory bandwidth isn't higher than on A100. So in practice we're happy if a single die can
keep up with an A100.
And if you really want to understand what's happening, you should not speculate about dynamic power management (which shouldn't have
a huge influence as we're talking about relatively small variations in clock speed), but should use proper profiles to see where the
code gets stuck. The performance degradation may come from a totally different source than you expect…
WARPX GPU aware MPI, on host OOM.
I have difficulties in installing conda environment (flash-attention 2) for training LLMs. I wonder if there is a hand-on video about the installation process. The LUMI environment is different to the system in CSC.
Directly, with you taking care of all bindings and all necessary environment variables
and it looks very convenient. Now I get access to the container and there seems to be flash-attn2.I am (Mehti, Aalto University) getting an error related to the missing OS library called libgthread. It's related to the opencv-python library (using headless version didn't help) and required to have on a system level. Users can't add this library because of permission restrictions. I also discussed this with Triton support at my university, they told me that you should install this library. Using different container version didn't help (I followed guideline here: https://github.com/mskhan793/LUMI-AI-Guide10032025/blob/main/setting-up-environment/README.md#installing-additional-python-packages-in-a-container). Some useful links that refer to this library:
Have you tried an unpriviliged proot
build? That way you can extend a prebuild container with system libraries. Check out this section of our training course: https://lumi-supercomputer.github.io/LUMI-training-materials/2day-20240502/09_Containers/#extending-the-container-with-cotainr
Please take our trainings as we explain why we cannot install everything on the system as we are (a) a large and scalable HPC infrastructure, not a workstation and (b) have to serve multiple users with conflicting expectations. Such things should be done in containers, using the "unprivileged proot build" process, and we have tutorials and demos about it in our latest training materials. Have a look at the software stack talk from our latest course
Is there imagenet-1k dataset on LUMI machine? I know that there is mini-dataset, but I wasn't able to find imagenet-1k full dataset with 1000 labels.
salloc
, then gdb4hpc
, no rocm-device found ❌salloc
, then rocgdb
, no rocm-device found ❌salloc
, then srun --interactive --pty bash
, then gdb4hpc
, no rocm-device found ❌salloc
, then srun --interactive --pty bash
, then rocgdb
, this one works! ✅Trainings
Maintenance breaks
FATAL: container creation failed: mount /proc/self/fd/7->/var/lib/singularity/mnt/session/data-images/0 error: while mounting image /proc/self/fd/7: failed to find loop device: could not attach image file to loop device: no loop devices available
(probably as described here https://github.com/sylabs/singularity/issues/1499 . Happened some times without any obvious reason. Happened both on small and standard CPU-only partitions)
python3: preload-me.cpp:64: int hipMemGetInfo(size_t *, size_t *): Assertion `count == 8 && "The hipMemGetInfo implementation is assuming the 8 GPUs in a LUMI node are visible to each process"' failed. I am getting this when trying to run a torch distributed job on 4 out 8 GPUs. If I run on all 8 GPUs, then there is no issue.
Need guidance for collecting MPI stats of MPI+HIP application and information about any new ROCM availablity with containers. currently using 6.0.3.
A question related to rocprof
. I used to use it on LUMI-G before the major stack upgrade (summer 2024). Then, after ROCm was updated, I never managed to get useful traces, more specifically, I was using:
srun -u -n1 rocprof --hip-trace --roctx-trace ./my_exec
Then, I found that there is a new version, rocprofv3: https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/how-to/using-rocprofv3.html, but only available from rocm/6.2.2
, which as per the message after module load rocm/6.2.2
, it is not installed "properly".
My question is, anyone managed to get a standard way to make rocprof
work with roctx traces in a non-hacky way?
Hello, I don’t know if it is the right place to ask this question but I tried to connect to my LUMI project but when I try to connect with MyAccessID I get this message, while other people from my lab did not have this issue:
The digital identity provided by your organization when you log in to this service does not meet the required levels for assuring your identity. In the near future, this will become a mandatory requirement for you to continue to access LUMI HPC resources. We are working on solutions for the level of assurance of your identity, and to speed up this process it would be helpful if you could send the information below to the indicated IT support for your organization. If your organization cannot meet the requirements, you will be provided with an alternative to assure your identity and continued access. If so, this will be communicated to you in good time before the requirement is in effect.
LUMI is participating in a research project with the HPC group of Prof. Florina Ciorba at the Department of Mathematics and Computer Science of the University of Basel. This work will contribute to the PhD work of Thomas Jakobsche focusing on Monitoring and Operational Data Analytics (MODA) and how to better characterize user workloads and avoid inefficient resource usage on HPC systems.
Part of his work is to develop tooling that will record some metadata from applications that are being run, like what modules are loaded and which binaries are being executed with the goal to help the LUMI user support team detect misconfigurations and inefficient resource usage. The tooling should be entirely transparent to any user codes and have a negligible impact on application runtime. We have thus far tested these tools internally, and are now in a state where we would want to try it out with a broader scope of real user codes, both to further confirm that it does not cause any issues with user codes and also to gauge how well we can characterize actual user workloads. Currently the tools do not work with anything that is containerized, meaning that if you run mainly containerized applications you cannot participate.
Participating in the testing is really simple, you need to load a module and run whatever you normally run with the module loaded. You can contribute just a few runs or you can keep the module loaded for longer periods. In case you encounter any issues with the module, it causing issues with the programs you run, or significant performance impacts, please reach out to the LUMI service desk with a description of what you ran and how you ran it.
To load the module on LUMI use:
or
Submitting jobs with the module loaded or loading it inside the batch job should both have the same result. In practice what the module does is it adds the fingerprinting library into any applications you run, and once you unload the module the library will no longer be added to any new runs.
We appreciate any runs you can do with the module loaded, from simple single binaries to more complex workflows. If you have further questions please contact the LUMI service desk.
EPICURE (members of EuroHPC JU projects only!)
Trainings
Is there a tool on LUMI to generate a report of the job, showing the CPU/GPU usage, like the LLView in Jülich Supercomputing Centre ? Or is the "Application fingerprinting" above something like this?
There is no graphical tool, but much of the data except the consumed billing units is available from the sacct
and sreport
commands in Slurm. The tool you mention seems to show mostly data monitored by Slurm with some additional data.
The billing formula is too complex to do with Slurm and is not done on LUMI itself but on a different CSC server after which the data on whether the project can keep access to the queues is fed back to LUMI.
If you want very detailed behaviour info of a job, you need to use profilers etc.
Application fingerprinting is something very different. It is a tool to give information to the cluster managers on which applications are used and not a tool to study in detail how each application or job step behaves.
It seems the mail-related features in the sbatch script does not work
Is there a plan to implement this in the future?
End of archive