SwissTwins all-hands Zurich 11/12-07-2024

# Notes JG ## 2024-07-11 - Manta tool https://github.com/eth-cscs/manta - Nowadays distributed data and HPC capabilities - Multi-tenancy (IaaS) - Manta allows for upscaling and downscaling clusters - Ansible playbooks also used for Alps - SCITAS from EPFL working on public cloud resources, e.g. vCluster - WeatherGenerator led by ECMWF -> AI-based weather forecast models; TerraDT -> Digital twin for cryosphere of earth - Podman container technology - GP: Make sure that we support SimFS (understand what it does first; Python driver for ICON) - WP10: GP: See if this can be integrated with C2SM (the more we integrate, the higher chances we will be included) - WAN circuit; SWITCH provides L2 service - uenv vs. Sarus containers - First use case of containers not to bring your own tech stack, but for isolation. - Instead, `uenv` really meant for somebody to generate their own OS / config / tech stack - ICON very hard to containerize (building currently very brittle); C2SM and ECMWF are using `uenv` for that instead ### Breakout room - Dry-run feature for our W&C workflow PoC - Submit longer running shell jobs, which are also compute-heavy to check actual usage on the cluster - And / or via FirecREST on Daint - ICON from [[Michael Jaehn]] is on Daint - In August FirecREST should be working on W&C cluster - 3 Clusters on Alps: One for W&C, one for ML, one for - If we don't have access to ICON by end of August, we need somebody else to helps us out - They have ARM64 processors, so you have to build your container using ARM64 binaries, rather than x86 micro-architecture - CUDA is very performance-portable - Podman is being used on Alps - Will Sawyer with Exclaim should be the person to reach out to -> Rico will reach out to him - Pulling image is easy -> 1 line; making a runnable instance out of it is not easy -> one needs to mount certain volumes, activate hooks, etc. - "Containers as first-class elements": https://confluence.cscs.ch/display/KB/Container+Engine - From imperative (bash commands) to declarative ([environment definition file (EDF)](https://confluence.cscs.ch/display/KB/Container+Engine) -> based on TOML) paradigm - `[srun|sbatch] --environment=your_environment.toml --pty bash` - [FourCastNet](https://docs.nvidia.com/deeplearning/modulus/modulus-sym/user_guide/neural_operators/fourcastnet.html): AI model for W&C https://github.com/NVlabs/FourCastNet - JFrog - From [Elsa Germann](https://www.psi.ch/de/awi/people/elsa-sylvia-germann) people to contact for ICON build / image: Jonas Jucker, Dominic Hofer - https://hub.docker.com/r/apache/spark - Through jump via ela connecting to the compute nodes will still always be possible - So far, monitoring via AiiDA has been running on the login node - For monitoring: General solution via FirecREST: Streaming of reduced data (e.g. subset from big output data file) via some binary provided by the user - In situ visualization with FirecREST? - https://www.paraview.org - In FirecREST, one can run whatever they want through FirecREST anyway on the compute node through the run script - With `srun overlap` one can run another job directly on the same node within the same resource allocation - DM4: Streaming - Use cases: - EXCLAIM workflow where they obtain historical weather data, which is used for a pytorch workflow (AI/ML production) -> Anurag is the relevant contact person - MeteoSwiss accessing production data - FDB: https://github.com/ecmwf/fdb ## 2024-07-12 - DM4: Simple ICON workflow where input files come from API / utility (polytop) that retrieves large data from tape; data taken from ECMWF in Bologna - Write to MeteoSwiss: How to fetch data from Polytop; which command, which API? - With Alex -> demonstrator to get data from them; how do they actually use that in a workflow? - Data access part? Do we need some credentials, etc., security concerns - Call a job that uses their API/utility to fetch data, and then we use that in our workflow. So far, dummy workflow, then next step, later could also be used in the actual ICON workflow - How can AiiDA use the large-scale, fast incoming data - First step: Getting data to disk; rather than live streaming - First slow API from ECMWF. Data coming from Alps2 to Alps (part of DM4; Jerome is point of contact) - See if `FLEXPART` workflows can also be run via the new YAML syntax from [[Matthieu Leclair]] and our WorkGraph-based engine - CSCS and `autosubmit` will get funding for workflow project soon -> Interest in getting collaboration going on - Understand `autosubmit` syntax to check if we can parse it to our YAML format with Matthieu - Summarize all the meetings with `autosubmit`, `ecFlow`, etc. - Rico's PoC doesn't work for the second run, not because it doesn't work with the restart_file, but because it doesn't find some other file - Next all-hands meeting should be around December - Data transfer issue (Jerome) different from tape issue -> Two different concerns - Use case 1: Fetch large global data to run smaller simulation - Use case 2: Obtain historical data from the tape -> E.g. for training ML models - Jerome: Main person for tape API; Kristian just uses the API to obtain the data - Person from Exclaim top-level overview of they do with the data -> How to get, e.g. 1gb rather than 1tb of useful data, put it on the scratch