# Notes JG
## 2024-07-11
- Manta tool https://github.com/eth-cscs/manta
- Nowadays distributed data and HPC capabilities
- Multi-tenancy (IaaS)
- Manta allows for upscaling and downscaling clusters
- Ansible playbooks also used for Alps
- SCITAS from EPFL working on public cloud resources, e.g. vCluster
- WeatherGenerator led by ECMWF -> AI-based weather forecast models; TerraDT -> Digital twin for cryosphere of earth
- Podman container technology
- GP: Make sure that we support SimFS (understand what it does first; Python driver for ICON)
- WP10: GP: See if this can be integrated with C2SM (the more we integrate, the higher chances we will be included)
- WAN circuit; SWITCH provides L2 service
- uenv vs. Sarus containers
- First use case of containers not to bring your own tech stack, but for isolation.
- Instead, `uenv` really meant for somebody to generate their own OS / config / tech stack
- ICON very hard to containerize (building currently very brittle); C2SM and ECMWF are using `uenv` for that instead
### Breakout room
- Dry-run feature for our W&C workflow PoC
- Submit longer running shell jobs, which are also compute-heavy to check actual usage on the cluster
- And / or via FirecREST on Daint
- ICON from [[Michael Jaehn]] is on Daint
- In August FirecREST should be working on W&C cluster
- 3 Clusters on Alps: One for W&C, one for ML, one for
- If we don't have access to ICON by end of August, we need somebody else to helps us out
- They have ARM64 processors, so you have to build your container using ARM64 binaries, rather than x86 micro-architecture
- CUDA is very performance-portable
- Podman is being used on Alps
- Will Sawyer with Exclaim should be the person to reach out to -> Rico will reach out to him
- Pulling image is easy -> 1 line; making a runnable instance out of it is not easy -> one needs to mount certain volumes, activate hooks, etc.
- "Containers as first-class elements": https://confluence.cscs.ch/display/KB/Container+Engine
- From imperative (bash commands) to declarative ([environment definition file (EDF)](https://confluence.cscs.ch/display/KB/Container+Engine) -> based on TOML) paradigm
- `[srun|sbatch] --environment=your_environment.toml --pty bash`
- [FourCastNet](https://docs.nvidia.com/deeplearning/modulus/modulus-sym/user_guide/neural_operators/fourcastnet.html): AI model for W&C https://github.com/NVlabs/FourCastNet
- JFrog
- From [Elsa Germann](https://www.psi.ch/de/awi/people/elsa-sylvia-germann) people to contact for ICON build / image: Jonas Jucker, Dominic Hofer
- https://hub.docker.com/r/apache/spark
- Through jump via ela connecting to the compute nodes will still always be possible
- So far, monitoring via AiiDA has been running on the login node
- For monitoring: General solution via FirecREST: Streaming of reduced data (e.g. subset from big output data file) via some binary provided by the user
- In situ visualization with FirecREST?
- https://www.paraview.org
- In FirecREST, one can run whatever they want through FirecREST anyway on the compute node through the run script
- With `srun overlap` one can run another job directly on the same node within the same resource allocation
- DM4: Streaming
- Use cases:
- EXCLAIM workflow where they obtain historical weather data, which is used for a pytorch workflow (AI/ML production) -> Anurag is the relevant contact person
- MeteoSwiss accessing production data
- FDB: https://github.com/ecmwf/fdb
## 2024-07-12
- DM4: Simple ICON workflow where input files come from API / utility (polytop) that retrieves large data from tape; data taken from ECMWF in Bologna
- Write to MeteoSwiss: How to fetch data from Polytop; which command, which API?
- With Alex -> demonstrator to get data from them; how do they actually use that in a workflow?
- Data access part? Do we need some credentials, etc., security concerns
- Call a job that uses their API/utility to fetch data, and then we use that in our workflow. So far, dummy workflow, then next step, later could also be used in the actual ICON workflow
- How can AiiDA use the large-scale, fast incoming data
- First step: Getting data to disk; rather than live streaming
- First slow API from ECMWF. Data coming from Alps2 to Alps (part of DM4; Jerome is point of contact)
- See if `FLEXPART` workflows can also be run via the new YAML syntax from [[Matthieu Leclair]] and our WorkGraph-based engine
- CSCS and `autosubmit` will get funding for workflow project soon -> Interest in getting collaboration going on
- Understand `autosubmit` syntax to check if we can parse it to our YAML format with Matthieu
- Summarize all the meetings with `autosubmit`, `ecFlow`, etc.
- Rico's PoC doesn't work for the second run, not because it doesn't work with the restart_file, but because it doesn't find some other file
- Next all-hands meeting should be around December
- Data transfer issue (Jerome) different from tape issue -> Two different concerns
- Use case 1: Fetch large global data to run smaller simulation
- Use case 2: Obtain historical data from the tape -> E.g. for training ML models
- Jerome: Main person for tape API; Kristian just uses the API to obtain the data
- Person from Exclaim top-level overview of they do with the data -> How to get, e.g. 1gb rather than 1tb of useful data, put it on the scratch