Discussion on RabbitMQ replacement - 2022-06-21

# Discussion on RabbitMQ replacement - 2022-06-21 ###### tags: `design` Participants: * Simon * Giovanni * Martin * Chris * Kristjan * Jusong * Francisco Discussions: - Original RMQ AEP: https://github.com/aiidateam/AEP/pull/30 - Presentation by Chris, "AiiDA process scheduler", see slides - Simon: [from K8s docs](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not), they say that k8s is not "really" an orchestration system which would imply "first do a then do b". You describe a desired state and the system will do its best to reach that state. - Simon: seems similar to [Dask](https://docs.dask.org/en/stable/) - maybe good to check it first? - Chris: yes, good idea, I'll look into it - Martin: he checked when writing the kiwipy paper. Dask does *not* has persistence, no way of restarting (unless you put a mechanism on top). - Simon: are there efforts in the community trying to address the same effort? Probably we are not the first. - Potentially relevant: - https://github.com/insitro/redun - https://distributed.dask.org/en/stable/ - https://mesos.apache.org/ - [Recent review paper of task scheduling systems](https://onlinelibrary.wiley.com/doi/full/10.1002/spe.3047) - Martin: Already the AEP gets a bit technical, maybe good to focus first on high-level requirements - Responsive system, some tasks such as CIF can be < 1s. ("Event based?") - Require things more than 1 times, or 0 or 1; we need to make a choice on the promises we can make Requirements: - Single source of truth - Implementation should have modest 'burden of maintenance' (understandable by multiple people) also in the long term, not only for initial implementation - Easy to test (test driven), to facilitate debugging - "Accessible logging" - if the workers or the deamon are "missbeheaving" there should be a place where the user can check what are they doing and provide good information to debug by themselves the problems that depend on them (example: environment issues). - It should be possible to use an 'unmodified' Postgres instance - A new user who installs AiiDA on a computer should be able to run up to a few (~3-4) profiles in parallel, with reasonable performance (~1000+ concurrent processes) on a standard powerful workstation - Technical note: E.g.: there is a max number of connections of 100 to a given postgreSQL DB, this is actually shared among all running AiiDA instances (and each will connect multiple times) - To orchestrate running of many AiiDA workflows ("scalability") - No overhead because of already-run processes - Still have some debugging information accessible (find good balance with amount of info, maybe user-selectable level of debugging) - Minimal requirement: 10'000+ concurrent running processes, ideal 100'000+ (concurrent from the point of view of the AiiDA user: they can submit all of them and AiiDA can efficiently deal with them) - A workflow may have many children to run per step (and each child is a new process) - A calculation may run for days/hours - All these workflows need to run to completion ("robustness") - Want to run in background process(es) - However, it should be possible (e.g. for devels, or for small processes, for testing) to run directly without a daemon - Spawn multiple (loosely coupled) processes, running multiple (asynchronous) workflows - Needs to manage these background processes and be responsive to crashes - Should be possible to reboot the AiiDA computer without losing the state of the running processes - Should be as resilient as possible to random failures/reboots (user will forget to turn off AiiDA before rebooting) - Needs to be resistant to process terminations (i.e. persistent state) - "Responsiveness": - Needs to be responsive to process terminations (e.g. quick reaction to child completion, to continue/complete work) - AiiDA should guarantee that e.g. if a job is submitted (in the sense of a process being submitted to AiiDA) and submission is acknowledged to the user, its execution is guaranteed (and no expensive operations are run more than once), and if it's not possible to give the guarantee, the user should know how to recover/continue. If there is an error during submission, it should not happen that a leaked job runs in the background but is invisible to AiiDA. - requirement: guarantee a job on a scheduler is never submitted twice - if something is not submitted (e.g. reboot of machine while submitting?), the error should be clearly visible to the user and should be easily recovered (e.g. the process is paused, and user can do `verdi process play` to retry), and not lose the whole workflow - Define states, and steps will need to bring from state A to B. Then we need to discuss what happens when something happens in between: it should never remain in an intermediate state, it should revert to state A? - If there is a long-running workflow, and a recoverable error happens in between, don't lose the whole state but allow to recover from the last "OK" state. - Performance: - Needs to not put undue stress on the storage backend (i.e. don’t short poll) - Needs to use minimal CPU when idle - Needs to use minimal RAM while idle (e.g. limit amount of data kept in memory while waiting for a job to complete) - (Nice to have but not critical) autoscaling of number of workers between min and max; this is probably not critical, if users can see a hint of what to do (e.g. now in `verdi process list` we have hints if the number of workers need to be increased) - (Nice to have) Avoid forcing the user to install system-level services (except PostgreSQL/i.e., reusing the storage backend) - (Nice to have) Support 'single instance, multiple users' scenario (e.g. a single AiiDA instance on a server, each connecting to the same profile). E.g.: can workers connect to the same storage, but run on different machines? (e.g. some workers on the supercomputer, to avoid issues with 2FA/MFA?). Maybe not necessarily multiple users, but the same user on different machines? - It should be trivial to create a "pool" of workers in various environments (compare to dask approach). - Allow to optionally not go to daemon for small processes (e.g. for small calcfunctions/calcjobs) (keep current functionality). - Provide timeout for e.g. killing/stopping the process, or at least give an estimate on how many more processes the daemon still needs to act to fully terminate. - Requirement: reduce the time to kill a workflow with many children to the minimum possible? Maybe not always possible, depends on the scheduler if you need to kill remotely the job. So at least give information on expected time. - (nice to have) "Hook" to increase the complexity of the orchestration, e.g. limiting number of jobs being submitted to a certain scheduler (replace at least in part aiida-hyperqueue, and aiida-submission-controller?) [there are open issues on aiida-core] - (nice to have) allow for submission of processes (and process actions, such as kill etc.) at any time once they have a valid AiiDA installation and a connection to the storage, without a daemon running; (e.g., updates the persistent state, which is later read by the orchestrator). - what about CPU-intensive steps? - at least this should work even if blocking, but not create issues (timeouts, ...) - (TBD) Allow the execution of containerized applications [details on an exampe to add, CSA] Keep current features: - ensure the current pooling of connections (SSH) and calls to `squeue` and similar to the supercomputer, to avoid flooding the supercomputer