# Job runner pipeline
[Draft v1](https://iiasahub.sharepoint.com/:w:/s/ene/ES4PElaE8VNEixPNF6dYjTABNOHWyK1xoBNcDURt5zetFA?e=mwsVdY&wdLOR=c5EB0B8E6-F1DD-7F48-9046-BBD7CB634744)
## Introduction
In the IXMP server landscape, some jobs/tasks need to be executed asynchronously.
This document is a collection of the requirements of a job execution system, which should allow to execute jobs of various types and report their execution status and history. Types of jobs are (to be extended):
- import timeseries data from IAMC-format excel files submitted by Scenario Explorer users
- generation of database snapshots (with relatively high requirements to execution time and data volumes)
- (layer import for geolayers, currently on halt)
- import metadata
- solve Scenarios
From user perspective submitting a job should be seamlessly blended into a UI, e.g. different parts of a UI may allow to submit jobs and display jobs of a certain type.
#### higher-level features
1. multiple versions of each job type
2. multi-stage workflows for job
3. job scheduling with prioritization
4. advanced API to handle job queue
## Terminology
**Workflow** is a custom script/program to be executed as part of a job (that's counter-intuitive: workflows usually consist of several stages or sub-tasks, let's find a better word)
comment by DH: agree, we should rename workflow.py as part of a restructuring of the repo (which I would like to be involved in)
**Execution profile** is a set of configuration parameters (like backend API URL, database credentials etc) which a job uses.
**Stage** is a separate part of the job which has defined input and output and can be executed independently. Can have dependencies on other stages.
## User requirements
- ability to submit new jobs
- users can provide job parameters like job type, file, ... - also depending on the job type
- ability to see list of started/completed jobs
- view execution log and output of individual jobs
- get notified via e-mail when job finished/failed
- ability to define/customize workflows via contributing to source code
comment by DH: this is not part of the requirement for the pipeline - any contribution would happen via PR to a repo which is used/imported by the job execution
- ability to run/test workflow offline (before submitting changes to VCS)
comment by DH: again, not really part of the requirement here
- ability to re-run previously submitted (finished) job
## Technical requirements
#### Workflow requirements
- support potentially conflicting requirements of the execution environment (versions of Python/R dependencies, system packages etc)
- keep all log output of job script/workflow (from moment of setting job status to "started" to "finished") and persist it at the end
- ability to support multiple execution profiles to handle jobs from different "environments" (SE instances)
- persist temporary data for pre-configured time to support troubleshooting (e.g. keep containers not removed or job directories not cleaned up)
- ability to re-run single stage of the multistaged job (e.g. stage 1: create new scenario version and import source data, stage 2: add climate assessment results to existing scenario version from stage 1)
- backdoor import: to be defined/clarified, what's the actual requirement?
comment by DH: This is not really a requirement for the job execution. It refers to the use case that a user adds/edits/removes data or meta indicators via the Python ixmp API
- avoid code duplication of workflow scripts (e.g. use a utility python package)
comment by DH: not really a requirement of the job runner pipeline, but related to the structure of the repos called by the processes
#### Job runner requirements
- jobs can consist of multiple sub-workflows that are visible in SE separately (still only one email notification after execution)
- job input: file and other (json) parameters
- job output: status (success/error), logs, file(s), descriptive error message (?)
- job types: subset of supported types can be configured per SE instance
- scheduler: for interaction with jobs queue prefer push over polling (reduce server load, start jobs immediately)
- versioning: ability to run old version of pipeline at "any" time in the future
comment by DH: not sure that this a hard requirement (if it makes implementation or user interface more complicated, probably better to drop it)
- updates: updating a workflow shouldn't stop jobs or require restarting the execution pipeline (maybe just add a new version of a workflow instead)
- resilience: gracefully handle network and other outages (e.g. lack of connectivity, subset of file system-related issues etc)
- monitoring: detect interruptions of the job runner to notice early
- scalability: ability to increase/decrease system capacity "on the fly" (number of jobs executed in parallel)
- allow limiting to one worker per instance
- resource planning: ability to define resource requirements/limitations for specific job type (e.g. CPU/memory limits)
#### API requirements
- ability to retrieve a subset of jobs by type, status, submit/execution times, user etc
- ability to persist and retrieve logs
- ability to update job status recording time of change
- (optional) ability to submit partial log updates (e.g. to build job progress on top of it)
## TODOs
- More coherent naming (workflow, job, task, pipeline, stages, ..)
- What are hard requirements, what could be omitted?
- Discuss/propose an architecture, depending on actual requirements we could e.g.:
- whether to use 3rd-party software (job execution server/framework)
- apply architecture options like microservices / serverless functions
- develop custom solution (potentially using above patterns or software)
- apply time constraints to the implementation (to keep running existing jobs and date to be fully functional)
- define what relates to workflow requirements and what to job runner requirements as many at the moment relate to both (maybe due to trying to be too abstract)