analysis processing - terminology

# analysis processing - terminology related docs: * this: https://hackmd.io/@rig/HJosMHf_K * use-case drafts: https://hackmd.io/@rig/rJcMZpUPt ## A pipeline: ```yaml # my pipeline configuration.yaml environment: define: stuff that: "i/need.csv" pipeline: - "my_job1": # (specified as a process of steps) - step1: # is function call arg1: 'thisfile.stuff' arg2: 'that_col' - step2: input: ~ # take previous step's output arg2: 5 arg3: 'yes' arg_n: foo - step_s: input: ~ - "my_job2": - again: some: steps what: {is, this} ``` ## Terminology (based on GitLab CI) pipeline : - list of **jobs** defined in *a single file* - is executed in order (in v1 only running the full pipeline is supported) - however, the loader could add support for caching internally (i.e. writing cache to disk after first run, read the existing cache on second run) job : - consists of concrete steps like loading data, filtering, aggregating,.. - within a job: - the first step does not get an automatic input (but must e.g. load data itself) - all other steps: output of step N = main input of step N+1 (extra inputs or variables must be configured) step (or whatever we call that..) : part of a job context : a dict containing the output of the last step for each job that is already finished, i.e. - job names must be unique - jobs have access to the final output of previous jobs > Note on persistence to disk: this does not happen automatically but through an extra step in jobs ## Terminology (based on process graphs) a suggestion: with details to be worked out job : * is defined by a _job description_ * implements and has at least 1 _process_ * produces 1+ _job-results_ * [ ] do we need a _job context_ ? * [ ] batch-able ? * [ ] are we looking for a config format to describe a job configuration or a process or both ? job result : * is **non-intermediate**, which means persisted (on disk) process : * is described by a [process-graph](https://en.wikipedia.org/wiki/Process_graph) in the form of a _process configuration_ * has several _process steps_ * produces _process results_ in steps process step : - [ ] can be a _process_ (do we need that ?) * basically one of our `Reader`, `Writer`, `Transformer`, `Filter`, `Selector`, `Aggregator` entities * produces _process results_ process result : * **intermediate** - except for the `Writer` * [ ] dies with the end of the process (clarify distinction between job and process here, see _job context_) * resides in a _process context_ process context : * holds the configuration to execute the _process steps_ * holds or refers to an intermediate result container ("cache")