# analysis processing - terminology
related docs:
* this: https://hackmd.io/@rig/HJosMHf_K
* use-case drafts:
https://hackmd.io/@rig/rJcMZpUPt
## A pipeline:
```yaml
# my pipeline configuration.yaml
environment:
define: stuff
that: "i/need.csv"
pipeline:
- "my_job1": # (specified as a process of steps)
- step1: # is function call
arg1: 'thisfile.stuff'
arg2: 'that_col'
- step2:
input: ~ # take previous step's output
arg2: 5
arg3: 'yes'
arg_n: foo
- step_s:
input: ~
- "my_job2":
- again:
some: steps
what: {is, this}
```
## Terminology (based on GitLab CI)
pipeline
: - list of **jobs** defined in *a single file*
- is executed in order (in v1 only running the full pipeline is supported)
- however, the loader could add support for caching internally (i.e. writing cache to disk after first run, read the existing cache on second run)
job
: - consists of concrete steps like loading data, filtering, aggregating,..
- within a job:
- the first step does not get an automatic input (but must e.g. load data itself)
- all other steps: output of step N = main input of step N+1 (extra inputs or variables must be configured)
step (or whatever we call that..)
: part of a job
context
: a dict containing the output of the last step for each job that is already finished, i.e.
- job names must be unique
- jobs have access to the final output of previous jobs
> Note on persistence to disk: this does not happen automatically but through an extra step in jobs
## Terminology (based on process graphs)
a suggestion:
with details to be worked out
job
: * is defined by a _job description_
* implements and has at least 1 _process_
* produces 1+ _job-results_
* [ ] do we need a _job context_ ?
* [ ] batch-able ?
* [ ] are we looking for a config format to describe a job configuration or a process or both ?
job result
: * is **non-intermediate**,
which means persisted (on disk)
process
: * is described by a [process-graph](https://en.wikipedia.org/wiki/Process_graph)
in the form of a _process configuration_
* has several _process steps_
* produces _process results_ in steps
process step
: - [ ] can be a _process_
(do we need that ?)
* basically one of our
`Reader`, `Writer`, `Transformer`, `Filter`, `Selector`, `Aggregator`
entities
* produces _process results_
process result
: * **intermediate**
- except for the `Writer`
* [ ] dies with the end of the process
(clarify distinction between job and process here, see _job context_)
* resides in a _process context_
process context
: * holds the configuration to execute the _process steps_
* holds or refers to an intermediate result container ("cache")