owned this note
owned this note
Published
Linked with GitHub
# New SWH configuration scheme
https://forge.softwareheritage.org/T1410
Better config system that does not rely on implicit configurations.
## Use Cases
> [name=tenma] Aren't production and docker environments the same?
> [name=douardda] nope since docker make use of entrypoint scripts whereas prod uses systemd unit files, but they are pretty similar in most aspects.
### Production deployment
- celery workers ("tasks")
- RPC servers (via gunicorn)
- inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.)
- replayer / backfiller / etc.
### Docker
- celery workers ("tasks")
- RPC servers (via gunicorn)
- inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.)
- replayer / backfiller / etc.
### Tests
- easy to hack/specify any configuration used
- consistency loading with same runtime code (aka no specific "if" for tests)
### REPL
> [name=douardda] not sure what's this use case really is about, can it be seen as part of the "cli tools" below? Personaly I rarely need more than a `s = get_storage(cls='memory')` in a shell, so...
> Also I definitely do NOT want that typing `s = get_storage()` in a shell or a script silently uses a config file somewhere (be it from the SWH_CONFIG_FILENAME or a default one).
> [name=ardumont] Indeed, sounds reasonable. That means, using the factory without any parameters should default to use in-memory implementations.
### cli tools
- end user: auth, web-api, swh cli tools (scanner, scheduler, ...)
- developer using cli tools to interact with a either the production, the staging or a docker-deployed stack (eg using the `swh scheduler` command to manage tasks, `swh loader`, `swh lister`, ...)
- sysadm/automate: to migrate django apps (webapp, deposit) through cli...
## Rationale
### Current features
> [name=douardda] it's not strictly necessary to keep this "current features" section IMHO
> [name=tenma] maybe, but it reminds us what we may or may not want
Current configuration system is an utility library implementing diverse strategies of config loading and parsing including the following functionalities:
- either absolute paths or file relative to some config directory location
- brittle abstraction from config format: file extensions whitelist, but no fail otherwise
- brittle abstraction from config directory location: resilient but not strict
- priority loading from multiple default file paths
- no clear API distinction between loading and parsing/validating
- directional config merging
- partial non-recursive config key type validation+conversion
- both a mixin and a static class, where the config is a class attribute shared by all user code
### Wanted features
- consistent config definition and processing across the SWH codebase
- one API that fit all cases and used everywhere
- priority loading with defined mechanics:
- priority descending: CLI option > envvar > default path (only for interactive usage?)
- merge with component-specific default config
- directional config merge: merge specific definition with a default one
- namespaced by distinct roles, so that one fully qualified config key can be used by different components, and a same unqualified key may exist for different roles, for example:
- "web-api/token" can be used by webclient and scanner
- key "token" could exist in namespace/role web-api or whatever-api, different fully-qualified keys for same unqualified key
- should have a straightforward API, possibly declarative, so that user code can plug config definitions in a single step (decorator, mixin/trait, factory attribute, etc.)
> [name=douardda] The declarative part seems pretty attractive, but we currently often use configuration items as constructor argument of classes, how does this fit with the declarative aspect?
> [name=tenma] this. it couples config with constructor signatures which lead to difficulties when renaming: rename the config element and all occurences in constructors' signature simultaneously. We can want this but it is the first time I see this kind of coupling, I couldn't rename args in the BufferedProxyStorage constructor because of this.
- configuration as attributes of target class to have proper doc/typing/validators either flat or Config object parametrized by class (either class object or config `cls` literal)
- may or may not want: decouple config keys from component constructors arguments (easy if Config is in another object) so that config keys and class attributes can evolve independently
- config is loaded on entrypoints (cli, celery task, gunicorn wsgi), not by each component (Loader, Lister, ...)
> [name=ardumont] possibly wrap instantiation of other components as factories get_storage, get_objstorage, get_indexer_storage, get_journal_client, get_journal_writer, etc... does
- process generic options like config-file in the toplevel command
## Early concrete elements
### Format and location
File format: YAML
Default config:
- Separate file: local or global?
- Python code/Docstring
Specific config:
- Separate file of chosen file format
Environment variables:
- ? specific envvar like click auto envvars (e.g. SWH_SCANNER_CONFIG_FILE)
- global envvar (e.g. current SWH_CONFIG_FILENAME)
> [name=tenma] I would prefer SWH_CONFIG_(FILE)PATH to SWH_CONFIG_FILENAME, to be clear that it is not a basename (we may want to) but won't argue much.
> [name=ardumont] yes, PATH sounds better than NAME (it's a detail that can be taken care of later when everything else is centralized)
### Library
swh.core.config
- load/read config which assumes config can be loaded and parsed (avoid duplicating click behavior)
- check that config can be loaded and parsed
- priority loading : CLI option > envvar > default path (only for interactive usage?)
Either run with a switch or a envvar, else hardcoded default path
### Usage
See example from scanner CLI.
---
## Current situation
> [name=douardda] this section probably needs to be moved somewhere else.
These are examples of config files as currenly used (we focus here on the configuration itself, not about where these files are loaded from).
Most of the configuration files use the form:
```
<swhcomponent>:
cls: <select the implementation to use>
args:
<dict of args passed to the class constructor>
```
Also most (?) CLI tools for swh packages use the same pattern: the config file loading mechanism is handled in the main click group for that package (e.g. in `swh.dataset.cli.data_cli_group` for `swh.dataset`, or `swh.storage.cli.storage` for the storage, etc.)
### objstorage
The generic config for an objstorage looks like:
```
objstorage:
# typ. used as swh.objstorage.factory.get_objstorage() kwargs
cls: pathslicing
args:
root: /srv/softwareheritage/objects
slicing: 0:5
```
In which we have the main config entry: how to access the underlying objstorage backend, then one (or more) configuration items for the objstorage RPC server (for which one needs to read the code to know what possible options are accepted).
#### rpc-server
The config is checked in `swh.objstorage.api.server.make_app` with some validation in `swh.objstorage.api.server.validate_config`.
It also accept a `client_max_size` top-level argument, which is the only "extra" config parameter supported (used in `make_app`).
#### WSGI/gunicorn
When started via gunicorn:
`swh.objstorage.api.server:make_app_from_configfile()`
This function does take care of the presence of the SWH_CONFIG_FILENAME, loads the config file, validate (`validate_config`) then call `make_app`.
#### replayer
The objstorage replayer needs 2 objstorage configurations (src and dst) and a journal_client one, e.g.:
```
objstorage_src:
cls: remote
args:
url: http://storage0.euwest.azure.internal.softwareheritage.org:5003
max_retries: 5
pool_connections: 100
pool_maxsize: 200
objstorage_dst:
cls: remote
args:
url: http://objstorage:5003
journal_client:
cls: kafka
brokers:
- kafka1
- kafka2
- kafka3
group_id: test-content-replayer-x-change-me
```
The `journal_client` config item is directly used ar argument of `swh.journal.client.get_journal_client()` factory.
### storage
```
storage:
cls: local
args:
db: postgresql:///?service=swh-storage
objstorage:
cls: remote
args:
url: http://swh-objstorage:5003/
journal_writer:
cls: kafka
args:
brokers:
- kafka
prefix: swh.journal.objects
client_id: swh.storage.master
```
In which we have the same config system for the main underlying (storage) backend.
Besides the configuration of the underlying storage access, there can also be the configuration for the linked objstorage and journal_writer.
The former is passed directly to the `swh.storage.objstorage.ObjStorage` class which is a thin layer above the real `swh.objstorage.ObjStorage` class (instanciated via `get_objstorage()`).
The later is directly used as argument of the `swh.storage.writer.JournalWriter` class.
Also note that the instanciation of the objstorage and journal writer is done in each storage backend (it's not a generic behavior in `get_storage()`).
#### rpc-serve
Same as general case + inject the `check_config` flag from cli options if needed.
#### WSGI/gunicorn
`swh.storage.api.server:make_app_from_configfile()`
#### replayer
This tool needs 2 entries: the destination storage (same config as above) + the journal client config (`journal_client`), like:
```
storage:
cls: remote
args:
url: http://storage:5002/
max_retries: 5
pool_connections: 100
pool_maxsize: 200
journal_client:
cls: kafka
brokers:
- kafka-broker:9094
group_id: test-graph-replayer-XX5
object_types:
- content
- skipped_content
```
The `journal_client` config item is directly used ar argument of `swh.journal.client.get_journal_client()` factory.
#### backfiller
The backfiller uses a "low-level" config scheme, because it needs a direct access to the database:
```
brokers:
- broker1
- ...
storage_dbconn: postgresql://db
prefix: swh.journal.objects
client_id: <UUID>
```
The config validation is performed within the `JournalBackfiller` class.
### dataset
In `swh.dataset`, loaded config is directly passed to `GraphEdgeExporter` via `export_edges` and `sort_graph_nodes`.
For the `GraphEdgeExporter`, these config values are actually the `**kwargs` of `ParallelExport.process` plus the `remove_pull_requests` flag extracted from the `config` dict in `process_messages()`.
This `ParallelExporter` uses a single config entry,`journal`, the configuration of a journal client.
For the `sort_graph_nodes`, config values are:
- `sort_buffer_size`
- `disk_buffer_dir`
### deposit
:::info
The main `click` group of `swh.deposit` does **not** load the configuration file.
:::
However, it provides a `swh.deposit.config.APIConfig` class that loads the configuration from the `SWH_CONFIG_FILENAME` file.
The generic implementation expects a `scheduler` entry, and have default values for `max_upload_size` and `checks`.
:::info
The current config file for the deposit service in docker looks like:
```
scheduler:
# used by the deposit RPC server
cls: remote
args:
url: http://swh-scheduler:5008
# deposit server now writes to the metadata storage (storage)
storage_metadata:
cls: remote
args:
url: http://swh-storage:5002/
storage:
cls: remote
url: http://swh-storage:5002/
# needed ^ for the old migration script (we cannot remove it or init fails)
allowed_hosts:
# used in "production" django settings (server)
- '*'
private:
# used in "production" django settings (server)
secret_key: prod-in-docker
db:
host: swh-deposit-db
port: 5432
name: swh-deposit
user: postgres
password: testpassword
media_root: /tmp/swh-deposit/uploads
extraction_dir: "/tmp/swh-deposit/archive/"
# used by swh.deposit.api.private.deposit_read.APIReadArchives()
```
> [name=douardda] I'm not sure how all these config entries are used, and by which piece of code.
> [name=ardumont] clarified the parts not explained, dropped the obsolete ones
> [name=ardumont] by cleaning up, i saw a discrepancy about the storage_metadata key, fixed.
> [name=ardumont] it's one entangled configuration file used by all deposit modules api, the "private" api and the workers. Each using a subset combination of those... To actually see what's used by what now, better look at the production configuration instead.
:::
#### client tools
The `swh.deposit.cli.client` clis do not explicitely implement configuration loading from a file, instead every configuration option is given as cli option.
**However**, some classes instanciated from there do support loading a config file from the `SWH_CONFIG_FILENAME` environment variable.
Config entries for a deposit client are:
- `url`
- `auth` (a dict with `username` and `password` entries)
> [name=ardumont] "some classes instanciated from there do support loading"> True. But it's not used within that particular cli context.
> [name=ardumont] That part is now covered with integration tests (no more mock) so modification on that part should be simpler
#### admin tools
The `swh.deposit.cli.admin.admin` click group does implement the config file loading pattern (actually the loading itself is implemented in the `setup_django_for()` function).
This function does load the django configuration from the `swh.deposit.settings.<platform>` (with `<platform>` in `["development", "production", "testing"]`), and set the `SWH_CONFIG_FILENAME` environment variable to the `config_file` argument given.
> [name=ardumont] That's some not pretty stuff that will hopefully get simplified with this spec ;)
> [name=ardumont] That part is now covered with tests so modification will be simpler as well
#### celery worker
The deposit provides one celery worker task (`CheckDepositTsk`) which loads its configuration exclusively from `SWH_CONFIG_FILENAME`. The only config entry used is the `deposit` server connection information.
#### RPC server
The deposit server uses the standard django configuration scheme, but the selected config module is managed by `swh.deposit.config.setup_djamgo_for()`.
A tricky thing is the `swh.deposit.settings.production` django settings module, since it does load the `SWH_CONFIG_FILENAME` config file (but NOT in the `development` nor `testing` flavors).
In `production` mode, it expects the configuration to have:
- `scheduler`
- `private` (credentials for the admin pages of the deposit),
- `allowed_hosts` (optional)
- `storage`
- `extraction_dir`
::: warning
> [name=douardda] not sure I have all deposit config options/usages
> [name=ardumont] in doubt, look at the [puppet manifest configuration](https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/defaults.yaml$1687-1701)
> [name=ardumont] all deposit usages are there
> [name=ardumont] as far as my understanding about django goes, this is indeed the standard way of configuring django (I dropped the `(?)`p).
:::
### graph
The main click group of `swh.graph` does load the config file, but it does not fall back to `SWH_CONFIG_FILENAME` if not config file is given as cli option argument.
Supported configuration values is declared/checked in the `swh.graph.config` module.
There is no main "graph" section or namespace in the config file, so all config entries are expected at file's top-level:
- batch_size
- max_ram
- java_tool_options
- java
- classpath
- tmp_dir
- logback
- graph.compress (for the compress tool)
### indexer
The main click group of `swh.indexer` does load the config file, but it does not fall back to `SWH_CONFIG_FILENAME` if not config file is given as cli option argument.
For the indexer storage, a standard `swh.indexer.storage.get_indexer_storage()` factory function is provided, and is generally called with arguments from the `indexer_storage` configuration entry.
#### schedule
The `swh.indexer.cli.schedule` command uses the config entries:
- indexer_storage
- scheduler
- storage
#### journal_client
The `swh.indexer.cli.journal_client` command (listen the journal to fire new indexing tasks) uses the config entries:
- scheduler
The connection to the kafka broker is handled only by command line option arguments.
#### RPC server
When started using the `swh indexer rpc-serve` command, it expect a config file name as required argument. Configuration entries are:
- indexer_storage
#### WSGI/gunicorn
When started as a WSGI app, the configuration is loaded from the `SWH_CONFIG_FILENAME` environment variable (in `make_app_from_configfile`).
### journal
The journal can be used from the producer side (e.g. a storage's journal writer) or the consumer side.
The `swh.journal.client.get_journal_client(cls, **kwargs)` factory function is generally used to get a journal client connection with arguments directly from `journal_client` (or `journal`) configuration entry.
The `swh.journal.writer.get_journal_writer(cls, **kwargs)` factory function is used to get a producer journal connection, with arguments directly from `journal_writer` configuration entry (generally it's the subentry of the "main"`storage` config entry, as seen above in the storage config example.)
### loaders
Loaders are mostly celery workers. There is a cli tool to synchronously execute a loading.
When run as a celery worker task, the configuration loading mech is detaild in the scheduler section below.
When executed directly, via `swh loader run`, the loader class is instanciated directly, thus it's the responsibility of that later to load a configuration file. This is normally done by using the `swh.core.config.load_from_envvar` class method.
### listers
The main lister cli group does handle the loading of the config file, including falling back to the `SWH_CONFIG_FILENAME` if not given as command line argument.
Expected config options are:
- lister
- priority (optional)
- policy (optional)
- any other option accepted as config option by the lister class, if any
The `swh lister run` command also instanciate a lister class. The base implementation support the configuration options:
- cache_dir (unclear if this can be overloaded by a config file)
- cache_responses (same)
- scheduler
- lister
- credentials (for listers inheriting from ListerHttpTransport)
- url (same)
- per_page (bitbucket)
- loading_task_policy (npm)
When used via a celery worker, standard celery worker config loading mechanism is used (see the scheduler below).
### scanner
The scanner's cli implements its own strategy for finding the configuration file to load (including looking at the `SWH_CONFIG_FILENAME` variable). It only needs connection informations to the public web API:
- url
- auth-token (optional)
### scheduler
The scheduler consists in several parts.
#### celery
Every piece of code that involves loading the celery stack of `swh.scheduler`, aka that imports the `swh.scheduler.celery_backend.config` module, will load the configuration file from the `SWH_CONFIG_FILENAME`, in which at least a `celery` section is expected.
Celery workers are registered from the `swh.workers` pkg_resources entry point as well as the `celery.task_modules` configuration entry.
The main celery app singleton is then configured from a hardwritten default config dict merged with the `celery` configuration loaded from the configuration file.
#### celery workers
Celery workers are started by the standard celery command (`python -m celery worker`) using `swh.scheduler.celery_backend.config.app` as celery app, so the configuration loading mechanism is the default celery one described above, and the only way to specify the configuration file to load is via the `SWH_CONFIG_FILENAME` variable.
#### cli tools
The main click group does implement the `--config-file` option, and uses the `swh.core.config.read()` function. So this main config file loading mechanism **does not** fall back to the `SWH_CONFIG_FILENAME` variable.
At this level, the only expected config entry is `scheduler` (connection to the underlying scheduler service).
Additional config entries for cli commands:
- `runner`:
- `celery`
- `listener`:
- `celery`
- `rpc-serve`:
- any flask configuration option
- `celery-monitor`:
- `celery`
- `archive`:
- any option accepted by `swh.scheduler.backend_es.ElasticSearchBackend`
#### WSGI/gunicorn
The loading of the WSGI app normally uses the `swh.scheduler.api.server.make_app_from_configfile()` function that takes care of loading the config file from the `SWH_CONFIG_FILENAME` with no fall back to a default path.
The loaded config is added to the main flask `app` object, so any flask-related config option is possible (at configuration's top-level.)
### search
#### cli tools
`swh.search` main cli group does implement the `--config-file` option (using `swh.core.config.read()` to load the file).
Config options by cli command:
- `initialize`:
- `search`
- `journal-client objects`:
- `journal`
- `search`
- `rpc-serve`:
This expect a config file name given as (mandatory) argument (in "addition" to the general `--config-file` option). This configuration is then used to configure the flask-based RPC server.
- `search`
- any flask config option
#### WSGI
The creation of the WSGI app is normally done using `swh.search.api.server.make_app_from_configfile`, which uses the `SWH_CONFIG_FILENAME` variable as (only) way of setting the config file.
### vault
#### cli tools
There is not support for the `--config-file` option in main click group, but the cli only provides one command (`rpc-serve`), which does support this option.
The configuration file is loaded in `swh.vault.api.server.make_app_from_configfile()`, and the main RPC server is `aiohttp` based.
### web
Django-based stuff.
### web client
No config file loading for now (?)
---
## Configuration loading mechanisms
Config file loading function used
| package | command | loading function | called from | config path
| -- | -- | -- | -- | --
| dataset | `swh dataset ...` | `swh.core.config.read()` | `swh.dataset.cli.dataset_cli_group()` | `--config-file`
| deposit | `swh deposit client` |
| deposit | `swh deposit admin` | `config.load_named_config()` | | `--config-file`, `SWH_CONFIG_FILENAME`
| deposit | HTTP application | `config.read_raw_config()` | | `SWH_CONFIG_FILENAME` via `DJANGO_SETTINGS_MODULE`
| graph | `swh graph ...` | `swh.core.config.read()` | `swh.graph.cli.graph_cli_group()` | `--config-file`
| graph | WSGI app | ?? | | ??
| indexer | `swh indexer ...` | `swh.core.config.read()` | `swh.indexer.cli.indexer_cli_group()` | `--config-file`
| indexer | `swh indexer rpc-serve` | `swh.core.config.read()` | `swh.indexer.storage.api.server.load_and_check_config()` | `config-path` (via `s.i.cli.rpc_server()`)
| indexer | WSGI app | `swh.core.config.read()` | `swh.indexer.storage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (via `s.i.s.a.s.make_app_from_configfile()`)
| icinga | `swh icinga_plugins ...` | |
| lister | `swh lister ...` | `swh.core.config.read()` | `swh.lister.cli.lister()` | `--config-file`, `SWH_CONFIG_FILENAME` (via `w.l.cli.lister()`)
| lister | celery worker | `config.load_from_envvar()` | `swh.lister.core.simple_lister.ListerBase()` | `SWH_CONFIG_FILENAME`, `<SWH_CONFIG_DIRECTORIES>/lister_<name>.<ext>`
| loader.package | `swh loader ...` | `config.load_from_envvar()` | `swh.loader.package.loader.PackageLoader()` | `SWH_CONFIG_FILENAME`
| loader.core | `swh loader ...` | `config.load_from_envvar()` | `swh.loader.core.loader.BaseLoader()` | `SWH_CONFIG_FILENAME`
| objstorage | `swh objstorage ...` | `swh.core.config.read()` | `swh.objstorage.cli.objstorage_cli_group()` | `--config-file`, `SWH_CONFIG_FILENAME` (via `s.o.cli.objstorage_cli_group()`)
| objstorage | WSGI app | `swh.core.config.read()` | `swh.objstorage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (via `s.o.api.server.make_app_from_configfile()`)
| scanner | `swh scanner ...` | `config.read_raw_config()` | `swh.scanner.cli.scanner()` | `--config-file`, `SWH_CONFIG_FILENAME`, `~/.config/swh/global.yml`
| scheduler | `swh scheduler ...` | `swh.core.config.read()` | `swh.scheduler.cli.cli()` | `--config-file`,
| scheduler | WSGI app | `swh.core.config.read()` | `swh.scheduler.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME`
| scheduler | celery worker | `swh.scheduler.celery_backend.config` | `swh.core.config.load_named_config()` | `swh.scheduler.celery_backend` | `SWH_CONFIG_FILENAME`, `<swh.core.config.SWH_CONFIG_DIRECTORIES/worker/<name>.<ext>`, `<s.c.c.SWH_CONFIG_DIRECTORIES/worker.<ext>`
| search | `swh search ...` | `swh.config.read()` | `swh.search.cli.search_cli_group()` | `--config-file`
| search | `swh search rpc-server` | `swh.config.read()` | `swh.search.api.server.load_and_check_config()` | `config-path`
| search | WSGI app | `swh.config.read()` | `swh.search.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (from `s.s.api.server.make_app_from_configfile()`)
| storage | `swh storage ...` | `swh.config.read()` | `swh.storage.cli.storage()` | `--config-file`, `SWH_CONFIG_FILENAME` (from `s.s.cli.storage()`)
| storage | WSGI app | `swh.core.config.read()` | `swh.storage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (from `s.s.api.server.make_app_from_configfile()`)
| vault | `swh rpc-serve` | `swh.core.config.read()` or `swh.core.config.load_named_config()`| `swh.vault.api.server.make_app_from_configfile()` | `--config-file`, `SWH_CONFIG_FILENAME` (from `s.v.api.server.make_app_from_configfile()`), `<swh.core.config.SWH_CONFIG_DIRECTORIES>/vault/server.<ext>`
| vault | celery worker | `swh.core.config.read()` or `swh.core.config.load_named_config()` | `swh.vault.cookers.get_cooker()` | `SWH_CONFIG_FILENAME` (from `...get_cooker()`), `<s.c.c.SWH_CONFIG_DIRECTORIES>/vault/cooker.<ext>` (from `get_cooker()`)
| web | XXX | |
| web-client | | None |
# Synthesis of discussion @tenma/@ardumont 2020-10-08
## Impacts
- only swh modules source code to migrate
- docker/production/staging should run as before once the changes are deployed (docker-compose.yml, puppet manifests untouched)
## Definitions
- component: a swh base component (e.g. Loader, Lister, Indexer, Storage, ObjStorage, Scheduler, etc...)
- entrypoint: an swh component orchestrator (celery worker "task", cli, wsgi, ...)
## Possible plan
- each component repository declares an `swh.<component>.config` module (like what we declare today for tasks in `swh.<component>.tasks`)
- module declare a typed object Config
- typed object Config is in charge of declaring config keys, with default values, and (gradually) validating the configuration or fails to instantiate if misconfigured (possible implementation: @attr as swh.model.model does)
- each entrypoint is in charge of instantiating the configuration object
- each entrypoint is in charge of injecting the Config object to the component
- each component must take one specific typed Config as a constructor parameter.
- existing code loading the configuration out an environment variable is removed
- existing code validating the configuration, if any, is removed
- the merge policy about loading from an environment variable or from a cli flag or whatever else is delegated to a function in swh.core.config
- corollary: The pseudo typed code in swh.core.config which kinda validated the types must be dropped (i think it's dead code anyway)
## Pre-requisites
The earlier described plan should respect the following:
- separation of concern (doing one thing and doing it well: merge policy, loading config, validating, running...)
- api unification between entrypoints and tests (consistency): all entrypoints respect the same pattern of instantiating, configuring and injecting
- fail early if misconfigured
## Out of scope
- (Global Inversion of control) A component injector in charge of instantiating, configuring and injecting between objects (ala Spring Framework)
## Feedback on the proposal
### olasd
I *strongly* like that this is going towards having fewer dicts being thrown around in our code around object instantiation in favor of stronger typed objects. I also think this can be implemented in a DRY way by parsing the signature of the classes that are being instantiated.
I'm not sure that the backwards compatibility for production is *such* a strong requirement, even though it would definitely be nicer.
> [name=ardumont] "strong requirement": it is not but please let's change one thing at a time. It will definitely help to not touch that part as a first step (especially if that breaks). And when we know the migration worked (implementation underneath changed, all deployed, everything ok as before), then we can change the config values incrementally with small changes.
#### Configuration sharing
This proposal doesn't seem to be solving the concern of shared configuration across so-called components; Let me use a concrete example:
- the `WebClient` (or whatever its name is) class in swh.web.client takes a `token` parameter for authentication, and a `base_url` for its setup
- the `SwhFuse` component uses a `WebClient`
- the `SwhScanner` component also uses a `WebClient`
- the `swh fuse` command takes a config file with its own parameters, as well as parameters for a web client
- the `swh scanner` command takes a config file with parameters for a web client; I expect most of its other parameters come from the CLI directly
In the proposal it's not clear to me how the following would happen:
- `swh.web.client` declares its configuration schema in `swh.web.client.config`
- `swh.fuse` and `swh.scanner` do the same in their own `config` modules
- the `swh fuse` and `swh scanner` entrypoints parse a configuration file; from the output, they instantiate a `SwhFuse` / `SwhScanner`.
Now that I've written all of this, I guess this could be solved by having a way for the `swh.fuse.config` and `swh.scanner.config` modules to declare that they're expecting a `swh.web.client.config` at the toplevel of their configuration file (rather than in a nested way like the `get_storage` factories currently work). Did you have a different idea?
> [name=tenma]
This is my point about namespaces/roles/contexts, in my initial suggestions (that we skimmed in last discussion and not included). It would be a way to both share or distinguish config keys. My proposition avoids some kind of hierarchy with arbitrary depth like composition (owning another config) and inheritance (suclassing another config), and be more similar to naming systems that use tags/prefixes. But it would imply that as a team we define ahead of time those namespaces/roles/contexts (we should choose a definitive name for this, I prefer role).
Example:
- `web-api/token`: token in the context of web-api
- `ext-service/token`: token in the context of external service
- both `swh.web.client.config` and `swh.scanner.config` could reference `web-api/token`
The injector in entrypoints, reading config definitions of the components it instanciates, would see which config keys would be shared, and could create a merged config definition from this (if we want to).
But this choice would require adapting config definitions that we can leave untouched for now.
> [name=ardumont] my initial answer was lost... I replied something about configuration composition at the time so what olasd suggested but i think we moved away from this anyways... (see the next paragraphs)
### douardda
We may see the problem we are trying to solve as:
- what do we want the configuration files to look like in any of the currently known use cases? (and how far are we with what we currently have?)
- how do we want to declare these configuration structures (typing, default values, etc.)?
- where should we instanciate/load these configuration structures?
---
# Synthesis of discussion @tenma/@douardda/@olasd 2020-10-12
## Taste
Examples that are not accurate but demonstrate syntax and functionalities.
Example config (global, mixing unrelated components) declarations:
```
storage:
uffizi:
cls: remote
url: ...
staging:
cls: remote
url: ...
local:
cls: pgsql
conninfo: ...
objstorage: <objstorage.foo>
ingestion:
cls: pipeline
steps:
- cls: filter
- cls: buffer
- <storage.uffizi>
objstorage:
foo:
cls: pathslicing
objstorage-replayer:
to_S3:
cls: ...
src: <objstorage.local>
dst: <objstorage.S3>
web-client:
swh:
token: ...
url: ...
```
This demonstrates:
- general syntax, analogous to the current one
- one more level to distinguish multiples instances of components, that may be all instanciated or be alternatives of each other
- package must be the package of a swh service
- same key name (leaf) can exist in different packages/instances
- reference syntax to reference qualified instances
- TODO support reference to a key? Would be <web-client.swh.url> in above example
> [name=ardumont] Unclear on the scope of that configuration sample.
- Is is a sample configuration for one service serving one module (matching what we already have)?
- Or is it a one global configuration file for all modules, thus defining all
production combination for all services. Then each service (storage, loader-git,
etc...) is picking the information it needs to in that file?
Example CLI usage:
```
swh storage rpc-serve --storage=local
SWH_STORAGE=local swh storage rpc-serve
```
> [name=ardumont] is that dedicated to one service? SWH_STORAGE for `swh storage`, SWH_SCHEDULER for `swh scheduler` and so on and so forth...
> or is the following possible as well?
> ```
> SWH_STORAGE=local swh loader run git https://...
> ```
> which would make the storage to use a local one within the scope of the
> loading?
> [name=tenma] here `local` corresponds to an instance. The injector will instanciate this instance and fill the references to it the config.
> maybe the option/envvar would be more qualified like X.Y.storage=instance.
> We did not discuss it in detail. More about this idea with olasd or douardda.
> [name=ardumont] i still do not get the scope of the sample ¯\_(ツ)_/¯ (i tried to clarify my questions ^)
## Configuration model
### Configuration statement syntax
3 levels (real names to be defined): package, instance, attribute (key:value)
Examples of existing "package" identifiers:
- storage
- objstorage
- objstorage-replayer
- indexer-storage
- journal
- ...
Anonymous instances in config file (for use as a value) is out of scope for now.
> [name=ardumont] well, my understanding so far would mean that such anonymous instances (if any) should no longer exist and be attached to some other level, level "package" then.
> [name=tenma] yes the idea was to have something regular without a shorthand anonymous form, at least for now. Then an instance must be defined at the 2nd level in a package. If many instances are needed it may become tedious to declare each this way if they are mostly used one, then it is not very difficult to have a syntax for anonymous instance definition.
Package corresponds to an existing identifier, the one of the SWH Python package that defines the components to configure. It is used to resolve components types (search for type `remote` in `storage`) and group components of the same service/package.
/!\ what about components that are not of the main type of the package? Would need inclusion into the map and factory.
Ex: non-storage component in storage package?
> [name=ardumont] What about currently unexisting specified "package" identifiers,
loaders, listers, ..., is the following good enough?
> - loader-npm
> - lister-cran
> - ...
> [name=tenma] right, for those we would need a top-level mapping I think...
> I do not know the whole type hierarchy...
> would the "package" be loader and the type "npm", or the package "loader-npm" and the type "task-?". To be defined...
`cls` (alt: `type`) keys is a type identifier and denotes what kind of object schema to use.
Other keys are instance keys conforming to the schema referenced by `cls`, more concretely, the class constructor arguments.
`<` and `>` are reference markers and denotes reference to a qualified instance or key. The choice of marker is not definitive. YAML reference feature (syntax `&`/`*`) is rejected because we want to keep references in our model and thus not having it processed outside our control.
### Configuration declaration
Packages are defined statically. Most probably in core.config. Could be "discovered", but we may prefer whitelisting supported ones.
> [name=ardumont] I'm under the impression that, for the discoverability, we could plug that part to the "registry mechanism" already in place for the module "tasks".
Adding a config key in the output of the `register` function there or something.
> swh.`<module>`.__init__ defines something like (extended with "config" here):
>
> ```
> def register() -> Mapping[str, Any]:
> """Register the current worker module's definition (tasks, loader, config, ...)"""
> from .loader import SomeLoader
>
> return {
> "task_modules": [f"{__name__}.tasks"],
> "loader": SomeLoader,
> "config": [f"{__name__}.config"], # <- or something
> }
>```
>
> Then in the setup.py of the module:
>
>```
> setup(
> ...
> entry_points="""
> ...
> [swh.workers]
> ...
> lister.bitbucket=swh.lister.bitbucket:register
> ...
> loader.someloader=swh.loader.some:register
> ...
> ```
> And some other parts in swh.core is reading that code to actually declare the tasks.
Packages are associated with a factory function to instanciate instances as `InstanceType(**keys)`, and optionally (class_identifier_string : Python_class) map/register.
Example: Storage package
```
map = {
"remote": RemoteStorage,
"local": LocalStorage,
"buffer": BufferedStorage,
...
}
def get_storage(*args, **kwargs):
...
```
> [name=ardumont] I gather it works for loaders, listers, indexers as well with for example, git loaders:
> ```
> map = {
> "remote": GitLoader,
> "disk": GitLoaderFromDisk,
> "archive": GitLoaderFromArchive,
> }
> def get_loader(*args, ...):
> ...
> ```
>
> bitbucket listers:
>
> ```
> map = {
> "full": FullBitBucketLister,
> "range": RangeBitBucketLister,
> "incremental": IncrementalBitBucketLister,
> }
> ```
>
> indexers:
>
> ```
> map = {
> "mimetype": ContentMimetypePartition,
> "fossology-license": ContentFossologyLicensePartition,
> "origin-metadata": OriginMetadata,
> ...
> }
> ```
Component constructor defines config keys, types and defaults.
No static definitions that would need maintaining, but documentation autogenerated from constructors.
### Endpoint usage
Config file is necessary to specify a graph of instances.
Config file parameter specified as a string path, either absolute or relative. Current name: config-file.
Config file can be specified as CLI option (`--<config-file>`) or envvar (`SWH_<CONFIG_FILE>`).
Reference to an instance can be specified inline (ref syntax), CLI option (`--<key>-instance`), envvar (`SWH_<key>_INSTANCE`).
Other keys can be specified as CLI option (`--<key>`), envvar (`SWH_<key>`).
> [name=ardumont] It'd be good then to specify the merge policy now...
### Library API
Example:
`core.config.get_component_from_config(config=cfg_contents, type="storage", id="uffizi")`
> [name=ardumont]
What's `cfg_contents` in the sample, a global unified configuration of all the config combination?
I'd be interested in a concrete code sample of the instantiation of a module with that code, to clarify a bit ;)
### Algorithms
To be defined.
Parsing (YAML library), config resolving (reference processing), component resolving, instanciating.
Restriction to N levels ease implementation.
---
## Example configurations and usages
Note that the actual name of keys is completely up to bikeshedding; specifically, dashes versus underscores versus dots is completely up in the air.
### Shared configuration for command line tools
#### Configuration file `~/.config/swh/default.yml` (default configuration path for user-facing cli tools)
```yaml=
web-client:
default:
# FIXME: single implementation => do we need a cls?
base-url: https://archive.softwareheritage.org/api/1/
token: foo-bar-baz
docker:
base-url: http://localhost:5080/api/1/
token: test-token
scanner:
default:
# FIXME: single implementation => do we need a cls?
web-client: <web-client.default>
scanner-param: foo
docker:
web-client: <web-client.docker>
scanner-param: bar
fuse:
default:
# FIXME: single implementation => do we need a cls?
web-client: <web-client.default>
```
#### Command-line calls
* scan the current directory against the archive in docker
* (cli flag) `swh scanner --scanner-instance=docker scan .`
* (equivalent env var) `SWH_SCANNER_INSTANCE=docker swh scanner scan .`
* actual python calls in the cli endpoint
```python=
scanner_instance = "docker" # from cli flag or envvar
config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint
# basic yaml parse, no interpolation of recursive entries
config_dict = swh.config.read_config(config_path)
scanner = swh.config.get_component_from_config(config_dict, type="scanner", instance=scanner_instance)
# this would get `config_dict["scanner"]["docker"]`, then notice that one of the values has the special syntax "<web-client.docker>".
# The entry would be replaced by a call to:
# web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="docker")
# Finally, the scanner would be instantiated with:
# get_scanner(web_client=web_client_instance, scanner_param="bar")
scanner.scan(".")
```
* mount a fuse filesystem
* `swh fs mount ~/foo swh:1:rev:bar`
* actual python calls in the cli endpoint
```python=
fuse_instance = "default" # default value of cli flag / envvar
config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint
# basic yaml parse, no interpolation of recursive entries
config_dict = swh.config.read_config(config_path)
fuse = swh.config.get_component_from_config(config_dict, type="fuse", instance=fuse_instance)
# this would get `config_dict["fuse"][fuse_instance]`, then notice that one of the values has the special syntax "<web-client.default>".
# The entry would be replaced by a call to:
# web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="default")
# Finally, the scanner would be instantiated with:
# get_fuse(web_client=web_client_instance)
fuse.mount("~/foo", "swh:1:rev:bar")
```
### objstorage replayer
#### Config file
```yaml
objstorage:
local:
cls: pathslicing
root: /srv/softwareheritage/objects
slicing: "0:2/2:5"
s3:
cls: s3
s3-param: foo
journal-client:
default:
# single implem: no cls needed
brokers:
- kafka
prefix: swh.journal.objects
client-param: blablabla
docker:
brokers:
- kafka.swh-dev.docker
...
# needed for second cli usecase
objstorage-replayer:
default:
src: <objstorage.local>
dst: <objstorage.s3>
journal-client: <journal-client.default>
```
#### Cli call
Default behavior (single call to `swh.core.config.get_component_from_config(config, 'objstorage-replayer', instance_name)`)
* `swh objstorage replayer --from-instance default`
* `swh objstorage replayer` (uses config from default instance)
Nice to have (multiple, manual, calls to `get_component_from_config` in the cli entry point)
* `swh objstorage replayer --src local --dst s3`
* `swh objstorage replayer --src local --dst s3 --journal-client docker`
<!-- -->
* @douardda's proposal: on-the-fly generation of instance config via syntactic sugar
`swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3`
* @tenma's proposal: dynamic handling of cli options wrt schema
`swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default`
- generic, too verbose, arbitrary attribute setting not needed
### Single-task celery worker from systemd
#### Configuration file `/etc/softwareheritage/loader_git.yml`
```yaml=
# component configuration
storage:
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- cls: filter
- cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
# other proposal:
storage:
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- cls: filter
- <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
# impossible currently:
storage:
filter:
cls: filter
# missing storage: argument
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- <storage.filter> # get_storage(cls="filter") => fail
# ? OR {cls: filter, storage: <storage.uffizi>}
- <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
loader-git:
default:
cls: remote
storage: <storage.default>
max_content_size: 104857600
save_data: false
save_data_path: "/srv/storage/space/data/sharded_packfiles"
# Not a swh component
# no type id, only instance id
celery:
task_broker: amqp://...
task_queues:
- swh.loader.git.tasks.UpdateGitRepository
- swh.loader.git.tasks.LoadDiskGitRepository
- swh.loader.git.tasks.UncompressAndLoadDiskGitRepository
```
#### (expanded) systemd unit '/etc/systemd/swh-worker@loader_git.service'
```ini
[Unit]
Description=Software Heritage Worker (loader_git)
After=network.target
[Service]
User=swhworker
Group=swhworker
Type=simple
# Celery
Environment=CONCURRENCY=6
Environment=MAX_TASKS_PER_CHILD=100
# Logging
Environment=SWH_LOG_TARGET=journal
Environment=LOGLEVEL=info
# Sentry
Environment=SWH_SENTRY_DSN=https://...
Environment=SWH_SENTRY_ENVIRONMENT=production
Environment=SWH_MAIN_PACKAGE=swh.loader.git
# Config
Environment=SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml
ExecStart=/usr/bin/python3 -m celery worker -n loader_git@${CELERY_HOSTNAME} --app=swh.scheduler.celery_backend.config.app --pool=prefork --events --concurrency=${CONCURRENCY} --maxtasksperchild=${MAX_TASKS_PER_CHILD} -Ofair --loglevel=${LOGLEVEL} --without-gossip --without-mingle --without-heartbeat
KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=15m
OOMPolicy=kill
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
```
#### Instantiation flow
* The celery cli loads the "app" object set in the cli: `swh.scheduler.celery_backend.config.app`
* This module loads the configuration file set in `SWH_CONFIG_FILENAME` to a dict (plausibly a singleton)
* This module loads celery task modules from the swh.workers entrypoint
* This module initializes the celery broker and queues from the `celery` key of the config dict
* celery task code
```python=
@shared_task(name="foo.bar")
def load_git(url):
config_dict = swh.core.config.load_from_envvar()
loader = swh.core.config.get_component_from_config(
config=config_dict,
type="loader-git",
id=os.environ.get("SWH_LOADER_GIT_INSTANCE", "default"),
)
return loader.load(url=url)
```
---
## Alternatives to the proposed structure of package/cls in config
> [name=tenma]
The instance aspect is both powerful and easy to understand.
Need to discuss the type and arguments aspects which are inconsistent (needed for some objects but not others).
`package` and `cls` are type identifiers. Keys are parameters, can be defined as anything that is neither a type nor an instance.
### Drop the first level and include in type
```
uffizi:
cls: storage.remote
url: ...
local:
cls: storage.pgsql
conninfo: ...
objstorage: <foo>
ingestion:
cls: storage.pipeline
steps:
- cls: filter
- cls: buffer
- <uffizi>
foo:
cls: objstorage.pathslicing
```
More generic, does not impose grouping of components of same "package", instance names then may need to be more descriptive.
### Do not restrict on swh service components
Q: do we only define instances of swh service components in the config file?
Using the more generic notion of role/namespace vs the current specific notion of package offer the flexibility of having names that does not refer to a swh component but any datastructure (with `type` having to be fully qualified e.g. `model.model.Origin`), and be shared by muliples instances (would represent a datastructure that does not belong specifically to a given swh service package).
It would need a register of datastructures that can be used, like the other proposals.
For example for core/model/graph/tools/other_library datastructure do we want to be able to specify sonetging along the lines of:
```
model:
orig1:
cls: Origin
url: ...
swhid1:
cls: SWHID
swhid: ...
node1:
cls: MerkleNode
data: ...
somelib:
datastruct1:
cls:
```
---
## Key points to the specification
- configuration declaration syntax
- relation between definitions
- external API (CLI options, environment variables)
- internal API (core.config library)
- instanciation of components and configuration loading in entrypoints
- precise scope and impacts
---
# Synthesis of the meeting 2020-10-21
Particpants: @tenma, @douardda, @olasd, @ardumont
Reporter: @tenma
Q = question
R = remark
OOS = Out of scope
Dates indicates the chronology of the report.
***Before starting to report the concepts tackled trough the meeting, some points about terminology. The terminology was difficult to choose while writing this synthesis, so it is not completely consistent. This section tries to give a basis for discussion.
## Terminology
### Initial remarks (2020-10-22)
The term `type identifier` (`TID` for short) will be used in place of `package` from now on for the 1st-level names.
`package` represented both the SWH Python package name and the base type of SWH components available in this package, in my initial, partial view of the subject.
Now that it has been shown that there is no 1-to-1 mapping between SWH component names and SWH packages, the name `package` is no longer accurate.
The more generic `type identifier` reflects the flexibility we introduce with a top-level register of components types referencible(?) in configuration.
The 2nd level maps `instance identifier` (`IID`) to instance definition mapping.
`TID` map to actual objects of some type, whereas `IID` exist only in configuration system.
### Terminology discussion preparation (2020-10-30)
Need to choose name for every concept of the system.
- configuration language:
- config tree: tree containing all the definitions of a config file
- config object: any level of the config tree
- config dictionary: collection of items, under any identifier
- item/attribute = key + value
- identifier to type, used in depth level 1
e.g. "storage"
- identifier to instance, used in depth level max-1 ;
e.g. "uffizi", "celery"
- instance object
- singleton object
- reference object
- programming constructs manipulated through this system:
- objects = components|singletons
- swh components (unit comprising config items)
- swh services (unit comprising components)
- external components
***Here really starts the report written as of 2020-10-22.
Examples written/updated during the meeting were not copied here.
## Single implementations of type
`cls` attribute to instances specifies implementation/alternative/flavour to use.
It used to be required along with `args`, because all configurable SWH components were polymorphic.
Components that have no such feature need no `cls` attribute in their configuration.
R: An indirection layer such as a factory may be defined for such components in order to keep consistency and allow polymorphism if needed later. Alternatively, for polymorphic components, better than a factory which needs to be known by user code, an abstract base class constructor would abstract this indirection layer away.
## Default instances
One instance in a package/namespace can be labeled as `default`.
It will be selected when an instance of a given component type is requested but no IID is given.
Q: could default instance be implied when instanciating with no IID?
## Singleton object definitions
For both sakes of clarity and reuse, ad-hoc configuration objects can be defined at top level and be referenced. These are not SWH or external components.
Those definitions are composed of an identifier (equivalent to a IID) on 1st level and a YAML object, possibly recursive, in further levels.
They are instaciated as schemaless dictionaries/lists.
Q: How to allow them in syntax and differentiate them from schemaful definitions?
Q: does such singleton object must be referenced at least in one place in the file they are part of, otherwise they would be ignored.
## Top-level component register
A register of components allowed in configurations is to be implemented in core.config.
It will consist of a `(TID : qualified_constructor)` mapping.
These entries will not be hardcoded in this mapping, but registered at import time from the package that defines the components.
`qualified_constructor` must be Python absolute import syntax for the callable which is either a factory function or a class object.
It may be in quoted form (string) or actual object form. String form avoid import but no static check of existence may be performed. Given that registering is done in the package responsible of defining the component, object form is chosen.
Components that can be registered may be any SWH service component which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
- API servers
- service workers
Q: use object or quoted form for `qualified_constructor` ?
Q: what about these config objects: journal related like journal-writer, most under deposit, celery-related...
## CLI parametrization of configuration loading
A CLI option may be passed to specify an instance ID (only at 2nd level?) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
OOS: extend usage to any IID in instance mapping
OOS: dynamic handling of any such options for any attribute
### Example propositions
Default behavior (single, manual, component instanciation)
* `swh objstorage replayer --from-instance default`
* `swh objstorage replayer` (uses config from default instance)
Nice to have (multiple, manual, component instanciations)
* `swh objstorage replayer --src local --dst s3`
* `swh objstorage replayer --src local --dst s3 --journal-client docker`
<!-- -->
* @douardda's proposal: on-the-fly generation of instance config via syntactic sugar
`swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3`
## Library API (proposition, 2020-10-23)
Moved to the specification below.
## Emerging problems (2020-10-23)
Q: How to allow and handle both typed and untyped (ad-hoc) objects?
Q: how to identify what is an instance in the definitions? Will it require special handling wrt referencing mechanics (source and destination).
Q: do we want to support reference to external config definitions or require autonomy? = always 1 file for a whole service, or composition of partial definitions?
In the former, similar sections in multiple config files will need synchronization of update. One possibility is having each SWH component its config file and compose on-demand (using puppet?) in a standalone file for prod/tests. But not easy to have multiple instances this way. Fill on-demand a template file for each service?
---
# Specification outline
## Synopsis
## General terminology/concepts
## Scope/use cases
## Rationale: existing, limitations, wanted
## Specific terminology/concepts
## Language description
## Library
## Client code
## Environment
## Out-of-scope, rejected ideas
## Limitations
## Impacts
## Implementation plan: library, use, tests, prod
---
# Specification 2020-11-02
(Writing sections breadth-first: deeper at each iteration)
Notations:
[opt=id]: concept subject to acceptance or removal, with identifier for easier reference. Usage similar to a feature flag.
[alt]: alternative to any surrounding [alt] statement
[rem]: remark
[OOS]: out-of-scope remark or idea
[rej]: rejected remark or idea
[Q]: question to be answered
## Synopsis
The configuration system evolved partially with use cases. Initial design decisions applied to all use cases turned out to be both too hard to reason about/unstable for production and too inflexible for cli or testing.
## General terminology and concepts
For the purpose of this specification.
- Component: a unit comprising data and/or functions, which provides functionality through an interface and has associated dependencies
- SWH component: a component consisting of a Python class or module, that provides a functionality specific to one or more SWH services. The closed set of SWH components can appear in configuration definitions.
- SWH service: collection of SWH components. Correspond roughly to docker services developped by SWH. Includes API, worker, journal services.
## Scope/use cases
All SWH services and components.
Environments:
- Production service
- CLI
- Testing
Configuration needs:
- system service: systemd service vs docker + shell script
- server: gunicorn vs Flask/aiohttp/Django devel server
- CLI entrypoint
- server application: app_from_configfile vs django config
- worker:
- component: constructor/factory
Configuration sources:
- environment: CLI parameter, CLI path, envvar parameter, envvar path, input stream
- code: literal
## Rationale
A.
- implicit/hard to follow loading: configuration may be loaded trough a number of ways automatically, good for cli cases but not for prod cases
- dependency on environment: must be able to instanciate component using only ad-hoc configuration, for testing purposes
-> different APIs for different use cases, all compatible (return config)
B.
- composition coupling: every owner component must know about how to instanciate an owned component
- heterogeneity: configuration loading (CLI) or instanciating (component factory) is implemented differently everywhere, could be abstracted away
-> dependency injection, component library API
C.
- should be able to specify alternative configurations for one component constructed ahead of time, and choose it at runtime/loadtime
- should be able to factor common configuration out
- uniform,complete,concise: the configuration could theoretically be centralized in one file which would give a clear overview of the configuration and interaction between all the components
-> instances, references, singletons
## Specific terminology and concepts of the proposition
Basis for discussing terms: [terminology proposition](https://hackmd.io/8hxTL4XMQoO2RVKtqFqM2g?both#Terminology-discussion-preparation-2020-10-30)
Used in this specification:
TID = type ID
IID = instance ID
AID = attribute ID
QID = qualified ID
ID = any of the above
ad-hoc object = singleton
"attribute ID" = path to the "key" of an attribute
## Language description
### Target example
```yaml=
storage:
default:
cls: buffer
min_batch_size:
content: 1000
storage: <storage.filtered-uffizi>
filtered-uffizi:
cls: filter
storage: <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
loader-git:
default:
cls: remote
storage: <storage.default>
save_data_path: "/srv/storage/space/data/sharded_packfiles"
random-component:
default:
cls: foo
celery: *celery
# Not a component: no type id, only instance id
_:
celery: &celery
task_broker: amqp://...
task_queues:
- swh.loader.git.tasks.UpdateGitRepository
- swh.loader.git.tasks.LoadDiskGitRepository
```
### Syntactic overview
Based on YAML:
- restricted to YAML primitive types (includes dicts and lists)
- restricted on document structure (see grammar below)
- replaced YAML reference system with ours (if not hookable)
3 levels of depth: type, instance, attribute
Instance definitions are composed of an ID and a mapping to attributes.
1 instance <-> N attributes.
Component type definitions are composed of an ID and a mapping to instances.
1 type <-> N instances.
These instances are variants of the component: same type but different constructions.
Singletons are instances defined outside type definitions, so they live at top-level and have no type.
:::warning
This model is syntactically complicated, here are alternatives to make it regular:
[alt=typePrefix] no type level, so only 2 levels; type and instance identifiers are merged as "type.instance".
[alt=typeAttr] move type identifiers to the attribute level, as a special attribute "type"
[alt=singletonType] use a dummy type for singletons: "singletons"
:::
References can be made to an object defined somewhere else in the tree, using a qualified identifier.
Legal forms are defined to be from an attribute value to an instance identifier.
[opt=refkey] Legal forms also includes from attribute value attribute identifier. OOS
[opt=recattr] There may be recursion from attribute value to instance value definition. This allows anonymous definition of instance object in an attribute. OOS
### Grammar
WARNING: hopefully consistent grammar mixup. May be offending to purists.
Some definitions have alternatives noted with `|=`.
```python=
ID ~= PCRE([A-Za-z0-9_-]+) # Could be stricter, e.g. snake_case
ID = TID | IID | AID
QID = (TID ".")? IID ("." AID)* # opt: refkey, skey
|= TID "." IID
ref = "<" QID ">"
attribute_value = YAML_object | ref
|= YAML_object | ref | attributes # opt: reckey
attributes = YAML_dict(AID, attribute_value)
instances = YAML_dict(IID, attributes)
singleton = YAML_object # no opt: sref, skey
|= YAML_dict(AID, attribute_value)
config_tree = YAML_dict(ID, YAML_dict) # loose typing
|= YAML_dict((TID, instances) | (IID, singleton))
```
Alternative definition of identifiers (always qualified):
```python=
ID ~= /[A-Za-z0-9_-]+/
TID = ID
IID = (TID ".")? ID
AID = IID "." ID+
QID = TID | IID | AID
```
### Identifier
Identifier is abbreviated ID.
Type ID is abbreviated TID.
Singleton ID is equivalent to Instance ID, abbreviated IID.
(Attribute) Key is identified by Attribute ID, abbreviated AID.
Qualified ID is abbreviated QID.
QID is a sequence of ID of the form (TID, IID) for component instances or (IID) for singletons. Its string form joins each field with ".".
[opt=refkey] May have form (TID, IID, AID) to reference component instance attributes. Useful either to reference another attribute of current instance, or any other attribute, except those defined in singletons.
[opt=reckey] May have form (TID, IID, AID*) to reference recursive component instance attributes.
[opt=skey] May have form (IID, AID*) to reference singleton attributes (sequence of AID because recursive).
[opt=sref] singleton attributes may reference any attribute.
### Attribute
Attribute is a (key, value) pair whose set forms an instance dictionnary.
Attribute value is either a YAML object or a reference.
[opt=recattr] Attribute value may also be an instance dictionnary.
Attribute level is any level under instance level, recursive or not.
### Reference
A reference is synctatically defined as a qualified identifier enclosed in chevrons. Its source is an attribute value and its target is the object identified by the QID it owns.
The reference is deleted when it is resolved by the reference resolution routine.
### Type
Python type of a component to be instantiated and configured.
It is referred to indirectly through a TID in a configuration definition, and through a component constructor in the component type register.
### Instance
Specific instantiation of a component, distinguished from the others by the set of attributes used to initialize it.
All identified instances of a type must be specified in the instance level of a configuration definition.
[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instanciation routine.
[opt=subinst] Instances may be referenced in an attribute value, and be recognized as instance declarations; i.e. be instanciated and initialized and not just passed as is to the constructor.
[opt=anoninst] Anonymous instances may be defined in an attribute value, and be recognized as instance declarations.
-> [Q] that would need a type declaration, which is to yet handled in the spec, to identify it as an instance and be able to instantiate it. One unflexible option is to have the parent AID, being the child IID, to be a TID, thus restricting its name to a known type ID. The other option being specifying TID in a dedicated child attribute with a name similar to `type` or `TID`.
### Singleton
Singletons objects are syntactically similar to instances.
Unless otherwise stated, the same rules apply.
They do not correspond to a predefined type, so they have no schema or attached semantics.
They are instantiated as a dict tree.
## Library
### Register
The component type register, abbreviated register, is a `(TID, qualified_constructor)` mapping, defined in the configuration library.
It is used by the component resolution routine to resolve type identifiers to Python type constructors.
Entries in this mapping are to be registered through the component registration library routine. This registration may happen anywhere provided it is executed at loading/import time. It is advised to register the component in the package that defines it.
`qualified_constructor` must be Python absolute import syntax for the object creating callable, which is either a factory function or a class.
It may be defined:
[alt] in quoted form (string).
[alt] in class object form.
[rem] String form avoid import but no static check of existence may be performed. If the registering is done in the package responsible of defining the component, object form is the most preferable.
Components that can be registered may be any SWH service component, SWH support component or external component, which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
- API servers
- service workers
[Q] what about these config objects: journal related like journal-writer, most under deposit, celery-related...
### Type implementations
This section is informational.
A component type may have multiple implementations.
There is no specific support for it in this system, but as this concept may appear in configuration, related considerations may be worth noting.
[rem] The component type of an instance may be abstract, in which case a concrete type must be determined by the component constructor.
A specific attribute of instances specifies implementation or flavour to use. It is commonly identified as `cls`, but could be `impl` or `flavor`.
It used to be required along with `args`, which is now deprecated, because all configurable SWH components were polymorphic.
Components that have no such feature need no `cls` attribute in their configuration.
Alternatively, polymorphic components may be instanciated without `cls`, in which case a default implementation will be used.
[rem] an indirection layer such as a factory may be defined for monomorphic components in order to keep consistency and allow polymorphism if needed later. Alternatively, for all components, better than a factory which is not derivable from the component type by user code, an abstract base class constructor would abstract this indirection layer away.
### Instantiation
Instantiating is the process through which a concrete object is constructed from a model and data describing its (initial) state. In the context of this system, a Python object is created though calling its constructor with the set of attributes associated to a particular instance in a configuration definition.
The input is a QID identifying an instance and a configuration tree (dictionary) containing the instance and its dependencies (reference targets).
The output is a component instance.
The process is composed of the following steps in order.
1. The instance dictionary, containing an attribute set, is fetched by QID on the configuration.
2. Resolve references.
1. Identify all references definitions in the instance dictionary.
2. Resolve each to the reference target object, which may be atomic or composed.
3. Replace the reference source by the resolved object.
3. [opt=subinst] Interpret and compose instances.
1. Identify all component instance definitions in the instance dictionary.
2. [opt=anoninst] Identify anonymous instance definitions.
3. Instanciate each instance.
4. Replace each definition by the instantiated object.
6. The component type of the instance is resolved from the TID contained in the QID, to a component constructor.
7. The component constructor is called, passing the updated instance dictionary as arguments.
[opt=subinst] Iidentifying instance definitions require a TID/IID.
[Q] As parent (`ID: {instance attrs}`) or child (`{ID: ..., instance attrs}`)?
[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instantiation routine.
Instances must be instanciated only once and used at each reference source.
### Interpretation
(Validation/Conversion/Interpretation)
This section is informational.
Interpretation of attributes beyond stated above is out of scope and left to the component constructors to do.
Standard Python typing available in constructors may be used to as the basis for the validation of configuration data.
Validity of structure, value and existence may be checked.
Conversions may also be performed.
[opt=validate] The library provides generic validation primitives and a validation routine based on a data model specification object.
### Loading
(Loading/Defaults/Merging)
Loading is the process of fetching data from a storage medium into a memory space which is easily accessible to the processing system. In the context of this system, this data is then read and converted into a Python object.
Loading source may be: an I/O file abstraction (whatever its backing source), or an operating system path to such file abstraction, or such path resolvable from an environment variable or a configuration file ID.
Only a Python dictionary is accepted as the holder of this data once loaded. A default configuration definition, either as a dictionary literal or a loaded configuration, can be specified in which case every attributes absent from loaded configuration will be set to the default one.
### API overview
Library should be imported as `config` everywhere for clarity and uniformity (e.g. `import swh.core.config as config` or `from swh.core import config`).
[rem] Existing routine `merge_configs` should be moved to another module as `merge_dicts`.
Configuration object: mapping
WARNING:
In the following examples, names subject to change.
Code is inspired by Python, but abstracted to focus on typing.
`DeriveType` denotes simply a type derived from an exiting one, with no consideration of compatibility with base type or any other.
### Loading API
[rem] Should choose term among `load`, `read`, `from`, `by`, `config`
Example names: `read_config`, `load_envvar`
```python=
Config = DeriveType(Mapping) # Tree. Allow only mapping in config definition top-level
ConfigFileID = DeriveType(str) # opt=fileid
Envvar = DeriveType(str)
File = io.IOBase
Path = os.PathLike
load: (Union[File,Path,Envvar,ConfigFileID], defaults:Config?) -> (Config)
load_from_file: (File, defaults:Config?) -> (Config)
load_from_path: (Path, defaults:Config?) -> (Config)
load_from_envvar: (Envvar, defaults:Config?) -> (Config)
load_from_name: (ConfigFileID, defaults:Config?) -> (Config) # opt=fileid
```
no default configs
Loads as YAML tree and convert to Python recursive mapping.
[opt=fileid] may use ID to reference files independently of their path or extension, in loading mechanism. This is sugar that existed but may be no longer wanted. OOS
[Q] Where to check for loadable path? In loading routines or user code? May duplicate behavior.
[Q] Should envvar be hardcoded in library or default? Same for default path.
### Instantiation API
OOS: every function but create component
[rem] Should choose amongst `get`, `read`, `from_config`, `instantiate`, `component`, `instance`, `iid`.
Example names: `get_component_from_config`, `instantiate_from_config`, `create_component`, `read_instance`, `get_from_id`
```python=
TID = DeriveType(str)
IID = DeriveType(str)
QID = (TID, IID) # Simplified form, may be Sequence(ID)
Component = DeriveType(type)
ComponentConstructor = DeriveType(Callable) # Either type or function
InstanceConfig = DeriveType(Config)
```
```python=
create_component: (Config, QID) -> (Component)
create_component: (InstanceConfig, TID) -> (Component)
```
Returns an instantiated component identified by QID.
Uses `get_obj`, `resolve_references`, `resolve_component`, `instantiate_component`.
```python=
get_obj: (Config, QID) -> (Config)
get_instance: (Config, QID) -> (InstanceConfig)
```
Returns a config object (subtree) of the config identified by the config ID. May be used both for getting all instances or one instance depending on whether the config ID has an instance ID part.
```python=
resolve_references: (Config) -> (Config)
resolve_reference: (Config, QID) -> (Config)
```
Replaces reference source with the object identified by reference target.
```python=
find_instances: (InstanceConfig) -> (Set(InstanceConfig))
```
[opt=subinst] Finds all instance definitions nested in this instance definition and returns a list of them.
[Q] How to identify instances?
[Q] Should it recurse into nested definitions or just one level?
```python=
resolve_component: (TID) -> (ComponentConstructor)
```
Lookups core.config register to get the constructor from TID.
```python=
instantiate_component: (InstanceConfig, ComponentConstructor) -> (Component)
```
Instantiate a component using given constructor and an instance configuration mapping.
### Validation API
[opt=validate]
This section proposes a framework for validating instance definitions in a fairly lightweight and flexible way, for use by component constructors or injectors.
```python=
check: (Config) -> (Boolean)
check_definitions: (Config) -> (Boolean)
check_component: (InstanceConfig, ModelSpec) -> (Boolean)
generate_spec_from_signature: (ComponentConstructor) -> (ModelSpec)
```
`check`: validate both language and instances.
`check_definitions`: validate whole definition against language spec.
`check_component`: validate instance definition against component spec.
This is a template function which is parametrized by user-specified spec.
#### Model specification
```python=
AttrKey ~= String("[A-Za-z0-9_\-]+")
AttrVal = YAML_object
# Path in the instance configuration dictionary
Path ~= String("([A-Za-z0-9_\-]+/)+")
# Wrapper to convert falsey values or exceptions to False, otherwise True
ensure_boolean: Booleanish -> Boolean
# Generic and context-sensitive signatures for flexibility
value_check: ((AttrVal) | (AttrVal, InstanceConfig)) -> Booleanish
# If not optional existence check should succeed, else not performed.
optional_check: (AttrVal, InstanceConfig) -> Booleanish) | Booleanish
# Checks whether attr exists at one of given paths, or anywhere if no path.
# No reason to have user customise existence check.
existence_check: (AttrVal, Set(Path), InstanceConfig) -> Boolean
# Here is the model specification
# Kwargs: best I found for a typed mapping where every item is optional
AttrProperties = Kwargs(value_check, optional_check, Set(Path))
# None for no checks on attribute
ModelSpec = Mapping(AttrKey, AttrProperties | None)
```
`check_component` verifies that all properties of every attribute holds in the instance definition, based on user-defined model specification. Model specification can leverage primitive check functions and user-defined check functions. Supported checks are value and existence in tree-structure checks, which are distinguished for expressiveness.
The model specification lists each (unqualified) attribute that may exist in the configuation definition, along with attribute properties that must hold.
An attribute may or not be optional, meaning whether validation should fail on absence, based on the boolean value of the `optional_check`. `optional_check` may be a callable that must determine whether the attribute is optional based on the configuration context and return a booleanish value, or be a booleanish value. It is run in a wrapper which converts falsey values or exceptions to `False`, and anything else to `True`. Required attribute is checked for existence based on a set of paths in the tree if any, or existence anywhere in the tree. Optional attribute is then not checked for existence but still for legal value.
The value check may be any callable that either accepts a single value, or a value and the configuration context (instance definition), and return a booleanish value, handled as above. This makes it possible to use many existing functions or object constructors to do the validation, e.g. `int`, `re.match`, `isinstance(Protocol)` or a function verifying a relation to another attribute in the definition is valid.
#### Helper for specification generation
`generate_spec_from_signature`: generate a model specification where annotations are used as `value_check` functions wherever possible, argument are optional or not depending on the existence a default value, and the path set contains only the tree root. A mapping from types to validators is used to validate most common types, others will only be checked by `insinstance`. This is a helper function to generate a spec draft ahead of time, that must be corrected and stored along the corresponding constructor, as it cannot be guaranteed to function properly in all cases.
Components with multiple implementations:
Operations based on function signatures like validation but also instanciation, need a way to map the `cls` argument to the concrete type and constructor signature.
A solution to automatically use the good constructor is to implement single dispatch and overloading on the main constructor. Every method may still call the main one, but must have a signature compatible with the one of the concrete class constructor, based on `cls`.
See also ["Library/Type implementations"](#Type-implementations) remark about abstract constructors.
## Client code (need contributions)
Demonstration of features in every use cases.
CLI, WSGI, worker, task, daemon, testing
### CLI entrypoint
* scan the current directory against the archive in docker
* (cli flag) `swh scanner --scanner-instance=docker scan .`
* (equivalent env var) `SWH_SCANNER_INSTANCE=docker swh scanner scan .`
* actual python calls in the cli endpoint
```python=
import swh.core.config as config
scanner_instance = "docker" # from cli flag or envvar
config_path = "~/.config/swh/default.yml" # from cli flag or envvar or CLI default or core.config default
config_dict = config.load(config_path)
scanner = config.create_component(
config_dict,
config.QID(type="scanner", instance=scanner_instance)
)
scanner.scan(".")
```
### API Server entrypoint
rpc-serve, WSGI app
```python=
def make_app_from_configfile() -> StorageServerApp: # Or any other module App
global app_instance
if not app_instance:
config_dict = config.load_from_envvar()
rpc_instance = os.environ.get("SWH_STORAGE_RPC_INSTANCE", "default")
app_instance = config.create_component(
config_dict,
config.QID(type="storage-rpc", instance=rpc_instance)
)
if not check_component(app_instance, "storage-rpc"): # raise ConfigurationError or something?
return app_instance
```
> [name=ardumont] Completed the snippet ^ (unsure about it)
### Celery task entrypoint
Celery task code
```python=
@shared_task(name="foo.bar")
def load_git(url):
config_dict = config.load_from_envvar()
loader_instance = os.environ.get("SWH_LOADER_GIT_INSTANCE", "default")
loader = config.create_component(
config_dict,
config.QID(type="loader-git", instance=loader_instance)
)
config_dict.create_component(type="loader-git", instance=loader_instance)
return loader.load(url=url)
```
> [name=ardumont] we moved away from passing parameters to the `load` function. The url parameter is to be passed along the constructor of the loader (same goes for lister, etc...)
### Testing / REPL
Example test
```python=
import swh.core.config as config
@pytest.fixture
def config_dict() {
return {...}
}
def test_config(config_dict):
type_ID = "objstorage"
instance_ID = "test_1"
instance = config.create_component(
config_dict,
config.QID(type=type_ID, instance=instance_ID)
)
...
@pytest.fixture
def config_path(datadir):
return f"{datadir}/other.yml"
def test_config2(config_path):
config_dict = config.load_from_path(config_path)
instance = config.create_component(
config_dict,
config.QID(type=type_ID, instance=instance_ID)
)
...
```
## Environment
The environment parameters comprises any dependency of the configuration system external to the code.
This includes: configuration directory, configuration file, environment variable and commandline parameters.
### Configuration directory
SWH configuration directory: `SWH_CONFIG_HOME=$HOME/.config/swh`
### Configuration file
YAML file with .yml containing only the configuration data.
Default if none is specified to the generic loading routine: `$SWH_CONFIG/default.yml`.
[opt=conffileid] a configuration file id corresponding to the basename of a configuration file (without extension).
-> [Q] then only from `$SWH_CONFIG_HOME` or have a register?
### Core configuration file parameter
This feature is to be built into SWH core library.
Specify the path to the configuration file to use for a whole service:
path_part = `path` | `file`
Environment variable: `SWH_CONFIG_<PATH_PART>`
CLI option: `swh --config-<path_part>`
[rem] "path" is a more precise term than "file".
### Specific configuration parameters
A CLI option may be passed to specify an instance ID (only at 2nd level) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
Specify the instance configuration to use for a given component, using instance ID:
id_part = `instance` | `id` | `iid` | `cid`
`SWH_<COMP>_<ID_PART>` `--<comp>-<id_part>`
[rem] Any variant containing "id" is more precise than simply "instance".
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
Specify any other predefined configuration option:
`SWH_<COMP>_<OPTION>` `--<comp>-<option>`
[OOS] dynamic handling of any such options for any attribute, similar to what `click` permits.
### Configuration priority
CLI has precedence over envvars.
Environment parameters have precedence over whole definitions (from file or code) and whole definitions have precedence over defaults, per-attribute.
This follows the principle that the particular takes precedence over the general.
CLI param > envvar param > CLI file > envvar file > default file > defaults literal
This precedence rules must be implemented in entrypoint client code, with help of the library loading API. Only part of it may be implemented, the minimum being accepting a whole definition trough either code or envvar.
### Example environment specifications (need contributions)
Using this objstorage replayer configuration file file
```yaml
objstorage:
local:
cls: pathslicing
root: /srv/softwareheritage/objects
slicing: "0:2/2:5"
s3:
cls: s3
s3-param: foo
journal-client:
default:
# single implem: no cls needed
brokers:
- kafka
prefix: swh.journal.objects
client-param: blablabla
docker:
brokers:
- kafka.swh-dev.docker
...
objstorage-replayer:
default:
src: <objstorage.local>
dst: <objstorage.s3>
journal-client: <journal-client.default>
```
CLI usage:
Specify no instance, use default instance config:
* `swh objstorage replayer`
Specify instance:
* `swh objstorage replayer --from-instance default`
Specify nested instances (opt=subinst):
* `swh objstorage replayer --src local --dst s3`
* `swh objstorage replayer --src local --dst s3 --journal-client docker`
CLI options to be defined statically.
#### Other proposals
@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
`swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3`
[OOS] @tenma's proposal: dynamic handling of cli options wrt schema
`swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default`
- generic, verbose, arbitrary attribute setting
## Limitations (need contributions)
Depends on chosen functionalities.
## OOS, rejected ideas (need contributions)
Depends on chosen functionalities.
## Impacts (need contributions)
- core.config library
- main constructor/factory of each SWH component: type mapping or dynamic dispatch, use of library APIs for validating
- entrypoints: use of library APIs for loading and instanciating
- configuration files format
- environment variables and cli calls in production+docker environments
## Implementation plan: library, ops code, tests, prod (need contributions)
(Proposition)
Prepare for easy switch and rollback by creating configuration copies conforming to the new system, and code conforming to the new system in separate branches.
- implement all library in the same file as before
- migrate tests (at any moment)
- prepare new config files, and service definitions that use them
- migrate services one by one following SWH dependencies:
- add needed declarations along with constructors
- entrypoint loading, instantiating and injecting (if opt=subinst)
- remove deprecated code
# Synthesis of the meeting 2020-11-24
Participants: tenma, douardda, olasd, ardumont
Language:
- use "_" tid for all singletons. Then QID become regular (IID, TID)
- USE subinst: instance references may appear arbitrary deep in instance
- OOS anoninst: every instance defined at 2nd level
- OOS refkey, reckey, skey, sref: no reference to keys and singletons, may use YAML ref syntax
APIs:
- OOS conffileid
- specify only public API, no instanciation plumbing, KISS
- loading API: no merging so remove defaults
- instanciation API: only distinction component/singleton
- instanciation API: use instance methods? use keywords for QID
- validation library for later, now only constructor validation
Implementation:
- register: qualified_constructor be type object, not str
- register: populate from setuptools declarations
references to constructor + documentation_builder (@olasd)
Environment:
- comprehensive environment handling is good, but should be opt-in
Notes about implementation:
- confirmed that anoninst need inline type declaration (but OOS)
- how to prepare definitions for instanciation?
Replace reference by instance in definition (duplication?)
qid could be inserted in the definition as key, and added to instance register; that would make handling more regular
Still open:
- rename singleton to ad-hoc object?
- forgot to choose terms in APIs
- forgot to cover usage of external components
- factory constructors instead of factory functions: not convinced
-> allow polymorphism, type and callable are associated
-> factories reimpl single dispatch which is builtin in classes
Conclusion:
- no feedback on spec itself, as it is not ready (cleaned)
- make clean spec in Sphinx and create diff
- finish library draft and create diff (P878)
- @olasd for details on populating register and docs through setuptools