New SWH configuration scheme

https://forge.softwareheritage.org/T1410

Better config system that does not rely on implicit configurations.

Use Cases

tenma Aren't production and docker environments the same?
douardda nope since docker make use of entrypoint scripts whereas prod uses systemd unit files, but they are pretty similar in most aspects.

Production deployment

  • celery workers ("tasks")
  • RPC servers (via gunicorn)
  • inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.)
  • replayer / backfiller / etc.

Docker

  • celery workers ("tasks")
  • RPC servers (via gunicorn)
  • inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.)
  • replayer / backfiller / etc.

Tests

  • easy to hack/specify any configuration used
  • consistency loading with same runtime code (aka no specific "if" for tests)

REPL

douardda not sure what's this use case really is about, can it be seen as part of the "cli tools" below? Personaly I rarely need more than a s = get_storage(cls='memory') in a shell, so
Also I definitely do NOT want that typing s = get_storage() in a shell or a script silently uses a config file somewhere (be it from the SWH_CONFIG_FILENAME or a default one).
ardumont Indeed, sounds reasonable. That means, using the factory without any parameters should default to use in-memory implementations.

cli tools

  • end user: auth, web-api, swh cli tools (scanner, scheduler, )
  • developer using cli tools to interact with a either the production, the staging or a docker-deployed stack (eg using the swh scheduler command to manage tasks, swh loader, swh lister, )
  • sysadm/automate: to migrate django apps (webapp, deposit) through cli

Rationale

Current features

douardda it's not strictly necessary to keep this "current features" section IMHO
tenma maybe, but it reminds us what we may or may not want

Current configuration system is an utility library implementing diverse strategies of config loading and parsing including the following functionalities:

  • either absolute paths or file relative to some config directory location
  • brittle abstraction from config format: file extensions whitelist, but no fail otherwise
  • brittle abstraction from config directory location: resilient but not strict
  • priority loading from multiple default file paths
  • no clear API distinction between loading and parsing/validating
  • directional config merging
  • partial non-recursive config key type validation+conversion
  • both a mixin and a static class, where the config is a class attribute shared by all user code

Wanted features

  • consistent config definition and processing across the SWH codebase

    • one API that fit all cases and used everywhere
  • priority loading with defined mechanics:

    • priority descending: CLI option > envvar > default path (only for interactive usage?)
    • merge with component-specific default config
  • directional config merge: merge specific definition with a default one

  • namespaced by distinct roles, so that one fully qualified config key can be used by different components, and a same unqualified key may exist for different roles, for example:

    • "web-api/token" can be used by webclient and scanner
    • key "token" could exist in namespace/role web-api or whatever-api, different fully-qualified keys for same unqualified key
  • should have a straightforward API, possibly declarative, so that user code can plug config definitions in a single step (decorator, mixin/trait, factory attribute, etc.)

douardda The declarative part seems pretty attractive, but we currently often use configuration items as constructor argument of classes, how does this fit with the declarative aspect?
tenma this. it couples config with constructor signatures which lead to difficulties when renaming: rename the config element and all occurences in constructors' signature simultaneously. We can want this but it is the first time I see this kind of coupling, I couldn't rename args in the BufferedProxyStorage constructor because of this.

  • configuration as attributes of target class to have proper doc/typing/validators either flat or Config object parametrized by class (either class object or config cls literal)

  • may or may not want: decouple config keys from component constructors arguments (easy if Config is in another object) so that config keys and class attributes can evolve independently

  • config is loaded on entrypoints (cli, celery task, gunicorn wsgi), not by each component (Loader, Lister, )

ardumont possibly wrap instantiation of other components as factories get_storage, get_objstorage, get_indexer_storage, get_journal_client, get_journal_writer, etc does

  • process generic options like config-file in the toplevel command

Early concrete elements

Format and location

File format: YAML

Default config:

  • Separate file: local or global?
  • Python code/Docstring

Specific config:

  • Separate file of chosen file format

Environment variables:

  • ? specific envvar like click auto envvars (e.g. SWH_SCANNER_CONFIG_FILE)
  • global envvar (e.g. current SWH_CONFIG_FILENAME)

tenma I would prefer SWH_CONFIG_(FILE)PATH to SWH_CONFIG_FILENAME, to be clear that it is not a basename (we may want to) but won't argue much.
ardumont yes, PATH sounds better than NAME (it's a detail that can be taken care of later when everything else is centralized)

Library

swh.core.config

  • load/read config which assumes config can be loaded and parsed (avoid duplicating click behavior)
  • check that config can be loaded and parsed
  • priority loading : CLI option > envvar > default path (only for interactive usage?)

Either run with a switch or a envvar, else hardcoded default path

Usage

See example from scanner CLI.


Current situation

douardda this section probably needs to be moved somewhere else.

These are examples of config files as currenly used (we focus here on the configuration itself, not about where these files are loaded from).

Most of the configuration files use the form:

<swhcomponent>:
   cls: <select the implementation to use>
   args:
      <dict of args passed to the class constructor>

Also most (?) CLI tools for swh packages use the same pattern: the config file loading mechanism is handled in the main click group for that package (e.g. in swh.dataset.cli.data_cli_group for swh.dataset, or swh.storage.cli.storage for the storage, etc.)

objstorage

The generic config for an objstorage looks like:

objstorage:  
  # typ. used as swh.objstorage.factory.get_objstorage() kwargs
  cls: pathslicing
  args:
    root: /srv/softwareheritage/objects
    slicing: 0:5

In which we have the main config entry: how to access the underlying objstorage backend, then one (or more) configuration items for the objstorage RPC server (for which one needs to read the code to know what possible options are accepted).

rpc-server

The config is checked in swh.objstorage.api.server.make_app with some validation in swh.objstorage.api.server.validate_config.

It also accept a client_max_size top-level argument, which is the only "extra" config parameter supported (used in make_app).

WSGI/gunicorn

When started via gunicorn:

swh.objstorage.api.server:make_app_from_configfile()

This function does take care of the presence of the SWH_CONFIG_FILENAME, loads the config file, validate (validate_config) then call make_app.

replayer

The objstorage replayer needs 2 objstorage configurations (src and dst) and a journal_client one, e.g.:

objstorage_src:
  cls: remote
  args:
    url: http://storage0.euwest.azure.internal.softwareheritage.org:5003
    max_retries: 5
    pool_connections: 100
    pool_maxsize: 200

objstorage_dst:
  cls: remote
  args:
    url: http://objstorage:5003

journal_client:
  cls: kafka
  brokers:
    - kafka1
    - kafka2
    - kafka3
  group_id: test-content-replayer-x-change-me

The journal_client config item is directly used ar argument of swh.journal.client.get_journal_client() factory.

storage

storage:
  cls: local
  args:
    db: postgresql:///?service=swh-storage
    objstorage:
      cls: remote
      args:
        url: http://swh-objstorage:5003/
    journal_writer:
      cls: kafka
      args:
        brokers:
          - kafka
        prefix: swh.journal.objects
        client_id: swh.storage.master

In which we have the same config system for the main underlying (storage) backend.

Besides the configuration of the underlying storage access, there can also be the configuration for the linked objstorage and journal_writer.

The former is passed directly to the swh.storage.objstorage.ObjStorage class which is a thin layer above the real swh.objstorage.ObjStorage class (instanciated via get_objstorage()).

The later is directly used as argument of the swh.storage.writer.JournalWriter class.

Also note that the instanciation of the objstorage and journal writer is done in each storage backend (it's not a generic behavior in get_storage()).

rpc-serve

Same as general case + inject the check_config flag from cli options if needed.

WSGI/gunicorn

swh.storage.api.server:make_app_from_configfile()

replayer

This tool needs 2 entries: the destination storage (same config as above) + the journal client config (journal_client), like:

storage:
  cls: remote
  args:
    url: http://storage:5002/
    max_retries: 5
    pool_connections: 100
    pool_maxsize: 200

journal_client:
  cls: kafka
  brokers:
    - kafka-broker:9094
  group_id: test-graph-replayer-XX5
  object_types:
    - content
    - skipped_content

The journal_client config item is directly used ar argument of swh.journal.client.get_journal_client() factory.

backfiller

The backfiller uses a "low-level" config scheme, because it needs a direct access to the database:

brokers:
  - broker1
  - ...
storage_dbconn: postgresql://db
prefix: swh.journal.objects
client_id: <UUID>

The config validation is performed within the JournalBackfiller class.

dataset

In swh.dataset, loaded config is directly passed to GraphEdgeExporter via export_edges and sort_graph_nodes.

For the GraphEdgeExporter, these config values are actually the **kwargs of ParallelExport.process plus the remove_pull_requests flag extracted from the config dict in process_messages().
This ParallelExporter uses a single config entry,journal, the configuration of a journal client.

For the sort_graph_nodes, config values are:

  • sort_buffer_size
  • disk_buffer_dir

deposit

The main click group of swh.deposit does not load the configuration file.

However, it provides a swh.deposit.config.APIConfig class that loads the configuration from the SWH_CONFIG_FILENAME file.
The generic implementation expects a scheduler entry, and have default values for max_upload_size and checks.

The current config file for the deposit service in docker looks like:

scheduler:
  # used by the deposit RPC server
  cls: remote
  args:
    url: http://swh-scheduler:5008

# deposit server now writes to the metadata storage (storage)
storage_metadata:
  cls: remote
  args:
    url: http://swh-storage:5002/

storage:
  cls: remote
  url: http://swh-storage:5002/
# needed ^ for the old migration script (we cannot remove it or init fails)

allowed_hosts:
  # used in "production" django settings (server)
  - '*'

private:
  # used in "production" django settings (server)
  secret_key: prod-in-docker
  db:
    host: swh-deposit-db
    port: 5432
    name: swh-deposit
    user: postgres
    password: testpassword
  media_root: /tmp/swh-deposit/uploads

extraction_dir: "/tmp/swh-deposit/archive/"
# used by swh.deposit.api.private.deposit_read.APIReadArchives()

douardda I'm not sure how all these config entries are used, and by which piece of code.
ardumont clarified the parts not explained, dropped the obsolete ones
ardumont by cleaning up, i saw a discrepancy about the storage_metadata key, fixed.
ardumont it's one entangled configuration file used by all deposit modules api, the "private" api and the workers. Each using a subset combination of those To actually see what's used by what now, better look at the production configuration instead.

client tools

The swh.deposit.cli.client clis do not explicitely implement configuration loading from a file, instead every configuration option is given as cli option.

However, some classes instanciated from there do support loading a config file from the SWH_CONFIG_FILENAME environment variable.

Config entries for a deposit client are:

  • url
  • auth (a dict with username and password entries)

ardumont "some classes instanciated from there do support loading"> True. But it's not used within that particular cli context.
ardumont That part is now covered with integration tests (no more mock) so modification on that part should be simpler

admin tools

The swh.deposit.cli.admin.admin click group does implement the config file loading pattern (actually the loading itself is implemented in the setup_django_for() function).

This function does load the django configuration from the swh.deposit.settings.<platform> (with <platform> in ["development", "production", "testing"]), and set the SWH_CONFIG_FILENAME environment variable to the config_file argument given.

ardumont That's some not pretty stuff that will hopefully get simplified with this spec ;)
ardumont That part is now covered with tests so modification will be simpler as well

celery worker

The deposit provides one celery worker task (CheckDepositTsk) which loads its configuration exclusively from SWH_CONFIG_FILENAME. The only config entry used is the deposit server connection information.

RPC server

The deposit server uses the standard django configuration scheme, but the selected config module is managed by swh.deposit.config.setup_djamgo_for().

A tricky thing is the swh.deposit.settings.production django settings module, since it does load the SWH_CONFIG_FILENAME config file (but NOT in the development nor testing flavors).

In production mode, it expects the configuration to have:

  • scheduler
  • private (credentials for the admin pages of the deposit),
  • allowed_hosts (optional)
  • storage
  • extraction_dir

douardda not sure I have all deposit config options/usages
ardumont in doubt, look at the puppet manifest configuration
ardumont all deposit usages are there
ardumont as far as my understanding about django goes, this is indeed the standard way of configuring django (I dropped the (?)p).

graph

The main click group of swh.graph does load the config file, but it does not fall back to SWH_CONFIG_FILENAME if not config file is given as cli option argument.

Supported configuration values is declared/checked in the swh.graph.config module.

There is no main "graph" section or namespace in the config file, so all config entries are expected at file's top-level:

  • batch_size
  • max_ram
  • java_tool_options
  • java
  • classpath
  • tmp_dir
  • logback
  • graph.compress (for the compress tool)

indexer

The main click group of swh.indexer does load the config file, but it does not fall back to SWH_CONFIG_FILENAME if not config file is given as cli option argument.

For the indexer storage, a standard swh.indexer.storage.get_indexer_storage() factory function is provided, and is generally called with arguments from the indexer_storage configuration entry.

schedule

The swh.indexer.cli.schedule command uses the config entries:

  • indexer_storage
  • scheduler
  • storage

journal_client

The swh.indexer.cli.journal_client command (listen the journal to fire new indexing tasks) uses the config entries:

  • scheduler

The connection to the kafka broker is handled only by command line option arguments.

RPC server

When started using the swh indexer rpc-serve command, it expect a config file name as required argument. Configuration entries are:

  • indexer_storage

WSGI/gunicorn

When started as a WSGI app, the configuration is loaded from the SWH_CONFIG_FILENAME environment variable (in make_app_from_configfile).

journal

The journal can be used from the producer side (e.g. a storage's journal writer) or the consumer side.

The swh.journal.client.get_journal_client(cls, **kwargs) factory function is generally used to get a journal client connection with arguments directly from journal_client (or journal) configuration entry.

The swh.journal.writer.get_journal_writer(cls, **kwargs) factory function is used to get a producer journal connection, with arguments directly from journal_writer configuration entry (generally it's the subentry of the "main"storage config entry, as seen above in the storage config example.)

loaders

Loaders are mostly celery workers. There is a cli tool to synchronously execute a loading.

When run as a celery worker task, the configuration loading mech is detaild in the scheduler section below.

When executed directly, via swh loader run, the loader class is instanciated directly, thus it's the responsibility of that later to load a configuration file. This is normally done by using the swh.core.config.load_from_envvar class method.

listers

The main lister cli group does handle the loading of the config file, including falling back to the SWH_CONFIG_FILENAME if not given as command line argument.

Expected config options are:

  • lister
  • priority (optional)
  • policy (optional)
  • any other option accepted as config option by the lister class, if any

The swh lister run command also instanciate a lister class. The base implementation support the configuration options:

  • cache_dir (unclear if this can be overloaded by a config file)
  • cache_responses (same)
  • scheduler
  • lister
  • credentials (for listers inheriting from ListerHttpTransport)
  • url (same)
  • per_page (bitbucket)
  • loading_task_policy (npm)

When used via a celery worker, standard celery worker config loading mechanism is used (see the scheduler below).

scanner

The scanner's cli implements its own strategy for finding the configuration file to load (including looking at the SWH_CONFIG_FILENAME variable). It only needs connection informations to the public web API:

  • url
  • auth-token (optional)

scheduler

The scheduler consists in several parts.

celery

Every piece of code that involves loading the celery stack of swh.scheduler, aka that imports the swh.scheduler.celery_backend.config module, will load the configuration file from the SWH_CONFIG_FILENAME, in which at least a celery section is expected.

Celery workers are registered from the swh.workers pkg_resources entry point as well as the celery.task_modules configuration entry.

The main celery app singleton is then configured from a hardwritten default config dict merged with the celery configuration loaded from the configuration file.

celery workers

Celery workers are started by the standard celery command (python -m celery worker) using swh.scheduler.celery_backend.config.app as celery app, so the configuration loading mechanism is the default celery one described above, and the only way to specify the configuration file to load is via the SWH_CONFIG_FILENAME variable.

cli tools

The main click group does implement the --config-file option, and uses the swh.core.config.read() function. So this main config file loading mechanism does not fall back to the SWH_CONFIG_FILENAME variable.

At this level, the only expected config entry is scheduler (connection to the underlying scheduler service).

Additional config entries for cli commands:

  • runner:
    • celery
  • listener:
    • celery
  • rpc-serve:
    • any flask configuration option
  • celery-monitor:
    • celery
  • archive:
    • any option accepted by swh.scheduler.backend_es.ElasticSearchBackend

WSGI/gunicorn

The loading of the WSGI app normally uses the swh.scheduler.api.server.make_app_from_configfile() function that takes care of loading the config file from the SWH_CONFIG_FILENAME with no fall back to a default path.

The loaded config is added to the main flask app object, so any flask-related config option is possible (at configuration's top-level.)

cli tools

swh.search main cli group does implement the --config-file option (using swh.core.config.read() to load the file).

Config options by cli command:

  • initialize:
    • search
  • journal-client objects:
    • journal
    • search
  • rpc-serve:
    This expect a config file name given as (mandatory) argument (in "addition" to the general --config-file option). This configuration is then used to configure the flask-based RPC server.
    • search
    • any flask config option

WSGI

The creation of the WSGI app is normally done using swh.search.api.server.make_app_from_configfile, which uses the SWH_CONFIG_FILENAME variable as (only) way of setting the config file.

vault

cli tools

There is not support for the --config-file option in main click group, but the cli only provides one command (rpc-serve), which does support this option.

The configuration file is loaded in swh.vault.api.server.make_app_from_configfile(), and the main RPC server is aiohttp based.

web

Django-based stuff.

web client

No config file loading for now (?)


Configuration loading mechanisms

Config file loading function used

package command loading function called from config path
dataset swh dataset ... swh.core.config.read() swh.dataset.cli.dataset_cli_group() --config-file
deposit swh deposit client
deposit swh deposit admin config.load_named_config() --config-file, SWH_CONFIG_FILENAME
deposit HTTP application config.read_raw_config() SWH_CONFIG_FILENAME via DJANGO_SETTINGS_MODULE
graph swh graph ... swh.core.config.read() swh.graph.cli.graph_cli_group() --config-file
graph WSGI app ?? ??
indexer swh indexer ... swh.core.config.read() swh.indexer.cli.indexer_cli_group() --config-file
indexer swh indexer rpc-serve swh.core.config.read() swh.indexer.storage.api.server.load_and_check_config() config-path (via s.i.cli.rpc_server())
indexer WSGI app swh.core.config.read() swh.indexer.storage.api.server.load_and_check_config() SWH_CONFIG_FILENAME (via s.i.s.a.s.make_app_from_configfile())
icinga swh icinga_plugins ...
lister swh lister ... swh.core.config.read() swh.lister.cli.lister() --config-file, SWH_CONFIG_FILENAME (via w.l.cli.lister())
lister celery worker config.load_from_envvar() swh.lister.core.simple_lister.ListerBase() SWH_CONFIG_FILENAME, <SWH_CONFIG_DIRECTORIES>/lister_<name>.<ext>
loader.package swh loader ... config.load_from_envvar() swh.loader.package.loader.PackageLoader() SWH_CONFIG_FILENAME
loader.core swh loader ... config.load_from_envvar() swh.loader.core.loader.BaseLoader() SWH_CONFIG_FILENAME
objstorage swh objstorage ... swh.core.config.read() swh.objstorage.cli.objstorage_cli_group() --config-file, SWH_CONFIG_FILENAME (via s.o.cli.objstorage_cli_group())
objstorage WSGI app swh.core.config.read() swh.objstorage.api.server.load_and_check_config() SWH_CONFIG_FILENAME (via s.o.api.server.make_app_from_configfile())
scanner swh scanner ... config.read_raw_config() swh.scanner.cli.scanner() --config-file, SWH_CONFIG_FILENAME, ~/.config/swh/global.yml
scheduler swh scheduler ... swh.core.config.read() swh.scheduler.cli.cli() --config-file,
scheduler WSGI app swh.core.config.read() swh.scheduler.api.server.load_and_check_config() SWH_CONFIG_FILENAME
scheduler celery worker swh.scheduler.celery_backend.config swh.core.config.load_named_config() swh.scheduler.celery_backend
search swh search ... swh.config.read() swh.search.cli.search_cli_group() --config-file
search swh search rpc-server swh.config.read() swh.search.api.server.load_and_check_config() config-path
search WSGI app swh.config.read() swh.search.api.server.load_and_check_config() SWH_CONFIG_FILENAME (from s.s.api.server.make_app_from_configfile())
storage swh storage ... swh.config.read() swh.storage.cli.storage() --config-file, SWH_CONFIG_FILENAME (from s.s.cli.storage())
storage WSGI app swh.core.config.read() swh.storage.api.server.load_and_check_config() SWH_CONFIG_FILENAME (from s.s.api.server.make_app_from_configfile())
vault swh rpc-serve swh.core.config.read() or swh.core.config.load_named_config() swh.vault.api.server.make_app_from_configfile() --config-file, SWH_CONFIG_FILENAME (from s.v.api.server.make_app_from_configfile()), <swh.core.config.SWH_CONFIG_DIRECTORIES>/vault/server.<ext>
vault celery worker swh.core.config.read() or swh.core.config.load_named_config() swh.vault.cookers.get_cooker() SWH_CONFIG_FILENAME (from ...get_cooker()), <s.c.c.SWH_CONFIG_DIRECTORIES>/vault/cooker.<ext> (from get_cooker())
web XXX
web-client None

Synthesis of discussion @tenma/@ardumont 2020-10-08

Impacts

  • only swh modules source code to migrate
  • docker/production/staging should run as before once the changes are deployed (docker-compose.yml, puppet manifests untouched)

Definitions

  • component: a swh base component (e.g. Loader, Lister, Indexer, Storage, ObjStorage, Scheduler, etc)
  • entrypoint: an swh component orchestrator (celery worker "task", cli, wsgi, )

Possible plan

  • each component repository declares an swh.<component>.config module (like what we declare today for tasks in swh.<component>.tasks)
  • module declare a typed object Config
  • typed object Config is in charge of declaring config keys, with default values, and (gradually) validating the configuration or fails to instantiate if misconfigured (possible implementation: @attr as swh.model.model does)
  • each entrypoint is in charge of instantiating the configuration object
  • each entrypoint is in charge of injecting the Config object to the component
  • each component must take one specific typed Config as a constructor parameter.
    • existing code loading the configuration out an environment variable is removed
    • existing code validating the configuration, if any, is removed
  • the merge policy about loading from an environment variable or from a cli flag or whatever else is delegated to a function in swh.core.config
    • corollary: The pseudo typed code in swh.core.config which kinda validated the types must be dropped (i think it's dead code anyway)

Pre-requisites

The earlier described plan should respect the following:

  • separation of concern (doing one thing and doing it well: merge policy, loading config, validating, running)
  • api unification between entrypoints and tests (consistency): all entrypoints respect the same pattern of instantiating, configuring and injecting
  • fail early if misconfigured

Out of scope

  • (Global Inversion of control) A component injector in charge of instantiating, configuring and injecting between objects (ala Spring Framework)

Feedback on the proposal

olasd

I strongly like that this is going towards having fewer dicts being thrown around in our code around object instantiation in favor of stronger typed objects. I also think this can be implemented in a DRY way by parsing the signature of the classes that are being instantiated.

I'm not sure that the backwards compatibility for production is such a strong requirement, even though it would definitely be nicer.

ardumont "strong requirement": it is not but please let's change one thing at a time. It will definitely help to not touch that part as a first step (especially if that breaks). And when we know the migration worked (implementation underneath changed, all deployed, everything ok as before), then we can change the config values incrementally with small changes.

Configuration sharing

This proposal doesn't seem to be solving the concern of shared configuration across so-called components; Let me use a concrete example:
- the WebClient (or whatever its name is) class in swh.web.client takes a token parameter for authentication, and a base_url for its setup
- the SwhFuse component uses a WebClient
- the SwhScanner component also uses a WebClient
- the swh fuse command takes a config file with its own parameters, as well as parameters for a web client
- the swh scanner command takes a config file with parameters for a web client; I expect most of its other parameters come from the CLI directly

In the proposal it's not clear to me how the following would happen:
- swh.web.client declares its configuration schema in swh.web.client.config
- swh.fuse and swh.scanner do the same in their own config modules
- the swh fuse and swh scanner entrypoints parse a configuration file; from the output, they instantiate a SwhFuse / SwhScanner.

Now that I've written all of this, I guess this could be solved by having a way for the swh.fuse.config and swh.scanner.config modules to declare that they're expecting a swh.web.client.config at the toplevel of their configuration file (rather than in a nested way like the get_storage factories currently work). Did you have a different idea?

tenma
This is my point about namespaces/roles/contexts, in my initial suggestions (that we skimmed in last discussion and not included). It would be a way to both share or distinguish config keys. My proposition avoids some kind of hierarchy with arbitrary depth like composition (owning another config) and inheritance (suclassing another config), and be more similar to naming systems that use tags/prefixes. But it would imply that as a team we define ahead of time those namespaces/roles/contexts (we should choose a definitive name for this, I prefer role).
Example:
- web-api/token: token in the context of web-api
- ext-service/token: token in the context of external service
- both swh.web.client.config and swh.scanner.config could reference web-api/token
The injector in entrypoints, reading config definitions of the components it instanciates, would see which config keys would be shared, and could create a merged config definition from this (if we want to).
But this choice would require adapting config definitions that we can leave untouched for now.
ardumont my initial answer was lost I replied something about configuration composition at the time so what olasd suggested but i think we moved away from this anyways (see the next paragraphs)

douardda

We may see the problem we are trying to solve as:

  • what do we want the configuration files to look like in any of the currently known use cases? (and how far are we with what we currently have?)
  • how do we want to declare these configuration structures (typing, default values, etc.)?
  • where should we instanciate/load these configuration structures?

Synthesis of discussion @tenma/@douardda/@olasd 2020-10-12

Taste

Examples that are not accurate but demonstrate syntax and functionalities.

Example config (global, mixing unrelated components) declarations:

storage:
    uffizi:
        cls: remote
        url: ...
    staging:
        cls: remote
        url: ...
    local:
        cls: pgsql
        conninfo: ...
        objstorage: <objstorage.foo>
    ingestion:
        cls: pipeline
        steps:
            - cls: filter
            - cls: buffer
            - <storage.uffizi>
objstorage:
    foo:
        cls: pathslicing
objstorage-replayer:
    to_S3:
        cls: ...
        src: <objstorage.local>
        dst: <objstorage.S3>
web-client:
    swh:
        token: ...
        url: ...

This demonstrates:

  • general syntax, analogous to the current one
  • one more level to distinguish multiples instances of components, that may be all instanciated or be alternatives of each other
  • package must be the package of a swh service
  • same key name (leaf) can exist in different packages/instances
  • reference syntax to reference qualified instances
  • TODO support reference to a key? Would be <web-client.swh.url> in above example

ardumont Unclear on the scope of that configuration sample.
- Is is a sample configuration for one service serving one module (matching what we already have)?
- Or is it a one global configuration file for all modules, thus defining all
production combination for all services. Then each service (storage, loader-git,
etc) is picking the information it needs to in that file?

Example CLI usage:

swh storage rpc-serve --storage=local
SWH_STORAGE=local swh storage rpc-serve

ardumont is that dedicated to one service? SWH_STORAGE for swh storage, SWH_SCHEDULER for swh scheduler and so on and so forth
or is the following possible as well?

SWH_STORAGE=local swh loader run git https://...

which would make the storage to use a local one within the scope of the
loading?

tenma here local corresponds to an instance. The injector will instanciate this instance and fill the references to it the config.
maybe the option/envvar would be more qualified like X.Y.storage=instance.
We did not discuss it in detail. More about this idea with olasd or douardda.
ardumont i still do not get the scope of the sample ¯_(ツ)_/¯ (i tried to clarify my questions ^)

Configuration model

Configuration statement syntax

3 levels (real names to be defined): package, instance, attribute (key:value)

Examples of existing "package" identifiers:

  • storage
  • objstorage
  • objstorage-replayer
  • indexer-storage
  • journal

Anonymous instances in config file (for use as a value) is out of scope for now.

ardumont well, my understanding so far would mean that such anonymous instances (if any) should no longer exist and be attached to some other level, level "package" then.
tenma yes the idea was to have something regular without a shorthand anonymous form, at least for now. Then an instance must be defined at the 2nd level in a package. If many instances are needed it may become tedious to declare each this way if they are mostly used one, then it is not very difficult to have a syntax for anonymous instance definition.

Package corresponds to an existing identifier, the one of the SWH Python package that defines the components to configure. It is used to resolve components types (search for type remote in storage) and group components of the same service/package.
/!\ what about components that are not of the main type of the package? Would need inclusion into the map and factory.
Ex: non-storage component in storage package?

ardumont What about currently unexisting specified "package" identifiers,
loaders, listers, , is the following good enough?

  • loader-npm
  • lister-cran

  • tenma right, for those we would need a top-level mapping I think
    I do not know the whole type hierarchy
    would the "package" be loader and the type "npm", or the package "loader-npm" and the type "task-?". To be defined

cls (alt: type) keys is a type identifier and denotes what kind of object schema to use.

Other keys are instance keys conforming to the schema referenced by cls, more concretely, the class constructor arguments.
< and > are reference markers and denotes reference to a qualified instance or key. The choice of marker is not definitive. YAML reference feature (syntax &/*) is rejected because we want to keep references in our model and thus not having it processed outside our control.

Configuration declaration

Packages are defined statically. Most probably in core.config. Could be "discovered", but we may prefer whitelisting supported ones.

ardumont I'm under the impression that, for the discoverability, we could plug that part to the "registry mechanism" already in place for the module "tasks".
Adding a config key in the output of the register function there or something.

swh.<module>.init defines something like (extended with "config" here):

def register() -> Mapping[str, Any]:
    """Register the current worker module's definition (tasks, loader, config, ...)"""
    from .loader import SomeLoader

   return {
       "task_modules": [f"{__name__}.tasks"],
       "loader": SomeLoader,
       "config": [f"{__name__}.config"],  # <- or something
   }

Then in the setup.py of the module:

setup(
   ...
   entry_points="""
   ...
       [swh.workers]
       ...
       lister.bitbucket=swh.lister.bitbucket:register
       ...
       loader.someloader=swh.loader.some:register
       ...

And some other parts in swh.core is reading that code to actually declare the tasks.

Packages are associated with a factory function to instanciate instances as InstanceType(**keys), and optionally (class_identifier_string : Python_class) map/register.

Example: Storage package

map = {
    "remote": RemoteStorage,
    "local": LocalStorage,
    "buffer": BufferedStorage,
    ...
}
def get_storage(*args, **kwargs):
    ...

ardumont I gather it works for loaders, listers, indexers as well with for example, git loaders:

map = {
    "remote": GitLoader,
    "disk": GitLoaderFromDisk,
    "archive": GitLoaderFromArchive,
}
def get_loader(*args, ...):
   ...

bitbucket listers:

map = {
    "full": FullBitBucketLister,
    "range": RangeBitBucketLister,
    "incremental": IncrementalBitBucketLister,
}

indexers:

map = {
    "mimetype": ContentMimetypePartition,
    "fossology-license": ContentFossologyLicensePartition,
    "origin-metadata": OriginMetadata,
    ...
}

Component constructor defines config keys, types and defaults.

No static definitions that would need maintaining, but documentation autogenerated from constructors.

Endpoint usage

Config file is necessary to specify a graph of instances.
Config file parameter specified as a string path, either absolute or relative. Current name: config-file.
Config file can be specified as CLI option (--<config-file>) or envvar (SWH_<CONFIG_FILE>).

Reference to an instance can be specified inline (ref syntax), CLI option (--<key>-instance), envvar (SWH_<key>_INSTANCE).

Other keys can be specified as CLI option (--<key>), envvar (SWH_<key>).

ardumont It'd be good then to specify the merge policy now

Library API

Example:
core.config.get_component_from_config(config=cfg_contents, type="storage", id="uffizi")

ardumont
What's cfg_contents in the sample, a global unified configuration of all the config combination?
I'd be interested in a concrete code sample of the instantiation of a module with that code, to clarify a bit ;)

Algorithms

To be defined.
Parsing (YAML library), config resolving (reference processing), component resolving, instanciating.
Restriction to N levels ease implementation.


Example configurations and usages

Note that the actual name of keys is completely up to bikeshedding; specifically, dashes versus underscores versus dots is completely up in the air.

Shared configuration for command line tools

Configuration file ~/.config/swh/default.yml (default configuration path for user-facing cli tools)

web-client: default: # FIXME: single implementation => do we need a cls? base-url: https://archive.softwareheritage.org/api/1/ token: foo-bar-baz docker: base-url: http://localhost:5080/api/1/ token: test-token scanner: default: # FIXME: single implementation => do we need a cls? web-client: <web-client.default> scanner-param: foo docker: web-client: <web-client.docker> scanner-param: bar fuse: default: # FIXME: single implementation => do we need a cls? web-client: <web-client.default>

Command-line calls

  • scan the current directory against the archive in docker
    • (cli flag) swh scanner --scanner-instance=docker scan .
    • (equivalent env var) SWH_SCANNER_INSTANCE=docker swh scanner scan .
    • actual python calls in the cli endpoint
scanner_instance = "docker" # from cli flag or envvar config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint # basic yaml parse, no interpolation of recursive entries config_dict = swh.config.read_config(config_path) scanner = swh.config.get_component_from_config(config_dict, type="scanner", instance=scanner_instance) # this would get `config_dict["scanner"]["docker"]`, then notice that one of the values has the special syntax "<web-client.docker>". # The entry would be replaced by a call to: # web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="docker") # Finally, the scanner would be instantiated with: # get_scanner(web_client=web_client_instance, scanner_param="bar") scanner.scan(".")
  • mount a fuse filesystem
    • swh fs mount ~/foo swh:1:rev:bar
    • actual python calls in the cli endpoint
fuse_instance = "default" # default value of cli flag / envvar config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint # basic yaml parse, no interpolation of recursive entries config_dict = swh.config.read_config(config_path) fuse = swh.config.get_component_from_config(config_dict, type="fuse", instance=fuse_instance) # this would get `config_dict["fuse"][fuse_instance]`, then notice that one of the values has the special syntax "<web-client.default>". # The entry would be replaced by a call to: # web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="default") # Finally, the scanner would be instantiated with: # get_fuse(web_client=web_client_instance) fuse.mount("~/foo", "swh:1:rev:bar")

objstorage replayer

Config file

objstorage:
  local:
    cls: pathslicing
    root: /srv/softwareheritage/objects
    slicing: "0:2/2:5"
  s3:
    cls: s3
    s3-param: foo

journal-client:
  default:
    # single implem: no cls needed
    brokers:
      - kafka
    prefix: swh.journal.objects
    client-param: blablabla
  docker:
    brokers:
      - kafka.swh-dev.docker
    ...

# needed for second cli usecase
objstorage-replayer:
  default:
    src: <objstorage.local>
    dst: <objstorage.s3>
    journal-client: <journal-client.default>

Cli call

Default behavior (single call to swh.core.config.get_component_from_config(config, 'objstorage-replayer', instance_name))

  • swh objstorage replayer --from-instance default
  • swh objstorage replayer (uses config from default instance)

Nice to have (multiple, manual, calls to get_component_from_config in the cli entry point)

  • swh objstorage replayer --src local --dst s3
  • swh objstorage replayer --src local --dst s3 --journal-client docker
  • @douardda's proposal: on-the-fly generation of instance config via syntactic sugar
    swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3

  • @tenma's proposal: dynamic handling of cli options wrt schema
    swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default

    • generic, too verbose, arbitrary attribute setting not needed

Single-task celery worker from systemd

Configuration file /etc/softwareheritage/loader_git.yml

# component configuration storage: default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - cls: filter - cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ # other proposal: storage: default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - cls: filter - <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ # impossible currently: storage: filter: cls: filter # missing storage: argument default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - <storage.filter> # get_storage(cls="filter") => fail # ? OR {cls: filter, storage: <storage.uffizi>} - <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ loader-git: default: cls: remote storage: <storage.default> max_content_size: 104857600 save_data: false save_data_path: "/srv/storage/space/data/sharded_packfiles" # Not a swh component # no type id, only instance id celery: task_broker: amqp://... task_queues: - swh.loader.git.tasks.UpdateGitRepository - swh.loader.git.tasks.LoadDiskGitRepository - swh.loader.git.tasks.UncompressAndLoadDiskGitRepository

(expanded) systemd unit '/etc/systemd/swh-worker@loader_git.service'

[Unit]
Description=Software Heritage Worker (loader_git)
After=network.target

[Service]
User=swhworker
Group=swhworker

Type=simple

# Celery
Environment=CONCURRENCY=6
Environment=MAX_TASKS_PER_CHILD=100

# Logging
Environment=SWH_LOG_TARGET=journal
Environment=LOGLEVEL=info

# Sentry
Environment=SWH_SENTRY_DSN=https://...
Environment=SWH_SENTRY_ENVIRONMENT=production
Environment=SWH_MAIN_PACKAGE=swh.loader.git

# Config
Environment=SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml

ExecStart=/usr/bin/python3 -m celery worker -n loader_git@${CELERY_HOSTNAME} --app=swh.scheduler.celery_backend.config.app --pool=prefork --events --concurrency=${CONCURRENCY} --maxtasksperchild=${MAX_TASKS_PER_CHILD} -Ofair --loglevel=${LOGLEVEL} --without-gossip --without-mingle --without-heartbeat

KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=15m

OOMPolicy=kill

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Instantiation flow

  • The celery cli loads the "app" object set in the cli: swh.scheduler.celery_backend.config.app

    • This module loads the configuration file set in SWH_CONFIG_FILENAME to a dict (plausibly a singleton)
    • This module loads celery task modules from the swh.workers entrypoint
    • This module initializes the celery broker and queues from the celery key of the config dict
  • celery task code

@shared_task(name="foo.bar") def load_git(url): config_dict = swh.core.config.load_from_envvar() loader = swh.core.config.get_component_from_config( config=config_dict, type="loader-git", id=os.environ.get("SWH_LOADER_GIT_INSTANCE", "default"), ) return loader.load(url=url)

Alternatives to the proposed structure of package/cls in config

tenma
The instance aspect is both powerful and easy to understand.
Need to discuss the type and arguments aspects which are inconsistent (needed for some objects but not others).
package and cls are type identifiers. Keys are parameters, can be defined as anything that is neither a type nor an instance.

Drop the first level and include in type

uffizi:
    cls: storage.remote
    url: ...
local:
    cls: storage.pgsql
    conninfo: ...
    objstorage: <foo>
ingestion:
    cls: storage.pipeline
    steps:
        - cls: filter
        - cls: buffer
        - <uffizi>
foo:
    cls: objstorage.pathslicing

More generic, does not impose grouping of components of same "package", instance names then may need to be more descriptive.

Do not restrict on swh service components

Q: do we only define instances of swh service components in the config file?

Using the more generic notion of role/namespace vs the current specific notion of package offer the flexibility of having names that does not refer to a swh component but any datastructure (with type having to be fully qualified e.g. model.model.Origin), and be shared by muliples instances (would represent a datastructure that does not belong specifically to a given swh service package).
It would need a register of datastructures that can be used, like the other proposals.
For example for core/model/graph/tools/other_library datastructure do we want to be able to specify sonetging along the lines of:

model:
    orig1:
        cls: Origin
        url: ...
    swhid1:
        cls: SWHID
        swhid: ...
    node1:
        cls: MerkleNode
        data: ...
somelib:
    datastruct1:
        cls:

Key points to the specification

  • configuration declaration syntax
  • relation between definitions
  • external API (CLI options, environment variables)
  • internal API (core.config library)
  • instanciation of components and configuration loading in entrypoints
  • precise scope and impacts

Synthesis of the meeting 2020-10-21

Particpants: @tenma, @douardda, @olasd, @ardumont
Reporter: @tenma

Q = question
R = remark
OOS = Out of scope
Dates indicates the chronology of the report.

***Before starting to report the concepts tackled trough the meeting, some points about terminology. The terminology was difficult to choose while writing this synthesis, so it is not completely consistent. This section tries to give a basis for discussion.

Terminology

Initial remarks (2020-10-22)

The term type identifier (TID for short) will be used in place of package from now on for the 1st-level names.
package represented both the SWH Python package name and the base type of SWH components available in this package, in my initial, partial view of the subject.
Now that it has been shown that there is no 1-to-1 mapping between SWH component names and SWH packages, the name package is no longer accurate.
The more generic type identifier reflects the flexibility we introduce with a top-level register of components types referencible(?) in configuration.

The 2nd level maps instance identifier (IID) to instance definition mapping.

TID map to actual objects of some type, whereas IID exist only in configuration system.

Terminology discussion preparation (2020-10-30)

Need to choose name for every concept of the system.

  • configuration language:
    • config tree: tree containing all the definitions of a config file
    • config object: any level of the config tree
    • config dictionary: collection of items, under any identifier
    • item/attribute = key + value
    • identifier to type, used in depth level 1
      e.g. "storage"
    • identifier to instance, used in depth level max-1 ;
      e.g. "uffizi", "celery"
    • instance object
    • singleton object
    • reference object
  • programming constructs manipulated through this system:
    • objects = components|singletons
    • swh components (unit comprising config items)
    • swh services (unit comprising components)
    • external components

***Here really starts the report written as of 2020-10-22.
Examples written/updated during the meeting were not copied here.

Single implementations of type

cls attribute to instances specifies implementation/alternative/flavour to use.
It used to be required along with args, because all configurable SWH components were polymorphic.
Components that have no such feature need no cls attribute in their configuration.

R: An indirection layer such as a factory may be defined for such components in order to keep consistency and allow polymorphism if needed later. Alternatively, for polymorphic components, better than a factory which needs to be known by user code, an abstract base class constructor would abstract this indirection layer away.

Default instances

One instance in a package/namespace can be labeled as default.
It will be selected when an instance of a given component type is requested but no IID is given.

Q: could default instance be implied when instanciating with no IID?

Singleton object definitions

For both sakes of clarity and reuse, ad-hoc configuration objects can be defined at top level and be referenced. These are not SWH or external components.
Those definitions are composed of an identifier (equivalent to a IID) on 1st level and a YAML object, possibly recursive, in further levels.
They are instaciated as schemaless dictionaries/lists.

Q: How to allow them in syntax and differentiate them from schemaful definitions?

Q: does such singleton object must be referenced at least in one place in the file they are part of, otherwise they would be ignored.

Top-level component register

A register of components allowed in configurations is to be implemented in core.config.
It will consist of a (TID : qualified_constructor) mapping.
These entries will not be hardcoded in this mapping, but registered at import time from the package that defines the components.

qualified_constructor must be Python absolute import syntax for the callable which is either a factory function or a class object.
It may be in quoted form (string) or actual object form. String form avoid import but no static check of existence may be performed. Given that registering is done in the package responsible of defining the component, object form is chosen.

Components that can be registered may be any SWH service component which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:

  • API servers
  • service workers

Q: use object or quoted form for qualified_constructor ?

Q: what about these config objects: journal related like journal-writer, most under deposit, celery-related

CLI parametrization of configuration loading

A CLI option may be passed to specify an instance ID (only at 2nd level?) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.

A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.

OOS: extend usage to any IID in instance mapping
OOS: dynamic handling of any such options for any attribute

Example propositions

Default behavior (single, manual, component instanciation)

  • swh objstorage replayer --from-instance default
  • swh objstorage replayer (uses config from default instance)

Nice to have (multiple, manual, component instanciations)

  • swh objstorage replayer --src local --dst s3
  • swh objstorage replayer --src local --dst s3 --journal-client docker
  • @douardda's proposal: on-the-fly generation of instance config via syntactic sugar
    swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3

Library API (proposition, 2020-10-23)

Moved to the specification below.

Emerging problems (2020-10-23)

Q: How to allow and handle both typed and untyped (ad-hoc) objects?

Q: how to identify what is an instance in the definitions? Will it require special handling wrt referencing mechanics (source and destination).

Q: do we want to support reference to external config definitions or require autonomy? = always 1 file for a whole service, or composition of partial definitions?
In the former, similar sections in multiple config files will need synchronization of update. One possibility is having each SWH component its config file and compose on-demand (using puppet?) in a standalone file for prod/tests. But not easy to have multiple instances this way. Fill on-demand a template file for each service?


Specification outline

Synopsis

General terminology/concepts

Scope/use cases

Rationale: existing, limitations, wanted

Specific terminology/concepts

Language description

Library

Client code

Environment

Out-of-scope, rejected ideas

Limitations

Impacts

Implementation plan: library, use, tests, prod


Specification 2020-11-02

(Writing sections breadth-first: deeper at each iteration)

Notations:
[opt=id]: concept subject to acceptance or removal, with identifier for easier reference. Usage similar to a feature flag.
[alt]: alternative to any surrounding [alt] statement
[rem]: remark
[OOS]: out-of-scope remark or idea
[rej]: rejected remark or idea
[Q]: question to be answered

Synopsis

The configuration system evolved partially with use cases. Initial design decisions applied to all use cases turned out to be both too hard to reason about/unstable for production and too inflexible for cli or testing.

General terminology and concepts

For the purpose of this specification.

  • Component: a unit comprising data and/or functions, which provides functionality through an interface and has associated dependencies
  • SWH component: a component consisting of a Python class or module, that provides a functionality specific to one or more SWH services. The closed set of SWH components can appear in configuration definitions.
  • SWH service: collection of SWH components. Correspond roughly to docker services developped by SWH. Includes API, worker, journal services.

Scope/use cases

All SWH services and components.

Environments:

  • Production service
  • CLI
  • Testing

Configuration needs:

  • system service: systemd service vs docker + shell script

  • server: gunicorn vs Flask/aiohttp/Django devel server

  • CLI entrypoint

  • server application: app_from_configfile vs django config

  • worker:

  • component: constructor/factory

Configuration sources:

  • environment: CLI parameter, CLI path, envvar parameter, envvar path, input stream
  • code: literal

Rationale

A.

  • implicit/hard to follow loading: configuration may be loaded trough a number of ways automatically, good for cli cases but not for prod cases
  • dependency on environment: must be able to instanciate component using only ad-hoc configuration, for testing purposes
    -> different APIs for different use cases, all compatible (return config)

B.

  • composition coupling: every owner component must know about how to instanciate an owned component
  • heterogeneity: configuration loading (CLI) or instanciating (component factory) is implemented differently everywhere, could be abstracted away
    -> dependency injection, component library API

C.

  • should be able to specify alternative configurations for one component constructed ahead of time, and choose it at runtime/loadtime
  • should be able to factor common configuration out
  • uniform,complete,concise: the configuration could theoretically be centralized in one file which would give a clear overview of the configuration and interaction between all the components
    -> instances, references, singletons

Specific terminology and concepts of the proposition

Basis for discussing terms: terminology proposition

Used in this specification:

TID = type ID
IID = instance ID
AID = attribute ID
QID = qualified ID
ID = any of the above

ad-hoc object = singleton

"attribute ID" = path to the "key" of an attribute

Language description

Target example

storage: default: cls: buffer min_batch_size: content: 1000 storage: <storage.filtered-uffizi> filtered-uffizi: cls: filter storage: <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ loader-git: default: cls: remote storage: <storage.default> save_data_path: "/srv/storage/space/data/sharded_packfiles" random-component: default: cls: foo celery: *celery # Not a component: no type id, only instance id _: celery: &celery task_broker: amqp://... task_queues: - swh.loader.git.tasks.UpdateGitRepository - swh.loader.git.tasks.LoadDiskGitRepository

Syntactic overview

Based on YAML:

  • restricted to YAML primitive types (includes dicts and lists)
  • restricted on document structure (see grammar below)
  • replaced YAML reference system with ours (if not hookable)

3 levels of depth: type, instance, attribute
Instance definitions are composed of an ID and a mapping to attributes.
1 instance <-> N attributes.
Component type definitions are composed of an ID and a mapping to instances.
1 type <-> N instances.
These instances are variants of the component: same type but different constructions.
Singletons are instances defined outside type definitions, so they live at top-level and have no type.

This model is syntactically complicated, here are alternatives to make it regular:
[alt=typePrefix] no type level, so only 2 levels; type and instance identifiers are merged as "type.instance".
[alt=typeAttr] move type identifiers to the attribute level, as a special attribute "type"
[alt=singletonType] use a dummy type for singletons: "singletons"

References can be made to an object defined somewhere else in the tree, using a qualified identifier.
Legal forms are defined to be from an attribute value to an instance identifier.
[opt=refkey] Legal forms also includes from attribute value attribute identifier. OOS

[opt=recattr] There may be recursion from attribute value to instance value definition. This allows anonymous definition of instance object in an attribute. OOS

Grammar

WARNING: hopefully consistent grammar mixup. May be offending to purists.
Some definitions have alternatives noted with |=.

ID ~= PCRE([A-Za-z0-9_-]+) # Could be stricter, e.g. snake_case ID = TID | IID | AID QID = (TID ".")? IID ("." AID)* # opt: refkey, skey |= TID "." IID ref = "<" QID ">" attribute_value = YAML_object | ref |= YAML_object | ref | attributes # opt: reckey attributes = YAML_dict(AID, attribute_value) instances = YAML_dict(IID, attributes) singleton = YAML_object # no opt: sref, skey |= YAML_dict(AID, attribute_value) config_tree = YAML_dict(ID, YAML_dict) # loose typing |= YAML_dict((TID, instances) | (IID, singleton))

Alternative definition of identifiers (always qualified):

ID ~= /[A-Za-z0-9_-]+/ TID = ID IID = (TID ".")? ID AID = IID "." ID+ QID = TID | IID | AID

Identifier

Identifier is abbreviated ID.
Type ID is abbreviated TID.
Singleton ID is equivalent to Instance ID, abbreviated IID.
(Attribute) Key is identified by Attribute ID, abbreviated AID.
Qualified ID is abbreviated QID.

QID is a sequence of ID of the form (TID, IID) for component instances or (IID) for singletons. Its string form joins each field with ".".
[opt=refkey] May have form (TID, IID, AID) to reference component instance attributes. Useful either to reference another attribute of current instance, or any other attribute, except those defined in singletons.
[opt=reckey] May have form (TID, IID, AID*) to reference recursive component instance attributes.
[opt=skey] May have form (IID, AID*) to reference singleton attributes (sequence of AID because recursive).
[opt=sref] singleton attributes may reference any attribute.

Attribute

Attribute is a (key, value) pair whose set forms an instance dictionnary.
Attribute value is either a YAML object or a reference.
[opt=recattr] Attribute value may also be an instance dictionnary.

Attribute level is any level under instance level, recursive or not.

Reference

A reference is synctatically defined as a qualified identifier enclosed in chevrons. Its source is an attribute value and its target is the object identified by the QID it owns.
The reference is deleted when it is resolved by the reference resolution routine.

Type

Python type of a component to be instantiated and configured.
It is referred to indirectly through a TID in a configuration definition, and through a component constructor in the component type register.

Instance

Specific instantiation of a component, distinguished from the others by the set of attributes used to initialize it.

All identified instances of a type must be specified in the instance level of a configuration definition.

[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instanciation routine.

[opt=subinst] Instances may be referenced in an attribute value, and be recognized as instance declarations; i.e. be instanciated and initialized and not just passed as is to the constructor.

[opt=anoninst] Anonymous instances may be defined in an attribute value, and be recognized as instance declarations.
-> [Q] that would need a type declaration, which is to yet handled in the spec, to identify it as an instance and be able to instantiate it. One unflexible option is to have the parent AID, being the child IID, to be a TID, thus restricting its name to a known type ID. The other option being specifying TID in a dedicated child attribute with a name similar to type or TID.

Singleton

Singletons objects are syntactically similar to instances.
Unless otherwise stated, the same rules apply.
They do not correspond to a predefined type, so they have no schema or attached semantics.
They are instantiated as a dict tree.

Library

Register

The component type register, abbreviated register, is a (TID, qualified_constructor) mapping, defined in the configuration library.
It is used by the component resolution routine to resolve type identifiers to Python type constructors.

Entries in this mapping are to be registered through the component registration library routine. This registration may happen anywhere provided it is executed at loading/import time. It is advised to register the component in the package that defines it.

qualified_constructor must be Python absolute import syntax for the object creating callable, which is either a factory function or a class.
It may be defined:
[alt] in quoted form (string).
[alt] in class object form.
[rem] String form avoid import but no static check of existence may be performed. If the registering is done in the package responsible of defining the component, object form is the most preferable.

Components that can be registered may be any SWH service component, SWH support component or external component, which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:

  • API servers
  • service workers

[Q] what about these config objects: journal related like journal-writer, most under deposit, celery-related

Type implementations

This section is informational.

A component type may have multiple implementations.
There is no specific support for it in this system, but as this concept may appear in configuration, related considerations may be worth noting.

[rem] The component type of an instance may be abstract, in which case a concrete type must be determined by the component constructor.

A specific attribute of instances specifies implementation or flavour to use. It is commonly identified as cls, but could be impl or flavor.
It used to be required along with args, which is now deprecated, because all configurable SWH components were polymorphic.
Components that have no such feature need no cls attribute in their configuration.
Alternatively, polymorphic components may be instanciated without cls, in which case a default implementation will be used.

[rem] an indirection layer such as a factory may be defined for monomorphic components in order to keep consistency and allow polymorphism if needed later. Alternatively, for all components, better than a factory which is not derivable from the component type by user code, an abstract base class constructor would abstract this indirection layer away.

Instantiation

Instantiating is the process through which a concrete object is constructed from a model and data describing its (initial) state. In the context of this system, a Python object is created though calling its constructor with the set of attributes associated to a particular instance in a configuration definition.

The input is a QID identifying an instance and a configuration tree (dictionary) containing the instance and its dependencies (reference targets).
The output is a component instance.
The process is composed of the following steps in order.

  1. The instance dictionary, containing an attribute set, is fetched by QID on the configuration.
  2. Resolve references.
    1. Identify all references definitions in the instance dictionary.
    2. Resolve each to the reference target object, which may be atomic or composed.
    3. Replace the reference source by the resolved object.
  3. [opt=subinst] Interpret and compose instances.
    1. Identify all component instance definitions in the instance dictionary.
    2. [opt=anoninst] Identify anonymous instance definitions.
    3. Instanciate each instance.
    4. Replace each definition by the instantiated object.
  4. The component type of the instance is resolved from the TID contained in the QID, to a component constructor.
  5. The component constructor is called, passing the updated instance dictionary as arguments.

[opt=subinst] Iidentifying instance definitions require a TID/IID.
[Q] As parent (ID: {instance attrs}) or child ({ID: ..., instance attrs})?

[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instantiation routine.

Instances must be instanciated only once and used at each reference source.

Interpretation

(Validation/Conversion/Interpretation)

This section is informational.

Interpretation of attributes beyond stated above is out of scope and left to the component constructors to do.

Standard Python typing available in constructors may be used to as the basis for the validation of configuration data.
Validity of structure, value and existence may be checked.
Conversions may also be performed.

[opt=validate] The library provides generic validation primitives and a validation routine based on a data model specification object.

Loading

(Loading/Defaults/Merging)

Loading is the process of fetching data from a storage medium into a memory space which is easily accessible to the processing system. In the context of this system, this data is then read and converted into a Python object.

Loading source may be: an I/O file abstraction (whatever its backing source), or an operating system path to such file abstraction, or such path resolvable from an environment variable or a configuration file ID.

Only a Python dictionary is accepted as the holder of this data once loaded. A default configuration definition, either as a dictionary literal or a loaded configuration, can be specified in which case every attributes absent from loaded configuration will be set to the default one.

API overview

Library should be imported as config everywhere for clarity and uniformity (e.g. import swh.core.config as config or from swh.core import config).
[rem] Existing routine merge_configs should be moved to another module as merge_dicts.

Configuration object: mapping

WARNING:
In the following examples, names subject to change.
Code is inspired by Python, but abstracted to focus on typing.
DeriveType denotes simply a type derived from an exiting one, with no consideration of compatibility with base type or any other.

Loading API

[rem] Should choose term among load, read, from, by, config
Example names: read_config, load_envvar

Config = DeriveType(Mapping) # Tree. Allow only mapping in config definition top-level ConfigFileID = DeriveType(str) # opt=fileid Envvar = DeriveType(str) File = io.IOBase Path = os.PathLike load: (Union[File,Path,Envvar,ConfigFileID], defaults:Config?) -> (Config) load_from_file: (File, defaults:Config?) -> (Config) load_from_path: (Path, defaults:Config?) -> (Config) load_from_envvar: (Envvar, defaults:Config?) -> (Config) load_from_name: (ConfigFileID, defaults:Config?) -> (Config) # opt=fileid

no default configs

Loads as YAML tree and convert to Python recursive mapping.

[opt=fileid] may use ID to reference files independently of their path or extension, in loading mechanism. This is sugar that existed but may be no longer wanted. OOS

[Q] Where to check for loadable path? In loading routines or user code? May duplicate behavior.
[Q] Should envvar be hardcoded in library or default? Same for default path.

Instantiation API

OOS: every function but create component

[rem] Should choose amongst get, read, from_config, instantiate, component, instance, iid.
Example names: get_component_from_config, instantiate_from_config, create_component, read_instance, get_from_id

TID = DeriveType(str) IID = DeriveType(str) QID = (TID, IID) # Simplified form, may be Sequence(ID) Component = DeriveType(type) ComponentConstructor = DeriveType(Callable) # Either type or function InstanceConfig = DeriveType(Config)
create_component: (Config, QID) -> (Component) create_component: (InstanceConfig, TID) -> (Component)

Returns an instantiated component identified by QID.
Uses get_obj, resolve_references, resolve_component, instantiate_component.

get_obj: (Config, QID) -> (Config) get_instance: (Config, QID) -> (InstanceConfig)

Returns a config object (subtree) of the config identified by the config ID. May be used both for getting all instances or one instance depending on whether the config ID has an instance ID part.

resolve_references: (Config) -> (Config) resolve_reference: (Config, QID) -> (Config)

Replaces reference source with the object identified by reference target.

find_instances: (InstanceConfig) -> (Set(InstanceConfig))

[opt=subinst] Finds all instance definitions nested in this instance definition and returns a list of them.
[Q] How to identify instances?
[Q] Should it recurse into nested definitions or just one level?

resolve_component: (TID) -> (ComponentConstructor)

Lookups core.config register to get the constructor from TID.

instantiate_component: (InstanceConfig, ComponentConstructor) -> (Component)

Instantiate a component using given constructor and an instance configuration mapping.

Validation API

[opt=validate]

This section proposes a framework for validating instance definitions in a fairly lightweight and flexible way, for use by component constructors or injectors.

check: (Config) -> (Boolean) check_definitions: (Config) -> (Boolean) check_component: (InstanceConfig, ModelSpec) -> (Boolean) generate_spec_from_signature: (ComponentConstructor) -> (ModelSpec)

check: validate both language and instances.
check_definitions: validate whole definition against language spec.
check_component: validate instance definition against component spec.
This is a template function which is parametrized by user-specified spec.

Model specification

AttrKey ~= String("[A-Za-z0-9_\-]+") AttrVal = YAML_object # Path in the instance configuration dictionary Path ~= String("([A-Za-z0-9_\-]+/)+") # Wrapper to convert falsey values or exceptions to False, otherwise True ensure_boolean: Booleanish -> Boolean # Generic and context-sensitive signatures for flexibility value_check: ((AttrVal) | (AttrVal, InstanceConfig)) -> Booleanish # If not optional existence check should succeed, else not performed. optional_check: (AttrVal, InstanceConfig) -> Booleanish) | Booleanish # Checks whether attr exists at one of given paths, or anywhere if no path. # No reason to have user customise existence check. existence_check: (AttrVal, Set(Path), InstanceConfig) -> Boolean # Here is the model specification # Kwargs: best I found for a typed mapping where every item is optional AttrProperties = Kwargs(value_check, optional_check, Set(Path)) # None for no checks on attribute ModelSpec = Mapping(AttrKey, AttrProperties | None)

check_component verifies that all properties of every attribute holds in the instance definition, based on user-defined model specification. Model specification can leverage primitive check functions and user-defined check functions. Supported checks are value and existence in tree-structure checks, which are distinguished for expressiveness.

The model specification lists each (unqualified) attribute that may exist in the configuation definition, along with attribute properties that must hold.

An attribute may or not be optional, meaning whether validation should fail on absence, based on the boolean value of the optional_check. optional_check may be a callable that must determine whether the attribute is optional based on the configuration context and return a booleanish value, or be a booleanish value. It is run in a wrapper which converts falsey values or exceptions to False, and anything else to True. Required attribute is checked for existence based on a set of paths in the tree if any, or existence anywhere in the tree. Optional attribute is then not checked for existence but still for legal value.

The value check may be any callable that either accepts a single value, or a value and the configuration context (instance definition), and return a booleanish value, handled as above. This makes it possible to use many existing functions or object constructors to do the validation, e.g. int, re.match, isinstance(Protocol) or a function verifying a relation to another attribute in the definition is valid.

Helper for specification generation

generate_spec_from_signature: generate a model specification where annotations are used as value_check functions wherever possible, argument are optional or not depending on the existence a default value, and the path set contains only the tree root. A mapping from types to validators is used to validate most common types, others will only be checked by insinstance. This is a helper function to generate a spec draft ahead of time, that must be corrected and stored along the corresponding constructor, as it cannot be guaranteed to function properly in all cases.

Components with multiple implementations:
Operations based on function signatures like validation but also instanciation, need a way to map the cls argument to the concrete type and constructor signature.
A solution to automatically use the good constructor is to implement single dispatch and overloading on the main constructor. Every method may still call the main one, but must have a signature compatible with the one of the concrete class constructor, based on cls.

See also "Library/Type implementations" remark about abstract constructors.

Client code (need contributions)

Demonstration of features in every use cases.

CLI, WSGI, worker, task, daemon, testing

CLI entrypoint

  • scan the current directory against the archive in docker
    • (cli flag) swh scanner --scanner-instance=docker scan .
    • (equivalent env var) SWH_SCANNER_INSTANCE=docker swh scanner scan .
    • actual python calls in the cli endpoint
import swh.core.config as config scanner_instance = "docker" # from cli flag or envvar config_path = "~/.config/swh/default.yml" # from cli flag or envvar or CLI default or core.config default config_dict = config.load(config_path) scanner = config.create_component( config_dict, config.QID(type="scanner", instance=scanner_instance) ) scanner.scan(".")

API Server entrypoint

rpc-serve, WSGI app

def make_app_from_configfile() -> StorageServerApp: # Or any other module App global app_instance if not app_instance: config_dict = config.load_from_envvar() rpc_instance = os.environ.get("SWH_STORAGE_RPC_INSTANCE", "default") app_instance = config.create_component( config_dict, config.QID(type="storage-rpc", instance=rpc_instance) ) if not check_component(app_instance, "storage-rpc"): # raise ConfigurationError or something? return app_instance

ardumont Completed the snippet ^ (unsure about it)

Celery task entrypoint

Celery task code

@shared_task(name="foo.bar") def load_git(url): config_dict = config.load_from_envvar() loader_instance = os.environ.get("SWH_LOADER_GIT_INSTANCE", "default") loader = config.create_component( config_dict, config.QID(type="loader-git", instance=loader_instance) ) config_dict.create_component(type="loader-git", instance=loader_instance) return loader.load(url=url)

ardumont we moved away from passing parameters to the load function. The url parameter is to be passed along the constructor of the loader (same goes for lister, etc)

Testing / REPL

Example test

import swh.core.config as config @pytest.fixture def config_dict() { return {...} } def test_config(config_dict): type_ID = "objstorage" instance_ID = "test_1" instance = config.create_component( config_dict, config.QID(type=type_ID, instance=instance_ID) ) ... @pytest.fixture def config_path(datadir): return f"{datadir}/other.yml" def test_config2(config_path): config_dict = config.load_from_path(config_path) instance = config.create_component( config_dict, config.QID(type=type_ID, instance=instance_ID) ) ...

Environment

The environment parameters comprises any dependency of the configuration system external to the code.
This includes: configuration directory, configuration file, environment variable and commandline parameters.

Configuration directory

SWH configuration directory: SWH_CONFIG_HOME=$HOME/.config/swh

Configuration file

YAML file with .yml containing only the configuration data.
Default if none is specified to the generic loading routine: $SWH_CONFIG/default.yml.
[opt=conffileid] a configuration file id corresponding to the basename of a configuration file (without extension).
-> [Q] then only from $SWH_CONFIG_HOME or have a register?

Core configuration file parameter

This feature is to be built into SWH core library.
Specify the path to the configuration file to use for a whole service:
path_part = path | file
Environment variable: SWH_CONFIG_<PATH_PART>
CLI option: swh --config-<path_part>

[rem] "path" is a more precise term than "file".

Specific configuration parameters

A CLI option may be passed to specify an instance ID (only at 2nd level) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.

Specify the instance configuration to use for a given component, using instance ID:
id_part = instance | id | iid | cid
SWH_<COMP>_<ID_PART> --<comp>-<id_part>

[rem] Any variant containing "id" is more precise than simply "instance".

A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.

Specify any other predefined configuration option:
SWH_<COMP>_<OPTION> --<comp>-<option>

[OOS] dynamic handling of any such options for any attribute, similar to what click permits.

Configuration priority

CLI has precedence over envvars.
Environment parameters have precedence over whole definitions (from file or code) and whole definitions have precedence over defaults, per-attribute.
This follows the principle that the particular takes precedence over the general.

CLI param > envvar param > CLI file > envvar file > default file > defaults literal

This precedence rules must be implemented in entrypoint client code, with help of the library loading API. Only part of it may be implemented, the minimum being accepting a whole definition trough either code or envvar.

Example environment specifications (need contributions)

Using this objstorage replayer configuration file file

objstorage:
  local:
    cls: pathslicing
    root: /srv/softwareheritage/objects
    slicing: "0:2/2:5"
  s3:
    cls: s3
    s3-param: foo

journal-client:
  default:
    # single implem: no cls needed
    brokers:
      - kafka
    prefix: swh.journal.objects
    client-param: blablabla
  docker:
    brokers:
      - kafka.swh-dev.docker
    ...

objstorage-replayer:
  default:
    src: <objstorage.local>
    dst: <objstorage.s3>
    journal-client: <journal-client.default>

CLI usage:

Specify no instance, use default instance config:

  • swh objstorage replayer

Specify instance:

  • swh objstorage replayer --from-instance default

Specify nested instances (opt=subinst):

  • swh objstorage replayer --src local --dst s3
  • swh objstorage replayer --src local --dst s3 --journal-client docker

CLI options to be defined statically.

Other proposals

@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3

[OOS] @tenma's proposal: dynamic handling of cli options wrt schema
swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default
- generic, verbose, arbitrary attribute setting

Limitations (need contributions)

Depends on chosen functionalities.

OOS, rejected ideas (need contributions)

Depends on chosen functionalities.

Impacts (need contributions)

  • core.config library
  • main constructor/factory of each SWH component: type mapping or dynamic dispatch, use of library APIs for validating
  • entrypoints: use of library APIs for loading and instanciating
  • configuration files format
  • environment variables and cli calls in production+docker environments

Implementation plan: library, ops code, tests, prod (need contributions)

(Proposition)

Prepare for easy switch and rollback by creating configuration copies conforming to the new system, and code conforming to the new system in separate branches.

  • implement all library in the same file as before
  • migrate tests (at any moment)
  • prepare new config files, and service definitions that use them
  • migrate services one by one following SWH dependencies:
    • add needed declarations along with constructors
    • entrypoint loading, instantiating and injecting (if opt=subinst)
  • remove deprecated code

Synthesis of the meeting 2020-11-24

Participants: tenma, douardda, olasd, ardumont

Language:

  • use "_" tid for all singletons. Then QID become regular (IID, TID)
  • USE subinst: instance references may appear arbitrary deep in instance
  • OOS anoninst: every instance defined at 2nd level
  • OOS refkey, reckey, skey, sref: no reference to keys and singletons, may use YAML ref syntax

APIs:

  • OOS conffileid
  • specify only public API, no instanciation plumbing, KISS
  • loading API: no merging so remove defaults
  • instanciation API: only distinction component/singleton
  • instanciation API: use instance methods? use keywords for QID
  • validation library for later, now only constructor validation

Implementation:

  • register: qualified_constructor be type object, not str
  • register: populate from setuptools declarations
    references to constructor + documentation_builder (@olasd)

Environment:

  • comprehensive environment handling is good, but should be opt-in

Notes about implementation:

  • confirmed that anoninst need inline type declaration (but OOS)
  • how to prepare definitions for instanciation?
    Replace reference by instance in definition (duplication?)
    qid could be inserted in the definition as key, and added to instance register; that would make handling more regular

Still open:

  • rename singleton to ad-hoc object?
  • forgot to choose terms in APIs
  • forgot to cover usage of external components
  • factory constructors instead of factory functions: not convinced
    -> allow polymorphism, type and callable are associated
    -> factories reimpl single dispatch which is builtin in classes

Conclusion:

  • no feedback on spec itself, as it is not ready (cleaned)
  • make clean spec in Sphinx and create diff
  • finish library draft and create diff (P878)
  • @olasd for details on populating register and docs through setuptools
Select a repo