HackMD
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# New SWH configuration scheme https://forge.softwareheritage.org/T1410 Better config system that does not rely on implicit configurations. ## Use Cases > [name=tenma] Aren't production and docker environments the same? > [name=douardda] nope since docker make use of entrypoint scripts whereas prod uses systemd unit files, but they are pretty similar in most aspects. ### Production deployment - celery workers ("tasks") - RPC servers (via gunicorn) - inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.) - replayer / backfiller / etc. ### Docker - celery workers ("tasks") - RPC servers (via gunicorn) - inter-communication between services (e.g. swh-web -> swh-storage, swh-scheduler -> swh-storage, etc.) - replayer / backfiller / etc. ### Tests - easy to hack/specify any configuration used - consistency loading with same runtime code (aka no specific "if" for tests) ### REPL > [name=douardda] not sure what's this use case really is about, can it be seen as part of the "cli tools" below? Personaly I rarely need more than a `s = get_storage(cls='memory')` in a shell, so... > Also I definitely do NOT want that typing `s = get_storage()` in a shell or a script silently uses a config file somewhere (be it from the SWH_CONFIG_FILENAME or a default one). > [name=ardumont] Indeed, sounds reasonable. That means, using the factory without any parameters should default to use in-memory implementations. ### cli tools - end user: auth, web-api, swh cli tools (scanner, scheduler, ...) - developer using cli tools to interact with a either the production, the staging or a docker-deployed stack (eg using the `swh scheduler` command to manage tasks, `swh loader`, `swh lister`, ...) - sysadm/automate: to migrate django apps (webapp, deposit) through cli... ## Rationale ### Current features > [name=douardda] it's not strictly necessary to keep this "current features" section IMHO > [name=tenma] maybe, but it reminds us what we may or may not want Current configuration system is an utility library implementing diverse strategies of config loading and parsing including the following functionalities: - either absolute paths or file relative to some config directory location - brittle abstraction from config format: file extensions whitelist, but no fail otherwise - brittle abstraction from config directory location: resilient but not strict - priority loading from multiple default file paths - no clear API distinction between loading and parsing/validating - directional config merging - partial non-recursive config key type validation+conversion - both a mixin and a static class, where the config is a class attribute shared by all user code ### Wanted features - consistent config definition and processing across the SWH codebase - one API that fit all cases and used everywhere - priority loading with defined mechanics: - priority descending: CLI option > envvar > default path (only for interactive usage?) - merge with component-specific default config - directional config merge: merge specific definition with a default one - namespaced by distinct roles, so that one fully qualified config key can be used by different components, and a same unqualified key may exist for different roles, for example: - "web-api/token" can be used by webclient and scanner - key "token" could exist in namespace/role web-api or whatever-api, different fully-qualified keys for same unqualified key - should have a straightforward API, possibly declarative, so that user code can plug config definitions in a single step (decorator, mixin/trait, factory attribute, etc.) > [name=douardda] The declarative part seems pretty attractive, but we currently often use configuration items as constructor argument of classes, how does this fit with the declarative aspect? > [name=tenma] this. it couples config with constructor signatures which lead to difficulties when renaming: rename the config element and all occurences in constructors' signature simultaneously. We can want this but it is the first time I see this kind of coupling, I couldn't rename args in the BufferedProxyStorage constructor because of this. - configuration as attributes of target class to have proper doc/typing/validators either flat or Config object parametrized by class (either class object or config `cls` literal) - may or may not want: decouple config keys from component constructors arguments (easy if Config is in another object) so that config keys and class attributes can evolve independently - config is loaded on entrypoints (cli, celery task, gunicorn wsgi), not by each component (Loader, Lister, ...) > [name=ardumont] possibly wrap instantiation of other components as factories get_storage, get_objstorage, get_indexer_storage, get_journal_client, get_journal_writer, etc... does - process generic options like config-file in the toplevel command ## Early concrete elements ### Format and location File format: YAML Default config: - Separate file: local or global? - Python code/Docstring Specific config: - Separate file of chosen file format Environment variables: - ? specific envvar like click auto envvars (e.g. SWH_SCANNER_CONFIG_FILE) - global envvar (e.g. current SWH_CONFIG_FILENAME) > [name=tenma] I would prefer SWH_CONFIG_(FILE)PATH to SWH_CONFIG_FILENAME, to be clear that it is not a basename (we may want to) but won't argue much. > [name=ardumont] yes, PATH sounds better than NAME (it's a detail that can be taken care of later when everything else is centralized) ### Library swh.core.config - load/read config which assumes config can be loaded and parsed (avoid duplicating click behavior) - check that config can be loaded and parsed - priority loading : CLI option > envvar > default path (only for interactive usage?) Either run with a switch or a envvar, else hardcoded default path ### Usage See example from scanner CLI. --- ## Current situation > [name=douardda] this section probably needs to be moved somewhere else. These are examples of config files as currenly used (we focus here on the configuration itself, not about where these files are loaded from). Most of the configuration files use the form: ``` <swhcomponent>: cls: <select the implementation to use> args: <dict of args passed to the class constructor> ``` Also most (?) CLI tools for swh packages use the same pattern: the config file loading mechanism is handled in the main click group for that package (e.g. in `swh.dataset.cli.data_cli_group` for `swh.dataset`, or `swh.storage.cli.storage` for the storage, etc.) ### objstorage The generic config for an objstorage looks like: ``` objstorage: # typ. used as swh.objstorage.factory.get_objstorage() kwargs cls: pathslicing args: root: /srv/softwareheritage/objects slicing: 0:5 ``` In which we have the main config entry: how to access the underlying objstorage backend, then one (or more) configuration items for the objstorage RPC server (for which one needs to read the code to know what possible options are accepted). #### rpc-server The config is checked in `swh.objstorage.api.server.make_app` with some validation in `swh.objstorage.api.server.validate_config`. It also accept a `client_max_size` top-level argument, which is the only "extra" config parameter supported (used in `make_app`). #### WSGI/gunicorn When started via gunicorn: `swh.objstorage.api.server:make_app_from_configfile()` This function does take care of the presence of the SWH_CONFIG_FILENAME, loads the config file, validate (`validate_config`) then call `make_app`. #### replayer The objstorage replayer needs 2 objstorage configurations (src and dst) and a journal_client one, e.g.: ``` objstorage_src: cls: remote args: url: http://storage0.euwest.azure.internal.softwareheritage.org:5003 max_retries: 5 pool_connections: 100 pool_maxsize: 200 objstorage_dst: cls: remote args: url: http://objstorage:5003 journal_client: cls: kafka brokers: - kafka1 - kafka2 - kafka3 group_id: test-content-replayer-x-change-me ``` The `journal_client` config item is directly used ar argument of `swh.journal.client.get_journal_client()` factory. ### storage ``` storage: cls: local args: db: postgresql:///?service=swh-storage objstorage: cls: remote args: url: http://swh-objstorage:5003/ journal_writer: cls: kafka args: brokers: - kafka prefix: swh.journal.objects client_id: swh.storage.master ``` In which we have the same config system for the main underlying (storage) backend. Besides the configuration of the underlying storage access, there can also be the configuration for the linked objstorage and journal_writer. The former is passed directly to the `swh.storage.objstorage.ObjStorage` class which is a thin layer above the real `swh.objstorage.ObjStorage` class (instanciated via `get_objstorage()`). The later is directly used as argument of the `swh.storage.writer.JournalWriter` class. Also note that the instanciation of the objstorage and journal writer is done in each storage backend (it's not a generic behavior in `get_storage()`). #### rpc-serve Same as general case + inject the `check_config` flag from cli options if needed. #### WSGI/gunicorn `swh.storage.api.server:make_app_from_configfile()` #### replayer This tool needs 2 entries: the destination storage (same config as above) + the journal client config (`journal_client`), like: ``` storage: cls: remote args: url: http://storage:5002/ max_retries: 5 pool_connections: 100 pool_maxsize: 200 journal_client: cls: kafka brokers: - kafka-broker:9094 group_id: test-graph-replayer-XX5 object_types: - content - skipped_content ``` The `journal_client` config item is directly used ar argument of `swh.journal.client.get_journal_client()` factory. #### backfiller The backfiller uses a "low-level" config scheme, because it needs a direct access to the database: ``` brokers: - broker1 - ... storage_dbconn: postgresql://db prefix: swh.journal.objects client_id: <UUID> ``` The config validation is performed within the `JournalBackfiller` class. ### dataset In `swh.dataset`, loaded config is directly passed to `GraphEdgeExporter` via `export_edges` and `sort_graph_nodes`. For the `GraphEdgeExporter`, these config values are actually the `**kwargs` of `ParallelExport.process` plus the `remove_pull_requests` flag extracted from the `config` dict in `process_messages()`. This `ParallelExporter` uses a single config entry,`journal`, the configuration of a journal client. For the `sort_graph_nodes`, config values are: - `sort_buffer_size` - `disk_buffer_dir` ### deposit :::info The main `click` group of `swh.deposit` does **not** load the configuration file. ::: However, it provides a `swh.deposit.config.APIConfig` class that loads the configuration from the `SWH_CONFIG_FILENAME` file. The generic implementation expects a `scheduler` entry, and have default values for `max_upload_size` and `checks`. :::info The current config file for the deposit service in docker looks like: ``` scheduler: # used by the deposit RPC server cls: remote args: url: http://swh-scheduler:5008 # deposit server now writes to the metadata storage (storage) storage_metadata: cls: remote args: url: http://swh-storage:5002/ storage: cls: remote url: http://swh-storage:5002/ # needed ^ for the old migration script (we cannot remove it or init fails) allowed_hosts: # used in "production" django settings (server) - '*' private: # used in "production" django settings (server) secret_key: prod-in-docker db: host: swh-deposit-db port: 5432 name: swh-deposit user: postgres password: testpassword media_root: /tmp/swh-deposit/uploads extraction_dir: "/tmp/swh-deposit/archive/" # used by swh.deposit.api.private.deposit_read.APIReadArchives() ``` > [name=douardda] I'm not sure how all these config entries are used, and by which piece of code. > [name=ardumont] clarified the parts not explained, dropped the obsolete ones > [name=ardumont] by cleaning up, i saw a discrepancy about the storage_metadata key, fixed. > [name=ardumont] it's one entangled configuration file used by all deposit modules api, the "private" api and the workers. Each using a subset combination of those... To actually see what's used by what now, better look at the production configuration instead. ::: #### client tools The `swh.deposit.cli.client` clis do not explicitely implement configuration loading from a file, instead every configuration option is given as cli option. **However**, some classes instanciated from there do support loading a config file from the `SWH_CONFIG_FILENAME` environment variable. Config entries for a deposit client are: - `url` - `auth` (a dict with `username` and `password` entries) > [name=ardumont] "some classes instanciated from there do support loading"> True. But it's not used within that particular cli context. > [name=ardumont] That part is now covered with integration tests (no more mock) so modification on that part should be simpler #### admin tools The `swh.deposit.cli.admin.admin` click group does implement the config file loading pattern (actually the loading itself is implemented in the `setup_django_for()` function). This function does load the django configuration from the `swh.deposit.settings.<platform>` (with `<platform>` in `["development", "production", "testing"]`), and set the `SWH_CONFIG_FILENAME` environment variable to the `config_file` argument given. > [name=ardumont] That's some not pretty stuff that will hopefully get simplified with this spec ;) > [name=ardumont] That part is now covered with tests so modification will be simpler as well #### celery worker The deposit provides one celery worker task (`CheckDepositTsk`) which loads its configuration exclusively from `SWH_CONFIG_FILENAME`. The only config entry used is the `deposit` server connection information. #### RPC server The deposit server uses the standard django configuration scheme, but the selected config module is managed by `swh.deposit.config.setup_djamgo_for()`. A tricky thing is the `swh.deposit.settings.production` django settings module, since it does load the `SWH_CONFIG_FILENAME` config file (but NOT in the `development` nor `testing` flavors). In `production` mode, it expects the configuration to have: - `scheduler` - `private` (credentials for the admin pages of the deposit), - `allowed_hosts` (optional) - `storage` - `extraction_dir` ::: warning > [name=douardda] not sure I have all deposit config options/usages > [name=ardumont] in doubt, look at the [puppet manifest configuration](https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/defaults.yaml$1687-1701) > [name=ardumont] all deposit usages are there > [name=ardumont] as far as my understanding about django goes, this is indeed the standard way of configuring django (I dropped the `(?)`p). ::: ### graph The main click group of `swh.graph` does load the config file, but it does not fall back to `SWH_CONFIG_FILENAME` if not config file is given as cli option argument. Supported configuration values is declared/checked in the `swh.graph.config` module. There is no main "graph" section or namespace in the config file, so all config entries are expected at file's top-level: - batch_size - max_ram - java_tool_options - java - classpath - tmp_dir - logback - graph.compress (for the compress tool) ### indexer The main click group of `swh.indexer` does load the config file, but it does not fall back to `SWH_CONFIG_FILENAME` if not config file is given as cli option argument. For the indexer storage, a standard `swh.indexer.storage.get_indexer_storage()` factory function is provided, and is generally called with arguments from the `indexer_storage` configuration entry. #### schedule The `swh.indexer.cli.schedule` command uses the config entries: - indexer_storage - scheduler - storage #### journal_client The `swh.indexer.cli.journal_client` command (listen the journal to fire new indexing tasks) uses the config entries: - scheduler The connection to the kafka broker is handled only by command line option arguments. #### RPC server When started using the `swh indexer rpc-serve` command, it expect a config file name as required argument. Configuration entries are: - indexer_storage #### WSGI/gunicorn When started as a WSGI app, the configuration is loaded from the `SWH_CONFIG_FILENAME` environment variable (in `make_app_from_configfile`). ### journal The journal can be used from the producer side (e.g. a storage's journal writer) or the consumer side. The `swh.journal.client.get_journal_client(cls, **kwargs)` factory function is generally used to get a journal client connection with arguments directly from `journal_client` (or `journal`) configuration entry. The `swh.journal.writer.get_journal_writer(cls, **kwargs)` factory function is used to get a producer journal connection, with arguments directly from `journal_writer` configuration entry (generally it's the subentry of the "main"`storage` config entry, as seen above in the storage config example.) ### loaders Loaders are mostly celery workers. There is a cli tool to synchronously execute a loading. When run as a celery worker task, the configuration loading mech is detaild in the scheduler section below. When executed directly, via `swh loader run`, the loader class is instanciated directly, thus it's the responsibility of that later to load a configuration file. This is normally done by using the `swh.core.config.load_from_envvar` class method. ### listers The main lister cli group does handle the loading of the config file, including falling back to the `SWH_CONFIG_FILENAME` if not given as command line argument. Expected config options are: - lister - priority (optional) - policy (optional) - any other option accepted as config option by the lister class, if any The `swh lister run` command also instanciate a lister class. The base implementation support the configuration options: - cache_dir (unclear if this can be overloaded by a config file) - cache_responses (same) - scheduler - lister - credentials (for listers inheriting from ListerHttpTransport) - url (same) - per_page (bitbucket) - loading_task_policy (npm) When used via a celery worker, standard celery worker config loading mechanism is used (see the scheduler below). ### scanner The scanner's cli implements its own strategy for finding the configuration file to load (including looking at the `SWH_CONFIG_FILENAME` variable). It only needs connection informations to the public web API: - url - auth-token (optional) ### scheduler The scheduler consists in several parts. #### celery Every piece of code that involves loading the celery stack of `swh.scheduler`, aka that imports the `swh.scheduler.celery_backend.config` module, will load the configuration file from the `SWH_CONFIG_FILENAME`, in which at least a `celery` section is expected. Celery workers are registered from the `swh.workers` pkg_resources entry point as well as the `celery.task_modules` configuration entry. The main celery app singleton is then configured from a hardwritten default config dict merged with the `celery` configuration loaded from the configuration file. #### celery workers Celery workers are started by the standard celery command (`python -m celery worker`) using `swh.scheduler.celery_backend.config.app` as celery app, so the configuration loading mechanism is the default celery one described above, and the only way to specify the configuration file to load is via the `SWH_CONFIG_FILENAME` variable. #### cli tools The main click group does implement the `--config-file` option, and uses the `swh.core.config.read()` function. So this main config file loading mechanism **does not** fall back to the `SWH_CONFIG_FILENAME` variable. At this level, the only expected config entry is `scheduler` (connection to the underlying scheduler service). Additional config entries for cli commands: - `runner`: - `celery` - `listener`: - `celery` - `rpc-serve`: - any flask configuration option - `celery-monitor`: - `celery` - `archive`: - any option accepted by `swh.scheduler.backend_es.ElasticSearchBackend` #### WSGI/gunicorn The loading of the WSGI app normally uses the `swh.scheduler.api.server.make_app_from_configfile()` function that takes care of loading the config file from the `SWH_CONFIG_FILENAME` with no fall back to a default path. The loaded config is added to the main flask `app` object, so any flask-related config option is possible (at configuration's top-level.) ### search #### cli tools `swh.search` main cli group does implement the `--config-file` option (using `swh.core.config.read()` to load the file). Config options by cli command: - `initialize`: - `search` - `journal-client objects`: - `journal` - `search` - `rpc-serve`: This expect a config file name given as (mandatory) argument (in "addition" to the general `--config-file` option). This configuration is then used to configure the flask-based RPC server. - `search` - any flask config option #### WSGI The creation of the WSGI app is normally done using `swh.search.api.server.make_app_from_configfile`, which uses the `SWH_CONFIG_FILENAME` variable as (only) way of setting the config file. ### vault #### cli tools There is not support for the `--config-file` option in main click group, but the cli only provides one command (`rpc-serve`), which does support this option. The configuration file is loaded in `swh.vault.api.server.make_app_from_configfile()`, and the main RPC server is `aiohttp` based. ### web Django-based stuff. ### web client No config file loading for now (?) --- ## Configuration loading mechanisms Config file loading function used | package | command | loading function | called from | config path | -- | -- | -- | -- | -- | dataset | `swh dataset ...` | `swh.core.config.read()` | `swh.dataset.cli.dataset_cli_group()` | `--config-file` | deposit | `swh deposit client` | | deposit | `swh deposit admin` | `config.load_named_config()` | | `--config-file`, `SWH_CONFIG_FILENAME` | deposit | HTTP application | `config.read_raw_config()` | | `SWH_CONFIG_FILENAME` via `DJANGO_SETTINGS_MODULE` | graph | `swh graph ...` | `swh.core.config.read()` | `swh.graph.cli.graph_cli_group()` | `--config-file` | graph | WSGI app | ?? | | ?? | indexer | `swh indexer ...` | `swh.core.config.read()` | `swh.indexer.cli.indexer_cli_group()` | `--config-file` | indexer | `swh indexer rpc-serve` | `swh.core.config.read()` | `swh.indexer.storage.api.server.load_and_check_config()` | `config-path` (via `s.i.cli.rpc_server()`) | indexer | WSGI app | `swh.core.config.read()` | `swh.indexer.storage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (via `s.i.s.a.s.make_app_from_configfile()`) | icinga | `swh icinga_plugins ...` | | | lister | `swh lister ...` | `swh.core.config.read()` | `swh.lister.cli.lister()` | `--config-file`, `SWH_CONFIG_FILENAME` (via `w.l.cli.lister()`) | lister | celery worker | `config.load_from_envvar()` | `swh.lister.core.simple_lister.ListerBase()` | `SWH_CONFIG_FILENAME`, `<SWH_CONFIG_DIRECTORIES>/lister_<name>.<ext>` | loader.package | `swh loader ...` | `config.load_from_envvar()` | `swh.loader.package.loader.PackageLoader()` | `SWH_CONFIG_FILENAME` | loader.core | `swh loader ...` | `config.load_from_envvar()` | `swh.loader.core.loader.BaseLoader()` | `SWH_CONFIG_FILENAME` | objstorage | `swh objstorage ...` | `swh.core.config.read()` | `swh.objstorage.cli.objstorage_cli_group()` | `--config-file`, `SWH_CONFIG_FILENAME` (via `s.o.cli.objstorage_cli_group()`) | objstorage | WSGI app | `swh.core.config.read()` | `swh.objstorage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (via `s.o.api.server.make_app_from_configfile()`) | scanner | `swh scanner ...` | `config.read_raw_config()` | `swh.scanner.cli.scanner()` | `--config-file`, `SWH_CONFIG_FILENAME`, `~/.config/swh/global.yml` | scheduler | `swh scheduler ...` | `swh.core.config.read()` | `swh.scheduler.cli.cli()` | `--config-file`, | scheduler | WSGI app | `swh.core.config.read()` | `swh.scheduler.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` | scheduler | celery worker | `swh.scheduler.celery_backend.config` | `swh.core.config.load_named_config()` | `swh.scheduler.celery_backend` | `SWH_CONFIG_FILENAME`, `<swh.core.config.SWH_CONFIG_DIRECTORIES/worker/<name>.<ext>`, `<s.c.c.SWH_CONFIG_DIRECTORIES/worker.<ext>` | search | `swh search ...` | `swh.config.read()` | `swh.search.cli.search_cli_group()` | `--config-file` | search | `swh search rpc-server` | `swh.config.read()` | `swh.search.api.server.load_and_check_config()` | `config-path` | search | WSGI app | `swh.config.read()` | `swh.search.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (from `s.s.api.server.make_app_from_configfile()`) | storage | `swh storage ...` | `swh.config.read()` | `swh.storage.cli.storage()` | `--config-file`, `SWH_CONFIG_FILENAME` (from `s.s.cli.storage()`) | storage | WSGI app | `swh.core.config.read()` | `swh.storage.api.server.load_and_check_config()` | `SWH_CONFIG_FILENAME` (from `s.s.api.server.make_app_from_configfile()`) | vault | `swh rpc-serve` | `swh.core.config.read()` or `swh.core.config.load_named_config()`| `swh.vault.api.server.make_app_from_configfile()` | `--config-file`, `SWH_CONFIG_FILENAME` (from `s.v.api.server.make_app_from_configfile()`), `<swh.core.config.SWH_CONFIG_DIRECTORIES>/vault/server.<ext>` | vault | celery worker | `swh.core.config.read()` or `swh.core.config.load_named_config()` | `swh.vault.cookers.get_cooker()` | `SWH_CONFIG_FILENAME` (from `...get_cooker()`), `<s.c.c.SWH_CONFIG_DIRECTORIES>/vault/cooker.<ext>` (from `get_cooker()`) | web | XXX | | | web-client | | None | # Synthesis of discussion @tenma/@ardumont 2020-10-08 ## Impacts - only swh modules source code to migrate - docker/production/staging should run as before once the changes are deployed (docker-compose.yml, puppet manifests untouched) ## Definitions - component: a swh base component (e.g. Loader, Lister, Indexer, Storage, ObjStorage, Scheduler, etc...) - entrypoint: an swh component orchestrator (celery worker "task", cli, wsgi, ...) ## Possible plan - each component repository declares an `swh.<component>.config` module (like what we declare today for tasks in `swh.<component>.tasks`) - module declare a typed object Config - typed object Config is in charge of declaring config keys, with default values, and (gradually) validating the configuration or fails to instantiate if misconfigured (possible implementation: @attr as swh.model.model does) - each entrypoint is in charge of instantiating the configuration object - each entrypoint is in charge of injecting the Config object to the component - each component must take one specific typed Config as a constructor parameter. - existing code loading the configuration out an environment variable is removed - existing code validating the configuration, if any, is removed - the merge policy about loading from an environment variable or from a cli flag or whatever else is delegated to a function in swh.core.config - corollary: The pseudo typed code in swh.core.config which kinda validated the types must be dropped (i think it's dead code anyway) ## Pre-requisites The earlier described plan should respect the following: - separation of concern (doing one thing and doing it well: merge policy, loading config, validating, running...) - api unification between entrypoints and tests (consistency): all entrypoints respect the same pattern of instantiating, configuring and injecting - fail early if misconfigured ## Out of scope - (Global Inversion of control) A component injector in charge of instantiating, configuring and injecting between objects (ala Spring Framework) ## Feedback on the proposal ### olasd I *strongly* like that this is going towards having fewer dicts being thrown around in our code around object instantiation in favor of stronger typed objects. I also think this can be implemented in a DRY way by parsing the signature of the classes that are being instantiated. I'm not sure that the backwards compatibility for production is *such* a strong requirement, even though it would definitely be nicer. > [name=ardumont] "strong requirement": it is not but please let's change one thing at a time. It will definitely help to not touch that part as a first step (especially if that breaks). And when we know the migration worked (implementation underneath changed, all deployed, everything ok as before), then we can change the config values incrementally with small changes. #### Configuration sharing This proposal doesn't seem to be solving the concern of shared configuration across so-called components; Let me use a concrete example: - the `WebClient` (or whatever its name is) class in swh.web.client takes a `token` parameter for authentication, and a `base_url` for its setup - the `SwhFuse` component uses a `WebClient` - the `SwhScanner` component also uses a `WebClient` - the `swh fuse` command takes a config file with its own parameters, as well as parameters for a web client - the `swh scanner` command takes a config file with parameters for a web client; I expect most of its other parameters come from the CLI directly In the proposal it's not clear to me how the following would happen: - `swh.web.client` declares its configuration schema in `swh.web.client.config` - `swh.fuse` and `swh.scanner` do the same in their own `config` modules - the `swh fuse` and `swh scanner` entrypoints parse a configuration file; from the output, they instantiate a `SwhFuse` / `SwhScanner`. Now that I've written all of this, I guess this could be solved by having a way for the `swh.fuse.config` and `swh.scanner.config` modules to declare that they're expecting a `swh.web.client.config` at the toplevel of their configuration file (rather than in a nested way like the `get_storage` factories currently work). Did you have a different idea? > [name=tenma] This is my point about namespaces/roles/contexts, in my initial suggestions (that we skimmed in last discussion and not included). It would be a way to both share or distinguish config keys. My proposition avoids some kind of hierarchy with arbitrary depth like composition (owning another config) and inheritance (suclassing another config), and be more similar to naming systems that use tags/prefixes. But it would imply that as a team we define ahead of time those namespaces/roles/contexts (we should choose a definitive name for this, I prefer role). Example: - `web-api/token`: token in the context of web-api - `ext-service/token`: token in the context of external service - both `swh.web.client.config` and `swh.scanner.config` could reference `web-api/token` The injector in entrypoints, reading config definitions of the components it instanciates, would see which config keys would be shared, and could create a merged config definition from this (if we want to). But this choice would require adapting config definitions that we can leave untouched for now. > [name=ardumont] my initial answer was lost... I replied something about configuration composition at the time so what olasd suggested but i think we moved away from this anyways... (see the next paragraphs) ### douardda We may see the problem we are trying to solve as: - what do we want the configuration files to look like in any of the currently known use cases? (and how far are we with what we currently have?) - how do we want to declare these configuration structures (typing, default values, etc.)? - where should we instanciate/load these configuration structures? --- # Synthesis of discussion @tenma/@douardda/@olasd 2020-10-12 ## Taste Examples that are not accurate but demonstrate syntax and functionalities. Example config (global, mixing unrelated components) declarations: ``` storage: uffizi: cls: remote url: ... staging: cls: remote url: ... local: cls: pgsql conninfo: ... objstorage: <objstorage.foo> ingestion: cls: pipeline steps: - cls: filter - cls: buffer - <storage.uffizi> objstorage: foo: cls: pathslicing objstorage-replayer: to_S3: cls: ... src: <objstorage.local> dst: <objstorage.S3> web-client: swh: token: ... url: ... ``` This demonstrates: - general syntax, analogous to the current one - one more level to distinguish multiples instances of components, that may be all instanciated or be alternatives of each other - package must be the package of a swh service - same key name (leaf) can exist in different packages/instances - reference syntax to reference qualified instances - TODO support reference to a key? Would be <web-client.swh.url> in above example > [name=ardumont] Unclear on the scope of that configuration sample. - Is is a sample configuration for one service serving one module (matching what we already have)? - Or is it a one global configuration file for all modules, thus defining all production combination for all services. Then each service (storage, loader-git, etc...) is picking the information it needs to in that file? Example CLI usage: ``` swh storage rpc-serve --storage=local SWH_STORAGE=local swh storage rpc-serve ``` > [name=ardumont] is that dedicated to one service? SWH_STORAGE for `swh storage`, SWH_SCHEDULER for `swh scheduler` and so on and so forth... > or is the following possible as well? > ``` > SWH_STORAGE=local swh loader run git https://... > ``` > which would make the storage to use a local one within the scope of the > loading? > [name=tenma] here `local` corresponds to an instance. The injector will instanciate this instance and fill the references to it the config. > maybe the option/envvar would be more qualified like X.Y.storage=instance. > We did not discuss it in detail. More about this idea with olasd or douardda. > [name=ardumont] i still do not get the scope of the sample ¯\_(ツ)_/¯ (i tried to clarify my questions ^) ## Configuration model ### Configuration statement syntax 3 levels (real names to be defined): package, instance, attribute (key:value) Examples of existing "package" identifiers: - storage - objstorage - objstorage-replayer - indexer-storage - journal - ... Anonymous instances in config file (for use as a value) is out of scope for now. > [name=ardumont] well, my understanding so far would mean that such anonymous instances (if any) should no longer exist and be attached to some other level, level "package" then. > [name=tenma] yes the idea was to have something regular without a shorthand anonymous form, at least for now. Then an instance must be defined at the 2nd level in a package. If many instances are needed it may become tedious to declare each this way if they are mostly used one, then it is not very difficult to have a syntax for anonymous instance definition. Package corresponds to an existing identifier, the one of the SWH Python package that defines the components to configure. It is used to resolve components types (search for type `remote` in `storage`) and group components of the same service/package. /!\ what about components that are not of the main type of the package? Would need inclusion into the map and factory. Ex: non-storage component in storage package? > [name=ardumont] What about currently unexisting specified "package" identifiers, loaders, listers, ..., is the following good enough? > - loader-npm > - lister-cran > - ... > [name=tenma] right, for those we would need a top-level mapping I think... > I do not know the whole type hierarchy... > would the "package" be loader and the type "npm", or the package "loader-npm" and the type "task-?". To be defined... `cls` (alt: `type`) keys is a type identifier and denotes what kind of object schema to use. Other keys are instance keys conforming to the schema referenced by `cls`, more concretely, the class constructor arguments. `<` and `>` are reference markers and denotes reference to a qualified instance or key. The choice of marker is not definitive. YAML reference feature (syntax `&`/`*`) is rejected because we want to keep references in our model and thus not having it processed outside our control. ### Configuration declaration Packages are defined statically. Most probably in core.config. Could be "discovered", but we may prefer whitelisting supported ones. > [name=ardumont] I'm under the impression that, for the discoverability, we could plug that part to the "registry mechanism" already in place for the module "tasks". Adding a config key in the output of the `register` function there or something. > swh.`<module>`.__init__ defines something like (extended with "config" here): > > ``` > def register() -> Mapping[str, Any]: > """Register the current worker module's definition (tasks, loader, config, ...)""" > from .loader import SomeLoader > > return { > "task_modules": [f"{__name__}.tasks"], > "loader": SomeLoader, > "config": [f"{__name__}.config"], # <- or something > } >``` > > Then in the setup.py of the module: > >``` > setup( > ... > entry_points=""" > ... > [swh.workers] > ... > lister.bitbucket=swh.lister.bitbucket:register > ... > loader.someloader=swh.loader.some:register > ... > ``` > And some other parts in swh.core is reading that code to actually declare the tasks. Packages are associated with a factory function to instanciate instances as `InstanceType(**keys)`, and optionally (class_identifier_string : Python_class) map/register. Example: Storage package ``` map = { "remote": RemoteStorage, "local": LocalStorage, "buffer": BufferedStorage, ... } def get_storage(*args, **kwargs): ... ``` > [name=ardumont] I gather it works for loaders, listers, indexers as well with for example, git loaders: > ``` > map = { > "remote": GitLoader, > "disk": GitLoaderFromDisk, > "archive": GitLoaderFromArchive, > } > def get_loader(*args, ...): > ... > ``` > > bitbucket listers: > > ``` > map = { > "full": FullBitBucketLister, > "range": RangeBitBucketLister, > "incremental": IncrementalBitBucketLister, > } > ``` > > indexers: > > ``` > map = { > "mimetype": ContentMimetypePartition, > "fossology-license": ContentFossologyLicensePartition, > "origin-metadata": OriginMetadata, > ... > } > ``` Component constructor defines config keys, types and defaults. No static definitions that would need maintaining, but documentation autogenerated from constructors. ### Endpoint usage Config file is necessary to specify a graph of instances. Config file parameter specified as a string path, either absolute or relative. Current name: config-file. Config file can be specified as CLI option (`--<config-file>`) or envvar (`SWH_<CONFIG_FILE>`). Reference to an instance can be specified inline (ref syntax), CLI option (`--<key>-instance`), envvar (`SWH_<key>_INSTANCE`). Other keys can be specified as CLI option (`--<key>`), envvar (`SWH_<key>`). > [name=ardumont] It'd be good then to specify the merge policy now... ### Library API Example: `core.config.get_component_from_config(config=cfg_contents, type="storage", id="uffizi")` > [name=ardumont] What's `cfg_contents` in the sample, a global unified configuration of all the config combination? I'd be interested in a concrete code sample of the instantiation of a module with that code, to clarify a bit ;) ### Algorithms To be defined. Parsing (YAML library), config resolving (reference processing), component resolving, instanciating. Restriction to N levels ease implementation. --- ## Example configurations and usages Note that the actual name of keys is completely up to bikeshedding; specifically, dashes versus underscores versus dots is completely up in the air. ### Shared configuration for command line tools #### Configuration file `~/.config/swh/default.yml` (default configuration path for user-facing cli tools) ```yaml= web-client: default: # FIXME: single implementation => do we need a cls? base-url: https://archive.softwareheritage.org/api/1/ token: foo-bar-baz docker: base-url: http://localhost:5080/api/1/ token: test-token scanner: default: # FIXME: single implementation => do we need a cls? web-client: <web-client.default> scanner-param: foo docker: web-client: <web-client.docker> scanner-param: bar fuse: default: # FIXME: single implementation => do we need a cls? web-client: <web-client.default> ``` #### Command-line calls * scan the current directory against the archive in docker * (cli flag) `swh scanner --scanner-instance=docker scan .` * (equivalent env var) `SWH_SCANNER_INSTANCE=docker swh scanner scan .` * actual python calls in the cli endpoint ```python= scanner_instance = "docker" # from cli flag or envvar config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint # basic yaml parse, no interpolation of recursive entries config_dict = swh.config.read_config(config_path) scanner = swh.config.get_component_from_config(config_dict, type="scanner", instance=scanner_instance) # this would get `config_dict["scanner"]["docker"]`, then notice that one of the values has the special syntax "<web-client.docker>". # The entry would be replaced by a call to: # web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="docker") # Finally, the scanner would be instantiated with: # get_scanner(web_client=web_client_instance, scanner_param="bar") scanner.scan(".") ``` * mount a fuse filesystem * `swh fs mount ~/foo swh:1:rev:bar` * actual python calls in the cli endpoint ```python= fuse_instance = "default" # default value of cli flag / envvar config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint # basic yaml parse, no interpolation of recursive entries config_dict = swh.config.read_config(config_path) fuse = swh.config.get_component_from_config(config_dict, type="fuse", instance=fuse_instance) # this would get `config_dict["fuse"][fuse_instance]`, then notice that one of the values has the special syntax "<web-client.default>". # The entry would be replaced by a call to: # web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="default") # Finally, the scanner would be instantiated with: # get_fuse(web_client=web_client_instance) fuse.mount("~/foo", "swh:1:rev:bar") ``` ### objstorage replayer #### Config file ```yaml objstorage: local: cls: pathslicing root: /srv/softwareheritage/objects slicing: "0:2/2:5" s3: cls: s3 s3-param: foo journal-client: default: # single implem: no cls needed brokers: - kafka prefix: swh.journal.objects client-param: blablabla docker: brokers: - kafka.swh-dev.docker ... # needed for second cli usecase objstorage-replayer: default: src: <objstorage.local> dst: <objstorage.s3> journal-client: <journal-client.default> ``` #### Cli call Default behavior (single call to `swh.core.config.get_component_from_config(config, 'objstorage-replayer', instance_name)`) * `swh objstorage replayer --from-instance default` * `swh objstorage replayer` (uses config from default instance) Nice to have (multiple, manual, calls to `get_component_from_config` in the cli entry point) * `swh objstorage replayer --src local --dst s3` * `swh objstorage replayer --src local --dst s3 --journal-client docker` <!-- --> * @douardda's proposal: on-the-fly generation of instance config via syntactic sugar `swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3` * @tenma's proposal: dynamic handling of cli options wrt schema `swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default` - generic, too verbose, arbitrary attribute setting not needed ### Single-task celery worker from systemd #### Configuration file `/etc/softwareheritage/loader_git.yml` ```yaml= # component configuration storage: default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - cls: filter - cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ # other proposal: storage: default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - cls: filter - <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ # impossible currently: storage: filter: cls: filter # missing storage: argument default: cls: pipeline steps: - cls: buffer min_batch_size: content: 1000 content_bytes: 52428800 directory: 1000 revision: 1000 release: 1000 - <storage.filter> # get_storage(cls="filter") => fail # ? OR {cls: filter, storage: <storage.uffizi>} - <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ loader-git: default: cls: remote storage: <storage.default> max_content_size: 104857600 save_data: false save_data_path: "/srv/storage/space/data/sharded_packfiles" # Not a swh component # no type id, only instance id celery: task_broker: amqp://... task_queues: - swh.loader.git.tasks.UpdateGitRepository - swh.loader.git.tasks.LoadDiskGitRepository - swh.loader.git.tasks.UncompressAndLoadDiskGitRepository ``` #### (expanded) systemd unit '/etc/systemd/swh-worker@loader_git.service' ```ini [Unit] Description=Software Heritage Worker (loader_git) After=network.target [Service] User=swhworker Group=swhworker Type=simple # Celery Environment=CONCURRENCY=6 Environment=MAX_TASKS_PER_CHILD=100 # Logging Environment=SWH_LOG_TARGET=journal Environment=LOGLEVEL=info # Sentry Environment=SWH_SENTRY_DSN=https://... Environment=SWH_SENTRY_ENVIRONMENT=production Environment=SWH_MAIN_PACKAGE=swh.loader.git # Config Environment=SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml ExecStart=/usr/bin/python3 -m celery worker -n loader_git@${CELERY_HOSTNAME} --app=swh.scheduler.celery_backend.config.app --pool=prefork --events --concurrency=${CONCURRENCY} --maxtasksperchild=${MAX_TASKS_PER_CHILD} -Ofair --loglevel=${LOGLEVEL} --without-gossip --without-mingle --without-heartbeat KillMode=process KillSignal=SIGTERM TimeoutStopSec=15m OOMPolicy=kill Restart=always RestartSec=10 [Install] WantedBy=multi-user.target ``` #### Instantiation flow * The celery cli loads the "app" object set in the cli: `swh.scheduler.celery_backend.config.app` * This module loads the configuration file set in `SWH_CONFIG_FILENAME` to a dict (plausibly a singleton) * This module loads celery task modules from the swh.workers entrypoint * This module initializes the celery broker and queues from the `celery` key of the config dict * celery task code ```python= @shared_task(name="foo.bar") def load_git(url): config_dict = swh.core.config.load_from_envvar() loader = swh.core.config.get_component_from_config( config=config_dict, type="loader-git", id=os.environ.get("SWH_LOADER_GIT_INSTANCE", "default"), ) return loader.load(url=url) ``` --- ## Alternatives to the proposed structure of package/cls in config > [name=tenma] The instance aspect is both powerful and easy to understand. Need to discuss the type and arguments aspects which are inconsistent (needed for some objects but not others). `package` and `cls` are type identifiers. Keys are parameters, can be defined as anything that is neither a type nor an instance. ### Drop the first level and include in type ``` uffizi: cls: storage.remote url: ... local: cls: storage.pgsql conninfo: ... objstorage: <foo> ingestion: cls: storage.pipeline steps: - cls: filter - cls: buffer - <uffizi> foo: cls: objstorage.pathslicing ``` More generic, does not impose grouping of components of same "package", instance names then may need to be more descriptive. ### Do not restrict on swh service components Q: do we only define instances of swh service components in the config file? Using the more generic notion of role/namespace vs the current specific notion of package offer the flexibility of having names that does not refer to a swh component but any datastructure (with `type` having to be fully qualified e.g. `model.model.Origin`), and be shared by muliples instances (would represent a datastructure that does not belong specifically to a given swh service package). It would need a register of datastructures that can be used, like the other proposals. For example for core/model/graph/tools/other_library datastructure do we want to be able to specify sonetging along the lines of: ``` model: orig1: cls: Origin url: ... swhid1: cls: SWHID swhid: ... node1: cls: MerkleNode data: ... somelib: datastruct1: cls: ``` --- ## Key points to the specification - configuration declaration syntax - relation between definitions - external API (CLI options, environment variables) - internal API (core.config library) - instanciation of components and configuration loading in entrypoints - precise scope and impacts --- # Synthesis of the meeting 2020-10-21 Particpants: @tenma, @douardda, @olasd, @ardumont Reporter: @tenma Q = question R = remark OOS = Out of scope Dates indicates the chronology of the report. ***Before starting to report the concepts tackled trough the meeting, some points about terminology. The terminology was difficult to choose while writing this synthesis, so it is not completely consistent. This section tries to give a basis for discussion. ## Terminology ### Initial remarks (2020-10-22) The term `type identifier` (`TID` for short) will be used in place of `package` from now on for the 1st-level names. `package` represented both the SWH Python package name and the base type of SWH components available in this package, in my initial, partial view of the subject. Now that it has been shown that there is no 1-to-1 mapping between SWH component names and SWH packages, the name `package` is no longer accurate. The more generic `type identifier` reflects the flexibility we introduce with a top-level register of components types referencible(?) in configuration. The 2nd level maps `instance identifier` (`IID`) to instance definition mapping. `TID` map to actual objects of some type, whereas `IID` exist only in configuration system. ### Terminology discussion preparation (2020-10-30) Need to choose name for every concept of the system. - configuration language: - config tree: tree containing all the definitions of a config file - config object: any level of the config tree - config dictionary: collection of items, under any identifier - item/attribute = key + value - identifier to type, used in depth level 1 e.g. "storage" - identifier to instance, used in depth level max-1 ; e.g. "uffizi", "celery" - instance object - singleton object - reference object - programming constructs manipulated through this system: - objects = components|singletons - swh components (unit comprising config items) - swh services (unit comprising components) - external components ***Here really starts the report written as of 2020-10-22. Examples written/updated during the meeting were not copied here. ## Single implementations of type `cls` attribute to instances specifies implementation/alternative/flavour to use. It used to be required along with `args`, because all configurable SWH components were polymorphic. Components that have no such feature need no `cls` attribute in their configuration. R: An indirection layer such as a factory may be defined for such components in order to keep consistency and allow polymorphism if needed later. Alternatively, for polymorphic components, better than a factory which needs to be known by user code, an abstract base class constructor would abstract this indirection layer away. ## Default instances One instance in a package/namespace can be labeled as `default`. It will be selected when an instance of a given component type is requested but no IID is given. Q: could default instance be implied when instanciating with no IID? ## Singleton object definitions For both sakes of clarity and reuse, ad-hoc configuration objects can be defined at top level and be referenced. These are not SWH or external components. Those definitions are composed of an identifier (equivalent to a IID) on 1st level and a YAML object, possibly recursive, in further levels. They are instaciated as schemaless dictionaries/lists. Q: How to allow them in syntax and differentiate them from schemaful definitions? Q: does such singleton object must be referenced at least in one place in the file they are part of, otherwise they would be ignored. ## Top-level component register A register of components allowed in configurations is to be implemented in core.config. It will consist of a `(TID : qualified_constructor)` mapping. These entries will not be hardcoded in this mapping, but registered at import time from the package that defines the components. `qualified_constructor` must be Python absolute import syntax for the callable which is either a factory function or a class object. It may be in quoted form (string) or actual object form. String form avoid import but no static check of existence may be performed. Given that registering is done in the package responsible of defining the component, object form is chosen. Components that can be registered may be any SWH service component which is public (= has an Python object API). In practice, only one main component by SWH service encapsulates all configuration for the components used in the service: - API servers - service workers Q: use object or quoted form for `qualified_constructor` ? Q: what about these config objects: journal related like journal-writer, most under deposit, celery-related... ## CLI parametrization of configuration loading A CLI option may be passed to specify an instance ID (only at 2nd level?) when several alternatives are provided in the configuration. Such option must be declared statically in CLI code. A CLI option may be passed to override an attribute in the configuration. Such option must be declared statically in CLI code. OOS: extend usage to any IID in instance mapping OOS: dynamic handling of any such options for any attribute ### Example propositions Default behavior (single, manual, component instanciation) * `swh objstorage replayer --from-instance default` * `swh objstorage replayer` (uses config from default instance) Nice to have (multiple, manual, component instanciations) * `swh objstorage replayer --src local --dst s3` * `swh objstorage replayer --src local --dst s3 --journal-client docker` <!-- --> * @douardda's proposal: on-the-fly generation of instance config via syntactic sugar `swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3` ## Library API (proposition, 2020-10-23) Moved to the specification below. ## Emerging problems (2020-10-23) Q: How to allow and handle both typed and untyped (ad-hoc) objects? Q: how to identify what is an instance in the definitions? Will it require special handling wrt referencing mechanics (source and destination). Q: do we want to support reference to external config definitions or require autonomy? = always 1 file for a whole service, or composition of partial definitions? In the former, similar sections in multiple config files will need synchronization of update. One possibility is having each SWH component its config file and compose on-demand (using puppet?) in a standalone file for prod/tests. But not easy to have multiple instances this way. Fill on-demand a template file for each service? --- # Specification outline ## Synopsis ## General terminology/concepts ## Scope/use cases ## Rationale: existing, limitations, wanted ## Specific terminology/concepts ## Language description ## Library ## Client code ## Environment ## Out-of-scope, rejected ideas ## Limitations ## Impacts ## Implementation plan: library, use, tests, prod --- # Specification 2020-11-02 (Writing sections breadth-first: deeper at each iteration) Notations: [opt=id]: concept subject to acceptance or removal, with identifier for easier reference. Usage similar to a feature flag. [alt]: alternative to any surrounding [alt] statement [rem]: remark [OOS]: out-of-scope remark or idea [rej]: rejected remark or idea [Q]: question to be answered ## Synopsis The configuration system evolved partially with use cases. Initial design decisions applied to all use cases turned out to be both too hard to reason about/unstable for production and too inflexible for cli or testing. ## General terminology and concepts For the purpose of this specification. - Component: a unit comprising data and/or functions, which provides functionality through an interface and has associated dependencies - SWH component: a component consisting of a Python class or module, that provides a functionality specific to one or more SWH services. The closed set of SWH components can appear in configuration definitions. - SWH service: collection of SWH components. Correspond roughly to docker services developped by SWH. Includes API, worker, journal services. ## Scope/use cases All SWH services and components. Environments: - Production service - CLI - Testing Configuration needs: - system service: systemd service vs docker + shell script - server: gunicorn vs Flask/aiohttp/Django devel server - CLI entrypoint - server application: app_from_configfile vs django config - worker: - component: constructor/factory Configuration sources: - environment: CLI parameter, CLI path, envvar parameter, envvar path, input stream - code: literal ## Rationale A. - implicit/hard to follow loading: configuration may be loaded trough a number of ways automatically, good for cli cases but not for prod cases - dependency on environment: must be able to instanciate component using only ad-hoc configuration, for testing purposes -> different APIs for different use cases, all compatible (return config) B. - composition coupling: every owner component must know about how to instanciate an owned component - heterogeneity: configuration loading (CLI) or instanciating (component factory) is implemented differently everywhere, could be abstracted away -> dependency injection, component library API C. - should be able to specify alternative configurations for one component constructed ahead of time, and choose it at runtime/loadtime - should be able to factor common configuration out - uniform,complete,concise: the configuration could theoretically be centralized in one file which would give a clear overview of the configuration and interaction between all the components -> instances, references, singletons ## Specific terminology and concepts of the proposition Basis for discussing terms: [terminology proposition](https://hackmd.io/8hxTL4XMQoO2RVKtqFqM2g?both#Terminology-discussion-preparation-2020-10-30) Used in this specification: TID = type ID IID = instance ID AID = attribute ID QID = qualified ID ID = any of the above ad-hoc object = singleton "attribute ID" = path to the "key" of an attribute ## Language description ### Target example ```yaml= storage: default: cls: buffer min_batch_size: content: 1000 storage: <storage.filtered-uffizi> filtered-uffizi: cls: filter storage: <storage.uffizi> uffizi: cls: remote url: http://uffizi.internal.softwareheritage.org:5002/ loader-git: default: cls: remote storage: <storage.default> save_data_path: "/srv/storage/space/data/sharded_packfiles" random-component: default: cls: foo celery: *celery # Not a component: no type id, only instance id _: celery: &celery task_broker: amqp://... task_queues: - swh.loader.git.tasks.UpdateGitRepository - swh.loader.git.tasks.LoadDiskGitRepository ``` ### Syntactic overview Based on YAML: - restricted to YAML primitive types (includes dicts and lists) - restricted on document structure (see grammar below) - replaced YAML reference system with ours (if not hookable) 3 levels of depth: type, instance, attribute Instance definitions are composed of an ID and a mapping to attributes. 1 instance <-> N attributes. Component type definitions are composed of an ID and a mapping to instances. 1 type <-> N instances. These instances are variants of the component: same type but different constructions. Singletons are instances defined outside type definitions, so they live at top-level and have no type. :::warning This model is syntactically complicated, here are alternatives to make it regular: [alt=typePrefix] no type level, so only 2 levels; type and instance identifiers are merged as "type.instance". [alt=typeAttr] move type identifiers to the attribute level, as a special attribute "type" [alt=singletonType] use a dummy type for singletons: "singletons" ::: References can be made to an object defined somewhere else in the tree, using a qualified identifier. Legal forms are defined to be from an attribute value to an instance identifier. [opt=refkey] Legal forms also includes from attribute value attribute identifier. OOS [opt=recattr] There may be recursion from attribute value to instance value definition. This allows anonymous definition of instance object in an attribute. OOS ### Grammar WARNING: hopefully consistent grammar mixup. May be offending to purists. Some definitions have alternatives noted with `|=`. ```python= ID ~= PCRE([A-Za-z0-9_-]+) # Could be stricter, e.g. snake_case ID = TID | IID | AID QID = (TID ".")? IID ("." AID)* # opt: refkey, skey |= TID "." IID ref = "<" QID ">" attribute_value = YAML_object | ref |= YAML_object | ref | attributes # opt: reckey attributes = YAML_dict(AID, attribute_value) instances = YAML_dict(IID, attributes) singleton = YAML_object # no opt: sref, skey |= YAML_dict(AID, attribute_value) config_tree = YAML_dict(ID, YAML_dict) # loose typing |= YAML_dict((TID, instances) | (IID, singleton)) ``` Alternative definition of identifiers (always qualified): ```python= ID ~= /[A-Za-z0-9_-]+/ TID = ID IID = (TID ".")? ID AID = IID "." ID+ QID = TID | IID | AID ``` ### Identifier Identifier is abbreviated ID. Type ID is abbreviated TID. Singleton ID is equivalent to Instance ID, abbreviated IID. (Attribute) Key is identified by Attribute ID, abbreviated AID. Qualified ID is abbreviated QID. QID is a sequence of ID of the form (TID, IID) for component instances or (IID) for singletons. Its string form joins each field with ".". [opt=refkey] May have form (TID, IID, AID) to reference component instance attributes. Useful either to reference another attribute of current instance, or any other attribute, except those defined in singletons. [opt=reckey] May have form (TID, IID, AID*) to reference recursive component instance attributes. [opt=skey] May have form (IID, AID*) to reference singleton attributes (sequence of AID because recursive). [opt=sref] singleton attributes may reference any attribute. ### Attribute Attribute is a (key, value) pair whose set forms an instance dictionnary. Attribute value is either a YAML object or a reference. [opt=recattr] Attribute value may also be an instance dictionnary. Attribute level is any level under instance level, recursive or not. ### Reference A reference is synctatically defined as a qualified identifier enclosed in chevrons. Its source is an attribute value and its target is the object identified by the QID it owns. The reference is deleted when it is resolved by the reference resolution routine. ### Type Python type of a component to be instantiated and configured. It is referred to indirectly through a TID in a configuration definition, and through a component constructor in the component type register. ### Instance Specific instantiation of a component, distinguished from the others by the set of attributes used to initialize it. All identified instances of a type must be specified in the instance level of a configuration definition. [opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instanciation routine. [opt=subinst] Instances may be referenced in an attribute value, and be recognized as instance declarations; i.e. be instanciated and initialized and not just passed as is to the constructor. [opt=anoninst] Anonymous instances may be defined in an attribute value, and be recognized as instance declarations. -> [Q] that would need a type declaration, which is to yet handled in the spec, to identify it as an instance and be able to instantiate it. One unflexible option is to have the parent AID, being the child IID, to be a TID, thus restricting its name to a known type ID. The other option being specifying TID in a dedicated child attribute with a name similar to `type` or `TID`. ### Singleton Singletons objects are syntactically similar to instances. Unless otherwise stated, the same rules apply. They do not correspond to a predefined type, so they have no schema or attached semantics. They are instantiated as a dict tree. ## Library ### Register The component type register, abbreviated register, is a `(TID, qualified_constructor)` mapping, defined in the configuration library. It is used by the component resolution routine to resolve type identifiers to Python type constructors. Entries in this mapping are to be registered through the component registration library routine. This registration may happen anywhere provided it is executed at loading/import time. It is advised to register the component in the package that defines it. `qualified_constructor` must be Python absolute import syntax for the object creating callable, which is either a factory function or a class. It may be defined: [alt] in quoted form (string). [alt] in class object form. [rem] String form avoid import but no static check of existence may be performed. If the registering is done in the package responsible of defining the component, object form is the most preferable. Components that can be registered may be any SWH service component, SWH support component or external component, which is public (= has an Python object API). In practice, only one main component by SWH service encapsulates all configuration for the components used in the service: - API servers - service workers [Q] what about these config objects: journal related like journal-writer, most under deposit, celery-related... ### Type implementations This section is informational. A component type may have multiple implementations. There is no specific support for it in this system, but as this concept may appear in configuration, related considerations may be worth noting. [rem] The component type of an instance may be abstract, in which case a concrete type must be determined by the component constructor. A specific attribute of instances specifies implementation or flavour to use. It is commonly identified as `cls`, but could be `impl` or `flavor`. It used to be required along with `args`, which is now deprecated, because all configurable SWH components were polymorphic. Components that have no such feature need no `cls` attribute in their configuration. Alternatively, polymorphic components may be instanciated without `cls`, in which case a default implementation will be used. [rem] an indirection layer such as a factory may be defined for monomorphic components in order to keep consistency and allow polymorphism if needed later. Alternatively, for all components, better than a factory which is not derivable from the component type by user code, an abstract base class constructor would abstract this indirection layer away. ### Instantiation Instantiating is the process through which a concrete object is constructed from a model and data describing its (initial) state. In the context of this system, a Python object is created though calling its constructor with the set of attributes associated to a particular instance in a configuration definition. The input is a QID identifying an instance and a configuration tree (dictionary) containing the instance and its dependencies (reference targets). The output is a component instance. The process is composed of the following steps in order. 1. The instance dictionary, containing an attribute set, is fetched by QID on the configuration. 2. Resolve references. 1. Identify all references definitions in the instance dictionary. 2. Resolve each to the reference target object, which may be atomic or composed. 3. Replace the reference source by the resolved object. 3. [opt=subinst] Interpret and compose instances. 1. Identify all component instance definitions in the instance dictionary. 2. [opt=anoninst] Identify anonymous instance definitions. 3. Instanciate each instance. 4. Replace each definition by the instantiated object. 6. The component type of the instance is resolved from the TID contained in the QID, to a component constructor. 7. The component constructor is called, passing the updated instance dictionary as arguments. [opt=subinst] Iidentifying instance definitions require a TID/IID. [Q] As parent (`ID: {instance attrs}`) or child (`{ID: ..., instance attrs}`)? [opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instantiation routine. Instances must be instanciated only once and used at each reference source. ### Interpretation (Validation/Conversion/Interpretation) This section is informational. Interpretation of attributes beyond stated above is out of scope and left to the component constructors to do. Standard Python typing available in constructors may be used to as the basis for the validation of configuration data. Validity of structure, value and existence may be checked. Conversions may also be performed. [opt=validate] The library provides generic validation primitives and a validation routine based on a data model specification object. ### Loading (Loading/Defaults/Merging) Loading is the process of fetching data from a storage medium into a memory space which is easily accessible to the processing system. In the context of this system, this data is then read and converted into a Python object. Loading source may be: an I/O file abstraction (whatever its backing source), or an operating system path to such file abstraction, or such path resolvable from an environment variable or a configuration file ID. Only a Python dictionary is accepted as the holder of this data once loaded. A default configuration definition, either as a dictionary literal or a loaded configuration, can be specified in which case every attributes absent from loaded configuration will be set to the default one. ### API overview Library should be imported as `config` everywhere for clarity and uniformity (e.g. `import swh.core.config as config` or `from swh.core import config`). [rem] Existing routine `merge_configs` should be moved to another module as `merge_dicts`. Configuration object: mapping WARNING: In the following examples, names subject to change. Code is inspired by Python, but abstracted to focus on typing. `DeriveType` denotes simply a type derived from an exiting one, with no consideration of compatibility with base type or any other. ### Loading API [rem] Should choose term among `load`, `read`, `from`, `by`, `config` Example names: `read_config`, `load_envvar` ```python= Config = DeriveType(Mapping) # Tree. Allow only mapping in config definition top-level ConfigFileID = DeriveType(str) # opt=fileid Envvar = DeriveType(str) File = io.IOBase Path = os.PathLike load: (Union[File,Path,Envvar,ConfigFileID], defaults:Config?) -> (Config) load_from_file: (File, defaults:Config?) -> (Config) load_from_path: (Path, defaults:Config?) -> (Config) load_from_envvar: (Envvar, defaults:Config?) -> (Config) load_from_name: (ConfigFileID, defaults:Config?) -> (Config) # opt=fileid ``` no default configs Loads as YAML tree and convert to Python recursive mapping. [opt=fileid] may use ID to reference files independently of their path or extension, in loading mechanism. This is sugar that existed but may be no longer wanted. OOS [Q] Where to check for loadable path? In loading routines or user code? May duplicate behavior. [Q] Should envvar be hardcoded in library or default? Same for default path. ### Instantiation API OOS: every function but create component [rem] Should choose amongst `get`, `read`, `from_config`, `instantiate`, `component`, `instance`, `iid`. Example names: `get_component_from_config`, `instantiate_from_config`, `create_component`, `read_instance`, `get_from_id` ```python= TID = DeriveType(str) IID = DeriveType(str) QID = (TID, IID) # Simplified form, may be Sequence(ID) Component = DeriveType(type) ComponentConstructor = DeriveType(Callable) # Either type or function InstanceConfig = DeriveType(Config) ``` ```python= create_component: (Config, QID) -> (Component) create_component: (InstanceConfig, TID) -> (Component) ``` Returns an instantiated component identified by QID. Uses `get_obj`, `resolve_references`, `resolve_component`, `instantiate_component`. ```python= get_obj: (Config, QID) -> (Config) get_instance: (Config, QID) -> (InstanceConfig) ``` Returns a config object (subtree) of the config identified by the config ID. May be used both for getting all instances or one instance depending on whether the config ID has an instance ID part. ```python= resolve_references: (Config) -> (Config) resolve_reference: (Config, QID) -> (Config) ``` Replaces reference source with the object identified by reference target. ```python= find_instances: (InstanceConfig) -> (Set(InstanceConfig)) ``` [opt=subinst] Finds all instance definitions nested in this instance definition and returns a list of them. [Q] How to identify instances? [Q] Should it recurse into nested definitions or just one level? ```python= resolve_component: (TID) -> (ComponentConstructor) ``` Lookups core.config register to get the constructor from TID. ```python= instantiate_component: (InstanceConfig, ComponentConstructor) -> (Component) ``` Instantiate a component using given constructor and an instance configuration mapping. ### Validation API [opt=validate] This section proposes a framework for validating instance definitions in a fairly lightweight and flexible way, for use by component constructors or injectors. ```python= check: (Config) -> (Boolean) check_definitions: (Config) -> (Boolean) check_component: (InstanceConfig, ModelSpec) -> (Boolean) generate_spec_from_signature: (ComponentConstructor) -> (ModelSpec) ``` `check`: validate both language and instances. `check_definitions`: validate whole definition against language spec. `check_component`: validate instance definition against component spec. This is a template function which is parametrized by user-specified spec. #### Model specification ```python= AttrKey ~= String("[A-Za-z0-9_\-]+") AttrVal = YAML_object # Path in the instance configuration dictionary Path ~= String("([A-Za-z0-9_\-]+/)+") # Wrapper to convert falsey values or exceptions to False, otherwise True ensure_boolean: Booleanish -> Boolean # Generic and context-sensitive signatures for flexibility value_check: ((AttrVal) | (AttrVal, InstanceConfig)) -> Booleanish # If not optional existence check should succeed, else not performed. optional_check: (AttrVal, InstanceConfig) -> Booleanish) | Booleanish # Checks whether attr exists at one of given paths, or anywhere if no path. # No reason to have user customise existence check. existence_check: (AttrVal, Set(Path), InstanceConfig) -> Boolean # Here is the model specification # Kwargs: best I found for a typed mapping where every item is optional AttrProperties = Kwargs(value_check, optional_check, Set(Path)) # None for no checks on attribute ModelSpec = Mapping(AttrKey, AttrProperties | None) ``` `check_component` verifies that all properties of every attribute holds in the instance definition, based on user-defined model specification. Model specification can leverage primitive check functions and user-defined check functions. Supported checks are value and existence in tree-structure checks, which are distinguished for expressiveness. The model specification lists each (unqualified) attribute that may exist in the configuation definition, along with attribute properties that must hold. An attribute may or not be optional, meaning whether validation should fail on absence, based on the boolean value of the `optional_check`. `optional_check` may be a callable that must determine whether the attribute is optional based on the configuration context and return a booleanish value, or be a booleanish value. It is run in a wrapper which converts falsey values or exceptions to `False`, and anything else to `True`. Required attribute is checked for existence based on a set of paths in the tree if any, or existence anywhere in the tree. Optional attribute is then not checked for existence but still for legal value. The value check may be any callable that either accepts a single value, or a value and the configuration context (instance definition), and return a booleanish value, handled as above. This makes it possible to use many existing functions or object constructors to do the validation, e.g. `int`, `re.match`, `isinstance(Protocol)` or a function verifying a relation to another attribute in the definition is valid. #### Helper for specification generation `generate_spec_from_signature`: generate a model specification where annotations are used as `value_check` functions wherever possible, argument are optional or not depending on the existence a default value, and the path set contains only the tree root. A mapping from types to validators is used to validate most common types, others will only be checked by `insinstance`. This is a helper function to generate a spec draft ahead of time, that must be corrected and stored along the corresponding constructor, as it cannot be guaranteed to function properly in all cases. Components with multiple implementations: Operations based on function signatures like validation but also instanciation, need a way to map the `cls` argument to the concrete type and constructor signature. A solution to automatically use the good constructor is to implement single dispatch and overloading on the main constructor. Every method may still call the main one, but must have a signature compatible with the one of the concrete class constructor, based on `cls`. See also ["Library/Type implementations"](#Type-implementations) remark about abstract constructors. ## Client code (need contributions) Demonstration of features in every use cases. CLI, WSGI, worker, task, daemon, testing ### CLI entrypoint * scan the current directory against the archive in docker * (cli flag) `swh scanner --scanner-instance=docker scan .` * (equivalent env var) `SWH_SCANNER_INSTANCE=docker swh scanner scan .` * actual python calls in the cli endpoint ```python= import swh.core.config as config scanner_instance = "docker" # from cli flag or envvar config_path = "~/.config/swh/default.yml" # from cli flag or envvar or CLI default or core.config default config_dict = config.load(config_path) scanner = config.create_component( config_dict, config.QID(type="scanner", instance=scanner_instance) ) scanner.scan(".") ``` ### API Server entrypoint rpc-serve, WSGI app ```python= def make_app_from_configfile() -> StorageServerApp: # Or any other module App global app_instance if not app_instance: config_dict = config.load_from_envvar() rpc_instance = os.environ.get("SWH_STORAGE_RPC_INSTANCE", "default") app_instance = config.create_component( config_dict, config.QID(type="storage-rpc", instance=rpc_instance) ) if not check_component(app_instance, "storage-rpc"): # raise ConfigurationError or something? return app_instance ``` > [name=ardumont] Completed the snippet ^ (unsure about it) ### Celery task entrypoint Celery task code ```python= @shared_task(name="foo.bar") def load_git(url): config_dict = config.load_from_envvar() loader_instance = os.environ.get("SWH_LOADER_GIT_INSTANCE", "default") loader = config.create_component( config_dict, config.QID(type="loader-git", instance=loader_instance) ) config_dict.create_component(type="loader-git", instance=loader_instance) return loader.load(url=url) ``` > [name=ardumont] we moved away from passing parameters to the `load` function. The url parameter is to be passed along the constructor of the loader (same goes for lister, etc...) ### Testing / REPL Example test ```python= import swh.core.config as config @pytest.fixture def config_dict() { return {...} } def test_config(config_dict): type_ID = "objstorage" instance_ID = "test_1" instance = config.create_component( config_dict, config.QID(type=type_ID, instance=instance_ID) ) ... @pytest.fixture def config_path(datadir): return f"{datadir}/other.yml" def test_config2(config_path): config_dict = config.load_from_path(config_path) instance = config.create_component( config_dict, config.QID(type=type_ID, instance=instance_ID) ) ... ``` ## Environment The environment parameters comprises any dependency of the configuration system external to the code. This includes: configuration directory, configuration file, environment variable and commandline parameters. ### Configuration directory SWH configuration directory: `SWH_CONFIG_HOME=$HOME/.config/swh` ### Configuration file YAML file with .yml containing only the configuration data. Default if none is specified to the generic loading routine: `$SWH_CONFIG/default.yml`. [opt=conffileid] a configuration file id corresponding to the basename of a configuration file (without extension). -> [Q] then only from `$SWH_CONFIG_HOME` or have a register? ### Core configuration file parameter This feature is to be built into SWH core library. Specify the path to the configuration file to use for a whole service: path_part = `path` | `file` Environment variable: `SWH_CONFIG_<PATH_PART>` CLI option: `swh --config-<path_part>` [rem] "path" is a more precise term than "file". ### Specific configuration parameters A CLI option may be passed to specify an instance ID (only at 2nd level) when several alternatives are provided in the configuration. Such option must be declared statically in CLI code. Specify the instance configuration to use for a given component, using instance ID: id_part = `instance` | `id` | `iid` | `cid` `SWH_<COMP>_<ID_PART>` `--<comp>-<id_part>` [rem] Any variant containing "id" is more precise than simply "instance". A CLI option may be passed to override an attribute in the configuration. Such option must be declared statically in CLI code. Specify any other predefined configuration option: `SWH_<COMP>_<OPTION>` `--<comp>-<option>` [OOS] dynamic handling of any such options for any attribute, similar to what `click` permits. ### Configuration priority CLI has precedence over envvars. Environment parameters have precedence over whole definitions (from file or code) and whole definitions have precedence over defaults, per-attribute. This follows the principle that the particular takes precedence over the general. CLI param > envvar param > CLI file > envvar file > default file > defaults literal This precedence rules must be implemented in entrypoint client code, with help of the library loading API. Only part of it may be implemented, the minimum being accepting a whole definition trough either code or envvar. ### Example environment specifications (need contributions) Using this objstorage replayer configuration file file ```yaml objstorage: local: cls: pathslicing root: /srv/softwareheritage/objects slicing: "0:2/2:5" s3: cls: s3 s3-param: foo journal-client: default: # single implem: no cls needed brokers: - kafka prefix: swh.journal.objects client-param: blablabla docker: brokers: - kafka.swh-dev.docker ... objstorage-replayer: default: src: <objstorage.local> dst: <objstorage.s3> journal-client: <journal-client.default> ``` CLI usage: Specify no instance, use default instance config: * `swh objstorage replayer` Specify instance: * `swh objstorage replayer --from-instance default` Specify nested instances (opt=subinst): * `swh objstorage replayer --src local --dst s3` * `swh objstorage replayer --src local --dst s3 --journal-client docker` CLI options to be defined statically. #### Other proposals @douardda's proposal: on-the-fly generation of instance config via syntactic sugar `swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3` [OOS] @tenma's proposal: dynamic handling of cli options wrt schema `swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default` - generic, verbose, arbitrary attribute setting ## Limitations (need contributions) Depends on chosen functionalities. ## OOS, rejected ideas (need contributions) Depends on chosen functionalities. ## Impacts (need contributions) - core.config library - main constructor/factory of each SWH component: type mapping or dynamic dispatch, use of library APIs for validating - entrypoints: use of library APIs for loading and instanciating - configuration files format - environment variables and cli calls in production+docker environments ## Implementation plan: library, ops code, tests, prod (need contributions) (Proposition) Prepare for easy switch and rollback by creating configuration copies conforming to the new system, and code conforming to the new system in separate branches. - implement all library in the same file as before - migrate tests (at any moment) - prepare new config files, and service definitions that use them - migrate services one by one following SWH dependencies: - add needed declarations along with constructors - entrypoint loading, instantiating and injecting (if opt=subinst) - remove deprecated code # Synthesis of the meeting 2020-11-24 Participants: tenma, douardda, olasd, ardumont Language: - use "_" tid for all singletons. Then QID become regular (IID, TID) - USE subinst: instance references may appear arbitrary deep in instance - OOS anoninst: every instance defined at 2nd level - OOS refkey, reckey, skey, sref: no reference to keys and singletons, may use YAML ref syntax APIs: - OOS conffileid - specify only public API, no instanciation plumbing, KISS - loading API: no merging so remove defaults - instanciation API: only distinction component/singleton - instanciation API: use instance methods? use keywords for QID - validation library for later, now only constructor validation Implementation: - register: qualified_constructor be type object, not str - register: populate from setuptools declarations references to constructor + documentation_builder (@olasd) Environment: - comprehensive environment handling is good, but should be opt-in Notes about implementation: - confirmed that anoninst need inline type declaration (but OOS) - how to prepare definitions for instanciation? Replace reference by instance in definition (duplication?) qid could be inserted in the definition as key, and added to instance register; that would make handling more regular Still open: - rename singleton to ad-hoc object? - forgot to choose terms in APIs - forgot to cover usage of external components - factory constructors instead of factory functions: not convinced -> allow polymorphism, type and callable are associated -> factories reimpl single dispatch which is builtin in classes Conclusion: - no feedback on spec itself, as it is not ready (cleaned) - make clean spec in Sphinx and create diff - finish library draft and create diff (P878) - @olasd for details on populating register and docs through setuptools

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully