or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
New SWH configuration scheme
https://forge.softwareheritage.org/T1410
Better config system that does not rely on implicit configurations.
Use Cases
Production deployment
Docker
Tests
REPL
cli tools
swh scheduler
command to manage tasks,swh loader
,swh lister
, …)Rationale
Current features
Current configuration system is an utility library implementing diverse strategies of config loading and parsing including the following functionalities:
Wanted features
consistent config definition and processing across the SWH codebase
priority loading with defined mechanics:
directional config merge: merge specific definition with a default one
namespaced by distinct roles, so that one fully qualified config key can be used by different components, and a same unqualified key may exist for different roles, for example:
should have a straightforward API, possibly declarative, so that user code can plug config definitions in a single step (decorator, mixin/trait, factory attribute, etc.)
configuration as attributes of target class to have proper doc/typing/validators either flat or Config object parametrized by class (either class object or config
cls
literal)may or may not want: decouple config keys from component constructors arguments (easy if Config is in another object) so that config keys and class attributes can evolve independently
config is loaded on entrypoints (cli, celery task, gunicorn wsgi), not by each component (Loader, Lister, …)
Early concrete elements
Format and location
File format: YAML
Default config:
Specific config:
Environment variables:
Library
swh.core.config
Either run with a switch or a envvar, else hardcoded default path
Usage
See example from scanner CLI.
Current situation
These are examples of config files as currenly used (we focus here on the configuration itself, not about where these files are loaded from).
Most of the configuration files use the form:
Also most (?) CLI tools for swh packages use the same pattern: the config file loading mechanism is handled in the main click group for that package (e.g. in
swh.dataset.cli.data_cli_group
forswh.dataset
, orswh.storage.cli.storage
for the storage, etc.)objstorage
The generic config for an objstorage looks like:
In which we have the main config entry: how to access the underlying objstorage backend, then one (or more) configuration items for the objstorage RPC server (for which one needs to read the code to know what possible options are accepted).
rpc-server
The config is checked in
swh.objstorage.api.server.make_app
with some validation inswh.objstorage.api.server.validate_config
.It also accept a
client_max_size
top-level argument, which is the only "extra" config parameter supported (used inmake_app
).WSGI/gunicorn
When started via gunicorn:
swh.objstorage.api.server:make_app_from_configfile()
This function does take care of the presence of the SWH_CONFIG_FILENAME, loads the config file, validate (
validate_config
) then callmake_app
.replayer
The objstorage replayer needs 2 objstorage configurations (src and dst) and a journal_client one, e.g.:
The
journal_client
config item is directly used ar argument ofswh.journal.client.get_journal_client()
factory.storage
In which we have the same config system for the main underlying (storage) backend.
Besides the configuration of the underlying storage access, there can also be the configuration for the linked objstorage and journal_writer.
The former is passed directly to the
swh.storage.objstorage.ObjStorage
class which is a thin layer above the realswh.objstorage.ObjStorage
class (instanciated viaget_objstorage()
).The later is directly used as argument of the
swh.storage.writer.JournalWriter
class.Also note that the instanciation of the objstorage and journal writer is done in each storage backend (it's not a generic behavior in
get_storage()
).rpc-serve
Same as general case + inject the
check_config
flag from cli options if needed.WSGI/gunicorn
swh.storage.api.server:make_app_from_configfile()
replayer
This tool needs 2 entries: the destination storage (same config as above) + the journal client config (
journal_client
), like:The
journal_client
config item is directly used ar argument ofswh.journal.client.get_journal_client()
factory.backfiller
The backfiller uses a "low-level" config scheme, because it needs a direct access to the database:
The config validation is performed within the
JournalBackfiller
class.dataset
In
swh.dataset
, loaded config is directly passed toGraphEdgeExporter
viaexport_edges
andsort_graph_nodes
.For the
GraphEdgeExporter
, these config values are actually the**kwargs
ofParallelExport.process
plus theremove_pull_requests
flag extracted from theconfig
dict inprocess_messages()
.This
ParallelExporter
uses a single config entry,journal
, the configuration of a journal client.For the
sort_graph_nodes
, config values are:sort_buffer_size
disk_buffer_dir
deposit
The main
click
group ofswh.deposit
does not load the configuration file.However, it provides a
swh.deposit.config.APIConfig
class that loads the configuration from theSWH_CONFIG_FILENAME
file.The generic implementation expects a
scheduler
entry, and have default values formax_upload_size
andchecks
.The current config file for the deposit service in docker looks like:
client tools
The
swh.deposit.cli.client
clis do not explicitely implement configuration loading from a file, instead every configuration option is given as cli option.However, some classes instanciated from there do support loading a config file from the
SWH_CONFIG_FILENAME
environment variable.Config entries for a deposit client are:
url
auth
(a dict withusername
andpassword
entries)admin tools
The
swh.deposit.cli.admin.admin
click group does implement the config file loading pattern (actually the loading itself is implemented in thesetup_django_for()
function).This function does load the django configuration from the
swh.deposit.settings.<platform>
(with<platform>
in["development", "production", "testing"]
), and set theSWH_CONFIG_FILENAME
environment variable to theconfig_file
argument given.celery worker
The deposit provides one celery worker task (
CheckDepositTsk
) which loads its configuration exclusively fromSWH_CONFIG_FILENAME
. The only config entry used is thedeposit
server connection information.RPC server
The deposit server uses the standard django configuration scheme, but the selected config module is managed by
swh.deposit.config.setup_djamgo_for()
.A tricky thing is the
swh.deposit.settings.production
django settings module, since it does load theSWH_CONFIG_FILENAME
config file (but NOT in thedevelopment
nortesting
flavors).In
production
mode, it expects the configuration to have:scheduler
private
(credentials for the admin pages of the deposit),allowed_hosts
(optional)storage
extraction_dir
graph
The main click group of
swh.graph
does load the config file, but it does not fall back toSWH_CONFIG_FILENAME
if not config file is given as cli option argument.Supported configuration values is declared/checked in the
swh.graph.config
module.There is no main "graph" section or namespace in the config file, so all config entries are expected at file's top-level:
indexer
The main click group of
swh.indexer
does load the config file, but it does not fall back toSWH_CONFIG_FILENAME
if not config file is given as cli option argument.For the indexer storage, a standard
swh.indexer.storage.get_indexer_storage()
factory function is provided, and is generally called with arguments from theindexer_storage
configuration entry.schedule
The
swh.indexer.cli.schedule
command uses the config entries:journal_client
The
swh.indexer.cli.journal_client
command (listen the journal to fire new indexing tasks) uses the config entries:The connection to the kafka broker is handled only by command line option arguments.
RPC server
When started using the
swh indexer rpc-serve
command, it expect a config file name as required argument. Configuration entries are:WSGI/gunicorn
When started as a WSGI app, the configuration is loaded from the
SWH_CONFIG_FILENAME
environment variable (inmake_app_from_configfile
).journal
The journal can be used from the producer side (e.g. a storage's journal writer) or the consumer side.
The
swh.journal.client.get_journal_client(cls, **kwargs)
factory function is generally used to get a journal client connection with arguments directly fromjournal_client
(orjournal
) configuration entry.The
swh.journal.writer.get_journal_writer(cls, **kwargs)
factory function is used to get a producer journal connection, with arguments directly fromjournal_writer
configuration entry (generally it's the subentry of the "main"storage
config entry, as seen above in the storage config example.)loaders
Loaders are mostly celery workers. There is a cli tool to synchronously execute a loading.
When run as a celery worker task, the configuration loading mech is detaild in the scheduler section below.
When executed directly, via
swh loader run
, the loader class is instanciated directly, thus it's the responsibility of that later to load a configuration file. This is normally done by using theswh.core.config.load_from_envvar
class method.listers
The main lister cli group does handle the loading of the config file, including falling back to the
SWH_CONFIG_FILENAME
if not given as command line argument.Expected config options are:
The
swh lister run
command also instanciate a lister class. The base implementation support the configuration options:When used via a celery worker, standard celery worker config loading mechanism is used (see the scheduler below).
scanner
The scanner's cli implements its own strategy for finding the configuration file to load (including looking at the
SWH_CONFIG_FILENAME
variable). It only needs connection informations to the public web API:scheduler
The scheduler consists in several parts.
celery
Every piece of code that involves loading the celery stack of
swh.scheduler
, aka that imports theswh.scheduler.celery_backend.config
module, will load the configuration file from theSWH_CONFIG_FILENAME
, in which at least acelery
section is expected.Celery workers are registered from the
swh.workers
pkg_resources entry point as well as thecelery.task_modules
configuration entry.The main celery app singleton is then configured from a hardwritten default config dict merged with the
celery
configuration loaded from the configuration file.celery workers
Celery workers are started by the standard celery command (
python -m celery worker
) usingswh.scheduler.celery_backend.config.app
as celery app, so the configuration loading mechanism is the default celery one described above, and the only way to specify the configuration file to load is via theSWH_CONFIG_FILENAME
variable.cli tools
The main click group does implement the
--config-file
option, and uses theswh.core.config.read()
function. So this main config file loading mechanism does not fall back to theSWH_CONFIG_FILENAME
variable.At this level, the only expected config entry is
scheduler
(connection to the underlying scheduler service).Additional config entries for cli commands:
runner
:celery
listener
:celery
rpc-serve
:celery-monitor
:celery
archive
:swh.scheduler.backend_es.ElasticSearchBackend
WSGI/gunicorn
The loading of the WSGI app normally uses the
swh.scheduler.api.server.make_app_from_configfile()
function that takes care of loading the config file from theSWH_CONFIG_FILENAME
with no fall back to a default path.The loaded config is added to the main flask
app
object, so any flask-related config option is possible (at configuration's top-level.)search
cli tools
swh.search
main cli group does implement the--config-file
option (usingswh.core.config.read()
to load the file).Config options by cli command:
initialize
:search
journal-client objects
:journal
search
rpc-serve
:This expect a config file name given as (mandatory) argument (in "addition" to the general
--config-file
option). This configuration is then used to configure the flask-based RPC server.search
WSGI
The creation of the WSGI app is normally done using
swh.search.api.server.make_app_from_configfile
, which uses theSWH_CONFIG_FILENAME
variable as (only) way of setting the config file.vault
cli tools
There is not support for the
--config-file
option in main click group, but the cli only provides one command (rpc-serve
), which does support this option.The configuration file is loaded in
swh.vault.api.server.make_app_from_configfile()
, and the main RPC server isaiohttp
based.web
Django-based stuff.
web client
No config file loading for now (?)
Configuration loading mechanisms
Config file loading function used
swh dataset ...
swh.core.config.read()
swh.dataset.cli.dataset_cli_group()
--config-file
swh deposit client
swh deposit admin
config.load_named_config()
--config-file
,SWH_CONFIG_FILENAME
config.read_raw_config()
SWH_CONFIG_FILENAME
viaDJANGO_SETTINGS_MODULE
swh graph ...
swh.core.config.read()
swh.graph.cli.graph_cli_group()
--config-file
swh indexer ...
swh.core.config.read()
swh.indexer.cli.indexer_cli_group()
--config-file
swh indexer rpc-serve
swh.core.config.read()
swh.indexer.storage.api.server.load_and_check_config()
config-path
(vias.i.cli.rpc_server()
)swh.core.config.read()
swh.indexer.storage.api.server.load_and_check_config()
SWH_CONFIG_FILENAME
(vias.i.s.a.s.make_app_from_configfile()
)swh icinga_plugins ...
swh lister ...
swh.core.config.read()
swh.lister.cli.lister()
--config-file
,SWH_CONFIG_FILENAME
(viaw.l.cli.lister()
)config.load_from_envvar()
swh.lister.core.simple_lister.ListerBase()
SWH_CONFIG_FILENAME
,<SWH_CONFIG_DIRECTORIES>/lister_<name>.<ext>
swh loader ...
config.load_from_envvar()
swh.loader.package.loader.PackageLoader()
SWH_CONFIG_FILENAME
swh loader ...
config.load_from_envvar()
swh.loader.core.loader.BaseLoader()
SWH_CONFIG_FILENAME
swh objstorage ...
swh.core.config.read()
swh.objstorage.cli.objstorage_cli_group()
--config-file
,SWH_CONFIG_FILENAME
(vias.o.cli.objstorage_cli_group()
)swh.core.config.read()
swh.objstorage.api.server.load_and_check_config()
SWH_CONFIG_FILENAME
(vias.o.api.server.make_app_from_configfile()
)swh scanner ...
config.read_raw_config()
swh.scanner.cli.scanner()
--config-file
,SWH_CONFIG_FILENAME
,~/.config/swh/global.yml
swh scheduler ...
swh.core.config.read()
swh.scheduler.cli.cli()
--config-file
,swh.core.config.read()
swh.scheduler.api.server.load_and_check_config()
SWH_CONFIG_FILENAME
swh.scheduler.celery_backend.config
swh.core.config.load_named_config()
swh.scheduler.celery_backend
swh search ...
swh.config.read()
swh.search.cli.search_cli_group()
--config-file
swh search rpc-server
swh.config.read()
swh.search.api.server.load_and_check_config()
config-path
swh.config.read()
swh.search.api.server.load_and_check_config()
SWH_CONFIG_FILENAME
(froms.s.api.server.make_app_from_configfile()
)swh storage ...
swh.config.read()
swh.storage.cli.storage()
--config-file
,SWH_CONFIG_FILENAME
(froms.s.cli.storage()
)swh.core.config.read()
swh.storage.api.server.load_and_check_config()
SWH_CONFIG_FILENAME
(froms.s.api.server.make_app_from_configfile()
)swh rpc-serve
swh.core.config.read()
orswh.core.config.load_named_config()
swh.vault.api.server.make_app_from_configfile()
--config-file
,SWH_CONFIG_FILENAME
(froms.v.api.server.make_app_from_configfile()
),<swh.core.config.SWH_CONFIG_DIRECTORIES>/vault/server.<ext>
swh.core.config.read()
orswh.core.config.load_named_config()
swh.vault.cookers.get_cooker()
SWH_CONFIG_FILENAME
(from...get_cooker()
),<s.c.c.SWH_CONFIG_DIRECTORIES>/vault/cooker.<ext>
(fromget_cooker()
)Synthesis of discussion @tenma/@ardumont 2020-10-08
Impacts
Definitions
Possible plan
swh.<component>.config
module (like what we declare today for tasks inswh.<component>.tasks
)Pre-requisites
The earlier described plan should respect the following:
Out of scope
Feedback on the proposal
olasd
I strongly like that this is going towards having fewer dicts being thrown around in our code around object instantiation in favor of stronger typed objects. I also think this can be implemented in a DRY way by parsing the signature of the classes that are being instantiated.
I'm not sure that the backwards compatibility for production is such a strong requirement, even though it would definitely be nicer.
Configuration sharing
This proposal doesn't seem to be solving the concern of shared configuration across so-called components; Let me use a concrete example:
- the
WebClient
(or whatever its name is) class in swh.web.client takes atoken
parameter for authentication, and abase_url
for its setup- the
SwhFuse
component uses aWebClient
- the
SwhScanner
component also uses aWebClient
- the
swh fuse
command takes a config file with its own parameters, as well as parameters for a web client- the
swh scanner
command takes a config file with parameters for a web client; I expect most of its other parameters come from the CLI directlyIn the proposal it's not clear to me how the following would happen:
-
swh.web.client
declares its configuration schema inswh.web.client.config
-
swh.fuse
andswh.scanner
do the same in their ownconfig
modules- the
swh fuse
andswh scanner
entrypoints parse a configuration file; from the output, they instantiate aSwhFuse
/SwhScanner
.Now that I've written all of this, I guess this could be solved by having a way for the
swh.fuse.config
andswh.scanner.config
modules to declare that they're expecting aswh.web.client.config
at the toplevel of their configuration file (rather than in a nested way like theget_storage
factories currently work). Did you have a different idea?douardda
We may see the problem we are trying to solve as:
Synthesis of discussion @tenma/@douardda/@olasd 2020-10-12
Taste
Examples that are not accurate but demonstrate syntax and functionalities.
Example config (global, mixing unrelated components) declarations:
This demonstrates:
Example CLI usage:
Configuration model
Configuration statement syntax
3 levels (real names to be defined): package, instance, attribute (key:value)
Examples of existing "package" identifiers:
Anonymous instances in config file (for use as a value) is out of scope for now.
Package corresponds to an existing identifier, the one of the SWH Python package that defines the components to configure. It is used to resolve components types (search for type
remote
instorage
) and group components of the same service/package./!\ what about components that are not of the main type of the package? Would need inclusion into the map and factory.
Ex: non-storage component in storage package?
cls
(alt:type
) keys is a type identifier and denotes what kind of object schema to use.Other keys are instance keys conforming to the schema referenced by
cls
, more concretely, the class constructor arguments.<
and>
are reference markers and denotes reference to a qualified instance or key. The choice of marker is not definitive. YAML reference feature (syntax&
/*
) is rejected because we want to keep references in our model and thus not having it processed outside our control.Configuration declaration
Packages are defined statically. Most probably in core.config. Could be "discovered", but we may prefer whitelisting supported ones.
Packages are associated with a factory function to instanciate instances as
InstanceType(**keys)
, and optionally (class_identifier_string : Python_class) map/register.Example: Storage package
Component constructor defines config keys, types and defaults.
No static definitions that would need maintaining, but documentation autogenerated from constructors.
Endpoint usage
Config file is necessary to specify a graph of instances.
Config file parameter specified as a string path, either absolute or relative. Current name: config-file.
Config file can be specified as CLI option (
--<config-file>
) or envvar (SWH_<CONFIG_FILE>
).Reference to an instance can be specified inline (ref syntax), CLI option (
--<key>-instance
), envvar (SWH_<key>_INSTANCE
).Other keys can be specified as CLI option (
--<key>
), envvar (SWH_<key>
).Library API
Example:
core.config.get_component_from_config(config=cfg_contents, type="storage", id="uffizi")
Algorithms
To be defined.
Parsing (YAML library), config resolving (reference processing), component resolving, instanciating.
Restriction to N levels ease implementation.
Example configurations and usages
Note that the actual name of keys is completely up to bikeshedding; specifically, dashes versus underscores versus dots is completely up in the air.
Shared configuration for command line tools
Configuration file
~/.config/swh/default.yml
(default configuration path for user-facing cli tools)Command-line calls
swh scanner --scanner-instance=docker scan .
SWH_SCANNER_INSTANCE=docker swh scanner scan .
swh fs mount ~/foo swh:1:rev:bar
objstorage replayer
Config file
Cli call
Default behavior (single call to
swh.core.config.get_component_from_config(config, 'objstorage-replayer', instance_name)
)swh objstorage replayer --from-instance default
swh objstorage replayer
(uses config from default instance)Nice to have (multiple, manual, calls to
get_component_from_config
in the cli entry point)swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
@tenma's proposal: dynamic handling of cli options wrt schema
swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default
Single-task celery worker from systemd
Configuration file
/etc/softwareheritage/loader_git.yml
(expanded) systemd unit '/etc/systemd/swh-worker@loader_git.service'
Instantiation flow
The celery cli loads the "app" object set in the cli:
swh.scheduler.celery_backend.config.app
SWH_CONFIG_FILENAME
to a dict (plausibly a singleton)celery
key of the config dictcelery task code
Alternatives to the proposed structure of package/cls in config
Drop the first level and include in type
More generic, does not impose grouping of components of same "package", instance names then may need to be more descriptive.
Do not restrict on swh service components
Q: do we only define instances of swh service components in the config file?
Using the more generic notion of role/namespace vs the current specific notion of package offer the flexibility of having names that does not refer to a swh component but any datastructure (with
type
having to be fully qualified e.g.model.model.Origin
), and be shared by muliples instances (would represent a datastructure that does not belong specifically to a given swh service package).It would need a register of datastructures that can be used, like the other proposals.
For example for core/model/graph/tools/other_library datastructure do we want to be able to specify sonetging along the lines of:
Key points to the specification
Synthesis of the meeting 2020-10-21
Particpants: @tenma, @douardda, @olasd, @ardumont
Reporter: @tenma
Q = question
R = remark
OOS = Out of scope
Dates indicates the chronology of the report.
***Before starting to report the concepts tackled trough the meeting, some points about terminology. The terminology was difficult to choose while writing this synthesis, so it is not completely consistent. This section tries to give a basis for discussion.
Terminology
Initial remarks (2020-10-22)
The term
type identifier
(TID
for short) will be used in place ofpackage
from now on for the 1st-level names.package
represented both the SWH Python package name and the base type of SWH components available in this package, in my initial, partial view of the subject.Now that it has been shown that there is no 1-to-1 mapping between SWH component names and SWH packages, the name
package
is no longer accurate.The more generic
type identifier
reflects the flexibility we introduce with a top-level register of components types referencible(?) in configuration.The 2nd level maps
instance identifier
(IID
) to instance definition mapping.TID
map to actual objects of some type, whereasIID
exist only in configuration system.Terminology discussion preparation (2020-10-30)
Need to choose name for every concept of the system.
e.g. "storage"
e.g. "uffizi", "celery"
***Here really starts the report written as of 2020-10-22.
Examples written/updated during the meeting were not copied here.
Single implementations of type
cls
attribute to instances specifies implementation/alternative/flavour to use.It used to be required along with
args
, because all configurable SWH components were polymorphic.Components that have no such feature need no
cls
attribute in their configuration.R: An indirection layer such as a factory may be defined for such components in order to keep consistency and allow polymorphism if needed later. Alternatively, for polymorphic components, better than a factory which needs to be known by user code, an abstract base class constructor would abstract this indirection layer away.
Default instances
One instance in a package/namespace can be labeled as
default
.It will be selected when an instance of a given component type is requested but no IID is given.
Q: could default instance be implied when instanciating with no IID?
Singleton object definitions
For both sakes of clarity and reuse, ad-hoc configuration objects can be defined at top level and be referenced. These are not SWH or external components.
Those definitions are composed of an identifier (equivalent to a IID) on 1st level and a YAML object, possibly recursive, in further levels.
They are instaciated as schemaless dictionaries/lists.
Q: How to allow them in syntax and differentiate them from schemaful definitions?
Q: does such singleton object must be referenced at least in one place in the file they are part of, otherwise they would be ignored.
Top-level component register
A register of components allowed in configurations is to be implemented in core.config.
It will consist of a
(TID : qualified_constructor)
mapping.These entries will not be hardcoded in this mapping, but registered at import time from the package that defines the components.
qualified_constructor
must be Python absolute import syntax for the callable which is either a factory function or a class object.It may be in quoted form (string) or actual object form. String form avoid import but no static check of existence may be performed. Given that registering is done in the package responsible of defining the component, object form is chosen.
Components that can be registered may be any SWH service component which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
Q: use object or quoted form for
qualified_constructor
?Q: what about these config objects: journal related like journal-writer, most under deposit, celery-related…
CLI parametrization of configuration loading
A CLI option may be passed to specify an instance ID (only at 2nd level?) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
OOS: extend usage to any IID in instance mapping
OOS: dynamic handling of any such options for any attribute
Example propositions
Default behavior (single, manual, component instanciation)
swh objstorage replayer --from-instance default
swh objstorage replayer
(uses config from default instance)Nice to have (multiple, manual, component instanciations)
swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
Library API (proposition, 2020-10-23)
Moved to the specification below.
Emerging problems (2020-10-23)
Q: How to allow and handle both typed and untyped (ad-hoc) objects?
Q: how to identify what is an instance in the definitions? Will it require special handling wrt referencing mechanics (source and destination).
Q: do we want to support reference to external config definitions or require autonomy? = always 1 file for a whole service, or composition of partial definitions?
In the former, similar sections in multiple config files will need synchronization of update. One possibility is having each SWH component its config file and compose on-demand (using puppet?) in a standalone file for prod/tests. But not easy to have multiple instances this way. Fill on-demand a template file for each service?
Specification outline
Synopsis
General terminology/concepts
Scope/use cases
Rationale: existing, limitations, wanted
Specific terminology/concepts
Language description
Library
Client code
Environment
Out-of-scope, rejected ideas
Limitations
Impacts
Implementation plan: library, use, tests, prod
Specification 2020-11-02
(Writing sections breadth-first: deeper at each iteration)
Notations:
[opt=id]: concept subject to acceptance or removal, with identifier for easier reference. Usage similar to a feature flag.
[alt]: alternative to any surrounding [alt] statement
[rem]: remark
[OOS]: out-of-scope remark or idea
[rej]: rejected remark or idea
[Q]: question to be answered
Synopsis
The configuration system evolved partially with use cases. Initial design decisions applied to all use cases turned out to be both too hard to reason about/unstable for production and too inflexible for cli or testing.
General terminology and concepts
For the purpose of this specification.
Scope/use cases
All SWH services and components.
Environments:
Configuration needs:
system service: systemd service vs docker + shell script
server: gunicorn vs Flask/aiohttp/Django devel server
CLI entrypoint
server application: app_from_configfile vs django config
worker:
component: constructor/factory
Configuration sources:
Rationale
A.
-> different APIs for different use cases, all compatible (return config)
B.
-> dependency injection, component library API
C.
-> instances, references, singletons
Specific terminology and concepts of the proposition
Basis for discussing terms: terminology proposition
Used in this specification:
TID = type ID
IID = instance ID
AID = attribute ID
QID = qualified ID
ID = any of the above
ad-hoc object = singleton
"attribute ID" = path to the "key" of an attribute
Language description
Target example
Syntactic overview
Based on YAML:
3 levels of depth: type, instance, attribute
Instance definitions are composed of an ID and a mapping to attributes.
1 instance <-> N attributes.
Component type definitions are composed of an ID and a mapping to instances.
1 type <-> N instances.
These instances are variants of the component: same type but different constructions.
Singletons are instances defined outside type definitions, so they live at top-level and have no type.
This model is syntactically complicated, here are alternatives to make it regular:
[alt=typePrefix] no type level, so only 2 levels; type and instance identifiers are merged as "type.instance".
[alt=typeAttr] move type identifiers to the attribute level, as a special attribute "type"
[alt=singletonType] use a dummy type for singletons: "singletons"
References can be made to an object defined somewhere else in the tree, using a qualified identifier.
Legal forms are defined to be from an attribute value to an instance identifier.
[opt=refkey] Legal forms also includes from attribute value attribute identifier. OOS
[opt=recattr] There may be recursion from attribute value to instance value definition. This allows anonymous definition of instance object in an attribute. OOS
Grammar
WARNING: hopefully consistent grammar mixup. May be offending to purists.
Some definitions have alternatives noted with
|=
.Alternative definition of identifiers (always qualified):
Identifier
Identifier is abbreviated ID.
Type ID is abbreviated TID.
Singleton ID is equivalent to Instance ID, abbreviated IID.
(Attribute) Key is identified by Attribute ID, abbreviated AID.
Qualified ID is abbreviated QID.
QID is a sequence of ID of the form (TID, IID) for component instances or (IID) for singletons. Its string form joins each field with ".".
[opt=refkey] May have form (TID, IID, AID) to reference component instance attributes. Useful either to reference another attribute of current instance, or any other attribute, except those defined in singletons.
[opt=reckey] May have form (TID, IID, AID*) to reference recursive component instance attributes.
[opt=skey] May have form (IID, AID*) to reference singleton attributes (sequence of AID because recursive).
[opt=sref] singleton attributes may reference any attribute.
Attribute
Attribute is a (key, value) pair whose set forms an instance dictionnary.
Attribute value is either a YAML object or a reference.
[opt=recattr] Attribute value may also be an instance dictionnary.
Attribute level is any level under instance level, recursive or not.
Reference
A reference is synctatically defined as a qualified identifier enclosed in chevrons. Its source is an attribute value and its target is the object identified by the QID it owns.
The reference is deleted when it is resolved by the reference resolution routine.
Type
Python type of a component to be instantiated and configured.
It is referred to indirectly through a TID in a configuration definition, and through a component constructor in the component type register.
Instance
Specific instantiation of a component, distinguished from the others by the set of attributes used to initialize it.
All identified instances of a type must be specified in the instance level of a configuration definition.
[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instanciation routine.
[opt=subinst] Instances may be referenced in an attribute value, and be recognized as instance declarations; i.e. be instanciated and initialized and not just passed as is to the constructor.
[opt=anoninst] Anonymous instances may be defined in an attribute value, and be recognized as instance declarations.
-> [Q] that would need a type declaration, which is to yet handled in the spec, to identify it as an instance and be able to instantiate it. One unflexible option is to have the parent AID, being the child IID, to be a TID, thus restricting its name to a known type ID. The other option being specifying TID in a dedicated child attribute with a name similar to
type
orTID
.Singleton
Singletons objects are syntactically similar to instances.
Unless otherwise stated, the same rules apply.
They do not correspond to a predefined type, so they have no schema or attached semantics.
They are instantiated as a dict tree.
Library
Register
The component type register, abbreviated register, is a
(TID, qualified_constructor)
mapping, defined in the configuration library.It is used by the component resolution routine to resolve type identifiers to Python type constructors.
Entries in this mapping are to be registered through the component registration library routine. This registration may happen anywhere provided it is executed at loading/import time. It is advised to register the component in the package that defines it.
qualified_constructor
must be Python absolute import syntax for the object creating callable, which is either a factory function or a class.It may be defined:
[alt] in quoted form (string).
[alt] in class object form.
[rem] String form avoid import but no static check of existence may be performed. If the registering is done in the package responsible of defining the component, object form is the most preferable.
Components that can be registered may be any SWH service component, SWH support component or external component, which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
[Q] what about these config objects: journal related like journal-writer, most under deposit, celery-related…
Type implementations
This section is informational.
A component type may have multiple implementations.
There is no specific support for it in this system, but as this concept may appear in configuration, related considerations may be worth noting.
[rem] The component type of an instance may be abstract, in which case a concrete type must be determined by the component constructor.
A specific attribute of instances specifies implementation or flavour to use. It is commonly identified as
cls
, but could beimpl
orflavor
.It used to be required along with
args
, which is now deprecated, because all configurable SWH components were polymorphic.Components that have no such feature need no
cls
attribute in their configuration.Alternatively, polymorphic components may be instanciated without
cls
, in which case a default implementation will be used.[rem] an indirection layer such as a factory may be defined for monomorphic components in order to keep consistency and allow polymorphism if needed later. Alternatively, for all components, better than a factory which is not derivable from the component type by user code, an abstract base class constructor would abstract this indirection layer away.
Instantiation
Instantiating is the process through which a concrete object is constructed from a model and data describing its (initial) state. In the context of this system, a Python object is created though calling its constructor with the set of attributes associated to a particular instance in a configuration definition.
The input is a QID identifying an instance and a configuration tree (dictionary) containing the instance and its dependencies (reference targets).
The output is a component instance.
The process is composed of the following steps in order.
[opt=subinst] Iidentifying instance definitions require a TID/IID.
[Q] As parent (
ID: {instance attrs}
) or child ({ID: ..., instance attrs}
)?[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instantiation routine.
Instances must be instanciated only once and used at each reference source.
Interpretation
(Validation/Conversion/Interpretation)
This section is informational.
Interpretation of attributes beyond stated above is out of scope and left to the component constructors to do.
Standard Python typing available in constructors may be used to as the basis for the validation of configuration data.
Validity of structure, value and existence may be checked.
Conversions may also be performed.
[opt=validate] The library provides generic validation primitives and a validation routine based on a data model specification object.
Loading
(Loading/Defaults/Merging)
Loading is the process of fetching data from a storage medium into a memory space which is easily accessible to the processing system. In the context of this system, this data is then read and converted into a Python object.
Loading source may be: an I/O file abstraction (whatever its backing source), or an operating system path to such file abstraction, or such path resolvable from an environment variable or a configuration file ID.
Only a Python dictionary is accepted as the holder of this data once loaded. A default configuration definition, either as a dictionary literal or a loaded configuration, can be specified in which case every attributes absent from loaded configuration will be set to the default one.
API overview
Library should be imported as
config
everywhere for clarity and uniformity (e.g.import swh.core.config as config
orfrom swh.core import config
).[rem] Existing routine
merge_configs
should be moved to another module asmerge_dicts
.Configuration object: mapping
WARNING:
In the following examples, names subject to change.
Code is inspired by Python, but abstracted to focus on typing.
DeriveType
denotes simply a type derived from an exiting one, with no consideration of compatibility with base type or any other.Loading API
[rem] Should choose term among
load
,read
,from
,by
,config
Example names:
read_config
,load_envvar
no default configs
Loads as YAML tree and convert to Python recursive mapping.
[opt=fileid] may use ID to reference files independently of their path or extension, in loading mechanism. This is sugar that existed but may be no longer wanted. OOS
[Q] Where to check for loadable path? In loading routines or user code? May duplicate behavior.
[Q] Should envvar be hardcoded in library or default? Same for default path.
Instantiation API
OOS: every function but create component
[rem] Should choose amongst
get
,read
,from_config
,instantiate
,component
,instance
,iid
.Example names:
get_component_from_config
,instantiate_from_config
,create_component
,read_instance
,get_from_id
Returns an instantiated component identified by QID.
Uses
get_obj
,resolve_references
,resolve_component
,instantiate_component
.Returns a config object (subtree) of the config identified by the config ID. May be used both for getting all instances or one instance depending on whether the config ID has an instance ID part.
Replaces reference source with the object identified by reference target.
[opt=subinst] Finds all instance definitions nested in this instance definition and returns a list of them.
[Q] How to identify instances?
[Q] Should it recurse into nested definitions or just one level?
Lookups core.config register to get the constructor from TID.
Instantiate a component using given constructor and an instance configuration mapping.
Validation API
[opt=validate]
This section proposes a framework for validating instance definitions in a fairly lightweight and flexible way, for use by component constructors or injectors.
check
: validate both language and instances.check_definitions
: validate whole definition against language spec.check_component
: validate instance definition against component spec.This is a template function which is parametrized by user-specified spec.
Model specification
check_component
verifies that all properties of every attribute holds in the instance definition, based on user-defined model specification. Model specification can leverage primitive check functions and user-defined check functions. Supported checks are value and existence in tree-structure checks, which are distinguished for expressiveness.The model specification lists each (unqualified) attribute that may exist in the configuation definition, along with attribute properties that must hold.
An attribute may or not be optional, meaning whether validation should fail on absence, based on the boolean value of the
optional_check
.optional_check
may be a callable that must determine whether the attribute is optional based on the configuration context and return a booleanish value, or be a booleanish value. It is run in a wrapper which converts falsey values or exceptions toFalse
, and anything else toTrue
. Required attribute is checked for existence based on a set of paths in the tree if any, or existence anywhere in the tree. Optional attribute is then not checked for existence but still for legal value.The value check may be any callable that either accepts a single value, or a value and the configuration context (instance definition), and return a booleanish value, handled as above. This makes it possible to use many existing functions or object constructors to do the validation, e.g.
int
,re.match
,isinstance(Protocol)
or a function verifying a relation to another attribute in the definition is valid.Helper for specification generation
generate_spec_from_signature
: generate a model specification where annotations are used asvalue_check
functions wherever possible, argument are optional or not depending on the existence a default value, and the path set contains only the tree root. A mapping from types to validators is used to validate most common types, others will only be checked byinsinstance
. This is a helper function to generate a spec draft ahead of time, that must be corrected and stored along the corresponding constructor, as it cannot be guaranteed to function properly in all cases.Components with multiple implementations:
Operations based on function signatures like validation but also instanciation, need a way to map the
cls
argument to the concrete type and constructor signature.A solution to automatically use the good constructor is to implement single dispatch and overloading on the main constructor. Every method may still call the main one, but must have a signature compatible with the one of the concrete class constructor, based on
cls
.See also "Library/Type implementations" remark about abstract constructors.
Client code (need contributions)
Demonstration of features in every use cases.
CLI, WSGI, worker, task, daemon, testing
CLI entrypoint
swh scanner --scanner-instance=docker scan .
SWH_SCANNER_INSTANCE=docker swh scanner scan .
API Server entrypoint
rpc-serve, WSGI app
Celery task entrypoint
Celery task code
Testing / REPL
Example test
Environment
The environment parameters comprises any dependency of the configuration system external to the code.
This includes: configuration directory, configuration file, environment variable and commandline parameters.
Configuration directory
SWH configuration directory:
SWH_CONFIG_HOME=$HOME/.config/swh
Configuration file
YAML file with .yml containing only the configuration data.
Default if none is specified to the generic loading routine:
$SWH_CONFIG/default.yml
.[opt=conffileid] a configuration file id corresponding to the basename of a configuration file (without extension).
-> [Q] then only from
$SWH_CONFIG_HOME
or have a register?Core configuration file parameter
This feature is to be built into SWH core library.
Specify the path to the configuration file to use for a whole service:
path_part =
path
|file
Environment variable:
SWH_CONFIG_<PATH_PART>
CLI option:
swh --config-<path_part>
[rem] "path" is a more precise term than "file".
Specific configuration parameters
A CLI option may be passed to specify an instance ID (only at 2nd level) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
Specify the instance configuration to use for a given component, using instance ID:
id_part =
instance
|id
|iid
|cid
SWH_<COMP>_<ID_PART>
--<comp>-<id_part>
[rem] Any variant containing "id" is more precise than simply "instance".
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
Specify any other predefined configuration option:
SWH_<COMP>_<OPTION>
--<comp>-<option>
[OOS] dynamic handling of any such options for any attribute, similar to what
click
permits.Configuration priority
CLI has precedence over envvars.
Environment parameters have precedence over whole definitions (from file or code) and whole definitions have precedence over defaults, per-attribute.
This follows the principle that the particular takes precedence over the general.
CLI param > envvar param > CLI file > envvar file > default file > defaults literal
This precedence rules must be implemented in entrypoint client code, with help of the library loading API. Only part of it may be implemented, the minimum being accepting a whole definition trough either code or envvar.
Example environment specifications (need contributions)
Using this objstorage replayer configuration file file
CLI usage:
Specify no instance, use default instance config:
swh objstorage replayer
Specify instance:
swh objstorage replayer --from-instance default
Specify nested instances (opt=subinst):
swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
CLI options to be defined statically.
Other proposals
@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
[OOS] @tenma's proposal: dynamic handling of cli options wrt schema
swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default
- generic, verbose, arbitrary attribute setting
Limitations (need contributions)
Depends on chosen functionalities.
OOS, rejected ideas (need contributions)
Depends on chosen functionalities.
Impacts (need contributions)
Implementation plan: library, ops code, tests, prod (need contributions)
(Proposition)
Prepare for easy switch and rollback by creating configuration copies conforming to the new system, and code conforming to the new system in separate branches.
Synthesis of the meeting 2020-11-24
Participants: tenma, douardda, olasd, ardumont
Language:
APIs:
Implementation:
references to constructor + documentation_builder (@olasd)
Environment:
Notes about implementation:
Replace reference by instance in definition (duplication?)
qid could be inserted in the definition as key, and added to instance register; that would make handling more regular
Still open:
-> allow polymorphism, type and callable are associated
-> factories reimpl single dispatch which is builtin in classes
Conclusion: