https://forge.softwareheritage.org/T1410
Better config system that does not rely on implicit configurations.
tenma Aren't production and docker environments the same?
douardda nope since docker make use of entrypoint scripts whereas prod uses systemd unit files, but they are pretty similar in most aspects.
douardda not sure what's this use case really is about, can it be seen as part of the "cli tools" below? Personaly I rarely need more than a
s = get_storage(cls='memory')
in a shell, so…
Also I definitely do NOT want that typings = get_storage()
in a shell or a script silently uses a config file somewhere (be it from the SWH_CONFIG_FILENAME or a default one).
ardumont Indeed, sounds reasonable. That means, using the factory without any parameters should default to use in-memory implementations.
swh scheduler
command to manage tasks, swh loader
, swh lister
, …)douardda it's not strictly necessary to keep this "current features" section IMHO
tenma maybe, but it reminds us what we may or may not want
Current configuration system is an utility library implementing diverse strategies of config loading and parsing including the following functionalities:
consistent config definition and processing across the SWH codebase
priority loading with defined mechanics:
directional config merge: merge specific definition with a default one
namespaced by distinct roles, so that one fully qualified config key can be used by different components, and a same unqualified key may exist for different roles, for example:
should have a straightforward API, possibly declarative, so that user code can plug config definitions in a single step (decorator, mixin/trait, factory attribute, etc.)
douardda The declarative part seems pretty attractive, but we currently often use configuration items as constructor argument of classes, how does this fit with the declarative aspect?
tenma this. it couples config with constructor signatures which lead to difficulties when renaming: rename the config element and all occurences in constructors' signature simultaneously. We can want this but it is the first time I see this kind of coupling, I couldn't rename args in the BufferedProxyStorage constructor because of this.
configuration as attributes of target class to have proper doc/typing/validators either flat or Config object parametrized by class (either class object or config cls
literal)
may or may not want: decouple config keys from component constructors arguments (easy if Config is in another object) so that config keys and class attributes can evolve independently
config is loaded on entrypoints (cli, celery task, gunicorn wsgi), not by each component (Loader, Lister, …)
ardumont possibly wrap instantiation of other components as factories get_storage, get_objstorage, get_indexer_storage, get_journal_client, get_journal_writer, etc… does
File format: YAML
Default config:
Specific config:
Environment variables:
tenma I would prefer SWH_CONFIG_(FILE)PATH to SWH_CONFIG_FILENAME, to be clear that it is not a basename (we may want to) but won't argue much.
ardumont yes, PATH sounds better than NAME (it's a detail that can be taken care of later when everything else is centralized)
swh.core.config
Either run with a switch or a envvar, else hardcoded default path
See example from scanner CLI.
douardda this section probably needs to be moved somewhere else.
These are examples of config files as currenly used (we focus here on the configuration itself, not about where these files are loaded from).
Most of the configuration files use the form:
<swhcomponent>:
cls: <select the implementation to use>
args:
<dict of args passed to the class constructor>
Also most (?) CLI tools for swh packages use the same pattern: the config file loading mechanism is handled in the main click group for that package (e.g. in swh.dataset.cli.data_cli_group
for swh.dataset
, or swh.storage.cli.storage
for the storage, etc.)
The generic config for an objstorage looks like:
objstorage:
# typ. used as swh.objstorage.factory.get_objstorage() kwargs
cls: pathslicing
args:
root: /srv/softwareheritage/objects
slicing: 0:5
In which we have the main config entry: how to access the underlying objstorage backend, then one (or more) configuration items for the objstorage RPC server (for which one needs to read the code to know what possible options are accepted).
The config is checked in swh.objstorage.api.server.make_app
with some validation in swh.objstorage.api.server.validate_config
.
It also accept a client_max_size
top-level argument, which is the only "extra" config parameter supported (used in make_app
).
When started via gunicorn:
swh.objstorage.api.server:make_app_from_configfile()
This function does take care of the presence of the SWH_CONFIG_FILENAME, loads the config file, validate (validate_config
) then call make_app
.
The objstorage replayer needs 2 objstorage configurations (src and dst) and a journal_client one, e.g.:
objstorage_src:
cls: remote
args:
url: http://storage0.euwest.azure.internal.softwareheritage.org:5003
max_retries: 5
pool_connections: 100
pool_maxsize: 200
objstorage_dst:
cls: remote
args:
url: http://objstorage:5003
journal_client:
cls: kafka
brokers:
- kafka1
- kafka2
- kafka3
group_id: test-content-replayer-x-change-me
The journal_client
config item is directly used ar argument of swh.journal.client.get_journal_client()
factory.
storage:
cls: local
args:
db: postgresql:///?service=swh-storage
objstorage:
cls: remote
args:
url: http://swh-objstorage:5003/
journal_writer:
cls: kafka
args:
brokers:
- kafka
prefix: swh.journal.objects
client_id: swh.storage.master
In which we have the same config system for the main underlying (storage) backend.
Besides the configuration of the underlying storage access, there can also be the configuration for the linked objstorage and journal_writer.
The former is passed directly to the swh.storage.objstorage.ObjStorage
class which is a thin layer above the real swh.objstorage.ObjStorage
class (instanciated via get_objstorage()
).
The later is directly used as argument of the swh.storage.writer.JournalWriter
class.
Also note that the instanciation of the objstorage and journal writer is done in each storage backend (it's not a generic behavior in get_storage()
).
Same as general case + inject the check_config
flag from cli options if needed.
swh.storage.api.server:make_app_from_configfile()
This tool needs 2 entries: the destination storage (same config as above) + the journal client config (journal_client
), like:
storage:
cls: remote
args:
url: http://storage:5002/
max_retries: 5
pool_connections: 100
pool_maxsize: 200
journal_client:
cls: kafka
brokers:
- kafka-broker:9094
group_id: test-graph-replayer-XX5
object_types:
- content
- skipped_content
The journal_client
config item is directly used ar argument of swh.journal.client.get_journal_client()
factory.
The backfiller uses a "low-level" config scheme, because it needs a direct access to the database:
brokers:
- broker1
- ...
storage_dbconn: postgresql://db
prefix: swh.journal.objects
client_id: <UUID>
The config validation is performed within the JournalBackfiller
class.
In swh.dataset
, loaded config is directly passed to GraphEdgeExporter
via export_edges
and sort_graph_nodes
.
For the GraphEdgeExporter
, these config values are actually the **kwargs
of ParallelExport.process
plus the remove_pull_requests
flag extracted from the config
dict in process_messages()
.
This ParallelExporter
uses a single config entry,journal
, the configuration of a journal client.
For the sort_graph_nodes
, config values are:
sort_buffer_size
disk_buffer_dir
The main click
group of swh.deposit
does not load the configuration file.
However, it provides a swh.deposit.config.APIConfig
class that loads the configuration from the SWH_CONFIG_FILENAME
file.
The generic implementation expects a scheduler
entry, and have default values for max_upload_size
and checks
.
The current config file for the deposit service in docker looks like:
scheduler:
# used by the deposit RPC server
cls: remote
args:
url: http://swh-scheduler:5008
# deposit server now writes to the metadata storage (storage)
storage_metadata:
cls: remote
args:
url: http://swh-storage:5002/
storage:
cls: remote
url: http://swh-storage:5002/
# needed ^ for the old migration script (we cannot remove it or init fails)
allowed_hosts:
# used in "production" django settings (server)
- '*'
private:
# used in "production" django settings (server)
secret_key: prod-in-docker
db:
host: swh-deposit-db
port: 5432
name: swh-deposit
user: postgres
password: testpassword
media_root: /tmp/swh-deposit/uploads
extraction_dir: "/tmp/swh-deposit/archive/"
# used by swh.deposit.api.private.deposit_read.APIReadArchives()
douardda I'm not sure how all these config entries are used, and by which piece of code.
ardumont clarified the parts not explained, dropped the obsolete ones
ardumont by cleaning up, i saw a discrepancy about the storage_metadata key, fixed.
ardumont it's one entangled configuration file used by all deposit modules api, the "private" api and the workers. Each using a subset combination of those… To actually see what's used by what now, better look at the production configuration instead.
The swh.deposit.cli.client
clis do not explicitely implement configuration loading from a file, instead every configuration option is given as cli option.
However, some classes instanciated from there do support loading a config file from the SWH_CONFIG_FILENAME
environment variable.
Config entries for a deposit client are:
url
auth
(a dict with username
and password
entries)ardumont "some classes instanciated from there do support loading"> True. But it's not used within that particular cli context.
ardumont That part is now covered with integration tests (no more mock) so modification on that part should be simpler
The swh.deposit.cli.admin.admin
click group does implement the config file loading pattern (actually the loading itself is implemented in the setup_django_for()
function).
This function does load the django configuration from the swh.deposit.settings.<platform>
(with <platform>
in ["development", "production", "testing"]
), and set the SWH_CONFIG_FILENAME
environment variable to the config_file
argument given.
ardumont That's some not pretty stuff that will hopefully get simplified with this spec ;)
ardumont That part is now covered with tests so modification will be simpler as well
The deposit provides one celery worker task (CheckDepositTsk
) which loads its configuration exclusively from SWH_CONFIG_FILENAME
. The only config entry used is the deposit
server connection information.
The deposit server uses the standard django configuration scheme, but the selected config module is managed by swh.deposit.config.setup_djamgo_for()
.
A tricky thing is the swh.deposit.settings.production
django settings module, since it does load the SWH_CONFIG_FILENAME
config file (but NOT in the development
nor testing
flavors).
In production
mode, it expects the configuration to have:
scheduler
private
(credentials for the admin pages of the deposit),allowed_hosts
(optional)storage
extraction_dir
douardda not sure I have all deposit config options/usages
ardumont in doubt, look at the puppet manifest configuration
ardumont all deposit usages are there
ardumont as far as my understanding about django goes, this is indeed the standard way of configuring django (I dropped the(?)
p).
The main click group of swh.graph
does load the config file, but it does not fall back to SWH_CONFIG_FILENAME
if not config file is given as cli option argument.
Supported configuration values is declared/checked in the swh.graph.config
module.
There is no main "graph" section or namespace in the config file, so all config entries are expected at file's top-level:
The main click group of swh.indexer
does load the config file, but it does not fall back to SWH_CONFIG_FILENAME
if not config file is given as cli option argument.
For the indexer storage, a standard swh.indexer.storage.get_indexer_storage()
factory function is provided, and is generally called with arguments from the indexer_storage
configuration entry.
The swh.indexer.cli.schedule
command uses the config entries:
The swh.indexer.cli.journal_client
command (listen the journal to fire new indexing tasks) uses the config entries:
The connection to the kafka broker is handled only by command line option arguments.
When started using the swh indexer rpc-serve
command, it expect a config file name as required argument. Configuration entries are:
When started as a WSGI app, the configuration is loaded from the SWH_CONFIG_FILENAME
environment variable (in make_app_from_configfile
).
The journal can be used from the producer side (e.g. a storage's journal writer) or the consumer side.
The swh.journal.client.get_journal_client(cls, **kwargs)
factory function is generally used to get a journal client connection with arguments directly from journal_client
(or journal
) configuration entry.
The swh.journal.writer.get_journal_writer(cls, **kwargs)
factory function is used to get a producer journal connection, with arguments directly from journal_writer
configuration entry (generally it's the subentry of the "main"storage
config entry, as seen above in the storage config example.)
Loaders are mostly celery workers. There is a cli tool to synchronously execute a loading.
When run as a celery worker task, the configuration loading mech is detaild in the scheduler section below.
When executed directly, via swh loader run
, the loader class is instanciated directly, thus it's the responsibility of that later to load a configuration file. This is normally done by using the swh.core.config.load_from_envvar
class method.
The main lister cli group does handle the loading of the config file, including falling back to the SWH_CONFIG_FILENAME
if not given as command line argument.
Expected config options are:
The swh lister run
command also instanciate a lister class. The base implementation support the configuration options:
When used via a celery worker, standard celery worker config loading mechanism is used (see the scheduler below).
The scanner's cli implements its own strategy for finding the configuration file to load (including looking at the SWH_CONFIG_FILENAME
variable). It only needs connection informations to the public web API:
The scheduler consists in several parts.
Every piece of code that involves loading the celery stack of swh.scheduler
, aka that imports the swh.scheduler.celery_backend.config
module, will load the configuration file from the SWH_CONFIG_FILENAME
, in which at least a celery
section is expected.
Celery workers are registered from the swh.workers
pkg_resources entry point as well as the celery.task_modules
configuration entry.
The main celery app singleton is then configured from a hardwritten default config dict merged with the celery
configuration loaded from the configuration file.
Celery workers are started by the standard celery command (python -m celery worker
) using swh.scheduler.celery_backend.config.app
as celery app, so the configuration loading mechanism is the default celery one described above, and the only way to specify the configuration file to load is via the SWH_CONFIG_FILENAME
variable.
The main click group does implement the --config-file
option, and uses the swh.core.config.read()
function. So this main config file loading mechanism does not fall back to the SWH_CONFIG_FILENAME
variable.
At this level, the only expected config entry is scheduler
(connection to the underlying scheduler service).
Additional config entries for cli commands:
runner
:
celery
listener
:
celery
rpc-serve
:
celery-monitor
:
celery
archive
:
swh.scheduler.backend_es.ElasticSearchBackend
The loading of the WSGI app normally uses the swh.scheduler.api.server.make_app_from_configfile()
function that takes care of loading the config file from the SWH_CONFIG_FILENAME
with no fall back to a default path.
The loaded config is added to the main flask app
object, so any flask-related config option is possible (at configuration's top-level.)
swh.search
main cli group does implement the --config-file
option (using swh.core.config.read()
to load the file).
Config options by cli command:
initialize
:
search
journal-client objects
:
journal
search
rpc-serve
:--config-file
option). This configuration is then used to configure the flask-based RPC server.
search
The creation of the WSGI app is normally done using swh.search.api.server.make_app_from_configfile
, which uses the SWH_CONFIG_FILENAME
variable as (only) way of setting the config file.
There is not support for the --config-file
option in main click group, but the cli only provides one command (rpc-serve
), which does support this option.
The configuration file is loaded in swh.vault.api.server.make_app_from_configfile()
, and the main RPC server is aiohttp
based.
Django-based stuff.
No config file loading for now (?)
Config file loading function used
package | command | loading function | called from | config path |
---|---|---|---|---|
dataset | swh dataset ... |
swh.core.config.read() |
swh.dataset.cli.dataset_cli_group() |
--config-file |
deposit | swh deposit client |
|||
deposit | swh deposit admin |
config.load_named_config() |
--config-file , SWH_CONFIG_FILENAME |
|
deposit | HTTP application | config.read_raw_config() |
SWH_CONFIG_FILENAME via DJANGO_SETTINGS_MODULE |
|
graph | swh graph ... |
swh.core.config.read() |
swh.graph.cli.graph_cli_group() |
--config-file |
graph | WSGI app | ?? | ?? | |
indexer | swh indexer ... |
swh.core.config.read() |
swh.indexer.cli.indexer_cli_group() |
--config-file |
indexer | swh indexer rpc-serve |
swh.core.config.read() |
swh.indexer.storage.api.server.load_and_check_config() |
config-path (via s.i.cli.rpc_server() ) |
indexer | WSGI app | swh.core.config.read() |
swh.indexer.storage.api.server.load_and_check_config() |
SWH_CONFIG_FILENAME (via s.i.s.a.s.make_app_from_configfile() ) |
icinga | swh icinga_plugins ... |
|||
lister | swh lister ... |
swh.core.config.read() |
swh.lister.cli.lister() |
--config-file , SWH_CONFIG_FILENAME (via w.l.cli.lister() ) |
lister | celery worker | config.load_from_envvar() |
swh.lister.core.simple_lister.ListerBase() |
SWH_CONFIG_FILENAME , <SWH_CONFIG_DIRECTORIES>/lister_<name>.<ext> |
loader.package | swh loader ... |
config.load_from_envvar() |
swh.loader.package.loader.PackageLoader() |
SWH_CONFIG_FILENAME |
loader.core | swh loader ... |
config.load_from_envvar() |
swh.loader.core.loader.BaseLoader() |
SWH_CONFIG_FILENAME |
objstorage | swh objstorage ... |
swh.core.config.read() |
swh.objstorage.cli.objstorage_cli_group() |
--config-file , SWH_CONFIG_FILENAME (via s.o.cli.objstorage_cli_group() ) |
objstorage | WSGI app | swh.core.config.read() |
swh.objstorage.api.server.load_and_check_config() |
SWH_CONFIG_FILENAME (via s.o.api.server.make_app_from_configfile() ) |
scanner | swh scanner ... |
config.read_raw_config() |
swh.scanner.cli.scanner() |
--config-file , SWH_CONFIG_FILENAME , ~/.config/swh/global.yml |
scheduler | swh scheduler ... |
swh.core.config.read() |
swh.scheduler.cli.cli() |
--config-file , |
scheduler | WSGI app | swh.core.config.read() |
swh.scheduler.api.server.load_and_check_config() |
SWH_CONFIG_FILENAME |
scheduler | celery worker | swh.scheduler.celery_backend.config |
swh.core.config.load_named_config() |
swh.scheduler.celery_backend |
search | swh search ... |
swh.config.read() |
swh.search.cli.search_cli_group() |
--config-file |
search | swh search rpc-server |
swh.config.read() |
swh.search.api.server.load_and_check_config() |
config-path |
search | WSGI app | swh.config.read() |
swh.search.api.server.load_and_check_config() |
SWH_CONFIG_FILENAME (from s.s.api.server.make_app_from_configfile() ) |
storage | swh storage ... |
swh.config.read() |
swh.storage.cli.storage() |
--config-file , SWH_CONFIG_FILENAME (from s.s.cli.storage() ) |
storage | WSGI app | swh.core.config.read() |
swh.storage.api.server.load_and_check_config() |
SWH_CONFIG_FILENAME (from s.s.api.server.make_app_from_configfile() ) |
vault | swh rpc-serve |
swh.core.config.read() or swh.core.config.load_named_config() |
swh.vault.api.server.make_app_from_configfile() |
--config-file , SWH_CONFIG_FILENAME (from s.v.api.server.make_app_from_configfile() ), <swh.core.config.SWH_CONFIG_DIRECTORIES>/vault/server.<ext> |
vault | celery worker | swh.core.config.read() or swh.core.config.load_named_config() |
swh.vault.cookers.get_cooker() |
SWH_CONFIG_FILENAME (from ...get_cooker() ), <s.c.c.SWH_CONFIG_DIRECTORIES>/vault/cooker.<ext> (from get_cooker() ) |
web | XXX | |||
web-client | None |
swh.<component>.config
module (like what we declare today for tasks in swh.<component>.tasks
)The earlier described plan should respect the following:
I strongly like that this is going towards having fewer dicts being thrown around in our code around object instantiation in favor of stronger typed objects. I also think this can be implemented in a DRY way by parsing the signature of the classes that are being instantiated.
I'm not sure that the backwards compatibility for production is such a strong requirement, even though it would definitely be nicer.
ardumont "strong requirement": it is not but please let's change one thing at a time. It will definitely help to not touch that part as a first step (especially if that breaks). And when we know the migration worked (implementation underneath changed, all deployed, everything ok as before), then we can change the config values incrementally with small changes.
This proposal doesn't seem to be solving the concern of shared configuration across so-called components; Let me use a concrete example:
- the WebClient
(or whatever its name is) class in swh.web.client takes a token
parameter for authentication, and a base_url
for its setup
- the SwhFuse
component uses a WebClient
- the SwhScanner
component also uses a WebClient
- the swh fuse
command takes a config file with its own parameters, as well as parameters for a web client
- the swh scanner
command takes a config file with parameters for a web client; I expect most of its other parameters come from the CLI directly
In the proposal it's not clear to me how the following would happen:
- swh.web.client
declares its configuration schema in swh.web.client.config
- swh.fuse
and swh.scanner
do the same in their own config
modules
- the swh fuse
and swh scanner
entrypoints parse a configuration file; from the output, they instantiate a SwhFuse
/ SwhScanner
.
Now that I've written all of this, I guess this could be solved by having a way for the swh.fuse.config
and swh.scanner.config
modules to declare that they're expecting a swh.web.client.config
at the toplevel of their configuration file (rather than in a nested way like the get_storage
factories currently work). Did you have a different idea?
tenma
This is my point about namespaces/roles/contexts, in my initial suggestions (that we skimmed in last discussion and not included). It would be a way to both share or distinguish config keys. My proposition avoids some kind of hierarchy with arbitrary depth like composition (owning another config) and inheritance (suclassing another config), and be more similar to naming systems that use tags/prefixes. But it would imply that as a team we define ahead of time those namespaces/roles/contexts (we should choose a definitive name for this, I prefer role).
Example:
-web-api/token
: token in the context of web-api
-ext-service/token
: token in the context of external service
- bothswh.web.client.config
andswh.scanner.config
could referenceweb-api/token
The injector in entrypoints, reading config definitions of the components it instanciates, would see which config keys would be shared, and could create a merged config definition from this (if we want to).
But this choice would require adapting config definitions that we can leave untouched for now.
ardumont my initial answer was lost… I replied something about configuration composition at the time so what olasd suggested but i think we moved away from this anyways… (see the next paragraphs)
We may see the problem we are trying to solve as:
Examples that are not accurate but demonstrate syntax and functionalities.
Example config (global, mixing unrelated components) declarations:
storage:
uffizi:
cls: remote
url: ...
staging:
cls: remote
url: ...
local:
cls: pgsql
conninfo: ...
objstorage: <objstorage.foo>
ingestion:
cls: pipeline
steps:
- cls: filter
- cls: buffer
- <storage.uffizi>
objstorage:
foo:
cls: pathslicing
objstorage-replayer:
to_S3:
cls: ...
src: <objstorage.local>
dst: <objstorage.S3>
web-client:
swh:
token: ...
url: ...
This demonstrates:
ardumont Unclear on the scope of that configuration sample.
- Is is a sample configuration for one service serving one module (matching what we already have)?
- Or is it a one global configuration file for all modules, thus defining all
production combination for all services. Then each service (storage, loader-git,
etc…) is picking the information it needs to in that file?
Example CLI usage:
swh storage rpc-serve --storage=local
SWH_STORAGE=local swh storage rpc-serve
ardumont is that dedicated to one service? SWH_STORAGE for
swh storage
, SWH_SCHEDULER forswh scheduler
and so on and so forth…
or is the following possible as well?SWH_STORAGE=local swh loader run git https://...
which would make the storage to use a local one within the scope of the
loading?
tenma here
local
corresponds to an instance. The injector will instanciate this instance and fill the references to it the config.
maybe the option/envvar would be more qualified like X.Y.storage=instance.
We did not discuss it in detail. More about this idea with olasd or douardda.
ardumont i still do not get the scope of the sample ¯_(ツ)_/¯ (i tried to clarify my questions ^)
3 levels (real names to be defined): package, instance, attribute (key:value)
Examples of existing "package" identifiers:
Anonymous instances in config file (for use as a value) is out of scope for now.
ardumont well, my understanding so far would mean that such anonymous instances (if any) should no longer exist and be attached to some other level, level "package" then.
tenma yes the idea was to have something regular without a shorthand anonymous form, at least for now. Then an instance must be defined at the 2nd level in a package. If many instances are needed it may become tedious to declare each this way if they are mostly used one, then it is not very difficult to have a syntax for anonymous instance definition.
Package corresponds to an existing identifier, the one of the SWH Python package that defines the components to configure. It is used to resolve components types (search for type remote
in storage
) and group components of the same service/package.
/!\ what about components that are not of the main type of the package? Would need inclusion into the map and factory.
Ex: non-storage component in storage package?
ardumont What about currently unexisting specified "package" identifiers,
loaders, listers, …, is the following good enough?
- loader-npm
- lister-cran
- …
tenma right, for those we would need a top-level mapping I think…
I do not know the whole type hierarchy…
would the "package" be loader and the type "npm", or the package "loader-npm" and the type "task-?". To be defined…
cls
(alt: type
) keys is a type identifier and denotes what kind of object schema to use.
Other keys are instance keys conforming to the schema referenced by cls
, more concretely, the class constructor arguments.
<
and >
are reference markers and denotes reference to a qualified instance or key. The choice of marker is not definitive. YAML reference feature (syntax &
/*
) is rejected because we want to keep references in our model and thus not having it processed outside our control.
Packages are defined statically. Most probably in core.config. Could be "discovered", but we may prefer whitelisting supported ones.
ardumont I'm under the impression that, for the discoverability, we could plug that part to the "registry mechanism" already in place for the module "tasks".
Adding a config key in the output of theregister
function there or something.
swh.
<module>
.init defines something like (extended with "config" here):def register() -> Mapping[str, Any]: """Register the current worker module's definition (tasks, loader, config, ...)""" from .loader import SomeLoader return { "task_modules": [f"{__name__}.tasks"], "loader": SomeLoader, "config": [f"{__name__}.config"], # <- or something }
Then in the setup.py of the module:
setup( ... entry_points=""" ... [swh.workers] ... lister.bitbucket=swh.lister.bitbucket:register ... loader.someloader=swh.loader.some:register ...
And some other parts in swh.core is reading that code to actually declare the tasks.
Packages are associated with a factory function to instanciate instances as InstanceType(**keys)
, and optionally (class_identifier_string : Python_class) map/register.
Example: Storage package
map = {
"remote": RemoteStorage,
"local": LocalStorage,
"buffer": BufferedStorage,
...
}
def get_storage(*args, **kwargs):
...
ardumont I gather it works for loaders, listers, indexers as well with for example, git loaders:
map = { "remote": GitLoader, "disk": GitLoaderFromDisk, "archive": GitLoaderFromArchive, } def get_loader(*args, ...): ...
bitbucket listers:
map = { "full": FullBitBucketLister, "range": RangeBitBucketLister, "incremental": IncrementalBitBucketLister, }
indexers:
map = { "mimetype": ContentMimetypePartition, "fossology-license": ContentFossologyLicensePartition, "origin-metadata": OriginMetadata, ... }
Component constructor defines config keys, types and defaults.
No static definitions that would need maintaining, but documentation autogenerated from constructors.
Config file is necessary to specify a graph of instances.
Config file parameter specified as a string path, either absolute or relative. Current name: config-file.
Config file can be specified as CLI option (--<config-file>
) or envvar (SWH_<CONFIG_FILE>
).
Reference to an instance can be specified inline (ref syntax), CLI option (--<key>-instance
), envvar (SWH_<key>_INSTANCE
).
Other keys can be specified as CLI option (--<key>
), envvar (SWH_<key>
).
ardumont It'd be good then to specify the merge policy now…
Example:
core.config.get_component_from_config(config=cfg_contents, type="storage", id="uffizi")
ardumont
What'scfg_contents
in the sample, a global unified configuration of all the config combination?
I'd be interested in a concrete code sample of the instantiation of a module with that code, to clarify a bit ;)
To be defined.
Parsing (YAML library), config resolving (reference processing), component resolving, instanciating.
Restriction to N levels ease implementation.
Note that the actual name of keys is completely up to bikeshedding; specifically, dashes versus underscores versus dots is completely up in the air.
~/.config/swh/default.yml
(default configuration path for user-facing cli tools)
web-client:
default:
# FIXME: single implementation => do we need a cls?
base-url: https://archive.softwareheritage.org/api/1/
token: foo-bar-baz
docker:
base-url: http://localhost:5080/api/1/
token: test-token
scanner:
default:
# FIXME: single implementation => do we need a cls?
web-client: <web-client.default>
scanner-param: foo
docker:
web-client: <web-client.docker>
scanner-param: bar
fuse:
default:
# FIXME: single implementation => do we need a cls?
web-client: <web-client.default>
swh scanner --scanner-instance=docker scan .
SWH_SCANNER_INSTANCE=docker swh scanner scan .
scanner_instance = "docker" # from cli flag or envvar
config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint
# basic yaml parse, no interpolation of recursive entries
config_dict = swh.config.read_config(config_path)
scanner = swh.config.get_component_from_config(config_dict, type="scanner", instance=scanner_instance)
# this would get `config_dict["scanner"]["docker"]`, then notice that one of the values has the special syntax "<web-client.docker>".
# The entry would be replaced by a call to:
# web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="docker")
# Finally, the scanner would be instantiated with:
# get_scanner(web_client=web_client_instance, scanner_param="bar")
scanner.scan(".")
swh fs mount ~/foo swh:1:rev:bar
fuse_instance = "default" # default value of cli flag / envvar
config_path = "~/.config/swh/default.yml" # from defaults of the cli endpoint
# basic yaml parse, no interpolation of recursive entries
config_dict = swh.config.read_config(config_path)
fuse = swh.config.get_component_from_config(config_dict, type="fuse", instance=fuse_instance)
# this would get `config_dict["fuse"][fuse_instance]`, then notice that one of the values has the special syntax "<web-client.default>".
# The entry would be replaced by a call to:
# web_client_instance = swh.config.get_component_from_config(config_dict, type="web-client", instance="default")
# Finally, the scanner would be instantiated with:
# get_fuse(web_client=web_client_instance)
fuse.mount("~/foo", "swh:1:rev:bar")
objstorage:
local:
cls: pathslicing
root: /srv/softwareheritage/objects
slicing: "0:2/2:5"
s3:
cls: s3
s3-param: foo
journal-client:
default:
# single implem: no cls needed
brokers:
- kafka
prefix: swh.journal.objects
client-param: blablabla
docker:
brokers:
- kafka.swh-dev.docker
...
# needed for second cli usecase
objstorage-replayer:
default:
src: <objstorage.local>
dst: <objstorage.s3>
journal-client: <journal-client.default>
Default behavior (single call to swh.core.config.get_component_from_config(config, 'objstorage-replayer', instance_name)
)
swh objstorage replayer --from-instance default
swh objstorage replayer
(uses config from default instance)Nice to have (multiple, manual, calls to get_component_from_config
in the cli entry point)
swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
@tenma's proposal: dynamic handling of cli options wrt schema
swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default
/etc/softwareheritage/loader_git.yml
# component configuration
storage:
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- cls: filter
- cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
# other proposal:
storage:
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- cls: filter
- <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
# impossible currently:
storage:
filter:
cls: filter
# missing storage: argument
default:
cls: pipeline
steps:
- cls: buffer
min_batch_size:
content: 1000
content_bytes: 52428800
directory: 1000
revision: 1000
release: 1000
- <storage.filter> # get_storage(cls="filter") => fail
# ? OR {cls: filter, storage: <storage.uffizi>}
- <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
loader-git:
default:
cls: remote
storage: <storage.default>
max_content_size: 104857600
save_data: false
save_data_path: "/srv/storage/space/data/sharded_packfiles"
# Not a swh component
# no type id, only instance id
celery:
task_broker: amqp://...
task_queues:
- swh.loader.git.tasks.UpdateGitRepository
- swh.loader.git.tasks.LoadDiskGitRepository
- swh.loader.git.tasks.UncompressAndLoadDiskGitRepository
[Unit]
Description=Software Heritage Worker (loader_git)
After=network.target
[Service]
User=swhworker
Group=swhworker
Type=simple
# Celery
Environment=CONCURRENCY=6
Environment=MAX_TASKS_PER_CHILD=100
# Logging
Environment=SWH_LOG_TARGET=journal
Environment=LOGLEVEL=info
# Sentry
Environment=SWH_SENTRY_DSN=https://...
Environment=SWH_SENTRY_ENVIRONMENT=production
Environment=SWH_MAIN_PACKAGE=swh.loader.git
# Config
Environment=SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_git.yml
ExecStart=/usr/bin/python3 -m celery worker -n loader_git@${CELERY_HOSTNAME} --app=swh.scheduler.celery_backend.config.app --pool=prefork --events --concurrency=${CONCURRENCY} --maxtasksperchild=${MAX_TASKS_PER_CHILD} -Ofair --loglevel=${LOGLEVEL} --without-gossip --without-mingle --without-heartbeat
KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=15m
OOMPolicy=kill
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
The celery cli loads the "app" object set in the cli: swh.scheduler.celery_backend.config.app
SWH_CONFIG_FILENAME
to a dict (plausibly a singleton)celery
key of the config dictcelery task code
@shared_task(name="foo.bar")
def load_git(url):
config_dict = swh.core.config.load_from_envvar()
loader = swh.core.config.get_component_from_config(
config=config_dict,
type="loader-git",
id=os.environ.get("SWH_LOADER_GIT_INSTANCE", "default"),
)
return loader.load(url=url)
tenma
The instance aspect is both powerful and easy to understand.
Need to discuss the type and arguments aspects which are inconsistent (needed for some objects but not others).
package
andcls
are type identifiers. Keys are parameters, can be defined as anything that is neither a type nor an instance.
uffizi:
cls: storage.remote
url: ...
local:
cls: storage.pgsql
conninfo: ...
objstorage: <foo>
ingestion:
cls: storage.pipeline
steps:
- cls: filter
- cls: buffer
- <uffizi>
foo:
cls: objstorage.pathslicing
More generic, does not impose grouping of components of same "package", instance names then may need to be more descriptive.
Q: do we only define instances of swh service components in the config file?
Using the more generic notion of role/namespace vs the current specific notion of package offer the flexibility of having names that does not refer to a swh component but any datastructure (with type
having to be fully qualified e.g. model.model.Origin
), and be shared by muliples instances (would represent a datastructure that does not belong specifically to a given swh service package).
It would need a register of datastructures that can be used, like the other proposals.
For example for core/model/graph/tools/other_library datastructure do we want to be able to specify sonetging along the lines of:
model:
orig1:
cls: Origin
url: ...
swhid1:
cls: SWHID
swhid: ...
node1:
cls: MerkleNode
data: ...
somelib:
datastruct1:
cls:
Particpants: @tenma, @douardda, @olasd, @ardumont
Reporter: @tenma
Q = question
R = remark
OOS = Out of scope
Dates indicates the chronology of the report.
***Before starting to report the concepts tackled trough the meeting, some points about terminology. The terminology was difficult to choose while writing this synthesis, so it is not completely consistent. This section tries to give a basis for discussion.
The term type identifier
(TID
for short) will be used in place of package
from now on for the 1st-level names.
package
represented both the SWH Python package name and the base type of SWH components available in this package, in my initial, partial view of the subject.
Now that it has been shown that there is no 1-to-1 mapping between SWH component names and SWH packages, the name package
is no longer accurate.
The more generic type identifier
reflects the flexibility we introduce with a top-level register of components types referencible(?) in configuration.
The 2nd level maps instance identifier
(IID
) to instance definition mapping.
TID
map to actual objects of some type, whereas IID
exist only in configuration system.
Need to choose name for every concept of the system.
***Here really starts the report written as of 2020-10-22.
Examples written/updated during the meeting were not copied here.
cls
attribute to instances specifies implementation/alternative/flavour to use.
It used to be required along with args
, because all configurable SWH components were polymorphic.
Components that have no such feature need no cls
attribute in their configuration.
R: An indirection layer such as a factory may be defined for such components in order to keep consistency and allow polymorphism if needed later. Alternatively, for polymorphic components, better than a factory which needs to be known by user code, an abstract base class constructor would abstract this indirection layer away.
One instance in a package/namespace can be labeled as default
.
It will be selected when an instance of a given component type is requested but no IID is given.
Q: could default instance be implied when instanciating with no IID?
For both sakes of clarity and reuse, ad-hoc configuration objects can be defined at top level and be referenced. These are not SWH or external components.
Those definitions are composed of an identifier (equivalent to a IID) on 1st level and a YAML object, possibly recursive, in further levels.
They are instaciated as schemaless dictionaries/lists.
Q: How to allow them in syntax and differentiate them from schemaful definitions?
Q: does such singleton object must be referenced at least in one place in the file they are part of, otherwise they would be ignored.
A register of components allowed in configurations is to be implemented in core.config.
It will consist of a (TID : qualified_constructor)
mapping.
These entries will not be hardcoded in this mapping, but registered at import time from the package that defines the components.
qualified_constructor
must be Python absolute import syntax for the callable which is either a factory function or a class object.
It may be in quoted form (string) or actual object form. String form avoid import but no static check of existence may be performed. Given that registering is done in the package responsible of defining the component, object form is chosen.
Components that can be registered may be any SWH service component which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
Q: use object or quoted form for qualified_constructor
?
Q: what about these config objects: journal related like journal-writer, most under deposit, celery-related…
A CLI option may be passed to specify an instance ID (only at 2nd level?) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
OOS: extend usage to any IID in instance mapping
OOS: dynamic handling of any such options for any attribute
Default behavior (single, manual, component instanciation)
swh objstorage replayer --from-instance default
swh objstorage replayer
(uses config from default instance)Nice to have (multiple, manual, component instanciations)
swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
Moved to the specification below.
Q: How to allow and handle both typed and untyped (ad-hoc) objects?
Q: how to identify what is an instance in the definitions? Will it require special handling wrt referencing mechanics (source and destination).
Q: do we want to support reference to external config definitions or require autonomy? = always 1 file for a whole service, or composition of partial definitions?
In the former, similar sections in multiple config files will need synchronization of update. One possibility is having each SWH component its config file and compose on-demand (using puppet?) in a standalone file for prod/tests. But not easy to have multiple instances this way. Fill on-demand a template file for each service?
(Writing sections breadth-first: deeper at each iteration)
Notations:
[opt=id]: concept subject to acceptance or removal, with identifier for easier reference. Usage similar to a feature flag.
[alt]: alternative to any surrounding [alt] statement
[rem]: remark
[OOS]: out-of-scope remark or idea
[rej]: rejected remark or idea
[Q]: question to be answered
The configuration system evolved partially with use cases. Initial design decisions applied to all use cases turned out to be both too hard to reason about/unstable for production and too inflexible for cli or testing.
For the purpose of this specification.
All SWH services and components.
Environments:
Configuration needs:
system service: systemd service vs docker + shell script
server: gunicorn vs Flask/aiohttp/Django devel server
CLI entrypoint
server application: app_from_configfile vs django config
worker:
component: constructor/factory
Configuration sources:
A.
B.
C.
Basis for discussing terms: terminology proposition
Used in this specification:
TID = type ID
IID = instance ID
AID = attribute ID
QID = qualified ID
ID = any of the above
ad-hoc object = singleton
"attribute ID" = path to the "key" of an attribute
storage:
default:
cls: buffer
min_batch_size:
content: 1000
storage: <storage.filtered-uffizi>
filtered-uffizi:
cls: filter
storage: <storage.uffizi>
uffizi:
cls: remote
url: http://uffizi.internal.softwareheritage.org:5002/
loader-git:
default:
cls: remote
storage: <storage.default>
save_data_path: "/srv/storage/space/data/sharded_packfiles"
random-component:
default:
cls: foo
celery: *celery
# Not a component: no type id, only instance id
_:
celery: &celery
task_broker: amqp://...
task_queues:
- swh.loader.git.tasks.UpdateGitRepository
- swh.loader.git.tasks.LoadDiskGitRepository
Based on YAML:
3 levels of depth: type, instance, attribute
Instance definitions are composed of an ID and a mapping to attributes.
1 instance <-> N attributes.
Component type definitions are composed of an ID and a mapping to instances.
1 type <-> N instances.
These instances are variants of the component: same type but different constructions.
Singletons are instances defined outside type definitions, so they live at top-level and have no type.
This model is syntactically complicated, here are alternatives to make it regular:
[alt=typePrefix] no type level, so only 2 levels; type and instance identifiers are merged as "type.instance".
[alt=typeAttr] move type identifiers to the attribute level, as a special attribute "type"
[alt=singletonType] use a dummy type for singletons: "singletons"
References can be made to an object defined somewhere else in the tree, using a qualified identifier.
Legal forms are defined to be from an attribute value to an instance identifier.
[opt=refkey] Legal forms also includes from attribute value attribute identifier. OOS
[opt=recattr] There may be recursion from attribute value to instance value definition. This allows anonymous definition of instance object in an attribute. OOS
WARNING: hopefully consistent grammar mixup. May be offending to purists.
Some definitions have alternatives noted with |=
.
ID ~= PCRE([A-Za-z0-9_-]+) # Could be stricter, e.g. snake_case
ID = TID | IID | AID
QID = (TID ".")? IID ("." AID)* # opt: refkey, skey
|= TID "." IID
ref = "<" QID ">"
attribute_value = YAML_object | ref
|= YAML_object | ref | attributes # opt: reckey
attributes = YAML_dict(AID, attribute_value)
instances = YAML_dict(IID, attributes)
singleton = YAML_object # no opt: sref, skey
|= YAML_dict(AID, attribute_value)
config_tree = YAML_dict(ID, YAML_dict) # loose typing
|= YAML_dict((TID, instances) | (IID, singleton))
Alternative definition of identifiers (always qualified):
ID ~= /[A-Za-z0-9_-]+/
TID = ID
IID = (TID ".")? ID
AID = IID "." ID+
QID = TID | IID | AID
Identifier is abbreviated ID.
Type ID is abbreviated TID.
Singleton ID is equivalent to Instance ID, abbreviated IID.
(Attribute) Key is identified by Attribute ID, abbreviated AID.
Qualified ID is abbreviated QID.
QID is a sequence of ID of the form (TID, IID) for component instances or (IID) for singletons. Its string form joins each field with ".".
[opt=refkey] May have form (TID, IID, AID) to reference component instance attributes. Useful either to reference another attribute of current instance, or any other attribute, except those defined in singletons.
[opt=reckey] May have form (TID, IID, AID*) to reference recursive component instance attributes.
[opt=skey] May have form (IID, AID*) to reference singleton attributes (sequence of AID because recursive).
[opt=sref] singleton attributes may reference any attribute.
Attribute is a (key, value) pair whose set forms an instance dictionnary.
Attribute value is either a YAML object or a reference.
[opt=recattr] Attribute value may also be an instance dictionnary.
Attribute level is any level under instance level, recursive or not.
A reference is synctatically defined as a qualified identifier enclosed in chevrons. Its source is an attribute value and its target is the object identified by the QID it owns.
The reference is deleted when it is resolved by the reference resolution routine.
Python type of a component to be instantiated and configured.
It is referred to indirectly through a TID in a configuration definition, and through a component constructor in the component type register.
Specific instantiation of a component, distinguished from the others by the set of attributes used to initialize it.
All identified instances of a type must be specified in the instance level of a configuration definition.
[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instanciation routine.
[opt=subinst] Instances may be referenced in an attribute value, and be recognized as instance declarations; i.e. be instanciated and initialized and not just passed as is to the constructor.
[opt=anoninst] Anonymous instances may be defined in an attribute value, and be recognized as instance declarations.
-> [Q] that would need a type declaration, which is to yet handled in the spec, to identify it as an instance and be able to instantiate it. One unflexible option is to have the parent AID, being the child IID, to be a TID, thus restricting its name to a known type ID. The other option being specifying TID in a dedicated child attribute with a name similar to type
or TID
.
Singletons objects are syntactically similar to instances.
Unless otherwise stated, the same rules apply.
They do not correspond to a predefined type, so they have no schema or attached semantics.
They are instantiated as a dict tree.
The component type register, abbreviated register, is a (TID, qualified_constructor)
mapping, defined in the configuration library.
It is used by the component resolution routine to resolve type identifiers to Python type constructors.
Entries in this mapping are to be registered through the component registration library routine. This registration may happen anywhere provided it is executed at loading/import time. It is advised to register the component in the package that defines it.
qualified_constructor
must be Python absolute import syntax for the object creating callable, which is either a factory function or a class.
It may be defined:
[alt] in quoted form (string).
[alt] in class object form.
[rem] String form avoid import but no static check of existence may be performed. If the registering is done in the package responsible of defining the component, object form is the most preferable.
Components that can be registered may be any SWH service component, SWH support component or external component, which is public (= has an Python object API).
In practice, only one main component by SWH service encapsulates all configuration for the components used in the service:
[Q] what about these config objects: journal related like journal-writer, most under deposit, celery-related…
This section is informational.
A component type may have multiple implementations.
There is no specific support for it in this system, but as this concept may appear in configuration, related considerations may be worth noting.
[rem] The component type of an instance may be abstract, in which case a concrete type must be determined by the component constructor.
A specific attribute of instances specifies implementation or flavour to use. It is commonly identified as cls
, but could be impl
or flavor
.
It used to be required along with args
, which is now deprecated, because all configurable SWH components were polymorphic.
Components that have no such feature need no cls
attribute in their configuration.
Alternatively, polymorphic components may be instanciated without cls
, in which case a default implementation will be used.
[rem] an indirection layer such as a factory may be defined for monomorphic components in order to keep consistency and allow polymorphism if needed later. Alternatively, for all components, better than a factory which is not derivable from the component type by user code, an abstract base class constructor would abstract this indirection layer away.
Instantiating is the process through which a concrete object is constructed from a model and data describing its (initial) state. In the context of this system, a Python object is created though calling its constructor with the set of attributes associated to a particular instance in a configuration definition.
The input is a QID identifying an instance and a configuration tree (dictionary) containing the instance and its dependencies (reference targets).
The output is a component instance.
The process is composed of the following steps in order.
[opt=subinst] Iidentifying instance definitions require a TID/IID.
[Q] As parent (ID: {instance attrs}
) or child ({ID: ..., instance attrs}
)?
[opt=deftinst] An instance identified as "default" is instantiated if a TID but no IID is provided to the instantiation routine.
Instances must be instanciated only once and used at each reference source.
(Validation/Conversion/Interpretation)
This section is informational.
Interpretation of attributes beyond stated above is out of scope and left to the component constructors to do.
Standard Python typing available in constructors may be used to as the basis for the validation of configuration data.
Validity of structure, value and existence may be checked.
Conversions may also be performed.
[opt=validate] The library provides generic validation primitives and a validation routine based on a data model specification object.
(Loading/Defaults/Merging)
Loading is the process of fetching data from a storage medium into a memory space which is easily accessible to the processing system. In the context of this system, this data is then read and converted into a Python object.
Loading source may be: an I/O file abstraction (whatever its backing source), or an operating system path to such file abstraction, or such path resolvable from an environment variable or a configuration file ID.
Only a Python dictionary is accepted as the holder of this data once loaded. A default configuration definition, either as a dictionary literal or a loaded configuration, can be specified in which case every attributes absent from loaded configuration will be set to the default one.
Library should be imported as config
everywhere for clarity and uniformity (e.g. import swh.core.config as config
or from swh.core import config
).
[rem] Existing routine merge_configs
should be moved to another module as merge_dicts
.
Configuration object: mapping
WARNING:
In the following examples, names subject to change.
Code is inspired by Python, but abstracted to focus on typing.
DeriveType
denotes simply a type derived from an exiting one, with no consideration of compatibility with base type or any other.
[rem] Should choose term among load
, read
, from
, by
, config
Example names: read_config
, load_envvar
Config = DeriveType(Mapping) # Tree. Allow only mapping in config definition top-level
ConfigFileID = DeriveType(str) # opt=fileid
Envvar = DeriveType(str)
File = io.IOBase
Path = os.PathLike
load: (Union[File,Path,Envvar,ConfigFileID], defaults:Config?) -> (Config)
load_from_file: (File, defaults:Config?) -> (Config)
load_from_path: (Path, defaults:Config?) -> (Config)
load_from_envvar: (Envvar, defaults:Config?) -> (Config)
load_from_name: (ConfigFileID, defaults:Config?) -> (Config) # opt=fileid
no default configs
Loads as YAML tree and convert to Python recursive mapping.
[opt=fileid] may use ID to reference files independently of their path or extension, in loading mechanism. This is sugar that existed but may be no longer wanted. OOS
[Q] Where to check for loadable path? In loading routines or user code? May duplicate behavior.
[Q] Should envvar be hardcoded in library or default? Same for default path.
OOS: every function but create component
[rem] Should choose amongst get
, read
, from_config
, instantiate
, component
, instance
, iid
.
Example names: get_component_from_config
, instantiate_from_config
, create_component
, read_instance
, get_from_id
TID = DeriveType(str)
IID = DeriveType(str)
QID = (TID, IID) # Simplified form, may be Sequence(ID)
Component = DeriveType(type)
ComponentConstructor = DeriveType(Callable) # Either type or function
InstanceConfig = DeriveType(Config)
create_component: (Config, QID) -> (Component)
create_component: (InstanceConfig, TID) -> (Component)
Returns an instantiated component identified by QID.
Uses get_obj
, resolve_references
, resolve_component
, instantiate_component
.
get_obj: (Config, QID) -> (Config)
get_instance: (Config, QID) -> (InstanceConfig)
Returns a config object (subtree) of the config identified by the config ID. May be used both for getting all instances or one instance depending on whether the config ID has an instance ID part.
resolve_references: (Config) -> (Config)
resolve_reference: (Config, QID) -> (Config)
Replaces reference source with the object identified by reference target.
find_instances: (InstanceConfig) -> (Set(InstanceConfig))
[opt=subinst] Finds all instance definitions nested in this instance definition and returns a list of them.
[Q] How to identify instances?
[Q] Should it recurse into nested definitions or just one level?
resolve_component: (TID) -> (ComponentConstructor)
Lookups core.config register to get the constructor from TID.
instantiate_component: (InstanceConfig, ComponentConstructor) -> (Component)
Instantiate a component using given constructor and an instance configuration mapping.
[opt=validate]
This section proposes a framework for validating instance definitions in a fairly lightweight and flexible way, for use by component constructors or injectors.
check: (Config) -> (Boolean)
check_definitions: (Config) -> (Boolean)
check_component: (InstanceConfig, ModelSpec) -> (Boolean)
generate_spec_from_signature: (ComponentConstructor) -> (ModelSpec)
check
: validate both language and instances.
check_definitions
: validate whole definition against language spec.
check_component
: validate instance definition against component spec.
This is a template function which is parametrized by user-specified spec.
AttrKey ~= String("[A-Za-z0-9_\-]+")
AttrVal = YAML_object
# Path in the instance configuration dictionary
Path ~= String("([A-Za-z0-9_\-]+/)+")
# Wrapper to convert falsey values or exceptions to False, otherwise True
ensure_boolean: Booleanish -> Boolean
# Generic and context-sensitive signatures for flexibility
value_check: ((AttrVal) | (AttrVal, InstanceConfig)) -> Booleanish
# If not optional existence check should succeed, else not performed.
optional_check: (AttrVal, InstanceConfig) -> Booleanish) | Booleanish
# Checks whether attr exists at one of given paths, or anywhere if no path.
# No reason to have user customise existence check.
existence_check: (AttrVal, Set(Path), InstanceConfig) -> Boolean
# Here is the model specification
# Kwargs: best I found for a typed mapping where every item is optional
AttrProperties = Kwargs(value_check, optional_check, Set(Path))
# None for no checks on attribute
ModelSpec = Mapping(AttrKey, AttrProperties | None)
check_component
verifies that all properties of every attribute holds in the instance definition, based on user-defined model specification. Model specification can leverage primitive check functions and user-defined check functions. Supported checks are value and existence in tree-structure checks, which are distinguished for expressiveness.
The model specification lists each (unqualified) attribute that may exist in the configuation definition, along with attribute properties that must hold.
An attribute may or not be optional, meaning whether validation should fail on absence, based on the boolean value of the optional_check
. optional_check
may be a callable that must determine whether the attribute is optional based on the configuration context and return a booleanish value, or be a booleanish value. It is run in a wrapper which converts falsey values or exceptions to False
, and anything else to True
. Required attribute is checked for existence based on a set of paths in the tree if any, or existence anywhere in the tree. Optional attribute is then not checked for existence but still for legal value.
The value check may be any callable that either accepts a single value, or a value and the configuration context (instance definition), and return a booleanish value, handled as above. This makes it possible to use many existing functions or object constructors to do the validation, e.g. int
, re.match
, isinstance(Protocol)
or a function verifying a relation to another attribute in the definition is valid.
generate_spec_from_signature
: generate a model specification where annotations are used as value_check
functions wherever possible, argument are optional or not depending on the existence a default value, and the path set contains only the tree root. A mapping from types to validators is used to validate most common types, others will only be checked by insinstance
. This is a helper function to generate a spec draft ahead of time, that must be corrected and stored along the corresponding constructor, as it cannot be guaranteed to function properly in all cases.
Components with multiple implementations:
Operations based on function signatures like validation but also instanciation, need a way to map the cls
argument to the concrete type and constructor signature.
A solution to automatically use the good constructor is to implement single dispatch and overloading on the main constructor. Every method may still call the main one, but must have a signature compatible with the one of the concrete class constructor, based on cls
.
See also "Library/Type implementations" remark about abstract constructors.
Demonstration of features in every use cases.
CLI, WSGI, worker, task, daemon, testing
swh scanner --scanner-instance=docker scan .
SWH_SCANNER_INSTANCE=docker swh scanner scan .
import swh.core.config as config
scanner_instance = "docker" # from cli flag or envvar
config_path = "~/.config/swh/default.yml" # from cli flag or envvar or CLI default or core.config default
config_dict = config.load(config_path)
scanner = config.create_component(
config_dict,
config.QID(type="scanner", instance=scanner_instance)
)
scanner.scan(".")
rpc-serve, WSGI app
def make_app_from_configfile() -> StorageServerApp: # Or any other module App
global app_instance
if not app_instance:
config_dict = config.load_from_envvar()
rpc_instance = os.environ.get("SWH_STORAGE_RPC_INSTANCE", "default")
app_instance = config.create_component(
config_dict,
config.QID(type="storage-rpc", instance=rpc_instance)
)
if not check_component(app_instance, "storage-rpc"): # raise ConfigurationError or something?
return app_instance
ardumont Completed the snippet ^ (unsure about it)
Celery task code
@shared_task(name="foo.bar")
def load_git(url):
config_dict = config.load_from_envvar()
loader_instance = os.environ.get("SWH_LOADER_GIT_INSTANCE", "default")
loader = config.create_component(
config_dict,
config.QID(type="loader-git", instance=loader_instance)
)
config_dict.create_component(type="loader-git", instance=loader_instance)
return loader.load(url=url)
ardumont we moved away from passing parameters to the
load
function. The url parameter is to be passed along the constructor of the loader (same goes for lister, etc…)
Example test
import swh.core.config as config
@pytest.fixture
def config_dict() {
return {...}
}
def test_config(config_dict):
type_ID = "objstorage"
instance_ID = "test_1"
instance = config.create_component(
config_dict,
config.QID(type=type_ID, instance=instance_ID)
)
...
@pytest.fixture
def config_path(datadir):
return f"{datadir}/other.yml"
def test_config2(config_path):
config_dict = config.load_from_path(config_path)
instance = config.create_component(
config_dict,
config.QID(type=type_ID, instance=instance_ID)
)
...
The environment parameters comprises any dependency of the configuration system external to the code.
This includes: configuration directory, configuration file, environment variable and commandline parameters.
SWH configuration directory: SWH_CONFIG_HOME=$HOME/.config/swh
YAML file with .yml containing only the configuration data.
Default if none is specified to the generic loading routine: $SWH_CONFIG/default.yml
.
[opt=conffileid] a configuration file id corresponding to the basename of a configuration file (without extension).
-> [Q] then only from $SWH_CONFIG_HOME
or have a register?
This feature is to be built into SWH core library.
Specify the path to the configuration file to use for a whole service:
path_part = path
| file
Environment variable: SWH_CONFIG_<PATH_PART>
CLI option: swh --config-<path_part>
[rem] "path" is a more precise term than "file".
A CLI option may be passed to specify an instance ID (only at 2nd level) when several alternatives are provided in the configuration.
Such option must be declared statically in CLI code.
Specify the instance configuration to use for a given component, using instance ID:
id_part = instance
| id
| iid
| cid
SWH_<COMP>_<ID_PART>
--<comp>-<id_part>
[rem] Any variant containing "id" is more precise than simply "instance".
A CLI option may be passed to override an attribute in the configuration.
Such option must be declared statically in CLI code.
Specify any other predefined configuration option:
SWH_<COMP>_<OPTION>
--<comp>-<option>
[OOS] dynamic handling of any such options for any attribute, similar to what click
permits.
CLI has precedence over envvars.
Environment parameters have precedence over whole definitions (from file or code) and whole definitions have precedence over defaults, per-attribute.
This follows the principle that the particular takes precedence over the general.
CLI param > envvar param > CLI file > envvar file > default file > defaults literal
This precedence rules must be implemented in entrypoint client code, with help of the library loading API. Only part of it may be implemented, the minimum being accepting a whole definition trough either code or envvar.
Using this objstorage replayer configuration file file
objstorage:
local:
cls: pathslicing
root: /srv/softwareheritage/objects
slicing: "0:2/2:5"
s3:
cls: s3
s3-param: foo
journal-client:
default:
# single implem: no cls needed
brokers:
- kafka
prefix: swh.journal.objects
client-param: blablabla
docker:
brokers:
- kafka.swh-dev.docker
...
objstorage-replayer:
default:
src: <objstorage.local>
dst: <objstorage.s3>
journal-client: <journal-client.default>
CLI usage:
Specify no instance, use default instance config:
swh objstorage replayer
Specify instance:
swh objstorage replayer --from-instance default
Specify nested instances (opt=subinst):
swh objstorage replayer --src local --dst s3
swh objstorage replayer --src local --dst s3 --journal-client docker
CLI options to be defined statically.
@douardda's proposal: on-the-fly generation of instance config via syntactic sugar
swh objstorage replayer --src "pathslicing://?root=/srv/softwareheritage/objects&slicing=0:2/2:5" --dst s3
[OOS] @tenma's proposal: dynamic handling of cli options wrt schema
swh objstorage replayer --objstorage-replayer.src=objstorage.local --objstorage-replayer.journal-client=journal-client.default
- generic, verbose, arbitrary attribute setting
Depends on chosen functionalities.
Depends on chosen functionalities.
(Proposition)
Prepare for easy switch and rollback by creating configuration copies conforming to the new system, and code conforming to the new system in separate branches.
Participants: tenma, douardda, olasd, ardumont
Language:
APIs:
Implementation:
Environment:
Notes about implementation:
Still open:
Conclusion: