Pulp3 Import/Export Design
Introduction/problem statement
There is a use-case for extracting repository-versions out of a running
instance of Pulp into a file that can be transferred to another Pulp
instance and imported. This is not the Pulp-to-Pulp sync case; the
assumption is that the receiving Pulp instance is network-isolated. This
design aims to describe the requirements, limitations, and a possible
implementation for this functionality.
Additional Documents
Definitions
- Upstream - the Pulp instance whose repository-version(s) we want
to export
- Downstream - the Pulp instance that will be importing those
repository-version(s)
- ModelResource - entity that understands how to map the metadata
for a specific Model owned/controlled by a plugin to an exportable
file-format (e.g., CSV or JSON) (see django-import-export
)
- Exporter - A resource that exports content from Pulp for a
variety of different use cases
- PulpExporter - A kind-of Exporter, that is specifically used
to export data from an Upstream for consumption by a Downstream
- PulpExport - a specific instantiation/run of a PulpExporter
- Export file - tarfile containing database metadata and
content-artifacts for a (set of) repository-versions, generated
during execution of an Export
- PulpImporter - A resource that accepts an Upstream
PulpExporter export file, and manages the process of importing
the data and artifacts included
- PulpImport - a specific instantiation/run of a PulpImporter
- Repository-mapping - a configuration file that provides the
ability to map an Upstream repository, to a Downstream
repository, into which the Upstream’s repository-version should be
imported by a PulpImporter
- Import order - for complicated repository-types, managing
relationships requires that models be imported in order. Plugins are
responsible for specifying the import-order of the
ModelResources they own
Workflow
“First time” setup and processing
- Upstream user (UU) defines a PulpExporter for a set of repositories
- UU does a dry-run of an Export for that Exporter
- UU executes an Export for that Exporter
- UU hands the resulting export-file to a Downstream User (DU)
- DU defines a PulpImporter
- DU does a dry-run of an Import, using the provided
file, to generate a proposed repository-mapping between the Upstream
repositories referenced in the export-file, and available Downstream
repositories.
- DU modifies the output of the dry-run to correctly match Upstream
and Downstream repositories
- DU modifies the PulpImporter defined previously with the provided mapping
- DU executes an Import for that PulpImporter with the provided export-file
“Steady state”/regular processing
- UU executes an Export for that Exporter
- UU hands the resulting export-file to a Downstream User (DU)
- DU executes an Import for that Importer with the provided export-file
Requirements
- [DONE] Need to export Model info (i.e. data from the database)
- [DONE] Need to export artifacts (i.e. files from the filesystem)
- [DONE] Need to be able to handle multiple repository-versions at once
- On export, need to be able to specify two versions and export only
the ‘diff’ between them
- On import, need to be able to recognize a diff, and be able to
specify a ‘base version’ on top of which that diff should be played
- [DONE] A given artifact should only be in the export once, even if it
exists in multiple repositories
- [DONE] Need to be able to import into downstream repositories with
arbitrary names (because downstream cannot be assumed to be
identical to upstream)
- [DONE] Need to be able to handle any repository type
- Need to be respond gracefully if a given repository type does not
support export/import
- [DONE] Need to minimize the amount of work for plugin authors (to reduce
implementation friction)
- Need to minimize the amount of work on the user (to reduce
acceptance friction)
- [DONE] Need to allow for complicated scenarios, while making the simple
“export version-N from repo-R and import it into the same repo
elsewhere” case simple
- Need to provide a “dry-run” capability for export that will produce
a description of what would be exported and a first-order
approximation of the required disk space
- Need to provide a “dry-run” capability on import in order to insure
that the provided export file and repo-mapping “makes sense”
- Optional: export takes only a repository-id, and produces an export
for the ‘latest’ repository-version
- [DONE] Need to provide a history to the user to see information about the
last time they exported
Existing documents/discussion
- Initial pulp issue 5096
- Current epic 6134 with design
meeting minutes for this issue
PulpExporter
PulpExporter is a type of Exporter that can create a PulpExport instance,
which knows how to create a file that can be imported back into Pulp.
Users can configure a PulpExporter with settings and view a history for
past Exports created by a PulpExporter.
A PulpExporter can rely on the last Export it created to know the
version of a repo(s) that were exported, and for incremental exports, use
those version(s) automatically as the base version(s).
Fields
- name - a unique string to identify the exporter
- path - the location on the filesystem of a DIRECTORY where exports should be
exported
- repositories - list of repos whose current-versions will be exported, OR
- last_export - [read-only] UUID of last Export instantiated by the PulpExporter,
which will contain the list of base repository versions to export
against. If there is no last-export-UUID, this is the first-time-in
and a full export will happen.
History
A user can view the history for a particular exporter. This consists of
the list of PulpExports created by this PulpExporter. Exports include:
- date and time when the export occurred
- specific RepositoryVersions that were exported
- parameters that were used when exporting
- filename the export-file was written to
- SHA256 checksum of the resulting export-file
REST API (pulpcore)
API for Exporting
Pulpcore will provide a CURDL endpoint at /pulp/api/v3/exporters/core/pulp/
for working with PulpExporters.
There will also be an endpoint for viewing the Exports instantiated
by/from a specific PulpExporter at /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/
.
API Specifics - working with PulpExporters
POST /pulp/api/v3/exporters/core/pulp/
- CREATE an Exporter
- name - human-readable identifier for this Exporter
- repositories - list of repositories whose current/latest
version should be exported
- path to the DIRECTORY where resulting tarfile will live
- Assumption: either you get the complete tarfile, or, if
Something Went Wrong, you get an error and no file
- Required with no default - if we let pulp ‘guess’, we can
run pulp itself out of storage
- Must be at/under one of the directories in the ALLOWED_EXPORT_PATHS
setting, or the create will fail
- [TBD] publications (optional, phase-2) to be exported
- [TBD] distributions (optional, phase-3) to be exported
GET /pulp/api/v3/exporters/core/pulp/
- LIST of all current Exporters
GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/
- READ specifics
about an Exporter
PUT pulp/api/v3/exporters/core/pulp/<exporter-uuid>/
- completely UPDATE
an Exporter
- name - human-readable identifier for this Exporter
- repositories - list of repositories whose current/latest
version should be exported, OR
- path - directory where resulting tarfile will live
- [TBD] publications (optional, phase-2) to be exported (?)
- [TBD] distributions (optional, phase-3) to be exported(?)
PATCH /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/
- partially UPDATE
an Exporter
DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/
- DELETE an
Exporter
API Specifics - working with Exports
POST /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/
-
CREATE AND EXECUTE an Export
- full - if true, ignore any previous Exports for this
Exporter and generate a full export. If false or not set,
generate an incremental export, using the last existing Export
as the base-version. Default is 'True'
- versions - list of repository-versions to export.
Must match the full set of Repositories being managed by the Exporter.
- start_versions - list of repository-versions to start from when doing an
incremental export. Must match the full set of Repositories being managed by the Exporter.
- chunk_size - specify a chunk-size for an export-file.
Export-file will be cut up into chunk sof no more than (chunk_size).
- [TBD] dry-run - if provided, produce output describing what would
have been produced, without actually writing anything to disk
GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/
- LIST
of all Exports for the specific PulpExporter
GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/
-
READ specifics about an Export
DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/
-
DELETE an Export record
There is no way to UPDATE an Export. It is an historical record.
API for Importing
Pulpcore will provide a CRUD endpoint for PulpImporters at
/pulp/api/v3/importers/core/pulp/
.
There will also be an endpoint for viewing the import history of a
PulpImporter at /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/
.
API Specifics - working with PulpImporters
POST /pulp/api/v3/importers/core/pulp/
- CREATE a PulpImporter
- repository to import the repository version into
- What if we let the repo be created by the repo-version?
- Plugin authors would need to have modelresources for
their repository metadata
- How to handle incrementals?
- What if the repo already exists?
- What if the repo isn’t of the same type as the export?!?
(whether we specify it or not)
- Wait, multiple repos in export - now what?
- OR,
- repository-mapping - configuration-file that maps
upstream-repositories to the downstream-repository they should
import-into
GET /pulp/api/v3/importers/core/pulp/
- LIST of all current Importers
GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/
- READ specifics
about an Importer
PATCH /pulp/api/v3/importers/core/pulp/<importer-uuid>/
- partial
UPDATE an Importer (subset of fields being updated)
PUT /pulp/api/v3/importers/core/pulp/<importer-uuid>/
- completely
UPDATE an Importer (all fields required)
DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/
- DELETE an
Importer
API Specifics - working with Imports
POST /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/
-
CREATE AND EXECUTE an Import
- file (points to the output of an Export)
- [TBD] dry-run - if provided, produce output describing what would
have been imported and to where, without actually making any
changes
- Useful for validating the repository-mapping
GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/
- LIST
of all Imports for the specific Importer
GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/
-
READ specifics about an Import
DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/
-
DELETE an Import record
There is no way to UPDATE an Import. It is an historical record.
Plugin API (for pulpcore to call)
- Plugin author is responsible for a
<plugin>.app.modelresource.py
file, so that
pulpcore can find it based on repo-type. This file will contain:
- All of that plugin’s ModelResource subclasses
- A specification of the order in which they must be imported as
IMPORT_ORDER
- see pulp_rpm for an example
- Pulpcore must include error handling when the plugin doesn't do
export/import
Infrastructure approach
This functionality builds on the Django import/export
framework
Plugin ModelResource requirements
- ModelResource for each object/entity to export/import
- lives at
<plugin>.app.modelresource
- Set of ModelResources for each exportable/importable Model
that plugin is responsible for
- Which fields in those Models are export/import targets
- Order in which entities are/must be imported
- Can ModelResource declare a set of dependencies/prereqs?
- YES - use IMPORT_ORDER to list the order ModelResources need to be imported in
- Do we have any possibility for circular dependencies?
- Investigate use of before/after hooks provided - can we handle this complex case using them?
- Large amount of work for low return
- Can maybe use
before_import()
and before_import_row()
Pulpcore ModelResource requirements
- Core_artifact
- Core_contentartifact
- Core_content
Directory Layout
The export file needs to:
- Identify which export this was generated by (from filename)
- Identify the repository-version(s) that are generated (RepositoryVersionResource.json)
- Hold the data export for each modelresource for each repository-version (content of Repository directories)
- Hold the artifacts for the exported repository-version(s) (artifacts/ direcetory)
- Identify versions of pulpcore and any plugins that were used to generate the export (version.json file)
Data export
- Database information layout is defined by django export rules (one
json per ModelResource).
- For multiple repository-versions, each version will be its own
directory (named by repo-version-uuid)
Artifacts
Artifacts exist in their own directory structure:
artifact/<first-2-digits-of-sha256>/<sha256-filenames>
Example layout
Incremental Export
Incremental export is handled by making a POST /pulp/api/v3/exporters/core/pulp/<uuid>/exports/
call and specifying
full=False
. In this case, the PulpExporter will generate an incremental
dump based on the repo-version(s) listed in the latest Export
instantiated by this PulpExporter, and the current-version of all the
associated repositories.
If you specify versions=
as well, PulpExporter will generate an incremental dump
of everything that happened after last_export
, up to the specified versions (as opposed to current-latest)
If you specify start_versions=
, PulpExporter will generate an incremental dump
of everything that happened since the specified start_versions, up to versions=
(if specified) or last_export
(if no versions=)
Edge cases/caveats
- How do we handle case when a user wants to ‘unimport’ repoverison
X?
- Export-full on upstream, specifying the desired end-state repository-version, and import on downstream
Open Questions
Currently, Pulp has the concept of Exporters (filesystem, rsync,
etc) which are implemented as Master/Detail. This was done to
accommodate the fact that some plugins will need to export
publications while others might export repository versions. Do we
divorce the concept of import/export and Exporters? Or bring the
Exporters inline with import/export by looking into having core
handle Exporters?
- On import, order can matter. Plugins will need to be able to define
which entities need to be imported in which order. Problem: how can
we handle circular relations?
- How are we going to handle publications/distributions? Esp in the
presence of incrementals?
- Can users update exporters? What happens if I change the repos or
the last_repository_versions?
- Last-repo-version is now implied by last-Exporter.
- current -version of repos-in-exporter, will be compared to
exported-repo-versions in most-recent Export. New repos will
get a full-export, removed-repos won’t be exported. On
import-side, new repos will get new versions, and removed
repos will stop getting new repo-versions at import time.
- OR - import-time can compare last-successful-import repos, to
incoming-repos, and delete repos that aren’t incoming? That
sounds like a bad idea.
- Problem: How does this interact with versions= and start_versions= validation?
Should we even allow this, or should we recommend "create a new Exporter instead"?
- What does dry-run look like?
- How does the user create the repo-to-repo mapping used at import
time?
- Have an export include a file which has the mapping
- Have dry run return the mapping
- [DONE] How will backwards incompatible changes in the exported data format be handled over time? e.g. I exported this data a loooong time ago and now the system I'm importing into expects a newer, different format?
- idea 1: Tie the data format to the exported system's pulpcore version number. Put that version number into the exported data somehow. Then have systems importing know the oldest version they support and refuse to import older exports.
- idea 2: Same idea as (1) except use it's own numbering scheme, probably semver based
- Current Status: versions.json included in export, with core and plugin versions for all
plugins involved in tyhe export. Import checks for exact match to versions.json on import,
and errors if there are diffs.
- [DONE] Are we confident this design will allow us to later/easily add the ability to break up the data format over N, RAR files, isos, etc? Do we have an idea of the plan there even if we're not implementing it now?
- Current status : If
chunk_size=
is specified, resulting tarfile is split into pieces of
size <= chunk_size.