owned this note
owned this note
Published
Linked with GitHub
# Pulp3 Import/Export Design
## Introduction/problem statement
There is a use-case for extracting repository-versions out of a running
instance of Pulp into a **file** that can be transferred to another Pulp
instance and imported. This is not the Pulp-to-Pulp sync case; the
assumption is that the receiving Pulp instance is network-isolated. This
design aims to describe the requirements, limitations, and a possible
implementation for this functionality.
### Additional Documents
* [low level design document](https://hackmd.io/@ggainey/importexport_lowlevel_naturalkeys)
* [Out of memory issues](https://hackmd.io/@ggainey/pie_memory_problem)
### Definitions
- **Upstream** - the Pulp instance whose repository-version(s) we want
to export
- **Downstream** - the Pulp instance that will be importing those
repository-version(s)
- **ModelResource** - entity that understands how to map the metadata
for a specific Model owned/controlled by a plugin to an exportable
file-format (e.g., CSV or JSON) (see `django-import-export`)
- **Exporter** - A resource that exports content from Pulp for a
variety of different use cases
- **PulpExporter** - A kind-of *Exporter*, that is specifically used
to export data from an *Upstream* for consumption by a *Downstream*
- **PulpExport** - a specific instantiation/run of a *PulpExporter*
- **Export file** - tarfile containing database metadata and
content-artifacts for a (set of) repository-versions, generated
during execution of an *Export*
- **PulpImporter** - A resource that accepts an *Upstream*
*PulpExporter* *export file*, and manages the process of importing
the data and artifacts included
- **PulpImport** - a specific instantiation/run of a *PulpImporter*
- **Repository-mapping** - a configuration file that provides the
ability to map an *Upstream* repository, to a *Downstream*
repository, into which the *Upstream’s* repository-version should be
imported by a *PulpImporter*
- **Import order** - for complicated repository-types, managing
relationships requires that models be imported in order. Plugins are
responsible for specifying the *import-order* of the
*ModelResources* they own
### Workflow
#### “First time” setup and processing
1. Upstream user (UU) defines a PulpExporter for a set of repositories
1. UU does a dry-run of an Export for that Exporter
1. UU executes an Export for that Exporter
1. UU hands the resulting export-file to a Downstream User (DU)
1. DU defines a PulpImporter
1. DU does a dry-run of an Import, using the provided
file, to generate a proposed repository-mapping between the Upstream
repositories referenced in the export-file, and available Downstream
repositories.
1. DU modifies the output of the dry-run to correctly match Upstream
and Downstream repositories
1. DU modifies the PulpImporter defined previously with the provided mapping
1. DU executes an Import for that PulpImporter with the provided export-file
#### “Steady state”/regular processing
1. UU executes an Export for that Exporter
1. UU hands the resulting export-file to a Downstream User (DU)
1. DU executes an Import for that Importer with the provided export-file
### Requirements
- [DONE] Need to export Model info (i.e. data from the database)
- [DONE] Need to export artifacts (i.e. files from the filesystem)
- [DONE] Need to be able to handle multiple repository-versions at once
- On export, need to be able to specify two versions and export only
the ‘diff’ between them
- On import, need to be able to recognize a diff, and be able to
specify a ‘base version’ on top of which that diff should be played
- [DONE] A given artifact should only be in the export once, even if it
exists in multiple repositories
- [DONE] Need to be able to import into downstream repositories with
arbitrary names (because downstream cannot be assumed to be
identical to upstream)
- [DONE] Need to be able to handle any repository type
- Need to be respond gracefully if a given repository type does not
support export/import
- [DONE] Need to minimize the amount of work for plugin authors (to reduce
implementation friction)
- Need to minimize the amount of work on the user (to reduce
acceptance friction)
- [DONE] Need to allow for complicated scenarios, while making the simple
“export version-N from repo-R and import it into the same repo
elsewhere” case simple
- Need to provide a “dry-run” capability for export that will produce
a description of what would be exported and a first-order
approximation of the required disk space
- Need to provide a “dry-run” capability on import in order to insure
that the provided export file and repo-mapping “makes sense”
- Optional: export takes only a repository-id, and produces an export
for the ‘latest’ repository-version
- [DONE] Need to provide a history to the user to see information about the
last time they exported
### Existing documents/discussion
- Initial pulp issue [*5096*](https://pulp.plan.io/issues/5096)
- Current epic [*6134*](https://pulp.plan.io/issues/6134) with design
meeting minutes for this issue
## PulpExporter
PulpExporter is a type of Exporter that can create a PulpExport instance,
which knows how to create a file that can be imported back into Pulp.
Users can configure a PulpExporter with settings and view a history for
past Exports created by a PulpExporter.
A PulpExporter can rely on the last Export it created to know the
version of a repo(s) that were exported, and for incremental exports, use
those version(s) automatically as the base version(s).
### Fields
- **name** - a unique string to identify the exporter
- **path** - the location on the filesystem of a DIRECTORY where exports should be
exported
- **repositories** - list of repos whose current-versions will be exported, **OR**
- **last_export** - [read-only] UUID of last Export instantiated by the PulpExporter,
which will contain the list of base repository versions to export
against. If there is no last-export-UUID, this is the first-time-in
and a full export will happen.
### History
A user can view the history for a particular exporter. This consists of
the list of *PulpExports* created by this *PulpExporter*. Exports include:
* date and time when the export occurred
* specific RepositoryVersions that were exported
* parameters that were used when exporting
* filename the export-file was written to
* SHA256 checksum of the resulting export-file
## REST API (pulpcore)
### API for Exporting
Pulpcore will provide a CURDL endpoint at `/pulp/api/v3/exporters/core/pulp/`
for working with *PulpExporters*.
There will also be an endpoint for viewing the *Exports* instantiated
by/from a specific *PulpExporter* at `/pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/`.
#### API Specifics - working with PulpExporters
- `POST /pulp/api/v3/exporters/core/pulp/` - **CREATE** an Exporter
- **name** - human-readable identifier for this Exporter
- **repositories** - list of repositories whose current/latest
version should be exported
- **path** to the **DIRECTORY** where resulting tarfile will live
- Assumption: either you get the complete tarfile, or, if
Something Went Wrong, you get an error and no file
- Required with no default - if we let pulp ‘guess’, we can
run pulp itself out of storage
- Must be at/under one of the directories in the ALLOWED_EXPORT_PATHS
setting, or the create will fail
- [TBD] **publications** (optional, phase-2) to be exported
- [TBD] **distributions** (optional, phase-3) to be exported
- `GET /pulp/api/v3/exporters/core/pulp/` - **LIST** of all current Exporters
- `GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - **READ** specifics
about an Exporter
- `PUT pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - completely **UPDATE**
an Exporter
- **name** - human-readable identifier for this Exporter
- **repositories** - list of repositories whose current/latest
version should be exported, **OR**
- **path** - directory where resulting tarfile will live
- [TBD] **publications** (optional, phase-2) to be exported (?)
- [TBD] **distributions** (optional, phase-3) to be exported(?)
- `PATCH /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - partially **UPDATE**
an Exporter
- `DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - **DELETE** an
Exporter
#### API Specifics - working with Exports
- `POST /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/ ` -
**CREATE AND EXECUTE** an Export
- **full** - if true, ignore any previous Exports for this
Exporter and generate a full export. If false or not set,
generate an incremental export, using the last existing Export
as the base-version. Default is 'True'
- **versions** - list of repository-versions to export.
Must match the full set of Repositories being managed by the Exporter.
- **start_versions** - list of repository-versions to start from when doing an
incremental export. Must match the full set of Repositories being managed by the Exporter.
- **chunk_size** - specify a chunk-size for an export-file.
Export-file will be cut up into chunk sof no more than (chunk_size).
- [TBD] **dry-run** - if provided, produce output describing what would
have been produced, without actually writing anything to disk
- `GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/` - **LIST**
of all Exports for the specific PulpExporter
- `GET
/pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/` -
**READ** specifics about an Export
- `DELETE
/pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/` -
**DELETE** an Export record
There is no way to **UPDATE** an Export. It is an historical record.
### API for Importing
Pulpcore will provide a CRUD endpoint for PulpImporters at
`/pulp/api/v3/importers/core/pulp/`.
There will also be an endpoint for viewing the import history of a
PulpImporter at `/pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/`.
#### API Specifics - working with PulpImporters
- `POST /pulp/api/v3/importers/core/pulp/` - **CREATE** a PulpImporter
- **repository** to import the repository version into
- What if we let the repo be created by the repo-version?
- Plugin authors would need to have modelresources for
their repository metadata
- How to handle incrementals?
- What if the repo already exists?
- What if the repo isn’t of the same type as the export?!?
(whether we specify it or not)
- Wait, multiple repos in export - now what?
- **OR**,
- **repository-mapping** - configuration-file that maps
upstream-repositories to the downstream-repository they should
import-into
- `GET /pulp/api/v3/importers/core/pulp/` - **LIST** of all current Importers
- `GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - **READ** specifics
about an Importer
- `PATCH /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - partial
**UPDATE** an Importer (subset of fields being updated)
- `PUT /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - completely
**UPDATE** an Importer (all fields required)
- `DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - **DELETE** an
Importer
#### API Specifics - working with Imports
- `POST /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/` -
**CREATE** AND **EXECUTE** an Import
- **file** (points to the output of an Export)
- [TBD] **dry-run** - if provided, produce output describing what would
have been imported and to where, without actually making any
changes
- Useful for validating the repository-mapping
- `GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/` - **LIST**
of all Imports for the specific Importer
- `GET
/pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/` -
**READ** specifics about an Import
- `DELETE
/pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/` -
**DELETE** an Import record
There is no way to **UPDATE** an Import. It is an historical record.
## Plugin API (for pulpcore to call)
- Plugin author is responsible for a `<plugin>.app.modelresource.py` file, so that
pulpcore can find it based on repo-type. This file will contain:
- All of that plugin’s ModelResource subclasses
- A specification of the order in which they must be imported as `IMPORT_ORDER`
- see [pulp_rpm](https://github.com/pulp/pulp_rpm/blob/master/pulp_rpm/app/modelresource.py) for an example
- Pulpcore must include error handling when the plugin doesn't do
export/import
## Infrastructure approach
This functionality builds on the [*Django import/export
framework*](https://django-import-export.readthedocs.io/en/latest/getting_started.html#exporting-data)
### Plugin ModelResource requirements
- ModelResource for each object/entity to export/import
- lives at `<plugin>.app.modelresource`
- Set of **ModelResources** for each exportable/importable Model
that plugin is responsible for
- Which **fields** in those Models are export/import targets
- Order in which entities are/must be imported
- Can ModelResource declare a set of dependencies/prereqs?
- YES - use IMPORT_ORDER to list the order ModelResources need to be imported in
- Do we have any possibility for circular dependencies?
- Investigate use of before/after hooks provided - can we handle this complex case using them?
- Large amount of work for low return
- Can maybe use `before_import()` and `before_import_row()`
### Pulpcore ModelResource requirements
- Core\_artifact
- Core\_contentartifact
- Core\_content
## File format ("HOW is it all exported")
### Directory Layout
The export file needs to:
- Identify which export this was generated by (from filename)
- Identify the repository-version(s) that are generated (RepositoryVersionResource.json)
- Hold the data export for each modelresource for each repository-version (content of Repository directories)
- Hold the artifacts for the exported repository-version(s) (artifacts/ direcetory)
- Identify versions of pulpcore and any plugins that were used to generate the export (version.json file)
#### Data export
- Database information layout is defined by django export rules (one
json per ModelResource).
- For multiple repository-versions, each version will be its own
directory (named by repo-version-uuid)
#### Artifacts
Artifacts exist in their own directory structure:
artifact/<first-2-digits-of-sha256>/<sha256-filenames>
#### Example layout
```
tar xvzf export-bfcc1def-f96c-4448-b585-0a3ee2f5651c-20200406_1639.tarfile.gz
├── artifact
│ ├── 02
│ │ ├── 77b750a47e23de8a391ca5ccfa00e8dbb19419814a229d31f1d7ba382783d2
│ │ └── 9b6b0ee32ae77e2d4ef16fdd4d9ff271195077b84d0ffcb59f393bbae4520f
│ ├── 14
│ │ └── 39c3e561ae7ece3c5833a6962ea1e758c9fc08f6afb4c1fe0939278070d516
│ ├── 27
│ │ └── 066e7572fc068a2d96956962c8c1234bd8724a2e875faf8afc9bf83d75b63e
│ ├── 2e
│ │ └── d4cb8647fb22113a36e12d6517ea1b91b1c1c61c3460a63da951af4b357fad
│ ├── 32
│ │ └── 8c2300ddd7a91265480020322806407ee46bba06fe446ba5c5020f857d0db0
│ ├── 38
│ │ └── 87adfb7f9c05f66fdba6c5f56ec4de9dc582b38d134d2870ee2859d17ff4ed
│ ├── 39
│ │ └── a74d75f3d41435de6f8e7e3219ebd97bca67f359a601aae9eeaef6a3827ef8
│ ├── 3b
│ │ └── ead1283c3241f099d284f78259cbb4d5dfd957b3d89c447e2bf3aa5b5c6faf
│ ├── 42
│ │ ├── 6ad7b4d71fde3b2c6caf0127862d841348d5d8b51ca2d2dc7d8f9100be04c4
│ │ └── bcdff1fa507e9dcde127689b443da75f8ec06486caece235fa492d26025b8b
│ ├── 4c
│ │ └── 0fd86853fe97d38d03905313501482fcbce511031a9d3c9adabcf01afb4d5e
│ ├── 55
│ │ └── 9d66be52e52263831f952de4dce7a4331ca8c3965dc7bd9d774e84acf0c3c3
│ ├── 57
│ │ └── fb2a45f839deb52a37f8c816e04c16557c393393bff3d2bc84fe851ec7163a
│ ├── 60
│ │ └── 9f12688c71fc65b5e6b9d919c18d31857270c7680d1f9c2fb3d9851a0e045e
│ ├── 65
│ │ └── 79ce096b6fcaf2b47981b7840578f31a58420d33dc0fa2f5fe5dc86d8dfa8e
│ ├── 71
│ │ └── 1df5946d0aa22e599b17295d102d4457a0c899d1293499897e1282f6c15c7b
│ ├── 78
│ │ └── 6aeaa5f329c4be42e581ecb8dcbdc426cae7f4365cbb168e8121d564437e6d
│ ├── 80
│ │ └── c7cfeb3d159a45e79801392cbbf0779869ca9e82302428a8346d31e85395bc
│ ├── 88
│ │ └── aff49c2658261e523724eaa9e2cd38772f9bcc70986aae30503d83f699612e
│ ├── 89
│ │ └── 5ba3c4847f2760a645f1224ad2a3f5d4da4aa289998c81598747582127c754
│ ├── 9b
│ │ └── b29691f097fe67a4dfebbafd1c895e5b41e13f623daaffc5a5a95dde1fabea
│ ├── a3
│ │ └── 77c9b3099d3a29729bdc8e4c059d0850c19415e38220e7e469880bf9e9b9ff
│ ├── af
│ │ └── 1f188d9e2523b24302b5b91036039777bd2df6b23313aff89cc475eb3e4b07
│ ├── b0
│ │ └── a3cc6b250c03719626af7c41d337e1187a5f0b779fce6005b3a45fac09f490
│ ├── b5
│ │ └── f7a0f30e2d8eb907b9c9ee8e2237d87db7054eddadfd17f8c5ce169530f28d
│ ├── b6
│ │ └── 7ee9aba5b21244948e651c61b47c81816e0b96c08deb0732c92197ea9c0d36
│ ├── bc
│ │ └── a4539394799d6ee9d863465b76ea12f51f95e7d87dee1558dd8430ee75bb7c
│ ├── be
│ │ └── 96d9b80dbd539050f5459bb24a134bd719030c4fad4f01761ef1959384bc02
│ ├── bf
│ │ └── d98c5543feb484581714123e8e7f52af9263d6794e77c723894619d75a5034
│ ├── c0
│ │ └── 6a997a99f84f54a05935f67aae9fddf0e9eed7cadcd04464670da439bfa48c
│ ├── c2
│ │ └── 8c806da973e7b568995d6ac4c800aa3464ad91e93c338383bad6cd9c175c66
│ ├── c5
│ │ └── fe0911fbe98294054558107e3dfce6fd1f9928ea469020de60abedea835ae2
│ ├── d3
│ │ └── 5ae5a97f28280f3e435fb5174938818739530071568d4da4351655b05c2bf7
│ ├── db
│ │ └── 4b1affbf78bff20720eacab4558c37cf0406b8a63b1bffb80ae878246b086a
│ ├── e4
│ │ └── bed3c8f84903f0b5c4dd35579b280bd440c541f245f6fffb1fcdd79fc9a43e
│ └── ef
│ ├── d7c18b5f8762cc72c5a0c17150c881e8cf4ccf20df5bf2bd77c98c4bc56238
│ └── f95a379774c7622037bcb7ff347271dc727ae4ddce9b16017477b58488aa3e
├── pulpcore.app.modelresource.ArtifactResource.json
├── pulpcore.app.modelresource.RepositoryResource.json
├── repository-4aeeecdb-9589-4520-92f6-90bb171702c7_1
│ ├── pulpcore.app.modelresource.ContentArtifactResource.json
│ ├── pulpcore.app.modelresource.ContentResource.json
│ ├── pulpcore.app.modelresource.RepositoryVersionResource.json
│ └── pulp_file.app.modelresource.FileContentResource.json
├── repository-c1796efe-9e62-4bf3-9ebe-c03f0518809a_1
│ ├── pulpcore.app.modelresource.ContentArtifactResource.json
│ ├── pulpcore.app.modelresource.ContentResource.json
│ └── pulpcore.app.modelresource.RepositoryVersionResource.json
└── version.json
```
## Incremental Export
Incremental export is handled by making a `POST
/pulp/api/v3/exporters/core/pulp/<uuid>/exports/` call and specifying
`full=False`. In this case, the PulpExporter will generate an incremental
dump based on the repo-version(s) listed in the latest Export
instantiated by this PulpExporter, and the current-version of all the
associated repositories.
If you specify `versions=` as well, PulpExporter will generate an incremental dump
of everything that happened after `last_export`, up to the specified versions (as opposed to current-latest)
If you specify `start_versions=`, PulpExporter will generate an incremental dump
of everything that happened since the specified start_versions, up to `versions=` (if specified) or `last_export` (if no versions=)
## Edge cases/caveats
- How do we handle case when a user wants to ‘unimport’ repoverison
X?
- Export-full on upstream, specifying the desired end-state repository-version, and import on downstream
## Open Questions
- ~~Currently, Pulp has the concept of Exporters (filesystem, rsync,
etc) which are implemented as Master/Detail. This was done to
accommodate the fact that some plugins will need to export
publications while others might export repository versions. Do we
divorce the concept of import/export and Exporters? Or bring the
Exporters inline with import/export by looking into having core
handle Exporters?~~
- On import, order can matter. Plugins will need to be able to define
which entities need to be imported in which order. **Problem: how can
we handle circular relations?**
- How are we going to handle publications/distributions? Esp in the
presence of incrementals?
- Can users update exporters? What happens if I change the repos or
the last_repository_versions?
- Last-repo-version is now implied by last-Exporter.
- current -version of repos-in-exporter, will be compared to
exported-repo-versions in most-recent Export. New repos will
get a full-export, removed-repos won’t be exported. On
import-side, new repos will get new versions, and removed
repos will stop getting new repo-versions at import time.
- **OR** - import-time can compare last-successful-import repos, to
incoming-repos, and delete repos that aren’t incoming? That
sounds like a bad idea.
- **Problem**: How does this interact with versions= and start_versions= validation?
Should we even allow this, or should we recommend "create a new Exporter instead"?
- What does dry-run look like?
- How does the user create the repo-to-repo mapping used at import
time?
- Have an export include a file which has the mapping
- Have dry run return the mapping
- [DONE] How will backwards incompatible changes in the exported data format be handled over time? e.g. I exported this data a loooong time ago and now the system I'm importing into expects a newer, different format?
- idea 1: Tie the data format to the exported system's pulpcore version number. Put that version number into the exported data somehow. Then have systems importing know the oldest version they support and refuse to import older exports.
- idea 2: Same idea as (1) except use it's own numbering scheme, probably semver based
- **Current Status**: versions.json included in export, with core and plugin versions for all
plugins involved in tyhe export. Import checks for **exact match** to versions.json on import,
and errors if there are diffs.
- [DONE] Are we confident this design will allow us to later/easily add the ability to break up the data format over N, RAR files, isos, etc? Do we have an idea of the plan there even if we're not implementing it now?
- **Current status** : If `chunk_size=` is specified, resulting tarfile is split into pieces of
size <= chunk_size.
###### tags: `import/export`