# Pulp3 Import/Export Design ## Introduction/problem statement There is a use-case for extracting repository-versions out of a running instance of Pulp into a **file** that can be transferred to another Pulp instance and imported. This is not the Pulp-to-Pulp sync case; the assumption is that the receiving Pulp instance is network-isolated. This design aims to describe the requirements, limitations, and a possible implementation for this functionality. ### Additional Documents * [low level design document](https://hackmd.io/@ggainey/importexport_lowlevel_naturalkeys) * [Out of memory issues](https://hackmd.io/@ggainey/pie_memory_problem) ### Definitions - **Upstream** - the Pulp instance whose repository-version(s) we want to export - **Downstream** - the Pulp instance that will be importing those repository-version(s) - **ModelResource** - entity that understands how to map the metadata for a specific Model owned/controlled by a plugin to an exportable file-format (e.g., CSV or JSON) (see `django-import-export`) - **Exporter** - A resource that exports content from Pulp for a variety of different use cases - **PulpExporter** - A kind-of *Exporter*, that is specifically used to export data from an *Upstream* for consumption by a *Downstream* - **PulpExport** - a specific instantiation/run of a *PulpExporter* - **Export file** - tarfile containing database metadata and content-artifacts for a (set of) repository-versions, generated during execution of an *Export* - **PulpImporter** - A resource that accepts an *Upstream* *PulpExporter* *export file*, and manages the process of importing the data and artifacts included - **PulpImport** - a specific instantiation/run of a *PulpImporter* - **Repository-mapping** - a configuration file that provides the ability to map an *Upstream* repository, to a *Downstream* repository, into which the *Upstream’s* repository-version should be imported by a *PulpImporter* - **Import order** - for complicated repository-types, managing relationships requires that models be imported in order. Plugins are responsible for specifying the *import-order* of the *ModelResources* they own ### Workflow #### “First time” setup and processing 1. Upstream user (UU) defines a PulpExporter for a set of repositories 1. UU does a dry-run of an Export for that Exporter 1. UU executes an Export for that Exporter 1. UU hands the resulting export-file to a Downstream User (DU) 1. DU defines a PulpImporter 1. DU does a dry-run of an Import, using the provided file, to generate a proposed repository-mapping between the Upstream repositories referenced in the export-file, and available Downstream repositories. 1. DU modifies the output of the dry-run to correctly match Upstream and Downstream repositories 1. DU modifies the PulpImporter defined previously with the provided mapping 1. DU executes an Import for that PulpImporter with the provided export-file #### “Steady state”/regular processing 1. UU executes an Export for that Exporter 1. UU hands the resulting export-file to a Downstream User (DU) 1. DU executes an Import for that Importer with the provided export-file ### Requirements - [DONE] Need to export Model info (i.e. data from the database) - [DONE] Need to export artifacts (i.e. files from the filesystem) - [DONE] Need to be able to handle multiple repository-versions at once - On export, need to be able to specify two versions and export only the ‘diff’ between them - On import, need to be able to recognize a diff, and be able to specify a ‘base version’ on top of which that diff should be played - [DONE] A given artifact should only be in the export once, even if it exists in multiple repositories - [DONE] Need to be able to import into downstream repositories with arbitrary names (because downstream cannot be assumed to be identical to upstream) - [DONE] Need to be able to handle any repository type - Need to be respond gracefully if a given repository type does not support export/import - [DONE] Need to minimize the amount of work for plugin authors (to reduce implementation friction) - Need to minimize the amount of work on the user (to reduce acceptance friction) - [DONE] Need to allow for complicated scenarios, while making the simple “export version-N from repo-R and import it into the same repo elsewhere” case simple - Need to provide a “dry-run” capability for export that will produce a description of what would be exported and a first-order approximation of the required disk space - Need to provide a “dry-run” capability on import in order to insure that the provided export file and repo-mapping “makes sense” - Optional: export takes only a repository-id, and produces an export for the ‘latest’ repository-version - [DONE] Need to provide a history to the user to see information about the last time they exported ### Existing documents/discussion - Initial pulp issue [*5096*](https://pulp.plan.io/issues/5096) - Current epic [*6134*](https://pulp.plan.io/issues/6134) with design meeting minutes for this issue ## PulpExporter PulpExporter is a type of Exporter that can create a PulpExport instance, which knows how to create a file that can be imported back into Pulp. Users can configure a PulpExporter with settings and view a history for past Exports created by a PulpExporter. A PulpExporter can rely on the last Export it created to know the version of a repo(s) that were exported, and for incremental exports, use those version(s) automatically as the base version(s). ### Fields - **name** - a unique string to identify the exporter - **path** - the location on the filesystem of a DIRECTORY where exports should be exported - **repositories** - list of repos whose current-versions will be exported, **OR** - **last_export** - [read-only] UUID of last Export instantiated by the PulpExporter, which will contain the list of base repository versions to export against. If there is no last-export-UUID, this is the first-time-in and a full export will happen. ### History A user can view the history for a particular exporter. This consists of the list of *PulpExports* created by this *PulpExporter*. Exports include: * date and time when the export occurred * specific RepositoryVersions that were exported * parameters that were used when exporting * filename the export-file was written to * SHA256 checksum of the resulting export-file ## REST API (pulpcore) ### API for Exporting Pulpcore will provide a CURDL endpoint at `/pulp/api/v3/exporters/core/pulp/` for working with *PulpExporters*. There will also be an endpoint for viewing the *Exports* instantiated by/from a specific *PulpExporter* at `/pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/`. #### API Specifics - working with PulpExporters - `POST /pulp/api/v3/exporters/core/pulp/` - **CREATE** an Exporter - **name** - human-readable identifier for this Exporter - **repositories** - list of repositories whose current/latest version should be exported - **path** to the **DIRECTORY** where resulting tarfile will live - Assumption: either you get the complete tarfile, or, if Something Went Wrong, you get an error and no file - Required with no default - if we let pulp ‘guess’, we can run pulp itself out of storage - Must be at/under one of the directories in the ALLOWED_EXPORT_PATHS setting, or the create will fail - [TBD] **publications** (optional, phase-2) to be exported - [TBD] **distributions** (optional, phase-3) to be exported - `GET /pulp/api/v3/exporters/core/pulp/` - **LIST** of all current Exporters - `GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - **READ** specifics about an Exporter - `PUT pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - completely **UPDATE** an Exporter - **name** - human-readable identifier for this Exporter - **repositories** - list of repositories whose current/latest version should be exported, **OR** - **path** - directory where resulting tarfile will live - [TBD] **publications** (optional, phase-2) to be exported (?) - [TBD] **distributions** (optional, phase-3) to be exported(?) - `PATCH /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - partially **UPDATE** an Exporter - `DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/` - **DELETE** an Exporter #### API Specifics - working with Exports - `POST /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/ ` - **CREATE AND EXECUTE** an Export - **full** - if true, ignore any previous Exports for this Exporter and generate a full export. If false or not set, generate an incremental export, using the last existing Export as the base-version. Default is 'True' - **versions** - list of repository-versions to export. Must match the full set of Repositories being managed by the Exporter. - **start_versions** - list of repository-versions to start from when doing an incremental export. Must match the full set of Repositories being managed by the Exporter. - **chunk_size** - specify a chunk-size for an export-file. Export-file will be cut up into chunk sof no more than (chunk_size). - [TBD] **dry-run** - if provided, produce output describing what would have been produced, without actually writing anything to disk - `GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/` - **LIST** of all Exports for the specific PulpExporter - `GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/` - **READ** specifics about an Export - `DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/` - **DELETE** an Export record There is no way to **UPDATE** an Export. It is an historical record. ### API for Importing Pulpcore will provide a CRUD endpoint for PulpImporters at `/pulp/api/v3/importers/core/pulp/`. There will also be an endpoint for viewing the import history of a PulpImporter at `/pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/`. #### API Specifics - working with PulpImporters - `POST /pulp/api/v3/importers/core/pulp/` - **CREATE** a PulpImporter - **repository** to import the repository version into - What if we let the repo be created by the repo-version? - Plugin authors would need to have modelresources for their repository metadata - How to handle incrementals? - What if the repo already exists? - What if the repo isn’t of the same type as the export?!? (whether we specify it or not) - Wait, multiple repos in export - now what? - **OR**, - **repository-mapping** - configuration-file that maps upstream-repositories to the downstream-repository they should import-into - `GET /pulp/api/v3/importers/core/pulp/` - **LIST** of all current Importers - `GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - **READ** specifics about an Importer - `PATCH /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - partial **UPDATE** an Importer (subset of fields being updated) - `PUT /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - completely **UPDATE** an Importer (all fields required) - `DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/` - **DELETE** an Importer #### API Specifics - working with Imports - `POST /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/` - **CREATE** AND **EXECUTE** an Import - **file** (points to the output of an Export) - [TBD] **dry-run** - if provided, produce output describing what would have been imported and to where, without actually making any changes - Useful for validating the repository-mapping - `GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/` - **LIST** of all Imports for the specific Importer - `GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/` - **READ** specifics about an Import - `DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/` - **DELETE** an Import record There is no way to **UPDATE** an Import. It is an historical record. ## Plugin API (for pulpcore to call) - Plugin author is responsible for a `<plugin>.app.modelresource.py` file, so that pulpcore can find it based on repo-type. This file will contain: - All of that plugin’s ModelResource subclasses - A specification of the order in which they must be imported as `IMPORT_ORDER` - see [pulp_rpm](https://github.com/pulp/pulp_rpm/blob/master/pulp_rpm/app/modelresource.py) for an example - Pulpcore must include error handling when the plugin doesn't do export/import ## Infrastructure approach This functionality builds on the [*Django import/export framework*](https://django-import-export.readthedocs.io/en/latest/getting_started.html#exporting-data) ### Plugin ModelResource requirements - ModelResource for each object/entity to export/import - lives at `<plugin>.app.modelresource` - Set of **ModelResources** for each exportable/importable Model that plugin is responsible for - Which **fields** in those Models are export/import targets - Order in which entities are/must be imported - Can ModelResource declare a set of dependencies/prereqs? - YES - use IMPORT_ORDER to list the order ModelResources need to be imported in - Do we have any possibility for circular dependencies? - Investigate use of before/after hooks provided - can we handle this complex case using them? - Large amount of work for low return - Can maybe use `before_import()` and `before_import_row()` ### Pulpcore ModelResource requirements - Core\_artifact - Core\_contentartifact - Core\_content ## File format ("HOW is it all exported") ### Directory Layout The export file needs to: - Identify which export this was generated by (from filename) - Identify the repository-version(s) that are generated (RepositoryVersionResource.json) - Hold the data export for each modelresource for each repository-version (content of Repository directories) - Hold the artifacts for the exported repository-version(s) (artifacts/ direcetory) - Identify versions of pulpcore and any plugins that were used to generate the export (version.json file) #### Data export - Database information layout is defined by django export rules (one json per ModelResource). - For multiple repository-versions, each version will be its own directory (named by repo-version-uuid) #### Artifacts Artifacts exist in their own directory structure: artifact/<first-2-digits-of-sha256>/<sha256-filenames> #### Example layout ``` tar xvzf export-bfcc1def-f96c-4448-b585-0a3ee2f5651c-20200406_1639.tarfile.gz ├── artifact │   ├── 02 │   │   ├── 77b750a47e23de8a391ca5ccfa00e8dbb19419814a229d31f1d7ba382783d2 │   │   └── 9b6b0ee32ae77e2d4ef16fdd4d9ff271195077b84d0ffcb59f393bbae4520f │   ├── 14 │   │   └── 39c3e561ae7ece3c5833a6962ea1e758c9fc08f6afb4c1fe0939278070d516 │   ├── 27 │   │   └── 066e7572fc068a2d96956962c8c1234bd8724a2e875faf8afc9bf83d75b63e │   ├── 2e │   │   └── d4cb8647fb22113a36e12d6517ea1b91b1c1c61c3460a63da951af4b357fad │   ├── 32 │   │   └── 8c2300ddd7a91265480020322806407ee46bba06fe446ba5c5020f857d0db0 │   ├── 38 │   │   └── 87adfb7f9c05f66fdba6c5f56ec4de9dc582b38d134d2870ee2859d17ff4ed │   ├── 39 │   │   └── a74d75f3d41435de6f8e7e3219ebd97bca67f359a601aae9eeaef6a3827ef8 │   ├── 3b │   │   └── ead1283c3241f099d284f78259cbb4d5dfd957b3d89c447e2bf3aa5b5c6faf │   ├── 42 │   │   ├── 6ad7b4d71fde3b2c6caf0127862d841348d5d8b51ca2d2dc7d8f9100be04c4 │   │   └── bcdff1fa507e9dcde127689b443da75f8ec06486caece235fa492d26025b8b │   ├── 4c │   │   └── 0fd86853fe97d38d03905313501482fcbce511031a9d3c9adabcf01afb4d5e │   ├── 55 │   │   └── 9d66be52e52263831f952de4dce7a4331ca8c3965dc7bd9d774e84acf0c3c3 │   ├── 57 │   │   └── fb2a45f839deb52a37f8c816e04c16557c393393bff3d2bc84fe851ec7163a │   ├── 60 │   │   └── 9f12688c71fc65b5e6b9d919c18d31857270c7680d1f9c2fb3d9851a0e045e │   ├── 65 │   │   └── 79ce096b6fcaf2b47981b7840578f31a58420d33dc0fa2f5fe5dc86d8dfa8e │   ├── 71 │   │   └── 1df5946d0aa22e599b17295d102d4457a0c899d1293499897e1282f6c15c7b │   ├── 78 │   │   └── 6aeaa5f329c4be42e581ecb8dcbdc426cae7f4365cbb168e8121d564437e6d │   ├── 80 │   │   └── c7cfeb3d159a45e79801392cbbf0779869ca9e82302428a8346d31e85395bc │   ├── 88 │   │   └── aff49c2658261e523724eaa9e2cd38772f9bcc70986aae30503d83f699612e │   ├── 89 │   │   └── 5ba3c4847f2760a645f1224ad2a3f5d4da4aa289998c81598747582127c754 │   ├── 9b │   │   └── b29691f097fe67a4dfebbafd1c895e5b41e13f623daaffc5a5a95dde1fabea │   ├── a3 │   │   └── 77c9b3099d3a29729bdc8e4c059d0850c19415e38220e7e469880bf9e9b9ff │   ├── af │   │   └── 1f188d9e2523b24302b5b91036039777bd2df6b23313aff89cc475eb3e4b07 │   ├── b0 │   │   └── a3cc6b250c03719626af7c41d337e1187a5f0b779fce6005b3a45fac09f490 │   ├── b5 │   │   └── f7a0f30e2d8eb907b9c9ee8e2237d87db7054eddadfd17f8c5ce169530f28d │   ├── b6 │   │   └── 7ee9aba5b21244948e651c61b47c81816e0b96c08deb0732c92197ea9c0d36 │   ├── bc │   │   └── a4539394799d6ee9d863465b76ea12f51f95e7d87dee1558dd8430ee75bb7c │   ├── be │   │   └── 96d9b80dbd539050f5459bb24a134bd719030c4fad4f01761ef1959384bc02 │   ├── bf │   │   └── d98c5543feb484581714123e8e7f52af9263d6794e77c723894619d75a5034 │   ├── c0 │   │   └── 6a997a99f84f54a05935f67aae9fddf0e9eed7cadcd04464670da439bfa48c │   ├── c2 │   │   └── 8c806da973e7b568995d6ac4c800aa3464ad91e93c338383bad6cd9c175c66 │   ├── c5 │   │   └── fe0911fbe98294054558107e3dfce6fd1f9928ea469020de60abedea835ae2 │   ├── d3 │   │   └── 5ae5a97f28280f3e435fb5174938818739530071568d4da4351655b05c2bf7 │   ├── db │   │   └── 4b1affbf78bff20720eacab4558c37cf0406b8a63b1bffb80ae878246b086a │   ├── e4 │   │   └── bed3c8f84903f0b5c4dd35579b280bd440c541f245f6fffb1fcdd79fc9a43e │   └── ef │   ├── d7c18b5f8762cc72c5a0c17150c881e8cf4ccf20df5bf2bd77c98c4bc56238 │   └── f95a379774c7622037bcb7ff347271dc727ae4ddce9b16017477b58488aa3e ├── pulpcore.app.modelresource.ArtifactResource.json ├── pulpcore.app.modelresource.RepositoryResource.json ├── repository-4aeeecdb-9589-4520-92f6-90bb171702c7_1 │   ├── pulpcore.app.modelresource.ContentArtifactResource.json │   ├── pulpcore.app.modelresource.ContentResource.json │   ├── pulpcore.app.modelresource.RepositoryVersionResource.json │   └── pulp_file.app.modelresource.FileContentResource.json ├── repository-c1796efe-9e62-4bf3-9ebe-c03f0518809a_1 │ ├── pulpcore.app.modelresource.ContentArtifactResource.json │ ├── pulpcore.app.modelresource.ContentResource.json │ └── pulpcore.app.modelresource.RepositoryVersionResource.json └── version.json ``` ## Incremental Export Incremental export is handled by making a `POST /pulp/api/v3/exporters/core/pulp/<uuid>/exports/` call and specifying `full=False`. In this case, the PulpExporter will generate an incremental dump based on the repo-version(s) listed in the latest Export instantiated by this PulpExporter, and the current-version of all the associated repositories. If you specify `versions=` as well, PulpExporter will generate an incremental dump of everything that happened after `last_export`, up to the specified versions (as opposed to current-latest) If you specify `start_versions=`, PulpExporter will generate an incremental dump of everything that happened since the specified start_versions, up to `versions=` (if specified) or `last_export` (if no versions=) ## Edge cases/caveats - How do we handle case when a user wants to ‘unimport’ repoverison X? - Export-full on upstream, specifying the desired end-state repository-version, and import on downstream ## Open Questions - ~~Currently, Pulp has the concept of Exporters (filesystem, rsync, etc) which are implemented as Master/Detail. This was done to accommodate the fact that some plugins will need to export publications while others might export repository versions. Do we divorce the concept of import/export and Exporters? Or bring the Exporters inline with import/export by looking into having core handle Exporters?~~ - On import, order can matter. Plugins will need to be able to define which entities need to be imported in which order. **Problem: how can we handle circular relations?** - How are we going to handle publications/distributions? Esp in the presence of incrementals? - Can users update exporters? What happens if I change the repos or the last_repository_versions? - Last-repo-version is now implied by last-Exporter. - current -version of repos-in-exporter, will be compared to exported-repo-versions in most-recent Export. New repos will get a full-export, removed-repos won’t be exported. On import-side, new repos will get new versions, and removed repos will stop getting new repo-versions at import time. - **OR** - import-time can compare last-successful-import repos, to incoming-repos, and delete repos that aren’t incoming? That sounds like a bad idea. - **Problem**: How does this interact with versions= and start_versions= validation? Should we even allow this, or should we recommend "create a new Exporter instead"? - What does dry-run look like? - How does the user create the repo-to-repo mapping used at import time? - Have an export include a file which has the mapping - Have dry run return the mapping - [DONE] How will backwards incompatible changes in the exported data format be handled over time? e.g. I exported this data a loooong time ago and now the system I'm importing into expects a newer, different format? - idea 1: Tie the data format to the exported system's pulpcore version number. Put that version number into the exported data somehow. Then have systems importing know the oldest version they support and refuse to import older exports. - idea 2: Same idea as (1) except use it's own numbering scheme, probably semver based - **Current Status**: versions.json included in export, with core and plugin versions for all plugins involved in tyhe export. Import checks for **exact match** to versions.json on import, and errors if there are diffs. - [DONE] Are we confident this design will allow us to later/easily add the ability to break up the data format over N, RAR files, isos, etc? Do we have an idea of the plan there even if we're not implementing it now? - **Current status** : If `chunk_size=` is specified, resulting tarfile is split into pieces of size <= chunk_size. ###### tags: `import/export`