Try   HackMD

Pulp3 Import/Export Design

Introduction/problem statement

There is a use-case for extracting repository-versions out of a running
instance of Pulp into a file that can be transferred to another Pulp
instance and imported. This is not the Pulp-to-Pulp sync case; the
assumption is that the receiving Pulp instance is network-isolated. This
design aims to describe the requirements, limitations, and a possible
implementation for this functionality.

Additional Documents

Definitions

  • Upstream - the Pulp instance whose repository-version(s) we want
    to export
  • Downstream - the Pulp instance that will be importing those
    repository-version(s)
  • ModelResource - entity that understands how to map the metadata
    for a specific Model owned/controlled by a plugin to an exportable
    file-format (e.g., CSV or JSON) (see django-import-export)
  • Exporter - A resource that exports content from Pulp for a
    variety of different use cases
  • PulpExporter - A kind-of Exporter, that is specifically used
    to export data from an Upstream for consumption by a Downstream
  • PulpExport - a specific instantiation/run of a PulpExporter
  • Export file - tarfile containing database metadata and
    content-artifacts for a (set of) repository-versions, generated
    during execution of an Export
  • PulpImporter - A resource that accepts an Upstream
    PulpExporter export file, and manages the process of importing
    the data and artifacts included
  • PulpImport - a specific instantiation/run of a PulpImporter
  • Repository-mapping - a configuration file that provides the
    ability to map an Upstream repository, to a Downstream
    repository, into which the Upstream’s repository-version should be
    imported by a PulpImporter
  • Import order - for complicated repository-types, managing
    relationships requires that models be imported in order. Plugins are
    responsible for specifying the import-order of the
    ModelResources they own

Workflow

“First time” setup and processing

  1. Upstream user (UU) defines a PulpExporter for a set of repositories
  2. UU does a dry-run of an Export for that Exporter
  3. UU executes an Export for that Exporter
  4. UU hands the resulting export-file to a Downstream User (DU)
  5. DU defines a PulpImporter
  6. DU does a dry-run of an Import, using the provided
    file, to generate a proposed repository-mapping between the Upstream
    repositories referenced in the export-file, and available Downstream
    repositories.
  7. DU modifies the output of the dry-run to correctly match Upstream
    and Downstream repositories
  8. DU modifies the PulpImporter defined previously with the provided mapping
  9. DU executes an Import for that PulpImporter with the provided export-file

“Steady state”/regular processing

  1. UU executes an Export for that Exporter
  2. UU hands the resulting export-file to a Downstream User (DU)
  3. DU executes an Import for that Importer with the provided export-file

Requirements

  • [DONE] Need to export Model info (i.e. data from the database)
  • [DONE] Need to export artifacts (i.e. files from the filesystem)
  • [DONE] Need to be able to handle multiple repository-versions at once
  • On export, need to be able to specify two versions and export only
    the ‘diff’ between them
  • On import, need to be able to recognize a diff, and be able to
    specify a ‘base version’ on top of which that diff should be played
  • [DONE] A given artifact should only be in the export once, even if it
    exists in multiple repositories
  • [DONE] Need to be able to import into downstream repositories with
    arbitrary names (because downstream cannot be assumed to be
    identical to upstream)
  • [DONE] Need to be able to handle any repository type
  • Need to be respond gracefully if a given repository type does not
    support export/import
  • [DONE] Need to minimize the amount of work for plugin authors (to reduce
    implementation friction)
  • Need to minimize the amount of work on the user (to reduce
    acceptance friction)
  • [DONE] Need to allow for complicated scenarios, while making the simple
    “export version-N from repo-R and import it into the same repo
    elsewhere” case simple
  • Need to provide a “dry-run” capability for export that will produce
    a description of what would be exported and a first-order
    approximation of the required disk space
  • Need to provide a “dry-run” capability on import in order to insure
    that the provided export file and repo-mapping “makes sense”
  • Optional: export takes only a repository-id, and produces an export
    for the ‘latest’ repository-version
  • [DONE] Need to provide a history to the user to see information about the
    last time they exported

Existing documents/discussion

  • Initial pulp issue 5096
  • Current epic 6134 with design
    meeting minutes for this issue

PulpExporter

PulpExporter is a type of Exporter that can create a PulpExport instance,
which knows how to create a file that can be imported back into Pulp.
Users can configure a PulpExporter with settings and view a history for
past Exports created by a PulpExporter.

A PulpExporter can rely on the last Export it created to know the
version of a repo(s) that were exported, and for incremental exports, use
those version(s) automatically as the base version(s).

Fields

  • name - a unique string to identify the exporter
  • path - the location on the filesystem of a DIRECTORY where exports should be
    exported
  • repositories - list of repos whose current-versions will be exported, OR
  • last_export - [read-only] UUID of last Export instantiated by the PulpExporter,
    which will contain the list of base repository versions to export
    against. If there is no last-export-UUID, this is the first-time-in
    and a full export will happen.

History

A user can view the history for a particular exporter. This consists of
the list of PulpExports created by this PulpExporter. Exports include:

  • date and time when the export occurred
  • specific RepositoryVersions that were exported
  • parameters that were used when exporting
  • filename the export-file was written to
  • SHA256 checksum of the resulting export-file

REST API (pulpcore)

API for Exporting

Pulpcore will provide a CURDL endpoint at /pulp/api/v3/exporters/core/pulp/
for working with PulpExporters.

There will also be an endpoint for viewing the Exports instantiated
by/from a specific PulpExporter at /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/.

API Specifics - working with PulpExporters

  • POST /pulp/api/v3/exporters/core/pulp/ - CREATE an Exporter
    • name - human-readable identifier for this Exporter
    • repositories - list of repositories whose current/latest
      version should be exported
    • path to the DIRECTORY where resulting tarfile will live
      • Assumption: either you get the complete tarfile, or, if
        Something Went Wrong, you get an error and no file
      • Required with no default - if we let pulp ‘guess’, we can
        run pulp itself out of storage
      • Must be at/under one of the directories in the ALLOWED_EXPORT_PATHS
        setting, or the create will fail
    • [TBD] publications (optional, phase-2) to be exported
    • [TBD] distributions (optional, phase-3) to be exported
  • GET /pulp/api/v3/exporters/core/pulp/ - LIST of all current Exporters
  • GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/ - READ specifics
    about an Exporter
  • PUT pulp/api/v3/exporters/core/pulp/<exporter-uuid>/ - completely UPDATE
    an Exporter
    • name - human-readable identifier for this Exporter
    • repositories - list of repositories whose current/latest
      version should be exported, OR
    • path - directory where resulting tarfile will live
    • [TBD] publications (optional, phase-2) to be exported (?)
    • [TBD] distributions (optional, phase-3) to be exported(?)
  • PATCH /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/ - partially UPDATE
    an Exporter
  • DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/ - DELETE an
    Exporter

API Specifics - working with Exports

  • POST /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/ -
    CREATE AND EXECUTE an Export
    • full - if true, ignore any previous Exports for this
      Exporter and generate a full export. If false or not set,
      generate an incremental export, using the last existing Export
      as the base-version. Default is 'True'
    • versions - list of repository-versions to export.
      Must match the full set of Repositories being managed by the Exporter.
    • start_versions - list of repository-versions to start from when doing an
      incremental export. Must match the full set of Repositories being managed by the Exporter.
    • chunk_size - specify a chunk-size for an export-file.
      Export-file will be cut up into chunk sof no more than (chunk_size).
    • [TBD] dry-run - if provided, produce output describing what would
      have been produced, without actually writing anything to disk
  • GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/ - LIST
    of all Exports for the specific PulpExporter
  • GET /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/ -
    READ specifics about an Export
  • DELETE /pulp/api/v3/exporters/core/pulp/<exporter-uuid>/exports/<export-uuid>/ -
    DELETE an Export record

There is no way to UPDATE an Export. It is an historical record.

API for Importing

Pulpcore will provide a CRUD endpoint for PulpImporters at
/pulp/api/v3/importers/core/pulp/.

There will also be an endpoint for viewing the import history of a
PulpImporter at /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/.

API Specifics - working with PulpImporters

  • POST /pulp/api/v3/importers/core/pulp/ - CREATE a PulpImporter
    • repository to import the repository version into
      • What if we let the repo be created by the repo-version?
        • Plugin authors would need to have modelresources for
          their repository metadata
        • How to handle incrementals?
        • What if the repo already exists?
        • What if the repo isn’t of the same type as the export?!?
          (whether we specify it or not)
        • Wait, multiple repos in export - now what?
      • OR,
    • repository-mapping - configuration-file that maps
      upstream-repositories to the downstream-repository they should
      import-into
  • GET /pulp/api/v3/importers/core/pulp/ - LIST of all current Importers
  • GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/ - READ specifics
    about an Importer
  • PATCH /pulp/api/v3/importers/core/pulp/<importer-uuid>/ - partial
    UPDATE an Importer (subset of fields being updated)
  • PUT /pulp/api/v3/importers/core/pulp/<importer-uuid>/ - completely
    UPDATE an Importer (all fields required)
  • DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/ - DELETE an
    Importer

API Specifics - working with Imports

  • POST /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/ -
    CREATE AND EXECUTE an Import
    • file (points to the output of an Export)
    • [TBD] dry-run - if provided, produce output describing what would
      have been imported and to where, without actually making any
      changes
      • Useful for validating the repository-mapping
  • GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/ - LIST
    of all Imports for the specific Importer
  • GET /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/ -
    READ specifics about an Import
  • DELETE /pulp/api/v3/importers/core/pulp/<importer-uuid>/imports/<import-uuid>/ -
    DELETE an Import record

There is no way to UPDATE an Import. It is an historical record.

Plugin API (for pulpcore to call)

  • Plugin author is responsible for a <plugin>.app.modelresource.py file, so that
    pulpcore can find it based on repo-type. This file will contain:
    • All of that plugin’s ModelResource subclasses
    • A specification of the order in which they must be imported as IMPORT_ORDER
    • see pulp_rpm for an example
  • Pulpcore must include error handling when the plugin doesn't do
    export/import

Infrastructure approach

This functionality builds on the Django import/export
framework

Plugin ModelResource requirements

  • ModelResource for each object/entity to export/import
    • lives at <plugin>.app.modelresource
    • Set of ModelResources for each exportable/importable Model
      that plugin is responsible for
    • Which fields in those Models are export/import targets
    • Order in which entities are/must be imported
      • Can ModelResource declare a set of dependencies/prereqs?
        • YES - use IMPORT_ORDER to list the order ModelResources need to be imported in
      • Do we have any possibility for circular dependencies?
      • Investigate use of before/after hooks provided - can we handle this complex case using them?
        • Large amount of work for low return
        • Can maybe use before_import() and before_import_row()

Pulpcore ModelResource requirements

  • Core_artifact
  • Core_contentartifact
  • Core_content

File format ("HOW is it all exported")

Directory Layout

The export file needs to:

  • Identify which export this was generated by (from filename)
  • Identify the repository-version(s) that are generated (RepositoryVersionResource.json)
  • Hold the data export for each modelresource for each repository-version (content of Repository directories)
  • Hold the artifacts for the exported repository-version(s) (artifacts/ direcetory)
  • Identify versions of pulpcore and any plugins that were used to generate the export (version.json file)

Data export

  • Database information layout is defined by django export rules (one
    json per ModelResource).
  • For multiple repository-versions, each version will be its own
    directory (named by repo-version-uuid)

Artifacts

Artifacts exist in their own directory structure:

​​​​artifact/<first-2-digits-of-sha256>/<sha256-filenames>

Example layout

tar xvzf export-bfcc1def-f96c-4448-b585-0a3ee2f5651c-20200406_1639.tarfile.gz
├── artifact
│   ├── 02
│   │   ├── 77b750a47e23de8a391ca5ccfa00e8dbb19419814a229d31f1d7ba382783d2
│   │   └── 9b6b0ee32ae77e2d4ef16fdd4d9ff271195077b84d0ffcb59f393bbae4520f
│   ├── 14
│   │   └── 39c3e561ae7ece3c5833a6962ea1e758c9fc08f6afb4c1fe0939278070d516
│   ├── 27
│   │   └── 066e7572fc068a2d96956962c8c1234bd8724a2e875faf8afc9bf83d75b63e
│   ├── 2e
│   │   └── d4cb8647fb22113a36e12d6517ea1b91b1c1c61c3460a63da951af4b357fad
│   ├── 32
│   │   └── 8c2300ddd7a91265480020322806407ee46bba06fe446ba5c5020f857d0db0
│   ├── 38
│   │   └── 87adfb7f9c05f66fdba6c5f56ec4de9dc582b38d134d2870ee2859d17ff4ed
│   ├── 39
│   │   └── a74d75f3d41435de6f8e7e3219ebd97bca67f359a601aae9eeaef6a3827ef8
│   ├── 3b
│   │   └── ead1283c3241f099d284f78259cbb4d5dfd957b3d89c447e2bf3aa5b5c6faf
│   ├── 42
│   │   ├── 6ad7b4d71fde3b2c6caf0127862d841348d5d8b51ca2d2dc7d8f9100be04c4
│   │   └── bcdff1fa507e9dcde127689b443da75f8ec06486caece235fa492d26025b8b
│   ├── 4c
│   │   └── 0fd86853fe97d38d03905313501482fcbce511031a9d3c9adabcf01afb4d5e
│   ├── 55
│   │   └── 9d66be52e52263831f952de4dce7a4331ca8c3965dc7bd9d774e84acf0c3c3
│   ├── 57
│   │   └── fb2a45f839deb52a37f8c816e04c16557c393393bff3d2bc84fe851ec7163a
│   ├── 60
│   │   └── 9f12688c71fc65b5e6b9d919c18d31857270c7680d1f9c2fb3d9851a0e045e
│   ├── 65
│   │   └── 79ce096b6fcaf2b47981b7840578f31a58420d33dc0fa2f5fe5dc86d8dfa8e
│   ├── 71
│   │   └── 1df5946d0aa22e599b17295d102d4457a0c899d1293499897e1282f6c15c7b
│   ├── 78
│   │   └── 6aeaa5f329c4be42e581ecb8dcbdc426cae7f4365cbb168e8121d564437e6d
│   ├── 80
│   │   └── c7cfeb3d159a45e79801392cbbf0779869ca9e82302428a8346d31e85395bc
│   ├── 88
│   │   └── aff49c2658261e523724eaa9e2cd38772f9bcc70986aae30503d83f699612e
│   ├── 89
│   │   └── 5ba3c4847f2760a645f1224ad2a3f5d4da4aa289998c81598747582127c754
│   ├── 9b
│   │   └── b29691f097fe67a4dfebbafd1c895e5b41e13f623daaffc5a5a95dde1fabea
│   ├── a3
│   │   └── 77c9b3099d3a29729bdc8e4c059d0850c19415e38220e7e469880bf9e9b9ff
│   ├── af
│   │   └── 1f188d9e2523b24302b5b91036039777bd2df6b23313aff89cc475eb3e4b07
│   ├── b0
│   │   └── a3cc6b250c03719626af7c41d337e1187a5f0b779fce6005b3a45fac09f490
│   ├── b5
│   │   └── f7a0f30e2d8eb907b9c9ee8e2237d87db7054eddadfd17f8c5ce169530f28d
│   ├── b6
│   │   └── 7ee9aba5b21244948e651c61b47c81816e0b96c08deb0732c92197ea9c0d36
│   ├── bc
│   │   └── a4539394799d6ee9d863465b76ea12f51f95e7d87dee1558dd8430ee75bb7c
│   ├── be
│   │   └── 96d9b80dbd539050f5459bb24a134bd719030c4fad4f01761ef1959384bc02
│   ├── bf
│   │   └── d98c5543feb484581714123e8e7f52af9263d6794e77c723894619d75a5034
│   ├── c0
│   │   └── 6a997a99f84f54a05935f67aae9fddf0e9eed7cadcd04464670da439bfa48c
│   ├── c2
│   │   └── 8c806da973e7b568995d6ac4c800aa3464ad91e93c338383bad6cd9c175c66
│   ├── c5
│   │   └── fe0911fbe98294054558107e3dfce6fd1f9928ea469020de60abedea835ae2
│   ├── d3
│   │   └── 5ae5a97f28280f3e435fb5174938818739530071568d4da4351655b05c2bf7
│   ├── db
│   │   └── 4b1affbf78bff20720eacab4558c37cf0406b8a63b1bffb80ae878246b086a
│   ├── e4
│   │   └── bed3c8f84903f0b5c4dd35579b280bd440c541f245f6fffb1fcdd79fc9a43e
│   └── ef
│       ├── d7c18b5f8762cc72c5a0c17150c881e8cf4ccf20df5bf2bd77c98c4bc56238
│       └── f95a379774c7622037bcb7ff347271dc727ae4ddce9b16017477b58488aa3e
├── pulpcore.app.modelresource.ArtifactResource.json
├── pulpcore.app.modelresource.RepositoryResource.json
├── repository-4aeeecdb-9589-4520-92f6-90bb171702c7_1
│   ├── pulpcore.app.modelresource.ContentArtifactResource.json
│   ├── pulpcore.app.modelresource.ContentResource.json
│   ├── pulpcore.app.modelresource.RepositoryVersionResource.json
│   └── pulp_file.app.modelresource.FileContentResource.json
├── repository-c1796efe-9e62-4bf3-9ebe-c03f0518809a_1
│   ├── pulpcore.app.modelresource.ContentArtifactResource.json
│   ├── pulpcore.app.modelresource.ContentResource.json
│   └── pulpcore.app.modelresource.RepositoryVersionResource.json
└── version.json

Incremental Export

Incremental export is handled by making a POST /pulp/api/v3/exporters/core/pulp/<uuid>/exports/ call and specifying
full=False. In this case, the PulpExporter will generate an incremental
dump based on the repo-version(s) listed in the latest Export
instantiated by this PulpExporter, and the current-version of all the
associated repositories.

If you specify versions= as well, PulpExporter will generate an incremental dump
of everything that happened after last_export, up to the specified versions (as opposed to current-latest)

If you specify start_versions=, PulpExporter will generate an incremental dump
of everything that happened since the specified start_versions, up to versions= (if specified) or last_export (if no versions=)

Edge cases/caveats

  • How do we handle case when a user wants to ‘unimport’ repoverison
    X?
    • Export-full on upstream, specifying the desired end-state repository-version, and import on downstream

Open Questions

  • Currently, Pulp has the concept of Exporters (filesystem, rsync,
    etc) which are implemented as Master/Detail. This was done to
    accommodate the fact that some plugins will need to export
    publications while others might export repository versions. Do we
    divorce the concept of import/export and Exporters? Or bring the
    Exporters inline with import/export by looking into having core
    handle Exporters?
  • On import, order can matter. Plugins will need to be able to define
    which entities need to be imported in which order. Problem: how can
    we handle circular relations?
  • How are we going to handle publications/distributions? Esp in the
    presence of incrementals?
  • Can users update exporters? What happens if I change the repos or
    the last_repository_versions?
    • Last-repo-version is now implied by last-Exporter.
    • current -version of repos-in-exporter, will be compared to
      exported-repo-versions in most-recent Export. New repos will
      get a full-export, removed-repos won’t be exported. On
      import-side, new repos will get new versions, and removed
      repos will stop getting new repo-versions at import time.
    • OR - import-time can compare last-successful-import repos, to
      incoming-repos, and delete repos that aren’t incoming? That
      sounds like a bad idea.
    • Problem: How does this interact with versions= and start_versions= validation?
      Should we even allow this, or should we recommend "create a new Exporter instead"?
  • What does dry-run look like?
  • How does the user create the repo-to-repo mapping used at import
    time?
    • Have an export include a file which has the mapping
    • Have dry run return the mapping
  • [DONE] How will backwards incompatible changes in the exported data format be handled over time? e.g. I exported this data a loooong time ago and now the system I'm importing into expects a newer, different format?
    • idea 1: Tie the data format to the exported system's pulpcore version number. Put that version number into the exported data somehow. Then have systems importing know the oldest version they support and refuse to import older exports.
    • idea 2: Same idea as (1) except use it's own numbering scheme, probably semver based
    • Current Status: versions.json included in export, with core and plugin versions for all
      plugins involved in tyhe export. Import checks for exact match to versions.json on import,
      and errors if there are diffs.
  • [DONE] Are we confident this design will allow us to later/easily add the ability to break up the data format over N, RAR files, isos, etc? Do we have an idea of the plan there even if we're not implementing it now?
    • Current status : If chunk_size= is specified, resulting tarfile is split into pieces of
      size <= chunk_size.
tags: import/export