# Pulp 2to3 Migration plugin overview ## Key concepts - **Migration plan** defines what should be migrated from pulp2 and where into pulp3 - **Pre-migration** stage is responsible for making a "snapshot" of pulp2 data needed for migration and store it in the pulp3 db - **Migration** stage is reponsible for creating all the necessary records in pulp3 db and moving artifacts into the pulp3 storage. - *[not relevant for Satellite]* Migration can happen on **per-plugin basis**, aka one plugin is still working with pulp2, another is migrated and fully working with pulp3. - Migration can be **re-run multiple times** and if not reset, it will always be **incremental**. ## Configuration https://docs.pulpproject.org/pulp_2to3_migration/configuration.html The migration plugin provides [default settings](https://github.com/pulp/pulp-2to3-migration/blob/a713b455d699df68a99b853c6d57c2260ce5b21a/pulp_2to3_migration/app/settings.py) which are applied on top of pulpcore ones. ## REST API https://docs.pulpproject.org/pulp_2to3_migration/restapi.html * migration plan * CRD for migration plan * run migration based on migration plan * reset migration based on a specific migration plan * view a mapping of pulp2 content to pulp3 content * view a mapping of a pulp2 repo to a pulp3 repo/version/remote/publication/distribution ## Tasks - migrate_from_pulp2 * creates a task group and adds itself to it * [performs pre-migration and migration](https://github.com/pulp/pulp-2to3-migration/blob/e243fcc21b16fcc4589027f8339bbef472bedb85/pulp_2to3_migration/app/tasks/migrate.py#L76-L82) Pre-migration and content migration happens in one task (not to overload system with heavy parallel tasks and for data correctness, e.g. some content has content which depends on the other + ability to support incremental re-runs makes it less error-prone if it's happening in one task). The last step of the task is responsible for creating repository versions, publications and distributions. This is happening after all content is migrated. This step triggers many tasks which run in parallel. There are as many tasks as there are repositories in pulp3(!), which are defined in the migration plan. - reset_pulp3_data Removes data from pulp 3 db for the plugins specified in a migration plan so the migration can be started from scratch and not be incremental. Does not remove artifacts. It is a long task because resetting happens on per plugin basis with carefully picking and removing only pulp 3 data relevant to a specific plugin. Potential improvement: a separate code path to remove all pulp 3 data, should be much faster than the per-plugin one. ## migrate_from_pulp2 task flow - validate the migration plan if asked - perform pre-migration, order of the steps matter (This is the main load on the MongoDB in terms of reading data. Most reads are simple, all records are requested or a specific one by its pk. One exception is Errata content pre-migration, it uses aggregation pipeline and can create some load on the MongoDB.) * pre-migrate everything apart from content (repositories, importers, distributors, repocontent relations * premigrate only new and changed resources on the re-run if possible * pre-migrate content (all content for specified plugins, even if some repositories are not being migrated) * only newly added or changed content since last run * analyze/mark outdated resources in pre-migration tables (e.g. something got removed in pulp2 between the runs) * analyze/remove potentially conflicting resources in pulp3 which are outdated (e.g. publications/distributions because something changed in pulp2 for those repos) - perform migration, order of the steps matter (Migration only deals with pre-migrated data in pulp3 db, there should be no load on the MongoDB at this point.) * migrate repositories (literally just create Repository objects in pulp3, no content is migrated at this stage) * migrate importers (just create Remotes in pulp3) * migrate content * the heaviest and longest part * uses stages API * creates hardlinks for downloaded content in pulp3 storage * creates Artifacts, Content, RemoteArtifacts, ... everything needed for content * relates pulp2 and pulp3 content * create repository versions, publications and distributions * triggers multiple tasks to run in parallel, one per each pulp3 repo * adds content repositories, publish and distribute created repo versions * relates pulp2 repository and pulp3 repository/version/publication/distribution ## Katello aspects Katello task which triggers migration performs the following: - generate migration plan + maybe some katello-only stuff - **create migration plan in Pulp and trigger pulp's migrate_from_pulp2 task** - after the pulp task is done katello imports data it needs to index all the pulp3 data properly + maybe some katello-only stuff * they are using mappings (pulp2repositories/ and pulp2content/ endpoints) to get all data they need about pulp2 and pulp3 relations * generate the list of missing or corrupted content ## Details ### Migration Plan Migration plan defines which plugin(s) to migrate to pulp3, it also allows to specify: * which pulp2 repo(s) migrate into which repository in pulp 3 * it can be a list of pulp2 repos for one pulp3 repo. That means each pulp2 repo will become a separate repository version of that pulp3 repo. Order is not guaranteed. * which pulp2 importer to migrate and to which pulp3 repo it should me mapped * for Katello, aka pulp-2to3-migration 0.11.z, the mapping is for the `pulp2repositories/` endpoint only, there is no db relation between a repo and a remote. In 0.12+, that db relationship is there. * there can be only one pulp2 importer specified per one pulp3(!) repo * the same pulp2 importer can be used for multiple repos * which pulp2 distributor to migrate and for which migrated pulp2 repo it should be used * there can be multiple pulp2 distributors for one pulp2 repo (e.g. to publish/distribute the same repo version under different paths) * each pulp2 distrubitor can be specified only once in the migration plan (pulp2 distributor defines a base_path, so if specified multiple times, it would cause a path overlap error) * which signing service to map to a pulp3 repo * it does not cause signing service to be created. The need to be created upfront in pulp3. * currently supported only by pulp_deb and in 0.15+ Migration plan schema is defined [here](https://github.com/pulp/pulp-2to3-migration/blob/main/pulp_2to3_migration/app/json_schema.py). There can be a simple or a complex plan. See [examples in the docs](https://docs.pulpproject.org/pulp_2to3_migration/migration_plan.html ). * simple plan * only plugins are specified, all repos are migrated * each pulp2 repo is migrated into a separate pulp3 repo, same goes to importers and distributors * under the hood, the simple plan is converted to a complex one before the actual migration starts * complex plan (used by Katello) * only repositories/importers/distributors specified in the plan are migrated * all content is migrated regardless, just some of it might be orphaned in pulp3 if its repo is not in the plan A Migration plan can be validated and pulp2 resources correctness is checked then. The main idea is to protect from typos, especially when a migration plan is generated by someone externally. ### Incremental migration of repositories/importers/distributors At the pre-migration stage, it is decided what needs to be migrated and it's marked accordingly. The migration stage just migrates everything marked for that and there is no extra logic around what to migrate and what not at this stage. Each resource (pulp2 repo, importer, distribution) has flags in the table for pre-migration: `is_migrated` and `not_in_plan`. `not_in_plan` set to `True` covers cases when resource is not specified in the plan or is no longer in pulp2. Only resources with `is_migrated=False and not_in_plan=False` will be migrated. Because resources are mutable, it is not safe to rely on timestamps only, [a more complicated logic](https://github.com/pulp/pulp-2to3-migration/blob/main/pulp_2to3_migration/app/pre_migration.py#L414-L427) is involved to re-migrate only changed resources and not all every time. In addition to that, there is a need to track if a migration plan changed between the runs. It is possible that resources themselves have not changed but relations have, e.g. an importer or a distributor is still the same but in the new migration plan it is used for a different repository. To track such changes, there is [a separate model/table RepoSetup](https://github.com/pulp/pulp-2to3-migration/blob/main/pulp_2to3_migration/app/models/base.py#L351). ### Incremental migration of content Content is pre-migrated in the order from oldest to latest, so if interrupted we can continue pre-migration based on a timestamp of the last pre-migrated content. To allow this optimization we need to pre-migrate all content for a specified plugin, even if a repository it belongs to is not specified in the plan. Same as with the resources, there is no decision logic present at the migration stage, it's pre-migration stage which takes care of marking or cleaning up. There is a special handling of few content types which can be mutable: RPM Errata, RPM modules, RPM module-defaults, Docker Tag. It is checked if there are any changed items, and those are marked for future migration, aka a relation between pulp2 and pulp3 content is removed. The migration stage migrates only pulp2 content which does not have a relation to its pulp3 analog. Unfortunately, there is no way to track lazy_catalog changes, so all entries are attempted to be pre-migrated on every run (bulk_create with ignore) but those already migrated are staying migrated. So it is still incremental but with more load on a db and likely a bunch of duplicate key errors in postgres logs. ## Code related comments ### Layout The layout is quite straightforward, the only directory worth mentioning is [`plugin/`](https://github.com/pulp/pulp-2to3-migration/tree/main/pulp_2to3_migration/app/plugin/). It contains: - files related to plugin API for the migration plugin - directories, one per each plugin (there are no separate github repositories for plugins) There is [a short page with guidelines for plugin writers](https://docs.pulpproject.org/pulp_2to3_migration/plugin_writers_guide.html) in the docs, about what needs to be done to add a new content plugin to the migration. ### Naming * Classes which deal with pulp2(mongodb) are a copy-paste from pulp2 with adjustments to work with python3.The general use classes are in a separate `pulp2` directory. In addition, each plugin has a dedicated python module for pulp2 data. **This code must never change pulp2 data**, only read it. * Models with names like `Pulp2XXXXX` have corresponding table in pulp3 db and are meant for storing pre-migrated pulp2 content. * Classes with names like `Pulp2to3XXXXX'` are meant for plugins to be subclassed. Those subclasses can store extra info pre-migrated from pulp2 and also provide methods for pre-migrating and migrating. ### Migration pipeline Content is migrated using stages API. Here is [an example of the pipeline](https://github.com/pulp/pulp-2to3-migration/blob/main/pulp_2to3_migration/app/plugin/content.py#L74-L81). Each plugin can have their own. The first stage is the most complicated logic wise, and its main part is used by all plugins. The `RelatePulp2to3Content` stage has to be the last one, because the relation between pulp2 and pulp3 content is used as a flag that content has been migrated. ### Tests CI has a special setup because it needs mongodb. Everything is done in customozable scripts, so one can apply updates from plugin_template without a problem. To migrate to pulp3, only access to mongodb and filesystem is needed, a working pulp2 is not required. For that reason, tests are dealing with snapshots of mongodb and corresponding /var/lib/pulp state. Snaphots are located in a separate github repository https://github.com/pulp/pulp-2to3-migration-test-fixtures/. You can use a snapshot of your choice in a test [this way](https://github.com/pulp/pulp-2to3-migration/blob/main/pulp_2to3_migration/tests/functional/test_migration_plan_changes.py#L77). To clean-up the db between tests, [tables are truncated](https://github.com/pulp/pulp-2to3-migration/blob/a713b455d699df68a99b853c6d57c2260ce5b21a/pulp_2to3_migration/tests/functional/constants.py#L36). There are tests for file/rpm/deb plugins, there are no tests for docker plugin.