Back to PulpCon2023 Schedule

Pulp Sub-repository discussion session

Slot: Thursday, November 9 (day #4) 9:00am EST, 15:00 CET

Speaker: Quirin Pamp
GitHub/Matrix: quba42
Company: ATIX
Role: pulp_deb maintainer

Problem Statement

  • Some content types may have meaningful subsets within a given repository.
  • The Pulp plugin must record not only what repository a particular content unit is stored in (handled by pulpcore using RepositoryContent), but also what subset or grouping within that repository.
    • Exception: Perhaps there is some natural way to querry for the subset?
      • e.g., by label, by some subfield, etc
  • e.g.: APT repo component or package index; RPM modulestream; RPM kickstart-tree repos

Additional Considerations

  • Various actions might target the subset rather than the repository as a whole whereas pulpcore functionality almost always operates on the repository version level.
  • There is (is there?) a hierarchical organization/grouping implied

How pulp_deb handles this to date

  • Today pulp_deb uses PackageReleaseComponent (PRC) content to record which Package is present in what ReleaseComponent (within a particular repository version).
    • Essentially just a table with two foreign keys.
    • Also inherits from Content
    • No associated artifacts

Problems and limitations

  • Doubles the amount of content (per repository version)
  • For each Package + PRC combination we record the following:
    • pulpcore RepositoryContent: This package is in this repository version
    • pulpcore RepositoryContent: This PRC is in this repository version
    • pulp_deb PackageReleaseComponent: This package is in this ReleaseComponent
  • Handling repo version consistency, e.g.: What happens with a PRC when a package is removed from the repository?
  • This exposes what should be pulp_deb plumbing to users (who should not have to care about understanding what a PackageReleaseComponent is).

=> This has proven to be the main obstacle for improving pulp_deb efficiency.

How a sub-repository might work

  • Each sub-repository represents one ReleaseComponent.
  • Packages are contained in the sub-repositories.
  • The parent repository does not contain any Packages directly.
  • Instead, it contains a ReleaseComponent, that points at a particular sub-repository version.
  • Both parent and sub-repository are independently versioned.

How would this solve the problem?

  • We would be back down to one pulpcore RepositoryContent record per Package in the APT repo.

Open questions and challenges

  • Can multiple parent repositories reference the same sub-repository? (I think not)
    • concurrency/deadlocks have entered the room
  • How do users interact with the sub-repositories?
    • For example: Does the user "add a package" to the subrepository, and then ask the parent repository to update its sub-repository version reference, or is there an API to "add a package" directly to the parent repository which then performs everything that is needed under the hood? Both?
    • user needs to know distribution and component-in-that-distro, at create time
      • prob should be user-required, no defaults
  • How to handle orphaned sub-repositories? On delete cascade?
  • Handling repo uniqueness constraints would get more complicated.
  • What happens when a sync fails for one sub-repository, but the other sub-repositories have already saved a new version?
  • Into the APT repo weeds: Fot optimize sync it would be better to have one sub-repository per package index, rather than one per release-component. However, this strikes me as not very intuitive and then there is the case of architecture=all indices

Discussion

  • How feasible would it be to introduce a "Sub-repository" concept into Pulp?
  • Is there demand for something like this outside of pulp_deb?
  • Pitfalls, concerns, alternatives?

Alternative: Just have the reference on the Package model

  • The Package model could have a component reference on it.
  • This would duplicate packages in the DB when the same package exists in different components (but not the artifacts).
  • Con: Feels like we are not "cutting nature at the joints"
  • Pro: Might be simpler than the sub-repository approach and also gets rid of PackageReleaseComponents

Notes

  • is "repository" the right word to use for this grouping concept?
    • it's internal-only, so does it matter?
    • "human beings" need to understand this architecture, and "repository" comes with semantics that we may/do not want/need here
  • what is the/a "natural key" of such a grouping?
  • what about immutability of repo-version, when a contained-object might change?
    • if the sub-repo-version is identified as Content, that belongs-to a Real Repository, that should maintain "version must be immutable"
Select a repo