[Back to PulpCon2023 Schedule](https://hackmd.io/@pulp/pulpcon_2023) # Pulp Sub-repository discussion session Slot: Thursday, November 9 (day #4) 9:00am EST, 15:00 CET Speaker: Quirin Pamp GitHub/Matrix: quba42 Company: [ATIX](https://atix.de/en/) Role: pulp_deb maintainer ## Problem Statement * Some content types may have meaningful subsets within a given repository. * The Pulp plugin must record not only what repository a particular content unit is stored in (handled by pulpcore using `RepositoryContent`), but also what subset or grouping within that repository. * Exception: Perhaps there is some natural way to querry for the subset? * e.g., by label, by some subfield, etc * e.g.: APT repo component or package index; RPM modulestream; RPM kickstart-tree repos ### Additional Considerations * Various actions might target the subset rather than the repository as a whole whereas pulpcore functionality almost always operates on the repository version level. * There is (is there?) a hierarchical organization/grouping implied ## How pulp_deb handles this to date * Today pulp_deb uses `PackageReleaseComponent` (PRC) content to record which `Package` is present in what `ReleaseComponent` (within a particular repository version). * Essentially just a table with two foreign keys. * Also inherits from `Content` * No associated artifacts ### Problems and limitations * Doubles the amount of content (per repository version) * For each Package + PRC combination we record the following: * pulpcore `RepositoryContent`: This package is in this repository version * pulpcore `RepositoryContent`: This PRC is in this repository version * pulp_deb `PackageReleaseComponent`: This package is in this `ReleaseComponent` * Handling repo version consistency, e.g.: What happens with a PRC when a package is removed from the repository? * This exposes what should be pulp_deb plumbing to users (who should not have to care about understanding what a `PackageReleaseComponent` is). => This has proven to be the main obstacle for improving pulp_deb efficiency. ## How a sub-repository might work * Each sub-repository represents one `ReleaseComponent`. * `Packages` are contained in the sub-repositories. * The parent repository does not contain any `Packages` directly. * Instead, it contains a `ReleaseComponent`, that points at a particular sub-repository version. * Both parent and sub-repository are independently versioned. ### How would this solve the problem? * We would be back down to one pulpcore `RepositoryContent` record per `Package` in the APT repo. ### Open questions and challenges * Can multiple parent repositories reference the same sub-repository? (I think not) * concurrency/deadlocks have entered the room... * How do users interact with the sub-repositories? * For example: Does the user "add a package" to the subrepository, and then ask the parent repository to update its sub-repository version reference, or is there an API to "add a package" directly to the parent repository which then performs everything that is needed under the hood? Both? * user needs to know distribution and component-in-that-distro, at create time * prob should be user-required, no defaults * How to handle orphaned sub-repositories? On delete cascade? * Handling repo uniqueness constraints would get more complicated. * What happens when a sync fails for one sub-repository, but the other sub-repositories have already saved a new version? * Into the APT repo weeds: Fot optimize sync it would be better to have one sub-repository per package index, rather than one per release-component. However, this strikes me as not very intuitive and then there is the case of architecture=all indices... ## Discussion * How feasible would it be to introduce a "Sub-repository" concept into Pulp? * Is there demand for something like this outside of pulp_deb? * Pitfalls, concerns, alternatives? ### Alternative: Just have the reference on the Package model * The `Package` model could have a component reference on it. * This would duplicate packages in the DB when the same package exists in different components (but not the artifacts). * Con: Feels like we are not "cutting nature at the joints" * Pro: Might be simpler than the sub-repository approach and also gets rid of PackageReleaseComponents ### Notes * is "repository" the right word to use for this grouping concept? * it's internal-only, so does it matter? * "human beings" need to understand this architecture, and "repository" comes with semantics that we may/do not want/need here * what is the/a "natural key" of such a grouping? * what about immutability of repo-version, when a contained-object might change? * if the sub-repo-version is identified as Content, that belongs-to a Real Repository, that should maintain "version must be immutable"