[Back to PulpCon2023 Schedule](https://hackmd.io/@pulp/pulpcon_2023)
# Pulp Sub-repository discussion session
Slot: Thursday, November 9 (day #4) 9:00am EST, 15:00 CET
Speaker: Quirin Pamp
GitHub/Matrix: quba42
Company: [ATIX](https://atix.de/en/)
Role: pulp_deb maintainer
## Problem Statement
* Some content types may have meaningful subsets within a given repository.
* The Pulp plugin must record not only what repository a particular content unit is stored in (handled by pulpcore using `RepositoryContent`), but also what subset or grouping within that repository.
* Exception: Perhaps there is some natural way to querry for the subset?
* e.g., by label, by some subfield, etc
* e.g.: APT repo component or package index; RPM modulestream; RPM kickstart-tree repos
### Additional Considerations
* Various actions might target the subset rather than the repository as a whole whereas pulpcore functionality almost always operates on the repository version level.
* There is (is there?) a hierarchical organization/grouping implied
## How pulp_deb handles this to date
* Today pulp_deb uses `PackageReleaseComponent` (PRC) content to record which `Package` is present in what `ReleaseComponent` (within a particular repository version).
* Essentially just a table with two foreign keys.
* Also inherits from `Content`
* No associated artifacts
### Problems and limitations
* Doubles the amount of content (per repository version)
* For each Package + PRC combination we record the following:
* pulpcore `RepositoryContent`: This package is in this repository version
* pulpcore `RepositoryContent`: This PRC is in this repository version
* pulp_deb `PackageReleaseComponent`: This package is in this `ReleaseComponent`
* Handling repo version consistency, e.g.: What happens with a PRC when a package is removed from the repository?
* This exposes what should be pulp_deb plumbing to users (who should not have to care about understanding what a `PackageReleaseComponent` is).
=> This has proven to be the main obstacle for improving pulp_deb efficiency.
## How a sub-repository might work
* Each sub-repository represents one `ReleaseComponent`.
* `Packages` are contained in the sub-repositories.
* The parent repository does not contain any `Packages` directly.
* Instead, it contains a `ReleaseComponent`, that points at a particular sub-repository version.
* Both parent and sub-repository are independently versioned.
### How would this solve the problem?
* We would be back down to one pulpcore `RepositoryContent` record per `Package` in the APT repo.
### Open questions and challenges
* Can multiple parent repositories reference the same sub-repository? (I think not)
* concurrency/deadlocks have entered the room...
* How do users interact with the sub-repositories?
* For example: Does the user "add a package" to the subrepository, and then ask the parent repository to update its sub-repository version reference, or is there an API to "add a package" directly to the parent repository which then performs everything that is needed under the hood? Both?
* user needs to know distribution and component-in-that-distro, at create time
* prob should be user-required, no defaults
* How to handle orphaned sub-repositories? On delete cascade?
* Handling repo uniqueness constraints would get more complicated.
* What happens when a sync fails for one sub-repository, but the other sub-repositories have already saved a new version?
* Into the APT repo weeds: Fot optimize sync it would be better to have one sub-repository per package index, rather than one per release-component. However, this strikes me as not very intuitive and then there is the case of architecture=all indices...
## Discussion
* How feasible would it be to introduce a "Sub-repository" concept into Pulp?
* Is there demand for something like this outside of pulp_deb?
* Pitfalls, concerns, alternatives?
### Alternative: Just have the reference on the Package model
* The `Package` model could have a component reference on it.
* This would duplicate packages in the DB when the same package exists in different components (but not the artifacts).
* Con: Feels like we are not "cutting nature at the joints"
* Pro: Might be simpler than the sub-repository approach and also gets rid of PackageReleaseComponents
### Notes
* is "repository" the right word to use for this grouping concept?
* it's internal-only, so does it matter?
* "human beings" need to understand this architecture, and "repository" comes with semantics that we may/do not want/need here
* what is the/a "natural key" of such a grouping?
* what about immutability of repo-version, when a contained-object might change?
* if the sub-repo-version is identified as Content, that belongs-to a Real Repository, that should maintain "version must be immutable"