Sync Memory Minimization Strategies

# Sync Memory Minimization Strategies ## Problem Statement There are a number of RPM repositories "in the wild" which cause practical issues when synced into Pulp. https://github.com/pulp/pulp_rpm/issues/4086 Generally the problem is that the filelist (or on some occasions, changelog) metadata is extremely voluminious for some packages - and there are many (hundreds or thousands) of copies of that package in the repository, and those packages end being processed concurrently during the sync pipeline process. ### Workaround One way to potentially avoid this issue on a per-repository basis is to use the `retain_package_versions` option to restrict the number of any given package that can be processed at a time. If one or several packages are particularly large, you may then only have a handful of copies of that package instead of dozens or hundreds. This has the additional benefit of improving sync times as less packages are processed, less packages are downloaded, less disk space is required, etc. For many users this is a good idea regardless of whether or not other ideas listed below are implemented due to those aforementioned properties. ### Solution Idea #1 (general) Automatically dynamically adjust the pipeline batch size based on RSS on a global basis. Pros: * very flexible, might apply across plugins (even though it mostly applies to certain plugins) * does not reqiure schema change Cons: * possibly very complex, may require a large refactoring of the sync pipeline * would potentially slow down the sync process substantially * does not reduce the total amount of database traffic ### Solution Idea #2 (general) Give plugins control over the pipeline, let plugins implement their own strategies for batch management by controlling the pipeline more directly. Pros: * no single global heuristic, we could potentially do a bit more experimentation or use better heuristics than mere RSS Cons: * probably just as complex as #1, still probably requires a large refactoring of the sync pipeline * would potentially slow down the sync process substantially * would require a different heuristic to potentially be used per-plugin * does not reduce the total amount of database traffic ### Solution Idea #3 (rpm-specific) Store the filelists in a more compact form, e.g. as a pair of path root (stored separately) and filename. Many times there are many thousands of files in the same directory (icons, manifest files, assets in general) and that means many thousands of copies of the same directory prefix. Pros: * reduces database traffic required Cons: * requires schema change * more complex, requires more effort to do and undo during serialization, sync, publish ### Solution Idea #4 (rpm-specific) Store the filelist in a (literally) compressed form, e.g. zstd or lz4 compressed JSON or XML string. Pros: * relatively simple * reduces database traffic required Cons: * requires schema change * serializing becomes more difficult ### Solution Idea #5 (rpm-specific) Perform a "lossy" saving of the filelist, perhaps behind a per-repository option. The vast, vast majority of files listed in many of these cases are not useful for depsolving or lookup purposes and can be "trimmed" using some heuristic. primary.xml already contains a restricted subset of files listed to avoid needing to reference filelists.xml - this is a similar principle. Artifactory has an option to disable filelists generation to save publish time. Pros: * simple * reduces database traffic required * doesn't require any schema changes * has tangential benefits such as reducing repodata size, less expensive serialization Cons: * there is no bulletproof way of knowing ahead of time whether a particular file is required for depsolving purposes, and "no take-backsies" because: * content is immutable and globally unique within a domain - syncing / uploading a package once with the restricted filelist means you cannot simply change it later * we kinda have the same problem in other contexts though, so this isn't "new", just a new case where the same applies