# Orphan clean up Problem: orphan clean up is a tasking singleton and waits until all other tasks have halted before running, locks All The Things and then takes non trivial amount of time when it runs. Solution: Have the orphan cleanup run asynchronously and without any locks. This will allow any worker to work on it in parallel with other Pulp task types. Challenge: Handle race between another task associating an orphan content with a repository version and orphan cleanup deleting an orphan. Brainstorming Ideas: 1. do not remove newly orphaned content but remove 5-10mins (for example) old orphaned content * we only know "recently created", not "recently orphaned" 2. Mark artifact/content as 'protected' * what if artifact never becomes a content, it will be marked as protected and won't be possible to clean it up * similar for content, what if content never gets associated to a repo, it will be marked as protected and won't be possible to clean it up ## Where do we experience the race between content association and removal? 1. Sync Pipeline 2. New content creation during upload 3. General plugin code 4. `modify` endpoints that validate in the viewsets but associate in the tasks ## Idea #1 (was dropped in favour of idea #2) Have orphan cleanup only remove Artifacts or Content older than X seconds 1. Sync pipeline * when adding content to the repo version, if content is gone missing, re-send it to the pipeline * note - make sure not to loose original relations on the extra_data on the DC * when creating content out of an artifact, if artifact is no longer available, re-send it through the pipeline 2. New content creation during upload * if one shot upload - it is handled in 1 transaction. Not every one-shot-upload adds newly created content to the repo version. In that case if the newly created content does not get added to the repo within X seconds( TBD) it risks to be cleaned up. Ask user to re-upload from scratch. * for non 1 shot upload, have orphan cleanup only remove Artifacts or Content older than X seconds. Still there is a window for race condition - in that case just fail ask the user to re-upload from scratch * what about existing content in pulp when upload is performed? while being re-used during the upload it can get cleaned up any moment 3. General plugin code * use transactions 4. modify endpoints that validate in the viewsets but associate in the tasks * content specified in the add_content_hrefs might get removed in meantime and become no longer available--> task will fail # Proposal (Idea #2) Add a new field `timestamp_of_interest` on the artifact and content models - This field should be set at the artifact/content creation time - This field should be updated whenever we work with the existing artifact/content It is expected that the artifact will be added to the content (unorphaning the artifact) quicker than X seconds/minutes (TBD) from the `timestamp_of_interest`. Similarly, content will be added to a repo version (unorphaning the content) also within X seconds/minutes(TBD). Sync Pipeline ============= `timestamp_of_interest` This timestamp will be set in the sync pipeline in the `QueryExistingArtifacts` and `QueryExistingContents` When querying for existing artifacts/content and they are found, set the timestamp. If committing this transaction fails due to orphan clenup commiting a transaction that removed the objects having `timestamp_of_interest` set, retry the exact same transaction again, the second time removed objects will not be found and the pipeline will proceed normally to re-download/re-create the artifact/content in the further stages. For newly created artifacts and content we need to set this timestamp as well, this way they will be marked 'as-planned-to-be-used'( aka I plan to add this artifact to the content or I plan to add this content to the repo) Upload ====== * if one shot upload - handle in 1 transaction artifact and content creation. * for non 1 shot upload, set the timestamp during content/artifact creation, or in case artifact/content are existing ones, update the timestamp. It is expected users will make their second call to associate the orphaned artifact within X seconds/minutes (TDB) since `timestamp_of_interest`. Modify ====== * set the timestamp on the content specified in the add_content_hrefs to prevent the case when orphan would clean them up in the middle of the running task. Orphan cleanup ============== Will be able to run in parallel without locks Remove Content that: - has no membership ( does not belong to any of the repo versions) - timestamp_of_interest is older than X seconds( was marked to be used X seconds ago but still does not belong to any of repo versions) Remove Artifact that: - has no membership ( does not belong to any of the content) - timestamp_of_interest is older than X seconds( was marked to be used X second ago but still does not belong to any of the content)