# Orphan clean up
Problem:
orphan clean up is a tasking singleton and waits until all other tasks have halted before running, locks All The Things and then takes non trivial amount of time when it runs.
Solution:
Have the orphan cleanup run asynchronously and without any locks. This will allow any worker to work on it in parallel with other Pulp task types.
Challenge:
Handle race between another task associating an orphan content with a repository version and orphan cleanup deleting an orphan.
Brainstorming Ideas:
1. do not remove newly orphaned content but remove 5-10mins (for example) old orphaned content
* we only know "recently created", not "recently orphaned"
2. Mark artifact/content as 'protected'
* what if artifact never becomes a content, it will be marked as protected and won't be possible to clean it up
* similar for content, what if content never gets associated to a repo, it will be marked as protected and won't be possible to clean it up
## Where do we experience the race between content association and removal?
1. Sync Pipeline
2. New content creation during upload
3. General plugin code
4. `modify` endpoints that validate in the viewsets but associate in the tasks
## Idea #1 (was dropped in favour of idea #2)
Have orphan cleanup only remove Artifacts or Content older than X seconds
1. Sync pipeline
* when adding content to the repo version, if content is gone missing, re-send it to the pipeline
* note - make sure not to loose original relations on the extra_data on the DC
* when creating content out of an artifact, if artifact is no longer available, re-send it through the pipeline
2. New content creation during upload
* if one shot upload - it is handled in 1 transaction. Not every one-shot-upload adds newly created content to the repo version. In that case if the newly created content does not get added to the repo within X seconds( TBD) it risks to be cleaned up. Ask user to re-upload from scratch.
* for non 1 shot upload, have orphan cleanup only remove Artifacts or Content older than X seconds. Still there is a window for race condition - in that case just fail ask the user to re-upload from scratch
* what about existing content in pulp when upload is performed? while being re-used during the upload it can get cleaned up any moment
3. General plugin code
* use transactions
4. modify endpoints that validate in the viewsets but associate in the tasks
* content specified in the add_content_hrefs might get removed in meantime and become no longer available--> task will fail
# Proposal (Idea #2)
Add a new field `timestamp_of_interest` on the artifact and content models
- This field should be set at the artifact/content creation time
- This field should be updated whenever we work with the existing artifact/content
It is expected that the artifact will be added to the content (unorphaning the artifact) quicker than X seconds/minutes (TBD) from the `timestamp_of_interest`. Similarly, content will be added to a repo version (unorphaning the content) also within X seconds/minutes(TBD).
Sync Pipeline
=============
`timestamp_of_interest` This timestamp will be set in the sync pipeline in the `QueryExistingArtifacts` and `QueryExistingContents`
When querying for existing artifacts/content and they are found, set the timestamp. If committing this transaction fails due to orphan clenup commiting a transaction that removed the objects having `timestamp_of_interest` set, retry the exact same transaction again, the second time removed objects will not be found and the pipeline will proceed normally to re-download/re-create the artifact/content in the further stages.
For newly created artifacts and content we need to set this timestamp as well, this way they will be marked 'as-planned-to-be-used'( aka I plan to add this artifact to the content or I plan to add this content to the repo)
Upload
======
* if one shot upload - handle in 1 transaction artifact and content creation.
* for non 1 shot upload, set the timestamp during content/artifact creation, or in case artifact/content are existing ones, update the timestamp. It is expected users will make their second call to associate the orphaned artifact within X seconds/minutes (TDB) since `timestamp_of_interest`.
Modify
======
* set the timestamp on the content specified in the add_content_hrefs to prevent the case when orphan would clean them up in the middle of the running task.
Orphan cleanup
==============
Will be able to run in parallel without locks
Remove Content that:
- has no membership ( does not belong to any of the repo versions)
- timestamp_of_interest is older than X seconds( was marked to be used X seconds ago but still does not belong to any of repo versions)
Remove Artifact that:
- has no membership ( does not belong to any of the content)
- timestamp_of_interest is older than X seconds( was marked to be used X second ago but still does not belong to any of the content)