# Tasking System Redesign POC: https://github.com/pulp/pulpcore/pull/1261 ## Locking Mechanism 1. Locks will be done using [Postgresql Advisory Locks](https://www.postgresql.org/docs/9.4/explicit-locking.html) 2. Strings will be deterministically converted to BigInt since Advisory Locks use integers * Collisions in that conversion might impact performance (thoughput), but not correctness (as long as the same task does not attempt to fetch the same lock twice) ## User Requirements 1. Tasks are processed such that tasks sharing a resource are handled First Come First Serve (FCFS) 2. Synchronous application code may attempt to aqcuire some locks (non-blocking) and simply fail if they are unavailable. It cannot however jump the resource queue (i.e. It must not use a resource that is already assigned to a queued task). 3. Worker can be reserved to short running tasks. 4. A high number of workers can be bad for database performance. Maybe workers can be grouped with one frontend to perform the distributed find-the-next-task algorithm. **Observation:** Given user requirement 1, a waiting task that is ready to be worked on will not loose that state again, because it will have shotgun for all its requested resources. ## Algorithm for each Worker 1. Finding the next task the worker can process * https://github.com/mdellweg/steakhouse/blob/master/steakhouse/grill.py#L38 * Find the oldest task in Waiting state that does not require a resource that is already locked (not sufficient to satisfy user requirement 1) ~~2. Locking the resources for that task Lock on resources one at a time, in a deterministic order such that all workers will attempt to lock in the same order~~ 2. Locking will be handled on the task level. Each process will only ever manipulate one task and therefore only ever hold one lock. The resources are protected transitively by requirement (1). **Note:** Maybe it is sufficient to lock on the task alone. Since the algorithm of finding the next workable task will ensure that no worker will even attempt to take a task if the resources are used by an earlier task. **Idea:** When looking at the running tasks anyway, we can attempt to clean up stale / abandoned tasks on the way. "A task in state RUNNING without a lock is considered invalid." This would move the task cleanup away from a worker_cleanup. ## Algorithm Example Tasks with Resources (Spices): T1 Pepper <-- user requested first T2 Salt T3 Salt, Pepper T4 Salt, Cumin T5 Cumin <-- use requested last ## Two worker example W1 - start T1; lock Pepper W2 - try to start T1 -> fail W2 - start T2; lock Salt W2 - finish T2; unlock Salt W2 - waiting W1 - finish T1; unlock pepper W1 - start T3; lock Salt, Pepper W2 - waiting W1 - finish T3; unlock Salt, Pepper W2 - start T4; lock Salt, Cumin W1 - waiting W2 - finish T4; unlock Salt, Cumin W1 - starting T5; lock Cumin W2 - nothing left -> hibernate W1 - finish T5; unlock Cumin W1 - nothing left -> hibernate ## Idea: Example with shared and exclusive locks T1 Pepper:exclusive T2 Salt:shared T3 Salt:shared, Pepper:shared T4 Salt:exclusive, Cumin:exclusive T5 Cumin:shared T6 Cumin:shared, Pepper:exclusive ## Two worker example W1 - start T1; lock Pepper:exclusive W2 - try to start T1 -> fail W2 - start T2; lock Salt:shared W2 - finish T2; unlock Salt W2 - waiting W1 - finish T1; unlock pepper W1 - start T3; lock Salt:shared, Pepper:shared W2 - waiting W1 - finish T3; unlock Salt, Pepper W2 - start T4; lock Salt:exclusive, Cumin:exclusive W1 - waiting W2 - finish T4; unlock Salt, Cumin W1 - starting T5; lock Cumin:shared W2 - starting T6; lock Cumin:shared, Pepper:exclusive W1 - finish T5; unlock Cumin W1 - waiting W2 - finish T6; unlock Cumin, Pepper W2 - waiting ## Human Work to Do 1. Extend the Task table to include: * Task args and kwargs * Resource Locks to be reserved 2. Write a new worker entry point based on click it needs to * heartbeat * Can advisory locks help here too? * record its worker entry in the db just like RQ did * watch other others if they disappear, e.g. due to OOM and cancel any unfinished tasks 3. Implement the identify_next_task() code 4. Implement the locking function ## Blog Post Outline ### Benefits Overview Queued tasks no longer cancel when worker stop/die Pulp is now fully highly available Simpler architecture with no resource manager Throughput scales with workers <------- show one graph Increased reliability * advisory locks auto-cleanup avoiding deadlock * resolves issues from data being synchronized between postgreSQL and Redis ### How to Use Disabled by default In 3.14 enable this setting, and start your workers this way... Can go back to the old style Switch with down time; drain tasks queues. TBD when the new-style will become the default, and it will become the default TBD when the old-style will be removed, but it will be removed ### Extra Info Identify that redis is no longer used for tasking but being kept in the architecture likely to speedup content serving caching ###### tags: `Tasking System`