Queue Overload

# Queue Overload > Tough on queue overload, tough on the causes of queue overload > **Tony Blair** ## Navigation 1. [Problem](https://hackmd.io/@jwdunne/HJXyhNY4h) 2. [Observability](https://hackmd.io/@jwdunne/S1pJ1CgHn) 3. [Testing](https://hackmd.io/@jwdunne/H1zKkAeSn) 4. [Throughput optimisation opportunities](https://hackmd.io/@jwdunne/H1h2k0xH3) 5. [Backpressure](https://hackmd.io/@jwdunne/B1WZeCeBh) 6. [Load shedding](https://hackmd.io/@jwdunne/BJB4MReH2) 7. [Autoscaling](https://hackmd.io/@jwdunne/Bkw_zAxHn) ## Problem “Queue overload” is a regular problem. During queue overload, the queue is so long that there are significant delays in completing jobs. This is sometimes noticed by users, where it seems as if their actions have not “persisted”. Or that their form submissions do not “come through” to Leadflo. This is, however, mitigated by using dedicated, prioritised queues for form submissions and user-driven actions. This can also delay sending out communications, automated or otherwise—a key feature of Leadflo. Although an overloaded queue is seldom noticed by end users, and Leadflo continues to function, it does demand time to reduce the overloaded queue by manual load shedding and scaling workers. This, however, isn't scalable because it demands: - Permissive DB access - Good knowledge of what can be shed - Knowledge of worker scaling limits ### Why the queue overloads The queue overloads when the job queued rate (lambda or $\lambda$) is greater than or equal to the job completion rate (mu or $\mu$) multiplied by the number of workers $c$. When $\lambda \geq c\mu$, queued jobs grow exponentially. This decomposes the problem into two categories of approach: - Increasing job completion rate - Decreasing job queued rate #### Increasing job completion rate Increasing job completion rate involves risky pieces of work: 1. Introducing auto-scaling of worker nodes (which depends on DB connection pooling) 2. Optimising the running times of problematic jobs 3. Optimising memory/CPU times of problematic jobs ##### Increasing workers Increasing the number of workers is the most effective way to combat queue overload. This is best illustrated by shopping in Tesco Express: in a busy period, with only one assistant on checkout, the queue swells. The assistant signals for a second assistant. As soon as the second assistant begins serving, the queue dissolves. Unfortunately, we are limited by the DB connection limit. Each node uses a DB connection per client. Solving this is risky due to large infrastructural changes around critical resources. We can, however, get it working using current desired count as max capacity, with a min capacity of 1. AWS auto-scaling automates increase and decrease of workers to meet a target threshold of $\lambda / c\mu$ by tracking a custom CloudWatch metric. We would set a maximum capacity to add a ceiling to costs, which we can then manually control as we scale. ##### Optimising problematic jobs The other option is by improving the performance of jobs themselves. When a long-running job occupies a worker, it reduces the number of available workers. Because worker availability is the biggest cause of queue overload, it is better to favour thousands of small jobs that are each completed in milliseconds than it is to have one larger job that runs in minutes. AWS kills workers that use too much memory/CPU. This, again, decreases the availability of workers. There is also a large lead time in booting up a new worker to replace it. This can lead to a perfect storm with our retry policy where the job is retried up to 10 times, killing workers over and over. Optimising the performance of jobs demands a case-by-case approach of identifying and then optimising problematic jobs, prioritised by impact. #### Decreasing job queued rate Another way of attacking the problem is by _decreasing_ the job queued rate. We would do this by: - Introducing backpressure - Shedding load This is less risky and cheaper but less effective. It is, therefore, suitable as a “frontline treatment” ahead of and complementing auto-scaling. ##### Backpressure Backpressure slows down job producers. For example, every integration webhook notification emits an event. Backpressure would add a delay before emitting the event when approaching overload, or more aggressively during overload. A naive implementation would increase API response times. But PHP-FPM allows us to do this _after_ the response is sent to the browser. Another example is increasing the interval when automation workflows are triggered (every 5 minutes). During overload, this causes a “pile on” of jobs, which are manually shed until overload subsides. ##### Load shedding Load shedding drops jobs that we can live without. For example, we may be able to proceed for a short amount of time without some broadcast events. These high-volume messages and, again, can lead to a pile on effect. Other jobs may be dropped if they already exist on the queue. We could aggregate monitoring counters as a sum. We could disregard all but the most recent monitoring gauges and `StartProcessingWorkflows` and `ProcessWorkflow` does not need to be queued at all if it already exists on the queue for that client. ##### Job failures We also treat all job failures the same, i.e. retry them 10 times with jittered backoff. This is futile for permanent failures. It's not ideal for transient failures in times of overload, either. In addition, permanent failures lead to a separate but somewhat related problem of dead-letter queue overload—there may be some permanent failures that we can do nothing about so should be dropped. For example, a network timeout is a transient failure where retries are useful. A PHP type error due to a regression is a permanent failure which should not be retried until we have fixed the regression. Finally, a form submission that is missing all required information is a permanent failure that we should never retry because the information is lost—only the record of this failure is useful for identifying the cause of lack of information. ## Goals These goals depend on introduction of DB connection pooling: - Eliminate the need for manual scaling of workers - Introduce auto-scaling of worker tasks These goals can be achieved without DB connection pooling: - Eliminate the need for manual load shedding - Introduce alerts around overloading and overloaded queue states - Introduce automated backpressure in times of overload - Introduce automated load shedding and throughput optimisation both in general and in times of overload ## Solution & Implementation I suggest that we implement this in a number of phases and as such things are broken down that way: 1. [Observability](https://hackmd.io/@jwdunne/S1pJ1CgHn) 2. [Testing](https://hackmd.io/@jwdunne/H1zKkAeSn) 3. [Throughput optimisation opportunities](https://hackmd.io/@jwdunne/H1h2k0xH3) 4. [Backpressure](https://hackmd.io/@jwdunne/B1WZeCeBh) 5. [Load shedding](https://hackmd.io/@jwdunne/BJB4MReH2) 6. [Autoscaling](https://hackmd.io/@jwdunne/Bkw_zAxHn) The reason why autoscaling is last is because it cannot provide a benefit to queue overload in its current form. ## References ### Theory 1. [Little's Law](https://en.wikipedia.org/wiki/Little%27s_law) 2. [Server Utilisation & Queuing](https://www.johndcook.com/blog/2009/01/30/server-utilization-joel-on-queuing/) 3. [Exponential Effect of Increasing Workers](https://www.johndcook.com/blog/2008/10/21/what-happens-when-you-add-a-new-teller/) ### Practice 1. [Queues Don't Fix Overload HN Thread](https://news.ycombinator.com/item?id=8632043) 2. [AWS on Avoiding Insurmountable Queue Backlogs](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/) 3. [AWS on Load Shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/) 4. [Queues](https://www.marionzualo.com/queues/) 5. [How to Implement Backpressure and Load Shedding](https://www.marionzualo.com/2019/05/30/how-to-implement-backpressure-and-load-shedding/) 6. [Handling Overload](https://ferd.ca/handling-overload.html) ### Autoscaling 1. [Terraform Autoscaling](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/appautoscaling_target#ecs-service-autoscaling) 2. [AWS Docs on Autoscaling](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html#service-auto-scaling-deployments) 3. [Application Autoscaling: Target Tracking](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) 4. [Publishing Custom CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) 5. [Example Terraform ECS App Autoscaling](https://github.com/techservicesillinois/terraform-aws-ecs-service/blob/main/autoscale.tf) 6. [Terraform ECS App Autoscaling Tutorial](https://towardsaws.com/aws-ecs-service-autoscaling-terraform-included-d4b46997742b)