Queue Overload - Autoscaling

# Queue Overload - Autoscaling ## Navigation 1. [Problem](https://hackmd.io/@jwdunne/HJXyhNY4h) 2. [Observability](https://hackmd.io/@jwdunne/S1pJ1CgHn) 3. [Testing](https://hackmd.io/@jwdunne/H1zKkAeSn) 4. [Throughput optimisation opportunities](https://hackmd.io/@jwdunne/H1h2k0xH3) 5. [Backpressure](https://hackmd.io/@jwdunne/B1WZeCeBh) 6. [Load shedding](https://hackmd.io/@jwdunne/BJB4MReH2) 7. [Autoscaling](https://hackmd.io/@jwdunne/Bkw_zAxHn) ## Solution > This depends on a working solution for DB connection pooling for it to be fully effective. We can, however, get this working to within the same capacity as we have now. We can configure ECS service to auto-scale based on a custom metric and a target value. AWS allows us to: 1. Set a min and max capacity, to prevent runaway costs 2. Set a scaling up cooldown period, so it doesn't create a runaway scaling up effect 3. Set a scaling down cooldown period, so it doesn't scale down too early, allowing the problem to come back We should be conservative with these values to begin with and iterate. This is also the general approach to autoscaling ECS services. If we need to do this for API workers, this is how we would do it. #### Variables To control autoscaling across environments (it'd be undesirable in staging long-term but useful to test), we should define a set of variables to control autoscaling behaviour. ```terraform variable "autoscaling_worker" { description = "Controls autoscaling parameters for workers" type = object({ min_capacity = number max_capacity = number target = number scale_in_cooldown_seconds = number scale_out_cooldown_seconds = number }) } ``` #### Set up a custom CloudWatch metric We just need to send a metric to CloudWatch from the application: ```php $cloudWatch->putMetricData([ 'MetricData' => [ [ 'MetricName' => 'JobQueuedCompletedRatio' 'Value' => $queueLoad->ratio() ] ], 'Namespace' => 'Leadflo/Queue' ]) ``` We should sample this for every 5% (configurable) jobs queued to minimise the costs in sending the data to AWS whilst still giving reasonable data to AWS. We can do this in a listener that listens for both `JobQueued` and `JobCompleted` events. This would be done using synchronous listeners since we cannot rely on the queue. This may incur some latency but we can dispatch events after the response is sent to the browser using FPM which should mitigate. #### Set up a scalable target This configures a target for application autoscaling. ```terraform resource "aws_appautoscaling_target" "worker" { max_capacity = var.autoscaling_worker.min_capacity min_capacity = var.autoscaling_worker.max_capacity resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.leadflo_worker.name}" scalable_dimension = "ecs:service:DesiredCount" service_namespace = "ecs" } ``` The variables allow us to keep autoscaling a flat 1 on staging environments. #### Set up an auto-scaling policy ```terraform resource "aws_appautoscaling_policy" "worker" { name = "worker" policy_type = "TargetTrackingScaling" resource_id = aws_appautoscaling_target.worker.resource_id scalable_dimension = aws_appautoscaling_target.worker.scalable_dimension service_namespace = aws_appautoscaling_target.worker.service_namespace target_tracking_scaling_policy_configuration { target_value = var.autoscaling_worker.target scale_in_cooldown = var.autoscaling_worker.scale_in_cooldown_seconds scale_out_cooldown = var.autoscaling_worker.scale_out_cooldown_seconds customized_metric_specification { namespace = "Leadflo/Queue" metric_name = "JobQueuedCompletedRatio" statistic = "Average" unit = "Percent" } } } ``` ## Implementation - Implement an IAM role for sending metrics to CloudWatch - Implement a Terraform variable for configuring worker autoscaling - Configure a scalable target and auto-scaling policy