# Queue Overload - Testing ## Navigation 1. [Problem](https://hackmd.io/@jwdunne/HJXyhNY4h) 2. [Observability](https://hackmd.io/@jwdunne/S1pJ1CgHn) 3. [Testing](https://hackmd.io/@jwdunne/H1zKkAeSn) 4. [Throughput optimisation opportunities](https://hackmd.io/@jwdunne/H1h2k0xH3) 5. [Backpressure](https://hackmd.io/@jwdunne/B1WZeCeBh) 6. [Load shedding](https://hackmd.io/@jwdunne/BJB4MReH2) 7. [Autoscaling](https://hackmd.io/@jwdunne/Bkw_zAxHn) ## Solution Simulating queue overload as a "fire-drill" would test our mechanisms, taking inspiration from Netflix's Chaos Monkey. This could be performed once a month and/or as and when required. A “busywork machine” could generate $N$ jobs at a variable rate that occupies a worker for a variable amount of time over a fixed timespan. This should be enough to trigger: * Alerts * Overloading mechanisms * Overloaded mechanisms * Auto-scaling We would set a configurable upper limit on how long the process generates jobs for so that it doesn't cause hours of disruption. One busywork job could be designed to occupy a worker indefinitely, or cause it to terminate. This would tell us whether our mechanisms are working, whether they need fine-tuning or if we have missed implementing these mechanisms on new code. ## Interface This should be a CLI command: ```bash php artisan leadflo:busywork [--overload] [--timeout=seconds] [--kill-worker] ``` By default, it would timeout in 15 minutes or until the queue is in an 'overloading' state. The optional `--overload` option would instead work until the queue is in an overloaded state. The `--timeout` option provides the ability to set a longer or shorter timeout. The `--kill-worker` option kills a worker instead of occupying it for a length of time. This command will: 1. Dispatch one command that is intended to occupy a single worker for the timeout duration (or kill it outright) 2. Continuously dispatch commands that occupy a worker for one to four seconds randomly until the desired state Between iterations, the command will wait for a random number of seconds between 0 and a maximum of the job queued rate at the start of the process. ## Commands Each "busywork" command would accept the time started and the timeout value. There will be an `OccupyWorker` command. This will have a boolean property `kill` that is `false` by default. By default, it will occupy a worker until the timeout. If `kill` is true, the command will generate a string so big that it causes the OOM killer to kill the worker. There will also be a `SwarmWorker` command. By default, this will occupy a worker for a specified time `workTimeout`. If the current time is greater than `globalTimeout + startTime`, then the job will be ignored. ## Implementation - Implement `OccupyWorker` command and receiver - Implement `SwarmWorker` command and receiver - Implement `leadflo:busywork` CLI command