Worker Pull Model for Provers

# Worker Pull Model for Provers ## Push Model (current system) Currently, the worker pushes prover tasks to an available provers in a [ProverPool](https://github.com/NebraZKP/worker-ts/blob/develop/src/utils/proverPool.ts). The drawbacks of this approach are twofold: 1. The set of provers is fixed when the `worker` process is instantiated. Moreover the worker must know fine-grained details about the provers such as IP addresses, ports, etc. 2. New provers cannot be (easily) dynamically added to auto-scale proving capacity. ![image](https://hackmd.io/_uploads/r1B1LEJS0.png) ## Pull Model (overview) To remedy these issues, we propose a pull model where the `worker` does not need to a priori know about the set of provers. Rather `prover_task_queue`s sits in between the `worker` and the `provers`. The `worker` pushes prover tasks onto these queues and provers pull jobs, as per their convenience and respond with proofs. ![image](https://hackmd.io/_uploads/rkPmpNJS0.png) ## RabbitMQ RPC Pattern Overview RabbitMQ is a message broker. This means that it allows publishers to send messages to a queue and allows consumers to consume them. In our system, the worker is the publisher and the various provers are the consumers. We will use the pattern from the [RPC tutorial](https://www.rabbitmq.com/tutorials/tutorial-six-javascript) on RabbitMQ's website for our system as well. In the RPC pattern: - Clients (publishers) to make requests to servers (consumers) by pushing (enqueing) tasks onto an `rpc_queue`. - Each request has a unique `correlation_id` that identifies it. - Servers then pull (dequeue) tasks from the queue, complete the tasks, and push the response back to an anonymous, exclusive callback queue *on the client*. What this means is that each request has its own callback queue which waits for a response with a particular `correlation_id`. On the other hand, the `rpc_queue` is shared amongst, clients and servers. So multiple workers can fulfill tasks from a single `rpc_queue`. ![image](https://hackmd.io/_uploads/HyTXlBkrA.png) ## Using RabbitMQ in our Worker System Let's take generating `outer` proofs as a running example ([code](https://github.com/NebraZKP/worker-ts/blob/develop/src/aggregation-pipeline/outerProofGenerator.ts)). Currently the pushed-based system works as follows: - `outer` prover tasks are pushed onto the `OuterProofParams` async queue. - When the `process` function is called on an element of this async queue, an async request is made to the `outerProverPool` via: ```typescript! await this.proverPool.request(outerProverInput) ``` - Under the hood, this requests queries for an available prover, pushes a task to it, and waits for the response. The new pull-based system will work as follows: - When the `process` function is called, prover task request will be pushed onto `prover_rpc_queue` (see discussion about RabbitMQ RPC pattern in above section). - Any `outer` prover can pull a prover task, complete it and push it back to the corresponding callback queue. - The `process` function awaits until the `callback` queue for this request gets a response. Once it does, it continues with its business logic. Advantages of this approach are: - The `worker` needs to know absolutely nothing about the `outer` prover (IP addresses, ports, etc.). It just needs to know the endpoint of the `prover_rpc_queue`. - `outer` provers can be spun up at any point (even after the `worker` process starts) and start pulling tasks from the `prover_rpc_queue`. Each task on the `prover_rpc_queue` comes with a `replyTo` field, which contains info about the callback queue to which the response needs to be pushed. ![image](https://hackmd.io/_uploads/SykFeByS0.png) Note we shaded the callback queues blue, because they exist within the worker process. ## Robustness to Worker Restarts If the `worker` restarts two things will happen, there will still be jobs on the `prover_rpc_queues`. Responses to these tasks may confuse the restarted worker. To naively, address this we can run a script which just purges the queue (see https://stackoverflow.com/questions/5313027/how-do-i-delete-all-messages-from-a-single-queue-using-the-cli) ## Robustness to Provers Dying and RabbitMQ Process Going Down RabbitMQ has a concept of message durability and persistence. The first allows task to be returned to the queue if a consumer dies. The second allows tasks to be saved to disk so that if the RabbitMQ process dies itself, tasks on the queue are not lost. ![image](https://hackmd.io/_uploads/H1XlbH1rR.png)