Big Data Project 2023 @ PES University - Distributed Load Testing System

# Big Data Project 2023 @ PES University - Distributed Load Testing System ## Goal Design and build a distributed load-testing system that co-ordinates between multiple driver nodes to run a highly concurrent, high-throughput load test on a web server. This system will use Kafka as a communication service. ## Application Architecture ![The Architecture Diagram of the entire Distributed Load Testing system](https://i.imgur.com/dEAZWyX.png) - The orchestrator node and driver node are implemented as separate processes (you do _not_ need to provision separate VMs for each) - The Orchestrator node and the Driver node must communicate via Kafka (see [topic and message descriptor](#json-message-and-topic-descriptor-format)) - The Orchestrator node and driver node must roughly have the above described components. ## Final Deliverables ### Functionality - The system must use Kafka as a single point of communication, publishing and subscribing roughly to the below topics. If you feel like there can be an improvement in the format (within reasonable bounds) you are free to implement it. - All nodes take the Kafka IP Address and the Orchestrator node IP Address as command-line arguments. - All Nodes must possess a uniqe ID That they use to register themselves, and be capable of generating unique IDs as necessary. - The system must support the two following types of load tests - **Tsunami testing**: The user must be able to set a delay interval between each request, and the system must maintain this gap between each request on each node - **Avalanche testing**: All requests are sent as soon as they are ready in first-come first-serve order - ~~The user must provide a target throughput per driver node (X requests per second) and the implementation must respect that.~~ The user must provide a total number of requests to be run per driver node in the test configuration. The test stops on each driver node once the responses to all of these requests have been recieved. - If you have already implemented features for a target throughput, that will be considered as well, and you will get marks for it. - There is no time bound for when tests can stop. It depends purely on when all responses are recieved. - Load tests stop when a fixed number of requests per driver node are run in parallel. There is no time bound, as this cannot be controlled by the load testing tool, but rather depends on the implementation of the Target Server. - The system must support observability - The Orchestrator node must know how many requests each driver node has sent, updated at a maximum interval of one second - The Orchestrator node must be able to show a dashboard with aggregated {min, max, mean, median, mode} response latency across all (driver) nodes, and for each (driver) node - Both the driver node and the orchestrator node must have a _metrics store_. It is up to you to choose how this is implemented, but all your metrics from testing must reside here. - The system must be scalable - It should be possible to run a test with a minimum of three (one orchestrator, two driver) and a maximum of nine (one orchestrator, 8 driver) nodes. - It should be possible to change the number of driver nodes between tests. ### Code 1. An implementation of an Orchestrator Node - Exposes a REST API to view and control different tests - Can trigger a load test - Can report statistics for a current load test - Implement a Runtime controller that - Handles Heartbeats from the Driver Nodes (sent after the Driver Nodes have been connected and until they are disconnected) - Is responsible for co-ordinating between driver nodes to trigger load tests - recieves metrics from the driver nodes and stores them in the metrics store 2. An implementation of a Driver node - Sends requests to the target webserver as indicated by the Orchestrator node - records statistics for {mean, median, min, max} response time - sends said results back to the Orchestrator node - Implements a communication layer that talks to the central Kafka instance, implemented around a Kafka Client. 3. An implementation of a target HTTP server - An endpoint to test (`/ping`) - A metrics endpoint to see the number of requests sent and responses sent (`/metrics`) You can find a sample implementation of a target server here, which you can use for your initial testing. This comes with both the `/ping` and `/metrics` endpoints baked in - [https://github.com/anirudhRowjee/bd2023-load-testing-server](https://github.com/anirudhRowjee/bd2023-load-testing-server) ### Implementation Guidelines - The preferred language of implementation is Golang as it is the industry's choice for building distributed systems. However, you are also free to use other languages such as Python, Java, C# and C++ to do the same. Marks will be awarded based on your implementation, thought process, architecture, and effort, not on the language. - The concept of Heartbeat has been covered in the course. - Feel free to use any available distributed system primitives (_not_ full applications such as RabbitMQ) to further ease your development process. - The JSON Message Format has been provided for your reference, and is intended to help you get started with implementation. You are free to modify it as you see fit, so long as it does not prevent you from implementing a part of the project properly. #### Weekly Guidelines 1. Week 1 Target: Single Orchestrator, Two Drivers with - Avalanche and Tsunami testing coverage 2. Week 2 Target: Single Orchestrator, Two Drivers - All features mentioned in the week 1 target - metrics reporting 3. Week 3 Target: Single Orchestrator with eight drivers - All features mentioned in week 2 target - Metrics dashboard and CLI or GUI Control interface - A maximum of one second latency for dashboard / metrics update from the orchestrator node ### JSON Message and Topic Descriptor Format All Nodes must register themselves through the `register` topic ```json // Register message format { "node_id": "<NODE ID HERE>", "node_IP": "<NODE IP ADDRESS (with port) HERE>", "message_type": "DRIVER_NODE_REGISTER" } ``` The Orchestrator node must publish a message to the `test_config` topic to send out test configuration to all driver nodes ```json // testconfig message format { "test_id": "<RANDOMLY GENERATED UNQUE TEST ID>", "test_type": "<AVALANCHE|TSUNAMI>", "test_message_delay": "<0 | CUSTOM_DELAY (only applicable in place of Tsunami testing)>", "message_count_per_driver": "<A NUMBER>" } ``` The Orchestrator node must publish a message to the `trigger` topic to begin the load test, and as soon as the drivers recieve this message, they have to begin the load test ```json // trigger message format { "test_id": "<RANDOMLY GENERATED UNQUE TEST ID>", "trigger": "YES" } ``` The driver nodes must publish their aggregate metrics to the `metrics` topic as the load test is going on at a regular interval ```json // metrics message format { "node_id": "<RANDOMLY GENERATED UNQUE TEST ID>", "test_id": "<TEST ID>", "report_id": "<RANDOMLY GENERATED ID FOR EACH METRICS MESSAGE>", // latencies in ms "metrics": { "mean_latency": "", "median_latency": "", "min_latency": "", "max_latency": "" } } ``` The driver nodes must also publish their heartbeats to the `heartbeat` topic as the load test is going on at a regular interval ```json // heartbeat message format { "node_id": "<RANDOMLY GENERATED UNQUE TEST ID>", "heartbeat": "YES", "timestamp": "<Heartbeat Timestamp>" } ```