Task 2 - Kafka

# Task 2 - Kafka ## **Story Background** You are a developer for an online coding platform which hosts various DSA problems and competitions for students, developers and anyone needing interview preparations. There is continuous stream of information generated by the platform and you are tasked with building a producer-consumer pipeline with Kafka to support the various operations and computations needed for the platform. ## **Dataset Format and Explanation** The rows in the dataset are of 3 types, based on the event that they are describing - 1. ```problem```: Describes the event when a user has made a submission for a specific problem with additional fields. The schema for this format is - :::info problem user_id problem_id category difficulty submission_id status language runtime ::: An example row of this format: ```problem u_1 p_1 Arrays Medium s_001 Passed Python 300``` 2. ```competition```: Describes the event when a user has made a submission for a problem as a part of the competition with additional fields. The schema for this format is - :::info competition comp_id user_id com_problem_id category difficulty comp_submission_id status language runtime time_taken ::: An example row of this format: ```competition c_1 u_1 cp_1 Trees Hard cs_1 Failed C 50 22``` 3. ```solution```: Describes the event when a user has made his correct/passed submission public for other users to see. The schema for this format is - :::info solution user_id problem_id submission_id upvotes ::: An example row of this format: ```solution u_1 p_1 s_1 230``` ### Additional Details about the dataset 1. The rows will be shuffled, as it is a stream of events. 2. The unit of measurement for runtime is millisecond(ms) and for time_taken is minutes. 3. The problems and solutions in the `competition` format are independent/different from those in the `problem` format. 4. If a submission is present in the `solution` format, it will also be present in the `problem` format with status `Passed`. 5. In the `competition` format, the same `comp_problem_id` can be used for multiple competitions. ## **Problem Statement** Using the dataset provided to you, generate output files for 3 different clients based on their requirements. **Client 1:** Wants to know the most frequently used programming language and the most difficult category to solve. **Client 2:** Wants to calculate every user's elo rating on the platform. **Client 3:** Wants to know which user is contributing to the community the most (highest total upvotes). ## Description Sample dataset ``` competition c_0003 u_012 cp_00004 Stacks&Queues Hard cs_00003 Failed Rust 1994 46 problem u_019 p_00006 DP Hard s_00005 Passed C 1408 problem u_016 p_00008 Trees Hard s_00003 Failed Java 1443 problem u_009 p_00015 Graphs Hard s_00004 Passed Go 120 solution u_019 p_00006 s_00005 511 problem u_017 p_00009 Heaps Hard s_00002 Failed Go 821 solution u_009 p_00015 s_00004 180 competition c_0003 u_018 cp_00008 Heaps Easy cs_00005 Passed Go 411 7 competition c_0001 u_005 cp_00004 Stacks&Queues Hard cs_00004 Passed Ruby 756 25 competition c_0002 u_002 cp_00010 Greedy Hard cs_00001 TLE Python 2440 32 competition c_0002 u_003 cp_00010 Greedy Hard cs_00002 TLE Python 2076 18 problem u_020 p_00016 Stacks&Queues Easy s_00001 TLE C++ 2805 EOF ``` ### Client 1 Needs information on 2 things - * Most frequently used programming language, i.e, the language with the most submissions made * Most difficult category to solve, i.e, the category with the smallest ratio of `Passed/Total` submissions Sample output and format after running consumer1.py, for client1: Print the json.dumps of the dictionary in the following format - ``` { "most_used_language": [ "Go" ], "most_difficult_category": [ "Heaps", "Stacks&Queues", "Trees" ] } ``` #### Notes for client 1: 1. The client is only interested in the submissions made in the `problem` format. 2. If there is a tie for either part of the question, then return all the results in a list sorted lexicographically(similar to the above output where 3 categories are tied for `most_difficult_category`) ### Client 2 Needs `each user's elo rating` according to the coding platform's scoring system. Sample output and format after running consumer2.py, for client2: Print the json.dumps of the dictionary in the following format - ``` { "u_002": 1210, "u_003": 1211, "u_005": 1245, "u_009": 1315, "u_012": 1195, "u_016": 1197, "u_017": 1202, "u_018": 1233, "u_019": 1239, "u_020": 1205 } ``` #### Notes for client 2: 1. The result is sorted by user_id lexicographically. 2. `Round down` the final user elo rating to the nearest whole number 4. Both `competition` and `problem` submissions are considered for the elo rating of a user. 5. Formula for User Elo Rating (to be computed for every submission that is encountered) - ``` New_Elo = Current_Elo + Submission_Points Submission_Points = K * (Status_Score * Difficulty_Score) + Runtime_bonus K = 32 (constant scaling factor) Status_Score = 1 if Passed, 0.2 if TLE, -0.3 if Failed Difficulty_Score = 1 if Hard, 0.7 if Medium, 0.3 if Easy Runtime_bonus = 10000/runtime Initial elo rating for all users = 1200 ``` ### Client 3 Needs information on - * User with the most contribution to the community, i.e, the user with the most total upvotes. Sample output and format after running consumer3.py, for client3: Print the json.dumps of the dictionary in the following format - ``` { "best_contributor": [ "u_019" ] } ``` #### Notes for client 3: 1. If the best_contributor is tied on total upvotes, then list all the user_id's sorted lexicographically. ## **Tips to solve the Assignment** The logic for the producer and consumer is up to you. The input to the producer file will be streamed through the standard input. You are required to use the `kafka-python` library to solve the problem statement. You should have four files, one for the producer and three for the three different consumers. The producer should be named `kafka-producer.py` and the consumers should be named `kafka-consumer1.py`, `kafka-consumer2.py` and `kafka-consumer3.py`. It is recommended for you to use three topics to solve the assignment. All three topic names will be passed as command line arguments to all four files, and you may make use of them as required. There is no constraint on how you want to use these 3 topics. To test your code, run the consumer files first in three seperate terminals and then the producer file in a fourth separate terminal. ` ./kafka-consumer1.py topicName1 topicName2 topicName3 > output1.json` ` ./kafka-consumer2.py topicName1 topicName2 topicName3 > output2.json` ` ./kafka-consumer3.py topicName1 topicName2 topicName3 > output3.json` `cat sample_dataset.txt | ./kafka-producer.py topicName1 topicName2 topicName3` ## Important 1. The topic name **should not be hardcoded**. Three topic names should be passed as a command line argument for both the producer and consumer files. 2. There is a special line in the end of the input file named `EOF`. This is done so that the consumer knows when to stop reading from the topic, and you can gracefully stop producer and the 3 consumers. You must not include this line in the output. 3. Usage of direct file interaction commands such as python's `open()` **is NOT allowed**. You must use Kafka's producer and consumer APIs to solve the problem statement. 4. Only **`kafka`**, **`sys`** and **`json`** modules are allowed. 5. Print the `result` dictionary using the following command only **`print(json.dumps(result, indent = 4))`** ## Dataset 🔗 You can find the datasets and expected outputs [here](https://drive.google.com/drive/folders/1F__LAtM4TznhWpHyy-Mv2_9OGduqSOju?usp=sharing)