UE22AM251B Big Data Assignment

# UE22AM251B Big Data Assignment ## Objectives and Outcomes The project should showcase your ability to design and implement big data workflows. ## Software/Languages   * Data Streaming: Spark * Scripter: Python * Environment: Oracle Virtualbox or VMWare * Job runner: Hadoop ## Overview In this assignment, you're required to create a data streaming workflow using Kafka for the first half. This workflow will involve producing streaming trade data to a HDFS sink. In the second half, you'll use the same HDFS file as a source for two Hadoop jobs outlined in tasks 1 and 2. ## Submission Guidelines A **pdf** report containing all code/scripts, detailed explanation of the workflow, configurations for setting up and executing the assignment and screenshots of the same **(make sure your team name is visible in the screenshots, in terminal screenshots/as a comment in code/as a dir name in execution path)**. <b>Plagiarised reports will have teams flagged and marks will be deducted.</b> Deliverable: your_team_name.pdf ## Task ## Task Specifications  You will be working with binance's API that returns values that represent different aspects of a trade event on Binance exchange. The following example retrieves real-time trade updates for a specific trading pair (BTC-USDT in this example) on Binance using the Binance WebSocket API #### Endpoint ```wss://stream.binance.com:9443/ws/btcusdt@trade``` ⁠ #### Schema ``` 1. "e": Event type (String) The type of event, in this case, "trade". 2. "E": Event time (Integer) The timestamp of the event in milliseconds. 3. "s": Symbol (String) The trading pair symbol, e.g., "BTCUSDT" for Bitcoin to USDT. 4. "t": Trade ID (Integer) The unique identifier for the trade event. 5. "p": Price (Float) The price at which the trade occurred. 6. "q": Quantity (Float) The quantity of the asset traded. 7. "b": Buyer order ID (Integer) The unique identifier for the buyer's order. 8. "a": Seller order ID (Integer) The unique identifier for the seller's order. 9. "T": Trade time (Integer) The timestamp of the trade in milliseconds. 10. "m": Is the buyer the market maker? (Boolean) Indicates whether the buyer is the market maker. 11. "M": Ignore (system specific, not relevant to user) (Boolean/Null) ``` #### Request Params `symbol` (btcusdt in this example) #### Sample response ``` { "e": "trade", "E": 1712176569334, "s": "BTCUSDT", "t": 3530509543, "p": "65613.39000000", "q": "0.00543000", "b": 26232302040, "a": 26232302223, "T": 1712176569333, "m": true, "M": true } ```  ## Kafka Using Kafka, you will need two Python scripts to stream data from Binance's WebSocket API, process it, and then write it to HDFS. Kafka is to be used as a message broker here, a producer script to publish messages (trade events from Binance's WebSocket API) to a Kafka topic (btcusdt_trades for example), and the consumer script to consume those messages from the same Kafka topic and wite to HDFS. Once this step is complete, execute the tasks below using the same file in HDFS as a data source for Hadoop. ## Hadoop Get the output for the following queries by running hadoop jobs. (Implementation is upto you, using Hive or MR) ### Task 1  Determine the percentage of trades where the buyer is the market maker for each trading pair. ### Task 2  Find the top 5 trading pairs with the highest trade volume. A trading pair consists of a buyer and seller (identified by their IDs). Your job is to find similar pairs and find out the top 5 pairs with the highest cumulative volume (price*quantity). If you do not find similar pairs then return top 5 trading pairs with the highest trade volume, noncumulative.   ## Links 1. https://github.com/binance/binance-spot-api-docs/blob/master/web-socket-streams.md#trade-streams