BGD702: Hadoop Cluster Deployment and PageRank Implementation

# BGD702: Hadoop Cluster Deployment and PageRank Implementation > Most complete branch of project is `ansible-terraform` branch **Student:** Hamze Ghalebi MS Big Data Telecom Paris **Instructor:** Prof. Marc Jeanmougin **Course:** BGD702 - **Institution:** Telecom Paris **Project Overview:** This report encapsulates the deployment of a Hadoop cluster and the implementation of a simplified PageRank algorithm. It highlights the challenges encountered, particularly in network management, access control, and debugging, and discusses the learnings and outcomes from this hands-on experience. --- ### 1. Hadoop Cluster Deployment 1. **Initial Efforts**: - As part of my initial effort, I focused on handling a Hadoop cluster deployment. - I aimed to split and process data efficiently using Hadoop's distributed system. 2. **Deployment Approach**: - Following suggestions, I utilized Ansible for Hadoop cluster deployment, following [this instruction](https://www.linkedin.com/pulse/configure-hadoop-starting-cluster-services-using-ansible-phatate and [this](https://www.techsupportpk.com/2020/03/how-to-create-multi-node-hadoop-cluster-centos-rhel.html). - Due to connection issues with Telecom Ecole, I opted to create VMs on Azure for testing the project. - The Hadoop NameNode is set up on a VM with the following IP: `20.52.245.231`. - Access to the NameNode for review can be gained through the following SSH connection: `ssh -i ~/.ssh/hadoop.pem hamze@20.52.245.231`. - The private key for access has been provided separately at `keys/hadoop.pem` ### 2. PageRank Implementation on Hadoop - **Data File**: - For the data source, I used a Wikipedia file, which can be found [here](https://dumps.wikimedia.org/frwiki/20231101/frwiki-20231101-pages-articles-multistream.xml.bz2). - **Simplified PageRank Algorithm**: - A simplified version of PageRank was implemented, scoring pages based on the number of incoming links, with each link having equal weight. - Iterative scoring based on link importance was not included in this simplified version. - **MapReduce Implementation**: - Two Mapper scripts were developed: 1. **Link Collection Mapper**: Extracts all internal links using simple pattern detection. ```python def find_links_in_text_file(file_path): link_pattern = re.compile(r'\[\[(.*?)\]\]') # Regular expression for [[Link]] links = [] with open(file_path, 'r', encoding='utf-8') as file: for line in file: found_links = link_pattern.findall(line) links.extend(found_links) return links ``` --- ``` ``` 2. **Counting Mapper**: Counts the occurrences of each page link throughout the dataset. - ```python def count_phrases_and_write_to_file(input_file_path, output_file_path): # Count the occurrences of each phrase phrase_counts = {} with open(input_file_path, 'r', encoding='utf-8') as file: for line in file: phrase = line.strip() # Remove any leading/trailing whitespace if phrase: phrase_counts[phrase] = phrase_counts.get(phrase, 0) + 1 # Write the unique phrases and their counts to the output file with open(output_file_path, 'w', encoding='utf-8') as file: for phrase, count in phrase_counts.items(): file.write(f"{phrase}\t{count}\n") return "File created successfully. ``` ``` - The Reducer script primarily handled data passing and aggregation. - **Adapting Python Scripts for MapReduce**: - Python scripts were adapted to fit into the MapReduce framework. ## mapper_links.py - ```python import sys import re link_pattern = re.compile(r'\[\[(.*?)\]\]') def main(): for line in sys.stdin: extract_links = link_pattern.findall(line) for link in links: print(f"{link}\t1") if __name__ == "__main__": main() ``` ### mapper_phrases.py ```python #!/usr/bin/env python import sys def main(): for line in sys.stdin: phrase = line.strip() if phrase: print(f"{phrase}\t1") if __name__ == "__main__": main() ``` ### reducer_phrases.py ```python #!/usr/bin/env python import sys def main(): current_phrase = None current_count = 0 for line in sys.stdin: phrase, count = line.strip().split('\t') count = int(count) if current_phrase == phrase: current_count += count else: if current_phrase: print(f"{current_phrase}\t{current_count}") current_phrase = phrase current_count = count if current_phrase == phrase: print(f"{current_phrase}\t{current_count}") if __name__ == "__main__": main() ``` ### 3. Automation and Scripting - **Shell Scripting for Automation**: - A shell script was developed to automate the process of running the MapReduce jobs. - ```sell [root@hadooop project2]# cat job2.sh #!/bin/bash # Hadoop Configuration HADOOP_STREAMING_JAR=/usr/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar HDFS_USER_PATH="/data/wikipedia" LOCAL_DATA_PATH="./data" # Unique identifier for output directories TIMESTAMP=$(date +%Y%m%d%H%M%S) # XML File Paths LOCAL_XML_FILE="./data/chunk_99.xml" HDFS_XML_PATH="$HDFS_USER_PATH/xml_data" # MapReduce Job Scripts MAPPER_LINKS="./mapper_links.py" REDUCER_LINKS="./reducer_links.py" MAPPER_PHRASES="./mapper_phrases.py" REDUCER_PHRASES="./reducer_phrases.py" # Output Directories LINKS_OUTPUT="$HDFS_USER_PATH/links_output_$TIMESTAMP" PHRASES_OUTPUT="$HDFS_USER_PATH/phrases_output_$TIMESTAMP" LOCAL_OUTPUT_DIR="./output" # Step 1: Upload XML to HDFS echo "Uploading XML file to HDFS..." hadoop fs -mkdir -p $HDFS_XML_PATH hadoop fs -put -f $LOCAL_XML_FILE $HDFS_XML_PATH/ # Step 2: Run MapReduce Job for Extracting Links echo "Running MapReduce Job for Extracting Links..." hadoop jar $HADOOP_STREAMING_JAR \ -files $MAPPER_LINKS,$REDUCER_LINKS -mapper mapper_links.py \ -reducer reducer_links.py \ -input $HDFS_XML_PATH/* -output $LINKS_OUTPUT # Step 3: Run MapReduce Job for Counting Phrases echo "Running MapReduce Job for Counting Phrases..." hadoop jar $HADOOP_STREAMING_JAR \ -files $MAPPER_PHRASES,$REDUCER_PHRASES -mapper mapper_phrases.py \ -reducer reducer_phrases.py \ -input $LINKS_OUTPUT/* -output $PHRASES_OUTPUT # Check if MapReduce jobs completed successfully before retrieving output if hadoop fs -test -d $PHRASES_OUTPUT; then echo "Retrieving the output..." hadoop fs -get $PHRASES_OUTPUT/* $LOCAL_OUTPUT_DIR else echo "MapReduce job did not produce any output. Skipping retrieval." fi echo "Pipeline execution completed successfully." ``` - This script handled tasks such as uploading data to HDFS, running the MapReduce jobs, and retrieving outputs. - ![SCR-20231119-rqfi](https://hackmd.io/_uploads/Hya8U4O46.png) ![SCR-20231119-rpxq](https://hackmd.io/_uploads/HyaIU4_Va.png) following is an out come of one of my trys: ```text Projet:Sport/Évaluation|Sport 417068 États-Unis 269315 France 253828 Paris 228533 P:CS|Correction syntaxique 197693 Projet:Football/Évaluation|Football 153383 France|français 121899 Projet:Cinéma/Évaluation|Cinéma 108224 espèce 99784 États-Unis|américain 97270 Seconde Guerre mondiale 93123 Londres 92919 Italie 90004 Parti socialiste (France)|PS 85198 Allemagne 83505 Projet:Italie/Évaluation|Italie 82451 New York 80522 Canada 79309 ceinture d'astéroïdes|ceinture principale 78705 Famille (biologie)|famille 73212 Espagne 71713 Japon 70874 Projet:France/Évaluation|France 70429 Projet:États-Unis/Évaluation|États-Unis 65415 Angleterre 63898 Royaume-Uni 63502 football 63474 Californie 59721 Projet:Histoire militaire/Évaluation|Histoire militaire 59189 Belgique 58784 Rome 57734 Première Guerre mondiale 57579 Divers droite|DVD 57196 Pologne 56469 Wikipédia:Bot/Requêtes/2023/03#Retirer la puce devant {{Liens}}|Modèles ne fonctionnant pas dans les listes 56368 Institut national de la statistique et des études économiques|Insee 55889 Québec 52289 Le Monde 51290 Suisse 48091 France|française 46333 Australie 46139 2008 45852 Jour julien|JJ 45513 2007 45449 anglais 45230 2006 44788 astéroïde 44569 Union pour un mouvement populaire|UMP 44345 Centre des planètes mineures 44091 Brésil 44084 périhélie 44075 ``` ### 4. Challenges and Resolutions * **Main Challenges** : * **Network Management** : Navigating and managing network configurations within the Hadoop cluster presented significant challenges, particularly in ensuring seamless connectivity and data flow between nodes. * **Access Control Adjustments** : Modifying access permissions for different users, especially resolving HDFS permission issues, was critical to maintaining secure and efficient data operations. * **Debugging Hadoop Implementations** : Identifying and resolving issues within the Hadoop ecosystem, including YARN resource management and job scheduling, required meticulous debugging and problem-solving skills. * **Learning Outcomes** : * Acquired hands-on experience in Hadoop cluster setup and management. * Gained a comprehensive understanding of the MapReduce paradigm and its application. * Learned the significance and techniques of automating processes in a distributed computing setup. * Developed proficiency in network management, access control modifications, and debugging within a Hadoop framework. ## Conclusion Embarking on this project has been an invaluable learning expedition into the complexities of Hadoop and the broader landscape of big data processing and distributed computing. While I gained a conceptual understanding of clustering and its significance in handling large-scale data, the more profound learning emerged from the hands-on challenges encountered, especially in the intricacies of infrastructure, networking, and system management. Navigating through the technical details of setting up and managing a Hadoop cluster brought to light the critical importance of a robust and well-configured network infrastructure in the realm of big data. The process of debugging, adjusting access permissions, and ensuring seamless communication between various components of the Hadoop ecosystem significantly enhanced my technical acumen. This practical experience has not only solidified my understanding of Hadoop's operational mechanics but also provided me with valuable insights into the systematic approach required for effective cluster management. This project has underscored the fact that while conceptual knowledge provides a foundation, the real depth of understanding is achieved through direct engagement with the system's technical aspects. The hands-on experience with Hadoop's infrastructure and network system management has been particularly enlightening, contributing significantly to my professional growth in the field of big data. --- # Project Structure Documentation ## Overview This document outlines the structure of our Hadoop Cluster Deployment project. It provides a detailed description of each directory and its contents, ensuring easy navigation and understanding of the project's components. ### Directory Structure #### `ansible/` This directory contains all the Ansible configurations for deploying the Hadoop cluster. Ansible is used for automating the deployment and configuration processes. #### `tf/` Here, you'll find all Terraform configurations for deploying the Hadoop cluster. Terraform is utilized for defining and providing data center infrastructure. - `host`: This file contains the IP addresses of the hosts in the cluster. #### `scr/` This folder holds all the application code necessary for the Hadoop cluster's operation. - **Files and Subdirectories**: - `src/linksdetecter.py` : This file contains the code for the link detection algorithm. - `src/pipline.py` : protype pipeline for the link detection algorithm - `src/score_pages.py` : simplifyied page rank algorithm for the link detection algorithm - `src/sort.sh` : sort the output of the link detection algorithm #### `scr4hadoop/` This directory contains all the application code specifically refactored for MapReduce operations in the Hadoop cluster. - **Files and Subdirectories**: - `scr4hadoop/project2/mapper_links.py` : mapper for the link detection algorithm - `scr4hadoop/project2/re`: reducer for the link detection algorithm - `scr4hadoop/project2/mapper_phrases.py` : mapper for the phrase detection algorithm - `scr4hadoop/project2/reducer_phrases.py` : reducer for the phrase to calculate the score of the pages base on a simplified page rank algorithm #### `docs/final-report/report.md` ---