# BGD702: Hadoop Cluster Deployment and PageRank Implementation
> Most complete branch of project is `ansible-terraform` branch
**Student:**
Hamze Ghalebi
MS Big Data
Telecom Paris
**Instructor:**
Prof. Marc Jeanmougin
**Course:**
BGD702 -
**Institution:**
Telecom Paris
**Project Overview:**
This report encapsulates the deployment of a Hadoop cluster and the implementation of a simplified PageRank algorithm. It highlights the challenges encountered, particularly in network management, access control, and debugging, and discusses the learnings and outcomes from this hands-on experience.
---
### 1. Hadoop Cluster Deployment
1. **Initial Efforts**:
- As part of my initial effort, I focused on handling a Hadoop cluster deployment.
- I aimed to split and process data efficiently using Hadoop's distributed system.
2. **Deployment Approach**:
- Following suggestions, I utilized Ansible for Hadoop cluster deployment, following [this instruction](https://www.linkedin.com/pulse/configure-hadoop-starting-cluster-services-using-ansible-phatate and [this](https://www.techsupportpk.com/2020/03/how-to-create-multi-node-hadoop-cluster-centos-rhel.html).
- Due to connection issues with Telecom Ecole, I opted to create VMs on Azure for testing the project.
- The Hadoop NameNode is set up on a VM with the following IP: `20.52.245.231`.
- Access to the NameNode for review can be gained through the following SSH connection: `ssh -i ~/.ssh/hadoop.pem hamze@20.52.245.231`.
- The private key for access has been provided separately at `keys/hadoop.pem`
### 2. PageRank Implementation on Hadoop
- **Data File**:
- For the data source, I used a Wikipedia file, which can be found [here](https://dumps.wikimedia.org/frwiki/20231101/frwiki-20231101-pages-articles-multistream.xml.bz2).
- **Simplified PageRank Algorithm**:
- A simplified version of PageRank was implemented, scoring pages based on the number of incoming links, with each link having equal weight.
- Iterative scoring based on link importance was not included in this simplified version.
- **MapReduce Implementation**:
- Two Mapper scripts were developed:
1. **Link Collection Mapper**: Extracts all internal links using simple pattern detection.
```python
def find_links_in_text_file(file_path):
link_pattern = re.compile(r'\[\[(.*?)\]\]') # Regular expression for [[Link]]
links = []
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
found_links = link_pattern.findall(line)
links.extend(found_links)
return links
```
---
```
```
2. **Counting Mapper**: Counts the occurrences of each page link throughout the dataset.
- ```python
def count_phrases_and_write_to_file(input_file_path, output_file_path):
# Count the occurrences of each phrase
phrase_counts = {}
with open(input_file_path, 'r', encoding='utf-8') as file:
for line in file:
phrase = line.strip() # Remove any leading/trailing whitespace
if phrase:
phrase_counts[phrase] = phrase_counts.get(phrase, 0) + 1
# Write the unique phrases and their counts to the output file
with open(output_file_path, 'w', encoding='utf-8') as file:
for phrase, count in phrase_counts.items():
file.write(f"{phrase}\t{count}\n")
return "File created successfully.
```
```
- The Reducer script primarily handled data passing and aggregation.
- **Adapting Python Scripts for MapReduce**:
- Python scripts were adapted to fit into the MapReduce framework.
## mapper_links.py
- ```python
import sys
import re
link_pattern = re.compile(r'\[\[(.*?)\]\]')
def main():
for line in sys.stdin:
extract_links = link_pattern.findall(line)
for link in links:
print(f"{link}\t1")
if __name__ == "__main__":
main()
```
### mapper_phrases.py
```python
#!/usr/bin/env python
import sys
def main():
for line in sys.stdin:
phrase = line.strip()
if phrase:
print(f"{phrase}\t1")
if __name__ == "__main__":
main()
```
### reducer_phrases.py
```python
#!/usr/bin/env python
import sys
def main():
current_phrase = None
current_count = 0
for line in sys.stdin:
phrase, count = line.strip().split('\t')
count = int(count)
if current_phrase == phrase:
current_count += count
else:
if current_phrase:
print(f"{current_phrase}\t{current_count}")
current_phrase = phrase
current_count = count
if current_phrase == phrase:
print(f"{current_phrase}\t{current_count}")
if __name__ == "__main__":
main()
```
### 3. Automation and Scripting
- **Shell Scripting for Automation**:
- A shell script was developed to automate the process of running the MapReduce jobs.
- ```sell
[root@hadooop project2]# cat job2.sh
#!/bin/bash
# Hadoop Configuration
HADOOP_STREAMING_JAR=/usr/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
HDFS_USER_PATH="/data/wikipedia"
LOCAL_DATA_PATH="./data"
# Unique identifier for output directories
TIMESTAMP=$(date +%Y%m%d%H%M%S)
# XML File Paths
LOCAL_XML_FILE="./data/chunk_99.xml"
HDFS_XML_PATH="$HDFS_USER_PATH/xml_data"
# MapReduce Job Scripts
MAPPER_LINKS="./mapper_links.py"
REDUCER_LINKS="./reducer_links.py"
MAPPER_PHRASES="./mapper_phrases.py"
REDUCER_PHRASES="./reducer_phrases.py"
# Output Directories
LINKS_OUTPUT="$HDFS_USER_PATH/links_output_$TIMESTAMP"
PHRASES_OUTPUT="$HDFS_USER_PATH/phrases_output_$TIMESTAMP"
LOCAL_OUTPUT_DIR="./output"
# Step 1: Upload XML to HDFS
echo "Uploading XML file to HDFS..."
hadoop fs -mkdir -p $HDFS_XML_PATH
hadoop fs -put -f $LOCAL_XML_FILE $HDFS_XML_PATH/
# Step 2: Run MapReduce Job for Extracting Links
echo "Running MapReduce Job for Extracting Links..."
hadoop jar $HADOOP_STREAMING_JAR \
-files $MAPPER_LINKS,$REDUCER_LINKS -mapper mapper_links.py \
-reducer reducer_links.py \
-input $HDFS_XML_PATH/* -output $LINKS_OUTPUT
# Step 3: Run MapReduce Job for Counting Phrases
echo "Running MapReduce Job for Counting Phrases..."
hadoop jar $HADOOP_STREAMING_JAR \
-files $MAPPER_PHRASES,$REDUCER_PHRASES -mapper mapper_phrases.py \
-reducer reducer_phrases.py \
-input $LINKS_OUTPUT/* -output $PHRASES_OUTPUT
# Check if MapReduce jobs completed successfully before retrieving output
if hadoop fs -test -d $PHRASES_OUTPUT; then
echo "Retrieving the output..."
hadoop fs -get $PHRASES_OUTPUT/* $LOCAL_OUTPUT_DIR
else
echo "MapReduce job did not produce any output. Skipping retrieval."
fi
echo "Pipeline execution completed successfully."
```
- This script handled tasks such as uploading data to HDFS, running the MapReduce jobs, and retrieving outputs.
- 

following is an out come of one of my trys:
```text
Projet:Sport/Évaluation|Sport 417068
États-Unis 269315
France 253828
Paris 228533
P:CS|Correction syntaxique 197693
Projet:Football/Évaluation|Football 153383
France|français 121899
Projet:Cinéma/Évaluation|Cinéma 108224
espèce 99784
États-Unis|américain 97270
Seconde Guerre mondiale 93123
Londres 92919
Italie 90004
Parti socialiste (France)|PS 85198
Allemagne 83505
Projet:Italie/Évaluation|Italie 82451
New York 80522
Canada 79309
ceinture d'astéroïdes|ceinture principale 78705
Famille (biologie)|famille 73212
Espagne 71713
Japon 70874
Projet:France/Évaluation|France 70429
Projet:États-Unis/Évaluation|États-Unis 65415
Angleterre 63898
Royaume-Uni 63502
football 63474
Californie 59721
Projet:Histoire militaire/Évaluation|Histoire militaire 59189
Belgique 58784
Rome 57734
Première Guerre mondiale 57579
Divers droite|DVD 57196
Pologne 56469
Wikipédia:Bot/Requêtes/2023/03#Retirer la puce devant {{Liens}}|Modèles ne fonctionnant pas dans les listes 56368
Institut national de la statistique et des études économiques|Insee 55889
Québec 52289
Le Monde 51290
Suisse 48091
France|française 46333
Australie 46139
2008 45852
Jour julien|JJ 45513
2007 45449
anglais 45230
2006 44788
astéroïde 44569
Union pour un mouvement populaire|UMP 44345
Centre des planètes mineures 44091
Brésil 44084
périhélie 44075
```
### 4. Challenges and Resolutions
* **Main Challenges** :
* **Network Management** : Navigating and managing network configurations within the Hadoop cluster presented significant challenges, particularly in ensuring seamless connectivity and data flow between nodes.
* **Access Control Adjustments** : Modifying access permissions for different users, especially resolving HDFS permission issues, was critical to maintaining secure and efficient data operations.
* **Debugging Hadoop Implementations** : Identifying and resolving issues within the Hadoop ecosystem, including YARN resource management and job scheduling, required meticulous debugging and problem-solving skills.
* **Learning Outcomes** :
* Acquired hands-on experience in Hadoop cluster setup and management.
* Gained a comprehensive understanding of the MapReduce paradigm and its application.
* Learned the significance and techniques of automating processes in a distributed computing setup.
* Developed proficiency in network management, access control modifications, and debugging within a Hadoop framework.
## Conclusion
Embarking on this project has been an invaluable learning expedition into the complexities of Hadoop and the broader landscape of big data processing and distributed computing. While I gained a conceptual understanding of clustering and its significance in handling large-scale data, the more profound learning emerged from the hands-on challenges encountered, especially in the intricacies of infrastructure, networking, and system management.
Navigating through the technical details of setting up and managing a Hadoop cluster brought to light the critical importance of a robust and well-configured network infrastructure in the realm of big data. The process of debugging, adjusting access permissions, and ensuring seamless communication between various components of the Hadoop ecosystem significantly enhanced my technical acumen. This practical experience has not only solidified my understanding of Hadoop's operational mechanics but also provided me with valuable insights into the systematic approach required for effective cluster management.
This project has underscored the fact that while conceptual knowledge provides a foundation, the real depth of understanding is achieved through direct engagement with the system's technical aspects. The hands-on experience with Hadoop's infrastructure and network system management has been particularly enlightening, contributing significantly to my professional growth in the field of big data.
---
# Project Structure Documentation
## Overview
This document outlines the structure of our Hadoop Cluster Deployment project. It provides a detailed description of each directory and its contents, ensuring easy navigation and understanding of the project's components.
### Directory Structure
#### `ansible/`
This directory contains all the Ansible configurations for deploying the Hadoop cluster. Ansible is used for automating the deployment and configuration processes.
#### `tf/`
Here, you'll find all Terraform configurations for deploying the Hadoop cluster. Terraform is utilized for defining and providing data center infrastructure.
- `host`: This file contains the IP addresses of the hosts in the cluster.
#### `scr/`
This folder holds all the application code necessary for the Hadoop cluster's operation.
- **Files and Subdirectories**:
- `src/linksdetecter.py` : This file contains the code for the link detection algorithm.
- `src/pipline.py` : protype pipeline for the link detection algorithm
- `src/score_pages.py` : simplifyied page rank algorithm for the link detection algorithm
- `src/sort.sh` : sort the output of the link detection algorithm
#### `scr4hadoop/`
This directory contains all the application code specifically refactored for MapReduce operations in the Hadoop cluster.
- **Files and Subdirectories**:
- `scr4hadoop/project2/mapper_links.py` : mapper for the link detection algorithm
- `scr4hadoop/project2/re`: reducer for the link detection algorithm
- `scr4hadoop/project2/mapper_phrases.py` : mapper for the phrase detection algorithm
- `scr4hadoop/project2/reducer_phrases.py` : reducer for the phrase to calculate the score of the pages base on a simplified page rank algorithm
#### `docs/final-report/report.md`
---