# Hadoop Final Project
contributed by <`bauuuu1021`>
## Environment Setup
* reference of
* [single node](https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster)
* [multi nodes-1](https://www.edureka.co/blog/setting-up-a-multi-node-cluster-in-hadoop-2-x/)
* [multi nodes-2](https://sparkbyexamples.com/hadoop/apache-hadoop-installation/)
* [import/export virtual machine](https://askubuntu.com/questions/588426/how-to-export-and-import-virtualbox-vm-images)
* [boot failed due to inodes corrupted](https://askubuntu.com/questions/651577/dev-sda1-inodes-that-were-part-of-a-corrupted-orphan-linked-list-found)
* ==Don't forget to change your network attach to **NAT Network** and restart==
### New Multi-Node
* Follow this [tutorial](https://sparkbyexamples.com/hadoop/apache-hadoop-installation/) but skip step 1-3 and 1-4
* [Master] Append IPs into `/etc/hosts`
* such like
```
192.168.1.100 master
192.168.1.141 datanode1
192.168.1.113 datanode2
192.168.1.118 datanode3
```
* [Master] ssh key setup
* generate
```shell
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
```
* copy to other machines
```shell
scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys
scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys
```
:::success
For multi homed
* Append follow content at `hdfs-site.xml`
```
<property>
<name>dfs.namenode.rpc-bind-host</name>
<value>0.0.0.0</value>
</property>
```
* [**Don't forget**]Create `/usr/local/hadoop/hdfs/data` directory and change owner by
```
$ sudo mkdir -p /usr/local/hadoop/hdfs/data
$ sudo chown ubuntu:ubuntu /usr/local/hadoop/hdfs/data
```
:::
### Ubuntu
```shell
sudo apt-get install ssh
sudo apt-get install rsync
```
### Java
Check if java is installed
```shell
java -version
```
Download `openjdk-8-jdk` package
```
sudo apt install openjdk-8-jdk
```
Find path
```shell
update-alternatives --list java
```
and if you got `/usr/..../jre/bin/java`, {path of java you found} is `/usr/..../jre`
Add environment variable
```shell
sudo vi ~/.bashrc
```
append following lines
```config
export JAVA_HOME={path of java you found}
export PATH=$PATH:$JAVA_HOME/bin
```
### Hadoop
==Earlier Hadoop version may not be availible, the [latest version](http://apache.stu.edu.tw/hadoop/common/) is recommend.== 3.2.0 is currently the latest version and should be replaced if there's more recent version released.
>Hadoop 2.6與更之前的版本支援Java 6,2.7版之後只支援Java 7,Hadoop 3.0版本開始只支援Java 8。
```shell
sudo wget https://archive.apache.org/dist/hadoop/core/hadoop-3.1.3/hadoop-3.1.3.tar.gz
sudo tar -zxvf hadoop-3.1.3.tar.gz
```
* [Download](https://archive.apache.org/dist/hadoop/core/)
Add environment variable
```shell
sudo vi ~/.bashrc
```
appending
```shell
export HADOOP_HOME="/home/ubuntu/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre"
export PATH=$PATH:$JAVA_HOME/bin
```
store and execute
```
source ~/.bashrc
```
:::warning
When I tried to execute `Hadoop`, the problem as below appeared.
```shell
bauuuu1021@x555:~$ hadoop version
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /usr/local/hadoop/logs/fairscheduler-statedump.log (沒有此一檔案或目錄)
at java.io.FileOutputStream.open0(Native Method)
...
at org.apache.hadoop.util.VersionInfo.<clinit>(VersionInfo.java:37)
Hadoop 3.2.0
Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
Compiled by sunilg on 2019-01-08T06:08Z
Compiled with protoc 2.5.0
From source with checksum d3f0795ed0d9dc378e2c785d3668f39
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar
```
The problem was fixed by add directory
```shell
sudo mkdir /usr/local/hadoop/logs/
```
and change accessing mode of the directory
```shell
sudo chmod -R 777 /usr/local/hadoop/logs
```
:::
Finally
```shell
bauuuu1021@x555:~$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
```
### Hello world(?)
Testing `Standalone` mode ( The others are `Pseudo-Distributed` and `Fully-Distributed`)
```shell
cd ~
mkdir input
cp $HADOOP_HOME/etc/hadoop/*.xml input
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar grep input output 'dfs[a-z.]+'
cat output/*
```
expect output is
```shell
1 dfsadmin
```
## Pseudo-Distributed
* modify hadoop-env.sh
```shell
sudo vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
```
find `export JAVA_HOME=...` and replace with
```config
export JAVA_HOME={path to java that you found}
```
* modify the setting of HDFS
* core-site.xml
```shell
sudo vi $HADOOP_HOME/etc/hadoop/core-site.xml
```
replace
```confing
<configuration>
</configuration>
```
with
```config
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
```
* hdfs-site.xml
```shell
sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
```
default copies is 3 while pseudo-distributed only needs 1
```config
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
```
* Yarn
* mapred-site.xml
```shell
sudo vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
```
change into
```config
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
```
* yarn-site.xml
```shell
sudo vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
```
replace
```config
<configuration>
<!-- Site specific YARN configuration properties -->
</configuration>
```
with
```config
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
```
>(May be able to skip in this mode)
>* Add user and set ssh
```shell
sudo useradd hadoop
```
set ssh
```shell
sudo ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
sudo chmod 0600 ~/.ssh/authorized_keys
```
* Format HDFS
```shell
hdfs namenode -format
```
boots namenode and yarn
```shell
cd $HADOOP_HOME/sbin
./start-dfs.sh
./start-yarn.sh
```
:::warning
Problem when commanding for the first time
```shell
bauuuu1021@X555:~$ start-dfs.sh
Starting namenodes on [localhost]
localhost: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied
localhost: bauuuu1021@localhost: Permission denied (publickey,password).
Starting datanodes
localhost: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied
localhost: bauuuu1021@localhost: Permission denied (publickey,password).
Starting secondary namenodes [X555]
X555: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied
X555: bauuuu1021@x555: Permission denied (publickey,password).
```
solved by
```shell
sudo chmod 777 /home/bauuuu1021/.ssh/id_rsa
```
:::
turn off namenode and yarn
```shell
cd $HADOOP_HOME/sbin
./stop-dfs.sh
./stop-yarn.sh
```
### HDFS operation
* [reference](https://ithelp.ithome.com.tw/articles/10191018)
After
```shell
cd $HADOOP_HOME/sbin
./start-dfs.sh
./start-yarn.sh
```
Create a testing file
```shell
echo "Hello World" >> test.txt
```
Put the testing file to hadoop
```shell
hadoop fs -put test.txt /
```
Check
```shell
hadoop fs -ls /
```
### Sample project
Before that, we need to make some modifications
* mapred-site.xml
```shell
cd $HADOOP_HOME/etc/hadoop
sudo vi mapred-site.xml
```
insert following setting
```config
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
```
* bashrc
```shell
sudo vi ~/.bashrc
```
Add the following line **after** `$JAVA_HOME`
```config
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
```
After the setting above, you may execute the [testing project](https://ithelp.ithome.com.tw/articles/10191235) without problem.
:::info
Don't forget to setup namenode and yarn first (if you restart the computer)
```shell
cd $HADOOP_HOME/sbin
./start-dfs.sh
./start-yarn.sh
```
:::
:::warning
Problem that you might meet
```shell
bauuuu1021@X555:/usr/local/hadoop/etc/hadoop$ hadoop fs -ls
ls: Call From X555/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
```
try
```shell
stop-all.sh
hadoop namenode -format
start-all.sh
```
[reference](https://stackoverflow.com/questions/28661285/hadoop-cluster-setup-java-net-connectexception-connection-refused)
:::
## Multi-node
:::warning
Install and set on master first, then copy configurations to slave(s) later
:::
* [reference](https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm)
### IP address
* Build local network first (mobile phone AP is **not** recommended)
* Get IP address (IPv4) of both master and slave first, type `ip a` on terminal and you'll see
```shell
...
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 1c:b7:2c:1d:6f:cd brd ff:ff:ff:ff:ff:ff
inet 192.168.0.4/24 brd 192.168.0.255 scope global dynamic noprefixroute enp2s0
valid_lft 7082sec preferred_lft 7082sec
inet6 fe80::5986:6af2:7dcf:7b91/64 scope link noprefixroute
valid_lft forever preferred_lft forever
...
```
`192.168.0.4` in the third line is the ip address
### Map nodes
`$ sudo vi /etc/hosts`
add following setting
```config
(ip_of_master) master
(ip_of_slave) slave
```
and comment out `127.0.1.1`
### **==Configuring Key Based Login==**
```shell
sudo ssh-keygen -t rsa
sudo ssh-copy-id -i ~/.ssh/id_rsa.pub (master_computer_name)@master
sudo ssh-copy-id -i ~/.ssh/id_rsa.pub (slave_computer_name)@slave
(if slaves are more than 1, setting by the same way)
chmod 0600 ~/.ssh/authorized_keys
```
test if the setting above is successful
```shell
ssh (slave_computer_name)@slave
```
### Configure Hadoop
```shell
cd $HADOOOP_HOME/etc/hadoop
```
edit as below
* core-site.xml
```config
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
```
* hdfs-site.xml
```config
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/tmp/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/tmp/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
```
* mapred-site.xml
```config
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
</property>
</configuration>
```
* hadoop-env.sh
I use default path as `HADOOP_CONF_DIR`, but I still put the path that is recommended by the tutorial FYI.
```shell
export JAVA_HOME=(location that you installed java)
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf
```
### Configure masters/slaves/workers
```shell
cd $HADOOP_HOME/etc/hadoop
```
* masters
```shell
sudo vi masters
```
and add
```config
(master_computer_name)@master
```
* slaves
```shell
sudo vi slaves
```
and add
```config
(slave_computer_name)@slave
```
* workers
```shell
sudo vi workers
```
and add
```config
(master_computer_name)@master
(slave_computer_name)@slave
```
### Copy to slaves
Using [scp](https://linux.die.net/man/1/scp)
```shell
cd $HADOOP_HOME/..
scp -r hadoop (slave_computer_name)@slave:/tmp
cd /tmp
mv hadoop (location to store hadoop)
```
### Start hadoop service
```shell
hdfs datanode -format
hdfs namenode -format
start-all.sh
```
* check by `jps`
* master should contain `NodeManager` , `DataNode` , `ResourceManager` , `NameNode` , `Jps` , `SecondaryNameNode`
eg,
```shell
24081 NodeManager
23346 DataNode
23891 ResourceManager
23162 NameNode
25309 Jps
23631 SecondaryNameNode
```
* slave should contain `NodeManager` , `Jps`
, `DataNode`
eg,
```shell
5552 NodeManager
5678 Jps
5375 DataNode
```
:::warning
If any of the node/manager is missing, you **must** solved it first.
:::
## Reference
* [Hadoop Environment Setup](https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm)
* [IT 邦幫忙](https://ithelp.ithome.com.tw/articles/10190871)
* [Hadoop multi-node cluster](https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm)
###### tags : `bauuuu1021`, `Hadoop`