Hadoop Final Project

# Hadoop Final Project contributed by <`bauuuu1021`> ## Environment Setup * reference of * [single node](https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster) * [multi nodes-1](https://www.edureka.co/blog/setting-up-a-multi-node-cluster-in-hadoop-2-x/) * [multi nodes-2](https://sparkbyexamples.com/hadoop/apache-hadoop-installation/) * [import/export virtual machine](https://askubuntu.com/questions/588426/how-to-export-and-import-virtualbox-vm-images) * [boot failed due to inodes corrupted](https://askubuntu.com/questions/651577/dev-sda1-inodes-that-were-part-of-a-corrupted-orphan-linked-list-found) * ==Don't forget to change your network attach to **NAT Network** and restart== ### New Multi-Node * Follow this [tutorial](https://sparkbyexamples.com/hadoop/apache-hadoop-installation/) but skip step 1-3 and 1-4 * [Master] Append IPs into `/etc/hosts` * such like ``` 192.168.1.100 master 192.168.1.141 datanode1 192.168.1.113 datanode2 192.168.1.118 datanode3 ``` * [Master] ssh key setup * generate ```shell ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ``` * copy to other machines ```shell scp .ssh/authorized_keys datanode1:/home/ubuntu/.ssh/authorized_keys scp .ssh/authorized_keys datanode2:/home/ubuntu/.ssh/authorized_keys scp .ssh/authorized_keys datanode3:/home/ubuntu/.ssh/authorized_keys ``` :::success For multi homed * Append follow content at `hdfs-site.xml` ``` <property> <name>dfs.namenode.rpc-bind-host</name> <value>0.0.0.0</value> </property> ``` * [**Don't forget**]Create `/usr/local/hadoop/hdfs/data` directory and change owner by ``` $ sudo mkdir -p /usr/local/hadoop/hdfs/data $ sudo chown ubuntu:ubuntu /usr/local/hadoop/hdfs/data ``` ::: ### Ubuntu ```shell sudo apt-get install ssh sudo apt-get install rsync ``` ### Java Check if java is installed ```shell java -version ``` Download `openjdk-8-jdk` package ``` sudo apt install openjdk-8-jdk ``` Find path ```shell update-alternatives --list java ``` and if you got `/usr/..../jre/bin/java`, {path of java you found} is `/usr/..../jre` Add environment variable ```shell sudo vi ~/.bashrc ``` append following lines ```config export JAVA_HOME={path of java you found} export PATH=$PATH:$JAVA_HOME/bin ``` ### Hadoop ==Earlier Hadoop version may not be availible, the [latest version](http://apache.stu.edu.tw/hadoop/common/) is recommend.== 3.2.0 is currently the latest version and should be replaced if there's more recent version released. >Hadoop 2.6與更之前的版本支援Java 6，2.7版之後只支援Java 7，Hadoop 3.0版本開始只支援Java 8。 ```shell sudo wget https://archive.apache.org/dist/hadoop/core/hadoop-3.1.3/hadoop-3.1.3.tar.gz sudo tar -zxvf hadoop-3.1.3.tar.gz ``` * [Download](https://archive.apache.org/dist/hadoop/core/) Add environment variable ```shell sudo vi ~/.bashrc ``` appending ```shell export HADOOP_HOME="/home/ubuntu/hadoop" export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME} export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre" export PATH=$PATH:$JAVA_HOME/bin ``` store and execute ``` source ~/.bashrc ``` :::warning When I tried to execute `Hadoop`, the problem as below appeared. ```shell bauuuu1021@x555:~$ hadoop version log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /usr/local/hadoop/logs/fairscheduler-statedump.log (沒有此一檔案或目錄) at java.io.FileOutputStream.open0(Native Method) ... at org.apache.hadoop.util.VersionInfo.<clinit>(VersionInfo.java:37) Hadoop 3.2.0 Source code repository https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf Compiled by sunilg on 2019-01-08T06:08Z Compiled with protoc 2.5.0 From source with checksum d3f0795ed0d9dc378e2c785d3668f39 This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.0.jar ``` The problem was fixed by add directory ```shell sudo mkdir /usr/local/hadoop/logs/ ``` and change accessing mode of the directory ```shell sudo chmod -R 777 /usr/local/hadoop/logs ``` ::: Finally ```shell bauuuu1021@x555:~$ java -version java version "1.8.0_201" Java(TM) SE Runtime Environment (build 1.8.0_201-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode) ``` ### Hello world(?) Testing `Standalone` mode ( The others are `Pseudo-Distributed` and `Fully-Distributed`) ```shell cd ~ mkdir input cp $HADOOP_HOME/etc/hadoop/*.xml input hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar grep input output 'dfs[a-z.]+' cat output/* ``` expect output is ```shell 1 dfsadmin ``` ## Pseudo-Distributed * modify hadoop-env.sh ```shell sudo vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh ``` find `export JAVA_HOME=...` and replace with ```config export JAVA_HOME={path to java that you found} ``` * modify the setting of HDFS * core-site.xml ```shell sudo vi $HADOOP_HOME/etc/hadoop/core-site.xml ``` replace ```confing <configuration> </configuration> ``` with ```config <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> ``` * hdfs-site.xml ```shell sudo vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml ``` default copies is 3 while pseudo-distributed only needs 1 ```config <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> ``` * Yarn * mapred-site.xml ```shell sudo vi $HADOOP_HOME/etc/hadoop/mapred-site.xml ``` change into ```config <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> </configuration> ``` * yarn-site.xml ```shell sudo vi $HADOOP_HOME/etc/hadoop/yarn-site.xml ``` replace ```config <configuration>  </configuration> ``` with ```config <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> ``` >(May be able to skip in this mode) >* Add user and set ssh ```shell sudo useradd hadoop ``` set ssh ```shell sudo ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys sudo chmod 0600 ~/.ssh/authorized_keys ``` * Format HDFS ```shell hdfs namenode -format ``` boots namenode and yarn ```shell cd $HADOOP_HOME/sbin ./start-dfs.sh ./start-yarn.sh ``` :::warning Problem when commanding for the first time ```shell bauuuu1021@X555:~$ start-dfs.sh Starting namenodes on [localhost] localhost: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied localhost: bauuuu1021@localhost: Permission denied (publickey,password). Starting datanodes localhost: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied localhost: bauuuu1021@localhost: Permission denied (publickey,password). Starting secondary namenodes [X555] X555: Load key "/home/bauuuu1021/.ssh/id_rsa": Permission denied X555: bauuuu1021@x555: Permission denied (publickey,password). ``` solved by ```shell sudo chmod 777 /home/bauuuu1021/.ssh/id_rsa ``` ::: turn off namenode and yarn ```shell cd $HADOOP_HOME/sbin ./stop-dfs.sh ./stop-yarn.sh ``` ### HDFS operation * [reference](https://ithelp.ithome.com.tw/articles/10191018) After ```shell cd $HADOOP_HOME/sbin ./start-dfs.sh ./start-yarn.sh ``` Create a testing file ```shell echo "Hello World" >> test.txt ``` Put the testing file to hadoop ```shell hadoop fs -put test.txt / ``` Check ```shell hadoop fs -ls / ``` ### Sample project Before that, we need to make some modifications * mapred-site.xml ```shell cd $HADOOP_HOME/etc/hadoop sudo vi mapred-site.xml ``` insert following setting ```config <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> ``` * bashrc ```shell sudo vi ~/.bashrc ``` Add the following line **after** `$JAVA_HOME` ```config export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar ``` After the setting above, you may execute the [testing project](https://ithelp.ithome.com.tw/articles/10191235) without problem. :::info Don't forget to setup namenode and yarn first (if you restart the computer) ```shell cd $HADOOP_HOME/sbin ./start-dfs.sh ./start-yarn.sh ``` ::: :::warning Problem that you might meet ```shell bauuuu1021@X555:/usr/local/hadoop/etc/hadoop$ hadoop fs -ls ls: Call From X555/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused ``` try ```shell stop-all.sh hadoop namenode -format start-all.sh ``` [reference](https://stackoverflow.com/questions/28661285/hadoop-cluster-setup-java-net-connectexception-connection-refused) ::: ## Multi-node :::warning Install and set on master first, then copy configurations to slave(s) later ::: * [reference](https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) ### IP address * Build local network first (mobile phone AP is **not** recommended) * Get IP address (IPv4) of both master and slave first, type `ip a` on terminal and you'll see ```shell ... 2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 1c:b7:2c:1d:6f:cd brd ff:ff:ff:ff:ff:ff inet 192.168.0.4/24 brd 192.168.0.255 scope global dynamic noprefixroute enp2s0 valid_lft 7082sec preferred_lft 7082sec inet6 fe80::5986:6af2:7dcf:7b91/64 scope link noprefixroute valid_lft forever preferred_lft forever ... ``` `192.168.0.4` in the third line is the ip address ### Map nodes `$ sudo vi /etc/hosts` add following setting ```config (ip_of_master) master (ip_of_slave) slave ``` and comment out `127.0.1.1` ### **==Configuring Key Based Login==** ```shell sudo ssh-keygen -t rsa sudo ssh-copy-id -i ~/.ssh/id_rsa.pub (master_computer_name)@master sudo ssh-copy-id -i ~/.ssh/id_rsa.pub (slave_computer_name)@slave (if slaves are more than 1, setting by the same way) chmod 0600 ~/.ssh/authorized_keys ``` test if the setting above is successful ```shell ssh (slave_computer_name)@slave ``` ### Configure Hadoop ```shell cd $HADOOOP_HOME/etc/hadoop ``` edit as below * core-site.xml ```config <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000/</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> ``` * hdfs-site.xml ```config <configuration> <property> <name>dfs.data.dir</name> <value>/tmp/data</value> <final>true</final> </property> <property> <name>dfs.name.dir</name> <value>/tmp/name</value> <final>true</final> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> ``` * mapred-site.xml ```config <configuration> <property> <name>mapred.job.tracker</name> <value>master:9001</value> </property> </configuration> ``` * hadoop-env.sh I use default path as `HADOOP_CONF_DIR`, but I still put the path that is recommended by the tutorial FYI. ```shell export JAVA_HOME=(location that you installed java) export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf ``` ### Configure masters/slaves/workers ```shell cd $HADOOP_HOME/etc/hadoop ``` * masters ```shell sudo vi masters ``` and add ```config (master_computer_name)@master ``` * slaves ```shell sudo vi slaves ``` and add ```config (slave_computer_name)@slave ``` * workers ```shell sudo vi workers ``` and add ```config (master_computer_name)@master (slave_computer_name)@slave ``` ### Copy to slaves Using [scp](https://linux.die.net/man/1/scp) ```shell cd $HADOOP_HOME/.. scp -r hadoop (slave_computer_name)@slave:/tmp cd /tmp mv hadoop (location to store hadoop) ``` ### Start hadoop service ```shell hdfs datanode -format hdfs namenode -format start-all.sh ``` * check by `jps` * master should contain `NodeManager` , `DataNode` , `ResourceManager` , `NameNode` , `Jps` , `SecondaryNameNode` eg, ```shell 24081 NodeManager 23346 DataNode 23891 ResourceManager 23162 NameNode 25309 Jps 23631 SecondaryNameNode ``` * slave should contain `NodeManager` , `Jps` , `DataNode` eg, ```shell 5552 NodeManager 5678 Jps 5375 DataNode ``` :::warning If any of the node/manager is missing, you **must** solved it first. ::: ## Reference * [Hadoop Environment Setup](https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm) * [IT 邦幫忙](https://ithelp.ithome.com.tw/articles/10190871) * [Hadoop multi-node cluster](https://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm) ###### tags : `bauuuu1021`, `Hadoop`