Lab block 3. Hadoop, Distributed Cluster, YARN, MapReduce.

--- tags: BigData-BS-2019, BigData-BS-2020 title: Lab block 3. Hadoop, Distributed Cluster, YARN, MapReduce. --- # Lab block 3. Hadoop: Distributed Cluster, YARN, MapReduce :::info *VM info* : **username:** vagrant **password:** vagrant ::: ## How to use this document This document is a tutorial on how to set up a distributed Hadoop cluster. It contains steps necessary for the cluster's minimal working configuration in the simplest scenario - launching an example program. Remember, Hadoop applies the parameters specified in a configuration only when those parameters are used. Consequently, parameters that are sufficient to start the resource manager can be insufficient to execute the job. In the case some parameters are incorrect, the execution of a Hadoop program will be terminated, and the error details will be written to a log file. Consult with the log file to identify the cause of an error. There is a troubleshooting section at the end of this document, where you can find solutions for most frequent problems. Video for this session: [link](https://www.youtube.com/watch?v=MjvKMkfO-B8&list=PLcv13aCxIfjDY1qv79zgCHRUZuh0_7tA7&index=6) ## Prerequisites To complete this tutorial, you need to install Vagrant virtualization software. VirtualBox is recommended as Vagrant's hypervisor (no other configuration was tested). To expedite the system set up, download [virtual machine image](https://edisk.university.innopolis.ru/edisk/Public/Fall%202019/Big%20Data/) (1.2Gb) Create `Vagrantfile` with the following content ```ruby # -*- mode: ruby -*- # vi: set ft=ruby : BOX_PATH = 'hadoop_image.box' Vagrant.configure("2") do |config| config.vm.define "server-1" do |subconfig| subconfig.vm.box = "server-1" #BOX_IMAGE subconfig.vm.box_url = BOX_PATH subconfig.vm.hostname = "server-1" subconfig.vm.network :private_network, ip: "10.0.0.11" subconfig.vm.network "forwarded_port", guest: 8088, host: 8088 subconfig.vm.provider "virtualbox" do |v| v.memory = 512 end end config.vm.define "server-2" do |subconfig| subconfig.vm.box = "server-2" #BOX_IMAGE subconfig.vm.box_url = BOX_PATH subconfig.vm.hostname = "server-2" subconfig.vm.network :private_network, ip: "10.0.0.12" subconfig.vm.provider "virtualbox" do |v| v.memory = 512 end end config.vm.define "server-3" do |subconfig| subconfig.vm.box = "server-3" #BOX_IMAGE subconfig.vm.box_url = BOX_PATH subconfig.vm.hostname = "server-3" subconfig.vm.network :private_network, ip: "10.0.0.13" subconfig.vm.provider "virtualbox" do |v| v.memory = 512 end end end ``` Place `hadoop_image.box` to the same folder as `Vagrantfile` and start VMs. ### System requirements The current configuration creates three virtual machines, each with 512Mb of RAM. You can increase or decrease this amount based on your requirements. Each virtual machine occupies ~4Gb of hard disk space right after its creation. The size of a virtual machine can increase after you start working with it. It is recommended to have at least 20Gb of disk space. ### Java 8 We are going to use Java 8. The newer versions of Hadoop support more recent versions of Java. However, issues were reported when using newer versions. Errors related to the Java version are not evident are troublesome to track. ## Prepare the Cluster Start the cluster by executing `vagrant up`. :::warning WARNIGN: Current configuration will occupy 15-20Gb of your disk space. If you have less than 4Gb of memory, it is highly recommended to find another machine before proceeding with this tutorial. ::: Open your system manager and check how much memory is consumed by VirtualBox services. Notice that it is larger than allocated to the VM. Open `Vagrantfile` and scan through the configuration. You have 3 VMs configured: `server-1`, `server-2`, `server-3`. Each one is assigned a dedicated IP address. Remember that you can log into a VM using `vagrant ssh vmname`, i.e. ```bash vagrant ssh server-1 vagrant ssh server-2 vagrant ssh server-3 ``` ### Configuring Host Names For proper operation, Hadoop prefers the configuration to be performed with domain names rather than IP addresses. In order for our system to resolve domain names, we need a Domain Name Service (DNS). The cheapest way to set up a DNS resemblance is [`/etc/hosts`](https://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap9sec95.html). In this file, you specify a mapping between domain names and their IPs. Whenever you make a request to a domain, the system first checks `/etc/hosts` and only then tries to resolve with other DNS services. Now, we want to enforce the following mapping ``` 10.0.0.11 server-1 10.0.0.12 server-2 10.0.0.13 server-3 ``` :::info NOTE: If you run your cluster elsewhere, you can adopt any other domain name convention. Then adjust all the following configs accordingly. ::: Set `/etc/hosts` on all nodes ``` 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost ip6-loopback ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.0.11 server-1 10.0.0.12 server-2 10.0.0.13 server-3 ``` ### Distributing Credentials For proper functioning of HDFS and YARN, the namenode and resource manager need ssh access to datanodes and nodemanagers. In this tutorial, we are adopting a simplified architecture of a distributed system, where the same machine plays the roles of HDFS namenode and YARN resource manager. In reality, this setting is impractical but works now for educational purposes. We are going to use `server-1` for namenode and resource manager. This node will need ssh access to all other nodes. For this reason, access keys should be distributed across nodes. On `server-1`, generate a key with (in case you don't have the key yet) ```bash ssh-keygen -b 4096 ``` and distribute this key to every node in the cluster with ```bash ssh-copy-id -i $HOME/.ssh/id_rsa.pub vagrant@node-address ``` Copy the key to all nodes, including `server-1`. Check whether you have succeeded with ```bash ssh server-1 ssh server-2 ssh server-3 ``` You should be able to ssh without a password. ### Configure Environment Variables When starting Hadoop services, the master node needs to start necessary daemons on remote nodes. This is done through ssh. Besides the ssh access, the master node should be able to execute Hadoop binaries. To help locate those binaries, modify `PATH` environment variable ```bash echo "PATH=/home/vagrant/hadoop/bin:/home/vagrant/hadoop/sbin:$PATH" >> ~/.bashrc ``` The effect of the command above will take place after the next login. Make sure you logout from `server-1` at least once to proceed with the instruction. ## Configure Hadoop Every VM contains Hadoop binaries located in `/home/vagrant/hadoop`. Configuration files reside in `~/hadoop/etc/hadoop`. :::info We are going to configure `server-1` first and then copy the configuration to other nodes. Assume further commands are executed on `server-1` unless specified otherwise. ::: ### Default Java Environment Change variable `JAVA_HOME` in `~/hadoop/etc/hadoop/hadoop-env.sh` to `/usr/lib/jvm/java-folder` (because Java already installed in VM image) or other path where you have your Java binaries installed. ### HDFS #### Hadoop Temp Change the current directory and create a folder for temporary files. This is necessary to ensure data persistence across datanodes (if this folder is not specified, you may experience problems after restarting VMs). ```bash cd ~ mkdir hadoop_tmp ``` Configure [`hdfs-site.xml`](https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml): ```xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/vagrant/hadoop_tmp</value> </property> </configuration> ``` #### Namenode Configure the `core-site.xml` and specify the namenode address ```xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://server-1:9000</value> </property> </configuration> ``` #### Configure Workers Edit the file `workers` ``` server-1 server-2 server-3 ``` #### Distribute Configuration To copy the configuration to other nodes we will use `scp` command. ```bash for node in another-server-address-1 another-server-address-2; do scp conf/path/on/local/machine/* $node:conf/path/on/remote/machine/; done ``` :::danger You need to create `hadoop_tmp` directories on all nodes. ::: #### Format HDFS The first time you start up HDFS, it must be formatted. Continue working on the `server-1` and format a new distributed filesystem: ```bash hdfs namenode -format ``` #### Start HDFS All of the HDFS processes can be started with a utility script. On the `server-1` execute: ```bash start-dfs.sh ``` `server-1` will connect to the rest of the nodes and start corresponding services. #### Check Correct HDFS Functioning ```bash hdfs dfsadmin -report ``` You should be able to run this command on any of Hadoop cluster nodes and expect the following output ``` >> hdfs dfsadmin -report Configured Capacity: 93497118720 (87.08 GB) Present Capacity: 78340636672 (72.96 GB) DFS Remaining: 78340562944 (72.96 GB) DFS Used: 73728 (72 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (3): Name: 10.0.0.11:9866 (server-1) Hostname: server-1 Decommission Status : Normal Configured Capacity: 31165706240 (29.03 GB) DFS Used: 24576 (24 KB) Non DFS Used: 3447148544 (3.21 GB) DFS Remaining: 26111803392 (24.32 GB) DFS Used%: 0.00% DFS Remaining%: 83.78% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Feb 13 09:38:47 PST 2019 Last Block Report: Wed Feb 13 09:38:29 PST 2019 Num of Blocks: 0 Name: 10.0.0.12:9866 (server-2) Hostname: server-2 Decommission Status : Normal Configured Capacity: 31165706240 (29.03 GB) DFS Used: 24576 (24 KB) Non DFS Used: 3444572160 (3.21 GB) DFS Remaining: 26114379776 (24.32 GB) DFS Used%: 0.00% DFS Remaining%: 83.79% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Feb 13 09:38:46 PST 2019 Last Block Report: Wed Feb 13 09:38:28 PST 2019 Num of Blocks: 0 Name: 10.0.0.13:9866 (server-3) Hostname: server-3 Decommission Status : Normal Configured Capacity: 31165706240 (29.03 GB) DFS Used: 24576 (24 KB) Non DFS Used: 3444572160 (3.21 GB) DFS Remaining: 26114379776 (24.32 GB) DFS Used%: 0.00% DFS Remaining%: 83.79% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Wed Feb 13 09:38:46 PST 2019 Last Block Report: Wed Feb 13 09:38:28 PST 2019 Num of Blocks: 0 ``` If the command above fails or the number of nodes is less then three, refer to the troubleshooting section for possible solutions. #### Working with File System Download [Alice in Wonderland](https://edisk.university.innopolis.ru/edisk/Public/Fall%202019/Big%20Data/). Copy the text file to one of the nodes (use Vagrant's shared folders). Place the text file onto hdfs using `hdfs put`. ```bash hdfs dfs -put path/on/local/machine path/on/hdfs ``` HDFS works in user directory by default, but it does not exist yet ```bash hdfs dfs -mkdir /user/ hdfs dfs -mkdir /user/vagrant ``` After you copied the file, you should be able to see it on HDFS ```bash hdfs dfs -ls ``` Most of unix filesystem commands are available ```bash hdfs dfs -rm alice.txt ``` ### YARN Resource manager YARN usually runs on a dedicated machine. Since many of us are limited with resources, we place it on the same node as the namenode. The file responsible for YARN configuration is `yarn-site.xml`. More detailed information about the available configuration options can be found in the [official documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html). #### Resource Manager ``` <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>server-1</value> </property> </configuration> ``` As you can see, we are going to run the resource manager on the same machine as the namenode. #### MapReduce Configure MapReduce framework in `mapred-site.xml` ``` <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> ``` #### Start YARN Distribute the configuration among nodes and start YARN daemons ```bash start-yarn.sh ``` Check the status with ```bash >> yarn node -list INFO client.RMProxy: Connecting to ResourceManager at /10.0.0.11:8032 Total Nodes:3 Node-Id Node-State Node-Http-Address Number-of-Running-Containers server-1:35921 RUNNING server-1:8042 0 server-2:34747 RUNNING server-2:8042 0 server-3:40715 RUNNING server-3:8042 0 ``` #### Run MapReduce Job ```bash hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar pi 20 10 ``` You can expect to see similar output ``` 2019-02-13 10:12:46,877 INFO mapreduce.Job: Job job_1550081099154_0001 completed successfully 2019-02-13 10:12:47,038 INFO mapreduce.Job: Counters: 53 File System Counters FILE: Number of bytes read=2206 FILE: Number of bytes written=21714789 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=26590 HDFS: Number of bytes written=215 HDFS: Number of read operations=405 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=100 Launched reduce tasks=1 Data-local map tasks=100 Total time spent by all maps in occupied slots (ms)=3272064 Total time spent by all reduces in occupied slots (ms)=826509 Total time spent by all map tasks (ms)=1090688 Total time spent by all reduce tasks (ms)=275503 Total vcore-milliseconds taken by all map tasks=1090688 Total vcore-milliseconds taken by all reduce tasks=275503 Total megabyte-milliseconds taken by all map tasks=837648384 Total megabyte-milliseconds taken by all reduce tasks=211586304 Map-Reduce Framework Map input records=100 Map output records=200 Map output bytes=1800 Map output materialized bytes=2800 Input split bytes=14790 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=2800 Reduce input records=200 Reduce output records=0 Spilled Records=400 Shuffled Maps =100 Failed Shuffles=0 Merged Map outputs=100 GC time elapsed (ms)=28868 CPU time spent (ms)=131670 Physical memory (bytes) snapshot=21592756224 Virtual memory (bytes) snapshot=237003071488 Total committed heap usage (bytes)=12206575616 Peak Map Physical memory (bytes)=221532160 Peak Map Virtual memory (bytes)=2352791552 Peak Reduce Physical memory (bytes)=128483328 Peak Reduce Virtual memory (bytes)=2351853568 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=11800 File Output Format Counters Bytes Written=97 Job Finished in 393.5 seconds Estimated value of Pi is 3.14800000000000000000 ``` If you could not get the same output, try to resolve issue by reading log files. #### Second MapReduce Job Then, run `wordcount` example ```bash hadoop jar hadoop⁩/⁨share⁩/⁨hadoop⁩/⁨mapreduce⁩/hadoop-mapreduce-examples-3.1.1.jar wordcount TextFile OutputDirectory ``` Both `TextFile` and `OtputDirectory` are paths in HDFS. Use `alice.txt` as`TextFile`. If no error has ocurred, ouput directory will contain two files ``` hdfs dfs -ls OtputDirectory Found 2 items -rw-r--r-- 3 vagrant supergroup 0 2018-10-02 08:47 OtputDirectory/_SUCCESS -rw-r--r-- 3 vagrant supergroup 557120 2018-10-02 08:47 OtputDirectory/part-r-00000 ``` You can examine the content of the output ```bash hdfs dfs -tail OtputDirectory/part-r-00000 ``` # Troubleshooting ## General Troubleshooting with Logs The default location of log files is `~/hadoop/logs`. There you can find logs for namenode, datanode, resourcemanager, and nodemanager depending on what is running on a particular machine. ## Formatting HDFS When you decide to format the file system, you need to clean data on both namenode and datanodes. 1. Stop the filesystem with `stop-dfs.sh`. 2. Log into each datanode and namenode and remove the data `rm -r ~/hadoop_tmp/*` 3. Format the filesystem on namenode `hdfs namenode -format` 4. Now you can restart the filesystem `start-dfs.sh` ## HDFS ### `hdfs dfsadmin` reports less than three nodes Make sure - You distributed `ssh` key - You added workers to `etc/hadoop/workers` - You distributed configuration - You formatted HDFS ### Cannot connect to `10.0.0.11:9000` - Namenode did not start - If the error message says that the connections is rejected, it is likely that namenode resolves itself to a different IP that other nodes, i.e. namenode resolves itself to `127.0.0.1`, and other nodes resolve namenode to `10.0.0.11`. Such differences are not allowed doe to Hadoop's security policy. ## YARN ### YARN nodes do not start Check log files for resourcemanager and for nodemanagers. - YARN requires you to use domain names only in `yarn-site`, not IPs - Verify that you copied the configuration - Verify that your `/etc/hosts` is set up properly ### Jobs fail When job executes you can run `yarn container -list` to check which workers executes the current container. Then you can check nodemanager log files on the corresponding VM. - Not enough physical memory - increase VM's RAM - Not enough vertual memory - increase virtual to physical memory ratio or disable virtual memory check - Worker cannot connecto to resource manager - check that `/etc/hosts` is set correctly and configuration is the same on all machines. ## Other known errors - `Caused by: java.net.UnknownHostException: datanode-2` - incorrect hostnames - namenode log: ``` 2018-12-26 11:28:34,266 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /private/tmp/hadoop-LTV/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:376) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:227) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710) 2018-12-26 11:28:34,269 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /private/tmp/hadoop-LTV/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. 2018-12-26 11:28:34,277 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: ``` Need to reformat HDFS or clean `hadoop_tmp` - datanode log ``` 2018-12-26 11:28:58,118 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: 10.240.16.166:9000. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) ``` Namenode is down or domain name is configured incorrectly # Self-Check Questions 1. What is a namenode? 2. What is a datanode? 3. What is replication factor? 4. Why did you need `ssh-copy-id`? 5. How is Hadoop configuration different between namenode and datanodes? 6. How do you check the content of HDFS file system? 7. How often do you need to run `hdfs namenode -format`? 8. What is YARN? What are components of YARN? 9. How do you specify the worker nodes for YARN? 10. How do you list active YARN worker nodes? 11. Where does Hadoop store log files? 12. Which of Hadoop services write log files? 13. What is the purpose of the file `/etc/hosts`? 14. What is virtual memory and why Hadoop cares about it? 15. How do you check available storage in HDFS? # Acceptance criteria: 0. Understanding of what you did 1. 3 workers ready to work (```yarn node -list```) 2. 3 datanodes in hdfs (```hdfs dfsadmin -report```) 3. Alice file in hdfs 4. PI count example works 5. Wordcount works 6. **You know where to find log files on each machine** # References 1. [Creating Vagrant Images](https://scotch.io/tutorials/how-to-create-a-vagrant-base-box-from-an-existing-one) 2. [Configuring Network Names](https://www.cloudera.com/documentation/enterprise/5-15-x/topics/cdh_ig_networknames_configure.html) *[hypervisor]: A hypervisor or virtual machine monitor (VMM) is computer software, firmware or hardware that creates and runs virtual machines. A computer on which a hypervisor runs one or more virtual machines is called a host machine, and each virtual machine is called a guest machine.