Lab Block 2. Hadoop Setting up a Single Node Hadoop Cluster

--- tags: BigData-BS-2019 title: Lab Block 2. Hadoop Setting up a Single Node Hadoop Cluster --- # Lab Block 2. Hadoop Setting up a Single Node Hadoop Cluster  This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). You can work from under any of your favourite OS, but it is recommended to use unix-based operating system, or use bash shell under Windows (i.e. [Git Bash](https://www.atlassian.com/git/tutorials/git-bash), [Linux Subsystem](https://docs.microsoft.com/en-us/windows/wsl/install-win10)). The goal of this lab is for you to get some DevOps experience. Not all instructions here are correct, and sometimes you will be reqired to read logs, search for answers, or install new software. Keep close attention to versions of your applications and mind that version mismatch can be an issue.      This Lab consists of two parts. In the first part, you are going to configure Hadoop using Vagrant. In the second, you are going to repeat the same configuration procedure, but using Docker. Video for this session: [link](https://www.youtube.com/watch?v=iHwGFh19YqQ&list=PLcv13aCxIfjDY1qv79zgCHRUZuh0_7tA7&index=5&t=1s) :::info If you would like to access web interfaces for HDFS and YARN, configure [port forwarding](https://www.vagrantup.com/docs/networking/forwarded_ports) in `Vagrantfile`. You need to expose ports `9870` (namenode), `8088` (resource manager), and `8042` (node manager). ::: ## Download Vagrant Image with Hadoop You can either download Hadoop from the [official website](https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/), or use our [vagrant image](https://edisk.university.innopolis.ru/edisk/Public/Fall%202019/Big%20Data/) with hadoop already unpacked into the home directory. :::info *VM info* : **username:** vagrant **password:** vagrant ::: In the rest of the tutorial it is assumed you are working from under unpacked hadoop directory. For our image this would be `~/hadoop`. Remember how to use `vagrant init` and `vagrant up`. ## Configure Java Environment In the distribution, edit the file `etc/hadoop/hadoop-env.sh` to define some parameters as follows: ```bash export JAVA_HOME=/usr/java/latest ``` Try the following command: ```bash $ bin/hadoop ``` This will display the usage documentation for the hadoop script. Congratulations! Hadoop is configured! Just kidding... ## Hadoop Standalone Mode By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory. (Attention: bash code) ```bash mkdir input cp etc/hadoop/*.xml input bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar grep input output 'dfs[a-z.]+' cat output/* ``` Let's try to understand what this snippet is doing line by line 1. `mkdir` creates a new directory 2. `cp` is a utility to copy files. The format is `cp source destination` 3. `bin/hadoop` is an executable. The next two arguments in the line are the arguments for this executable. First we tell that we apre providing a `jar` container, and specify its path. The rest of the arguments are the arguments for the executable inside `jar` container. `hadoop-mapreduce-examples-3.0.3.jar` has a lot of examples inside, so we specify the one we want to execute: `grep`. The example named `grep` accepts three arguments: input directory, output directory, and the [RegExp](https://en.wikipedia.org/wiki/Regular_expression) pattern. Read more about this example job [here](https://cwiki.apache.org/confluence/display/HADOOP2/Grep). 4. `cat` is the utility that concatenates all the files provided as arguments. :::info Nice thing about hadoop, is that it can accept files and folders as input. In the example above, if you specify a file as input, it will read this file. If you specify a directory, it will read all the files inside this directory, but not subdirectories. ::: Standalone mode is quite straighforward. Clean up the mess ```bash rm -r input rm -r output ``` ## Pseudo-Distributed Mode Where as standalone mode worked as a single process, now we are going to run hadoop daemons on a single mashine. ### Hadoop Configs There are two primary configuration files to begin with. Start editing them: Edit `etc/hadoop/core-site.xml`: ```xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> ``` and `etc/hadoop/hdfs-site.xml`: ```xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> ``` First config specifies the location of the namenode, and the second - replication factor for HDFS, i.e. the number of times the same data is copied in HDFS for fail safety. ### Setup passphraseless ssh Now check that you can ssh to the localhost without a passphrase: ```bash $ ssh localhost ``` If you cannot ssh to localhost without a passphrase, execute the following commands: ```bash $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys ``` ### Execution The following instructions are to run a MapReduce job locally. 1. Format the filesystem: ```bash $ /home/vagrant/hadoop/bin/hdfs namenode -format ``` 2. Start NameNode daemon and DataNode daemon: ```bash $ sbin/start-dfs.sh ``` The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs). 3. Browse the web interface for the NameNode; by default it is available at: NameNode - http://localhost:9870/ Make the HDFS directories required to execute MapReduce jobs: ```bash $ bin/hdfs dfs -mkdir /user $ bin/hdfs dfs -mkdir /user/<username> ``` Copy the input files into the distributed filesystem: ```bash $ bin/hdfs dfs -mkdir input $ bin/hdfs dfs -put etc/hadoop/*.xml input ``` Check that the files are in place with `bin/hdfs dfs -ls /path` Run some of the examples provided: ```bash $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-checkYouVersion.jar grep input output 'dfs[a-z.]+' ``` Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them: ```bash $ bin/hdfs dfs -get output output $ cat output/* ``` or View the output files on the distributed filesystem: ```bash $ bin/hdfs dfs -cat output/* ``` When you’re done, stop the daemons with: ```bash $ sbin/stop-dfs.sh ``` ## YARN on a Single Node You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition. The following instructions assume that you have completed previoud steps of this tutorial. Configure parameters as follows: ### Additional Configs Edit `etc/hadoop/mapred-site.xml`: ```xml <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> ``` ```xml <configuration> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value> </property> </configuration> ``` Edit `etc/hadoop/yarn-site.xml`: ```xml <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> </configuration> ``` Start ResourceManager daemon and NodeManager daemon: ``` $ sbin/start-yarn.sh ``` Browse the web interface for the ResourceManager; by default it is available at: ResourceManager - http://localhost:8088/ ### Run a MapReduce job Check whether you can run your previous hadoop job (`grep` example). When you’re done, stop the daemons with: ``` $ sbin/stop-yarn.sh ``` :::info In fomething goes wrong, check log files in `hadoop/logs`. Logfiles for namenode and datanode are related to HDFS. If you receive some error messages about inconsistent states, then reboot the VM and re-format namenode. ::: ### Troubleshooting Problems This execution has likely failed. The standard way to figure out the root of the issue is to go through log files (in case the problem is not evident from the standard output). For the current setup, logs can be found in `~/hadoop/logs`. You need to look through the logs of two processes: `resourcemanager` and `nodemanager`. If any issue present, it will likely appear at the end of the log file. You can look at the last 100 lines of a file using `tail` ```bash tail -100 logfile.log ``` The error message you are supposed to find is ``` Container ... is running ... beyond the 'VIRTUAL' memory limit. Current usage: 71.2 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container. ``` Temporary solution that we will use is to disable virtual memory check. Add the following configuration to `yarn-site.xml` and restart YARN (run `stop-yarn.sh` followed by `start-yarn.sh`) ```xml <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> <description>Whether virtual memory limits will be enforced for containers</description> </property> ``` Now you can run the job again and verify that everything works properly. ## Running Everything Using Docker So far, we have tried to use a virtual machine to run Hadoop services. Now it is time to try using Docker to do the same and see how the configuration process is different. The configuration process will be a little less straightforward because first, we need to create a docker image with Hadoop inside. ### Creating Base Image In the case of Vagrant, you had a pre-packaged image. Now you are going to create your own image for Docker. First, you need to download [Hadoop binaries](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz). Create a directory where the base image's configuration will reside, extract Hadoop binaries in this directory, and create `Dockerfile`. ```bash mkdir hadoop-base cd hadoop-base tar -xvf hadoop.tar.gz touch Dockerfile ``` Now, populate `Dockerfile` with content ```Docker # use standard Ubuntu image as the base FROM ubuntu # set environment variable with your Hadoop version ENV HADOOP_VERSION 3.3.0 # Hadoop installation will reside in /hadoop WORKDIR /hadoop COPY hadoop-$HADOOP_VERSION /hadoop # ubuntu image is a bare minimum. install necessary packages RUN apt-get update && apt-get install openjdk-8-jre ssh -y # set environment variables that might be useful in the future ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 ENV HADOOP_HOME /hadoop ENV PATH /hadoop/sbin:/hadoop/bin:$PATH # Hadoop frowns upon running its services as the root user. Let's convince it is necessary by setting environment variables ENV HDFS_NAMENODE_USER=root ENV HDFS_DATANODE_USER=root ENV HDFS_SECONDARYNAMENODE_USER=root ENV YARN_RESOURCEMANAGER_USER=root ENV YARN_NODEMANAGER_USER=root # we do not have CMD or ENTRYPOINT set in this image ``` After the `Dockerfile` is ready, we can build the image ```bash docker build --tag=hadoop-base . ``` Run `docker image ls` and check the size of the image. ### Standalone Mode with Docker Create a separate directory for configuring standalone mode ```bash mkdir standalone cd standalone touch run.sh touch Dockerfile ``` The file `run.sh` will contain familiar to us commands to run the job in standalone mode ```bash mkdir input cp etc/hadoop/*.xml input hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep input output 'dfs[a-z.]+' cat output/* ``` The `Dockerfile` will be based on the image that we have just created ```Docker FROM hadoop-base:latest WORKDIR /hadoop COPY run.sh /hadoop # make this script executable. otherwise, you will see errors RUN chmod +x run.sh # the default command executed when you do docker run CMD /hadoop/run.sh ``` Now you can build a new image for standalone mode and run it ```bash docker run hadoop-standalone ``` ### Pseudo-Distributed Mode with Docker Create a separate directory for configuring pseudo-distributed mode ```bash mkdir pseudo-distributed cd pseudo-distributed touch run.sh touch Dockerfile ``` Copy configuration file to this directory ```bash cp -r path/to/hadoop/etc/hadoop pseudo-distributed/hadoop-conf ``` You need the same configurations that you created when working with a Vagrant virtual machine. Make sure the following files are configured `hadoop-env.sh`, `core-site.xml`, `hdfs-site.xml`, `mapred-site.xml`, `yarn-site.xml`. First, configure `Dockerfile` ```Docker FROM hadoop-base:latest WORKDIR /hadoop # we set an environment variable pointing to the folder with configuration files for convenience ENV HADOOP_CONF_DIR=/hadoop/etc/hadoop # Setup passphraseless ssh RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa RUN cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys RUN chmod 0600 ~/.ssh/authorized_keys # every new container will come with pre-formatted HDFS RUN hdfs namenode -format # copy configuration files COPY hadoop-conf $HADOOP_CONF_DIR # copy the run script and make it executable COPY run.sh /hadoop RUN chmod +x run.sh # we will need to bind these ports when instantiating a new container to make web-interfaces available # read https://docs.docker.com/engine/reference/builder/#expose for more info # namenode EXPOSE 9870 # resource manager EXPOSE 8088 # node manager EXPOSE 8042 # set default run script CMD /hadoop/run.sh ``` Now, the `run.sh` should have similar content ```bash # ssh service is not running; we have just installed it with apt-get when created hadoop-base /etc/init.d/ssh start start-dfs.sh start-yarn.sh # we need to keep container busy to keep it alive while true; do sleep 1000; done ``` You can run the container, and look through the standard output to make sure you have configured the container correctly. Note that we need to explicitly forward ports to make them available on the host. ```bash docker run -p 9870:9870 -p 8088:8088 -p 8042:8042 hadoop-pseudodistributed ``` Make sure that the output does not contain any error messages. In case there are some issues, check the container id with `docker container ls` and stop the container with `docker container stop <id>`. If everything is configured correctly, you should be able to access web interfaces for [HDFS](http://localhost:9870) and [YARN](http://localhost:8088). :::warning There are two ways to run the commands below. The first is using [`docker container exec`](https://docs.docker.com/engine/reference/commandline/exec/). The alternative is to launch the shell session with `docker container exec -it <id> bash` and run all of the commands directly from inside the container. ::: Now that everything is configured, we can prepare the files for out test job ```bash docker container exec <id> hdfs dfs -mkdir /user/ docker container exec <id> hdfs dfs -mkdir /user/root docker container exec <id> hdfs dfs -mkdir input docker container exec <id> sh -c 'hdfs dfs -put /hadoop/etc/hadoop/*.xml input' ``` Make sure you understand these commands. Especially the last one (why does it look different from the rest?). Run the job and check the output ```bash docker container exec <id> hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep input output 'dfs[a-z.]+' docker container exec <id> sh -c 'hdfs dfs -cat output/*' ``` The output should be similar to what you have seen before ``` 1 dfsadmin 1 dfs.replication ``` :::info In case the execution of this last job fails (taking too long to continue), try freeing as much RAM as possible and running the container again. ::: # Self-check Questions The target goal for you is to understand how the software we use works, and not to memorize the sequence of the commands. The main question you should ask yourself is why.  1. If you have difficulties understanding *bash*, make sure to go over [this refresher](https://learnxinyminutes.com/docs/bash/) 2. What [version of Java](https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions) is required for your version of Hadoop? 3. What does the [example job](https://cwiki.apache.org/confluence/display/HADOOP2/Grep) that you run throughout this tutorial is supposed to do? 4. What is the purpose of `PATH` environment variable? 5. Where the output of the test job stored? 6. What is DFS replication factor? 7. What is passphraseless ssh, and why do you need it in this lab? 8. How does one upload files to HDFS? 9. Where does Hadoop store log files by default? 10. How does one verify that the job finished successfully? 11. What is stored in the file `~/.ssh/id_rsa`? 12. What is stored in the file `~/.ssh/authorized_keys`? 13. How are these two files used for passphraseless ssh? 14. What is the default user in Docker's ubuntu image? 15. How does one specify an environment variable in `Dockerfile`? 16. What is the purpose of this command `chmod +x run.sh`? 17. What is the purpose of this command `/etc/init.d/ssh start`? 18. When creating Dockerfile, what is the difference between `ENTRYPOINT` and `CMD`? 19. How does one execute a bash command in a docker container? 20. Assume you created a docker container and executed several commands. These commands created new files in the container's file system. How can you access thee files after the container has stopped? 21. What is the difference between `docker run` and `docker container exec`? 22. Why we call it pseudo-distributed mode, and not simply distributed? # Acceptance criteria: 1. Pseudo-distributed mode in Vagrant VM is working. 2. Pseudo-distributed mode in Docker VM is working. 3. You can explain what you did to make them work.