Lab block 4. MapReduce with Java.

--- tags: BigData-BS-2020 title: Lab block 4. MapReduce with Java. --- # Lab block 4. MapReduce with Java. During this lab, you will learn how to compile and package containers for MapReduce. You will need to write a simple Java application and execute it on the cluster that you have created. :::warning If your computer struggles to run three VMs simultaneously, constrain yourself to a cluster with 1 or 2 VMs in it. If you use less than three nodes, make sure that the replication factor for HDFS is less than 3. ::: Link for this session: [link](https://www.youtube.com/watch?v=ETDbbNpq8Mg&list=PLcv13aCxIfjDY1qv79zgCHRUZuh0_7tA7&index=7&t=0s) ## Initial Configuration To run compiled Java applications on the Hadoop cluster, you will need to have a copy of Hadoop binaries on your development machine (you can use one of VMs or your host system run Hadoop Java applications). If you are using your guest system for development, copy Hadoop's configuration from your cluster to your host OS. You need to have an appropriate version of JDK installed. For convenience, add Hadoop binaries to `PATH`. :::success You are recommended to use IntelliJ Idea. Students have a free premium subscription for one year. IDE will definitely make your life easier. If an IDE is too cumbersome for your taste, install one of the modern programmer's text editors: [Atom](https://atom.io/) or [Visual Studio Code](https://code.visualstudio.com/) (or other). They provide syntax support for most programming languages through plugins. ::: Download [Alice in Wonderland](https://edisk.university.innopolis.ru/edisk/Public/Fall%202019/Big%20Data/alice.txt) and place it in your HDFS. ## WordCount Example ### Understanding MapReduce Code An example of the MapReduce wordcount application is provided below. Read the code and understand what different parts of the code do. Later you will need to extend the functionality of this MapReduce program. ```java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` Create a `java` file and copy the contents from above to this file. ### Compiling Application We are going to use `javac` to compile the sources and `jar` to create a jar container. In order to compile the sources, `javac` should be able to find necessary MapReduce classes. The classpath for Hadoop binaries can be retrieved by calling ```bash hadoop classpath ``` If the instruction above works, you can run the compiler ```bash javac -cp $(hadoop classpath) WordCount.java ``` The only thing is left after a successful compilation is to create a `jar` container ```bash jar -cvf WordCount.jar WordCount*.class ``` :::success You can use IntelliJ Idea to create artifacts (`jar`). Use [this](https://www.jetbrains.com/help/idea/creating-and-running-your-first-java-application.html#create_class) as reference. Make sure to specify the correct version of Java for your project. ::: Try executing this application on your cluster. Run ``` hadoop jar WordCount.jar TextFile OutputDirectory ``` Make sure everything works correctly. Count the occurrence of the word `rabbit`. ### Modify WordCount Code In this part, you will need to introduce the changes into the program that you have just compiled: - transform words to the lowercase - remove all non-letters except '-' Count the occurrence of the word `rabbit` again. ### Sort the Output of WordCount using MapReduce You need to sort the output of MapReduce by frequency. To implement this, you could run another MapReduce task where the mapper swaps the word and the frequency count (swap key and value), and the reducer is... trivial. Go back and carefully study how [WordCount](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) works. If you feel confident, implement the mapper yourself. Otherwise, just copy and paste this code. You should implement the reducer yourself. ```java public static class FrequencySortMap extends Mapper<Object, Text, IntWritable, Text> { public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer stringTokenizer = new StringTokenizer(line); { int frequency = 0; String word = ""; //XXX: split the line, read frequency //and the word context.write(new IntWritable(number), new Text(word)); } } } ``` Run the container ```bash hadoop jar Sort.jar Sort alice_wordcount/* alice_sorted ``` What are the top 3 nouns in the text? # Troubleshooting - **Java compiler needs Hadoop libraries to compile word count** classpath for Hadoop is returned by `hadoop classpath`. - **Error: missing class** If you see error during compilation or execution, try downgrading to Java 8. # Self-Check Questions 1. What are the main stages of MapReduce? 2. What are the types that can be used to instantiate `Mapper` and `Reducer` templates? 3. What does `job.setMapperClass` do? 4. What does `job.setCombinerClass` do? 5. What does `job.setReducerClass` do? 6. What does `job.setOutputKeyClass` do? 7. Why does one need to have `job.waitForCompletion` in the code? 8. How does one create a jar container? 9. Can one apply several Map transformations within one MapReduce job? # Acceptance Criteria - Code compiles - The results of code execution are correct - You can answer self-check questions # Further Reading Use [Apache Hadoop Mapreduce Tutorial](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) to know more