---
tags: BigData-BS-2020
title: Lab block 4. MapReduce with Java.
---
# Lab block 4. MapReduce with Java.
During this lab, you will learn how to compile and package containers for MapReduce. You will need to write a simple Java application and execute it on the cluster that you have created.
:::warning
If your computer struggles to run three VMs simultaneously, constrain yourself to a cluster with 1 or 2 VMs in it. If you use less than three nodes, make sure that the replication factor for HDFS is less than 3.
:::
Link for this session: [link](https://www.youtube.com/watch?v=ETDbbNpq8Mg&list=PLcv13aCxIfjDY1qv79zgCHRUZuh0_7tA7&index=7&t=0s)
## Initial Configuration
To run compiled Java applications on the Hadoop cluster, you will need to have a copy of Hadoop binaries on your development machine (you can use one of VMs or your host system run Hadoop Java applications).
If you are using your guest system for development, copy Hadoop's configuration from your cluster to your host OS. You need to have an appropriate version of JDK installed. For convenience, add Hadoop binaries to `PATH`.
:::success
You are recommended to use IntelliJ Idea. Students have a free premium subscription for one year. IDE will definitely make your life easier. If an IDE is too cumbersome for your taste, install one of the modern programmer's text editors: [Atom](https://atom.io/) or [Visual Studio Code](https://code.visualstudio.com/) (or other). They provide syntax support for most programming languages through plugins.
:::
Download [Alice in Wonderland](https://edisk.university.innopolis.ru/edisk/Public/Fall%202019/Big%20Data/alice.txt) and place it in your HDFS.
## WordCount Example
### Understanding MapReduce Code
An example of the MapReduce wordcount application is provided below. Read the code and understand what different parts of the code do. Later you will need to extend the functionality of this MapReduce program.
```java
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
```
Create a `java` file and copy the contents from above to this file.
### Compiling Application
We are going to use `javac` to compile the sources and `jar` to create a jar container.
In order to compile the sources, `javac` should be able to find necessary MapReduce classes. The classpath for Hadoop binaries can be retrieved by calling
```bash
hadoop classpath
```
If the instruction above works, you can run the compiler
```bash
javac -cp $(hadoop classpath) WordCount.java
```
The only thing is left after a successful compilation is to create a `jar` container
```bash
jar -cvf WordCount.jar WordCount*.class
```
:::success
You can use IntelliJ Idea to create artifacts (`jar`). Use [this](https://www.jetbrains.com/help/idea/creating-and-running-your-first-java-application.html#create_class) as reference. Make sure to specify the correct version of Java for your project.
:::
Try executing this application on your cluster. Run
```
hadoop jar WordCount.jar TextFile OutputDirectory
```
Make sure everything works correctly. Count the occurrence of the word `rabbit`.
### Modify WordCount Code
In this part, you will need to introduce the changes into the program that you have just compiled:
- transform words to the lowercase
- remove all non-letters except '-'
Count the occurrence of the word `rabbit` again.
### Sort the Output of WordCount using MapReduce
You need to sort the output of MapReduce by frequency. To implement this, you could run another MapReduce task where the mapper swaps the word and the frequency count (swap key and value), and the reducer is... trivial. Go back and carefully study how [WordCount](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) works. If you feel confident, implement the mapper yourself. Otherwise, just copy and paste this code. You should implement the reducer yourself.
```java
public static class FrequencySortMap extends Mapper<Object, Text, IntWritable, Text>
{
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer stringTokenizer = new StringTokenizer(line);
{
int frequency = 0;
String word = "";
//XXX: split the line, read frequency
//and the word
context.write(new IntWritable(number), new Text(word));
}
}
}
```
Run the container
```bash
hadoop jar Sort.jar Sort alice_wordcount/* alice_sorted
```
What are the top 3 nouns in the text?
# Troubleshooting
- **Java compiler needs Hadoop libraries to compile word count**
classpath for Hadoop is returned by `hadoop classpath`.
- **Error: missing class**
If you see error during compilation or execution, try downgrading to Java 8.
# Self-Check Questions
1. What are the main stages of MapReduce?
2. What are the types that can be used to instantiate `Mapper` and `Reducer` templates?
3. What does `job.setMapperClass` do?
4. What does `job.setCombinerClass` do?
5. What does `job.setReducerClass` do?
6. What does `job.setOutputKeyClass` do?
7. Why does one need to have `job.waitForCompletion` in the code?
8. How does one create a jar container?
9. Can one apply several Map transformations within one MapReduce job?
# Acceptance Criteria
- Code compiles
- The results of code execution are correct
- You can answer self-check questions
# Further Reading
Use [Apache Hadoop Mapreduce Tutorial](https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) to know more