用eclipse建立java為基底的Apache Maven Project的MapReduce 應用程式並在遠端Apache Hadoop上運行

# 用eclipse建立java為基底的Apache Maven Project的MapReduce 應用程式並在遠端Apache Hadoop上運行 ***以下follow hadoop官網docs中的Map Reduce Tutorial http://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html*** ## 預備環境 JDK8 https://www.oracle.com/tw/java/technologies/javase/javase-jdk8-downloads.html Eclipse IDE for Eclipse Committers https://www.eclipse.org/downloads/packages/release/2020-06/r/eclipse-ide-eclipse-committers Apache Maven 3.6.3 https://maven.apache.org/install.html ## 生成jar檔 Ecilpse > File > New > Other > Maven Project ![](https://i.imgur.com/WfDlkk6.png) 選擇maven-archetype-quickstart > Next ![](https://i.imgur.com/Gdqjhq1.png) 填寫Group Id, Artifact Id > Finish ![](https://i.imgur.com/sR5WGW9.png) > **groupId**: uniquely identifies your project across all projects. A group ID should follow Java's package name rules. This means it starts with a reversed domain name you control. For example, > org.apache.maven, org.apache.commons > **artifactId:** is the name of the jar without version. If you created it, then you can choose whatever name you want with lowercase letters and no strange symbols. If it's a third party jar, you have to take the name of the jar as it's distributed. > eg. maven, commons-math > 建立一個新的class WordCount 在WordCount/pom.xml中配置以下dependency ```xml= <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-jobclient</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.1</version> </dependency> </dependencies> ``` > POM配置的版本內容等可以以自己所需參考： > https://mvnrepository.com/artifact/org.apache.hadoop 用maven update project後 eclipse自動load進上面那些dependency ![](https://i.imgur.com/UUwmouD.jpg) ![](https://i.imgur.com/7KTx6Rq.png) 確認有自動load進 ![](https://i.imgur.com/CTPw6aN.png) 創建一個class WordCount.java 引入官方文檔教學示範[Example: WordCount v1.0](http://hadoop.apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) ```java= import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ``` Run As > Maven build ![](https://i.imgur.com/cRAmM4V.jpg) Goals: clean install > Run ![](https://i.imgur.com/ZKGRJ1X.png) console ![](https://i.imgur.com/223xEwg.png) cd到自己目錄的eclipse-workspace/WordCount/target中可以看多到多了一個 **WordCount-0.0.1-SNAPSHOT.jar** 以上完成生成jar檔部分 ## hadoop上執行jar檔利用scp將local端jar檔案丟到remote端 > ***注意大寫-P*** ```shell= scp -P _____ eclipse-workspace/WordCount/target/WordCount-0.0.1-SNAPSHOT.jar user@___.__.__.__:/user/___ ``` 透過ssh進入remote端 ```shell= ssh -p _____ user@___.__.__.__ enter password ``` 查看剛剛的路徑是否有成功傳入 ```shell= hdfs dfs -ls /user/___ ``` 由於執行這隻WordCount需要兩個資料夾先建立input directory ```shell= hdfs dfs -mkdir /input ``` 加入測資 ```shell= vim test1.txt(按下i進入vim編輯) >Hello World Bye World(esc後:wq儲存) vim test2.txt >Hello Hadoop Goodbye Hadoop :wq ``` 將兩筆txt丟入input資料夾 ```shell= hdfs dfs -put test1.txt /user/___/input hdfs dfs -put test2.txt /user/___/input ``` 檢查是否丟入 ```shell= hdfs dfs -ls /user/___/input ``` result: ```shell= Found 2 items -rw-rw-rw- 3 user user 22 2020-08-14 03:05 /user/___/input/test1.txt -rw-rw-rw- 3 user user 28 2020-08-14 03:05 /user/___/input/test2.txt ``` 最後步驟執行jar檔指定input output資料夾 ```shell= hadoop jar WordCount-0.0.1-SNAPSHOT.jar com.WordCount.WordCount /user/___/input /user/___/output ``` ![](https://i.imgur.com/b6ntXr9.png) ![](https://i.imgur.com/xaRwFVr.png) 結果輸出在output資料夾中用cat去查看 ![](https://i.imgur.com/NymOQZU.png)