在Eclipse上运行Spark(Standalone,Yarn-Client)
来自: http://www.cnblogs.com/zdfjf/p/5175566.html
我们知道有eclipse的Hadoop插件,能够在eclipse上操作hdfs上的文件和新建mapreduce程序,以及以Run On Hadoop方式运行程序。那么我们可不可以直接在eclipse上运行Spark程序,提交到集群上以YARN-Client方式运行,或者以Standalone方式运行呢?
答案是可以的。下面我来介绍一下如何在eclipse上运行Spark的wordcount程序。我用的hadoop 版本为2.6.2,spark版本为1.5.2。
-
1.Standalone方式运行
-
1.1 新建一个普通的java工程即可,下面直接上代码,
1 /* 2 * Licensed to the Apache Software Foundation (ASF) under one or more 3 * contributor license agreements. See the NOTICE file distributed with 4 * this work for additional information regarding copyright ownership. 5 * The ASF licenses this file to You under the Apache License, Version 2.0 6 * (the "License"); you may not use this file except in compliance with 7 * the License. You may obtain a copy of the License at 8 * 9 * http://www.apache.org/licenses/LICENSE-2.0 10 * 11 * Unless required by applicable law or agreed to in writing, software 12 * distributed under the License is distributed on an "AS IS" BASIS, 13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 * See the License for the specific language governing permissions and 15 * limitations under the License. 16 */ 17 18 package com.frank.spark; 19 20 import scala.Tuple2; 21 import org.apache.spark.SparkConf; 22 import org.apache.spark.api.java.JavaPairRDD; 23 import org.apache.spark.api.java.JavaRDD; 24 import org.apache.spark.api.java.JavaSparkContext; 25 import org.apache.spark.api.java.function.FlatMapFunction; 26 import org.apache.spark.api.java.function.Function2; 27 import org.apache.spark.api.java.function.PairFunction; 28 29 import java.util.Arrays; 30 import java.util.List; 31 import java.util.regex.Pattern; 32 33 public final class JavaWordCount { 34 private static final Pattern SPACE = Pattern.compile(" "); 35 36 public static void main(String[] args) throws Exception { 37 38 if (args.length < 1) { 39 System.err.println("Usage: JavaWordCount <file>"); 40 System.exit(1); 41 } 42 43 SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount"); 44 sparkConf.setMaster("spark://192.168.0.1:7077"); 45 JavaSparkContext ctx = new JavaSparkContext(sparkConf); 46 ctx.addJar("C:\\Users\\Frank\\sparkwordcount.jar"); 47 JavaRDD<String> lines = ctx.textFile(args[0], 1); 48 49 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { 50 @Override 51 public Iterable<String> call(String s) { 52 return Arrays.asList(SPACE.split(s)); 53 } 54 }); 55 56 JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { 57 @Override 58 public Tuple2<String, Integer> call(String s) { 59 return new Tuple2<String, Integer>(s, 1); 60 } 61 }); 62 63 JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { 64 @Override 65 public Integer call(Integer i1, Integer i2) { 66 return i1 + i2; 67 } 68 }); 69 70 List<Tuple2<String, Integer>> output = counts.collect(); 71 for (Tuple2<?,?> tuple : output) { 72 System.out.println(tuple._1() + ": " + tuple._2()); 73 } 74 ctx.stop(); 75 } 76 }
代码直接从spark安装包解压后在examples/src/main/java/org/apache/spark/examples/JavaWordCount.java拷贝出来,唯一不同的地方在增加了44行和46行,44行设置了Master,为hadoop的master 结点的IP,端口号为7077。46行设置了工程打包后放置在windows上的路径。
-
1.2 加入spark依赖包spark-assembly-1.5.2-hadoop2.6.0.jar,这个包可以从spark 安装包解压 后在lib目录下。
-
1.3 配置要统计的文件在hdfs上的路径
Run As->Run Configurations
点击Arguments,因为程序中47行要求输入被统计的文件路径,所以在这里配置以下,文件必须放在hdfs上,所以这里的ip也是你的hadoop的master机器的ip.
-
1.4 接下来就是Run程序了,统计的结果会显示在eclipse的控制台。你也可以通过spark的web页面查看刚才提交的程序。
-
2. 以YARN-Client方式运行
-
2.1 先上代码
1 /* 2 * Licensed to the Apache Software Foundation (ASF) under one or more 3 * contributor license agreements. See the NOTICE file distributed with 4 * this work for additional information regarding copyright ownership. 5 * The ASF licenses this file to You under the Apache License, Version 2.0 6 * (the "License"); you may not use this file except in compliance with 7 * the License. You may obtain a copy of the License at 8 * 9 * http://www.apache.org/licenses/LICENSE-2.0 10 * 11 * Unless required by applicable law or agreed to in writing, software 12 * distributed under the License is distributed on an "AS IS" BASIS, 13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 * See the License for the specific language governing permissions and 15 * limitations under the License. 16 */ 17 18 package com.frank.spark; 19 20 import scala.Tuple2; 21 import org.apache.spark.SparkConf; 22 import org.apache.spark.api.java.JavaPairRDD; 23 import org.apache.spark.api.java.JavaRDD; 24 import org.apache.spark.api.java.JavaSparkContext; 25 import org.apache.spark.api.java.function.FlatMapFunction; 26 import org.apache.spark.api.java.function.Function2; 27 import org.apache.spark.api.java.function.PairFunction; 28 29 import java.util.Arrays; 30 import java.util.List; 31 import java.util.regex.Pattern; 32 33 public final class JavaWordCount { 34 private static final Pattern SPACE = Pattern.compile(" "); 35 36 public static void main(String[] args) throws Exception { 37 38 System.setProperty("HADOOP_USER_NAME", "hadoop"); 39 40 if (args.length < 1) { 41 System.err.println("Usage: JavaWordCount <file>"); 42 System.exit(1); 43 } 44 45 SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountByFrank01"); 46 sparkConf.setMaster("yarn-client"); 47 sparkConf.set("spark.yarn.dist.files", "C:\\software\\workspace\\sparkwordcount\\src\\yarn-site.xml"); 48 sparkConf.set("spark.yarn.jar", "hdfs://192.168.0.1:9000/user/bigdatagfts/spark-assembly-1.5.2-hadoop2.6.0.jar"); 49 50 JavaSparkContext ctx = new JavaSparkContext(sparkConf); 51 ctx.addJar("C:\\Users\\Frank\\sparkwordcount.jar"); 52 JavaRDD<String> lines = ctx.textFile(args[0], 1); 53 54 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { 55 @Override 56 public Iterable<String> call(String s) { 57 return Arrays.asList(SPACE.split(s)); 58 } 59 }); 60 61 JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { 62 @Override 63 public Tuple2<String, Integer> call(String s) { 64 return new Tuple2<String, Integer>(s, 1); 65 } 66 }); 67 68 JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { 69 @Override 70 public Integer call(Integer i1, Integer i2) { 71 return i1 + i2; 72 } 73 }); 74 75 List<Tuple2<String, Integer>> output = counts.collect(); 76 for (Tuple2<?,?> tuple : output) { 77 System.out.println(tuple._1() + ": " + tuple._2()); 78 } 79 ctx.stop(); 80 } 81 }
-
2.2 程序解释
38行,如果你的windows用户名和集群上用户名不一样,这里就应该配置一下。比如我windows用户名为Frank,而装有hadoop的集群username为hadoop,这里我就以38行这样设置。
46行,这里配置以yarn-client方式
48行,以这种方式运行时候,每一次运行都会把spark-assembly-1.5.2-hadoop2.6.0.jar包上传到hdfs下这次生成的application-id文件夹下,会耗费几分钟时间,这里也可以配置spark.yarn.jar,先把spark-assembly-1.5.2-hadoop2.6.0.jar上传到hdfs一个目录下,这样就不用每次从windows上传到hdfs下了。参考https://spark.apache.org/docs/1.5.2/running-on-yarn.html.
spark.yarn.jar :The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to "hdfs:///some/path".
51行,把项目打包后放在windows上的路径。
-
2.3 程序配置
把3个配置文件放在src下,配置文件从hadoop的linux机器上拷贝下来。
-
2.4 配置要统计的文件在hdfs上的路径
参考1.3,同样结果显示在eclipse控制台。