Spark入门

前言

Spark入门学习

安装

安装scala

下载scala-2.12.7.tgz,解压&&环境变量

安装Spark

下载spark-2.4.0-bin-hadoop2.7.tgz,解压&&环境变量

1
2
3
4
5
6
# scala
export SCALA_HOME=/usr/scala
export PATH=$PATH:$SCALA_HOME/bin
# spark
export SPARK_HOME=/hadoop/spark-2.4.0
export PATH=$PATH:$SPARK_HOME/bin

spark-env.sh

先重命名spark-env.sh.template,vim增加如下内容

1
2
3
4
5
6
7
8
9
10
11
12
JAVA_HOME=/usr/java/jdk1.8.0_171
SCALA_HOME=/usr/scala
HADOOP_HOME=/hadoop-2.8.4
HADOOP_CONF_DIR=/hadoop/hadoop-2.8.4/etc/hadoop
SPARK_MASTER_IP=hadoop01
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
SPARK_WORKER_INSTANCES=1

spark-defaults.conf

mv spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf

1
spark.master spark://hadoop01:7077

slaves
1
2
mv slaves.template slaves
vim slaves
1
hadoop02
拷贝

在hadoop02上安装scala并且将hadoop01上的Spark目录拷贝到hadoop02

1
2
安装,环境变量配置略...
scp -r /hadoop/spark-2.4.0/ hadoop02:/hadoop/

启动
1
/sbin/start-all.sh

jps查看进程,发现多了Master节点和Worker节点,说明安装成功~,管理界面:http://hadoop01:8080/

WordCount程序

Windows上下载scala环境,IDEA下载scala插件,新建一个maven项目,将scala插件添加到项目中

pom

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11</scala.version>
<spark.version>2.4.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<finalName>scalaDemo</finalName>
<plugins>
<plugin>
<!-- 这是个编译scala代码的 -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

注意plugin标签上一定要添加scala的插件,否则直接点maven->compile时,scala文件无法编译成class文件

code
1
2
3
4
5
6
7
8
9
10
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("WC")
val sc = new SparkContext(conf)
val line = sc.textFile("hdfs://hadoop01:9000/wc_test_02")
val result = line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
result.saveAsTextFile("hdfs://hadoop01:9000/spark/out2")
sc.stop()
}
}

然后编译,打包,上传到linux,准备运行。。

踩坑错误

1
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

搞了好几个小时,Spark太吃内存了吧,解决办法:

修改spark-env.sh

1
SPARK_WORKER_MEMORY=800M

把内存搞小一点,然后运行

1
./spark-submit --class demo02.WordCount --executor-memory 500m /root/Desktop/scalaDemo.jar

指定的内存要小于800m,然后就运行成功了,也没一点提示(不报错就是成功)。

结果

1
2
3
4
5
(brother,1)
(hello,4)
(world,1)
(spark,1)
(hadoop,1)

一个简单的Spark WordCount程序,比写MapReduce简单很多。

中途遇到问题,SparkSubmit进程为什么杀不死?kill也不行,最后进程太多,只能重启虚拟机来解决了