博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hive On Spark环境搭建
阅读量:6610 次
发布时间:2019-06-24

本文共 5414 字,大约阅读时间需要 18 分钟。

Spark源码编译与环境搭建

Note that you must have a version of Spark which does not include the Hive jars;

Spark编译:

git clone https://github.com/apache/spark.git spark_srccd spark_srcexport MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"./make-distribution.sh --name "spark-without-hive" --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.1 -Pyarn -DskipTests package

Spark搭建:见Spark环境搭建章节

 

Hive源码编译与环境搭建

Hive编译

git clone https://github.com/apache/hive.git hive_on_sparkgit checkout sparkcd hive_on_sparkmvn clean install -Phadoop-2,dist -DskipTests

编译完成后,hive安装包的位置: /packaging/target/apache-hive-1.2.0-SNAPSHOT-bin.tar.gz

注意pom.xml中spark.version要和spark的版本号对应

1.3.0

Hive安装:见Hive环境搭建章节

 

本案例中Spark和Hive的安装路径如下:

Spark安装目录:/home/spark/app/spark-1.3.0-bin-spark-without-hive

Hive安装目录:/home/spark/app/apache-hive-1.2.0-SNAPSHOT-bin

 

添加Spark的依赖到Hive的方法

方式一: Set the property 'spark.home' to point to the Spark installation:

hive> set spark.home=/home/spark/app/spark-1.3.0-bin-spark-without-hive;

方式二: Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

export SPARK_HOME=/home/spark/app/spark-1.3.0-bin-spark-without-hive

方式三: Set the spark-assembly jar on the Hive auxpath:

hive --auxpath /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar

方式四: Add the spark-assembly jar for the current user session:

hive> add jar /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar;

方式五: Link the spark-assembly jar to $HIVE_HOME/lib.

 

启动Hive过程中可能出现的错误: 

[ERROR] Terminal initialization failed; falling back to unsupportedjava.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected        at jline.TerminalFactory.create(TerminalFactory.java:101)        at jline.TerminalFactory.get(TerminalFactory.java:158)        at jline.console.ConsoleReader.
(ConsoleReader.java:229) at jline.console.ConsoleReader.
(ConsoleReader.java:221) at jline.console.ConsoleReader.
(ConsoleReader.java:209) at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

解决方法:export HADOOP_USER_CLASSPATH_FIRST=true

其他场景的错误解决方法参见:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

 

还有一个坑:需要设置spark.eventLog.dir参数,比如:

set spark.eventLog.dir= hdfs://hadoop000:8020/directory

否则查询会报错,这个坑深啊。。。。。。,否则一直报错:/tmp/spark-event类似的文件夹不存在。。。。

 

启动hive后设置执行引擎为spark:

hive> set hive.execution.engine=spark;

 

设置spark的运行模式:

hive> set spark.master=spark://hadoop000:7077

或者yarn:spark.master=yarn

 

Configure Spark-application configs for Hive

可以配置在spark-defaults.conf或者hive-site.xml

spark.master=
spark.eventLog.enabled=true; spark.executor.memory=512m; spark.serializer=org.apache.spark.serializer.KryoSerializer;spark.executor.memory=... #Amount of memory to use per executor process.spark.executor.cores=... #Number of cores per executor.spark.yarn.executor.memoryOverhead=...spark.executor.instances=... #The number of executors assigned to each application.spark.driver.memory=... #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.spark.yarn.driver.memoryOverhead=... #We recommend 400 (MB).

参数配置详见文档:

 

执行sql语句后可以在监控页面查看job/stages等信息

hive (default)> select city_id, count(*) c from page_views group by city_id order by c desc limit 5;Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1eTotal jobs = 1Launching Job 1 out of 1In order to change the average load for a reducer (in bytes):  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers: set hive.exec.reducers.max=
In order to set a constant number of reducers: set mapreduce.job.reduces=
state = SENTstate = STARTEDstate = STARTEDstate = STARTEDstate = STARTEDQuery Hive on Spark job[0] stages:012Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]2015-03-09 17:38:11,822 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1 Stage-2_0: 0/1state = STARTEDstate = STARTEDstate = STARTED2015-03-09 17:38:14,845 Stage-0_0: 0(+1)/1 Stage-1_0: 0/1 Stage-2_0: 0/1state = STARTEDstate = STARTED2015-03-09 17:38:16,861 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1 Stage-2_0: 0/1state = SUCCEEDED2015-03-09 17:38:17,867 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished Stage-2_0: 1/1 FinishedStatus: Finished successfully in 10.07 secondsOKcity_id c-1000 22826-10 17294-20 10608-1 6186237 4158Time taken: 18.417 seconds, Fetched: 5 row(s)

 

 

转载地址:http://voiso.baihongyu.com/

你可能感兴趣的文章
.NET Core 2.1改进了性能,并提供了新的部署选项
查看>>
全面了解大数据“三驾马车”的开源实现
查看>>
为什么AppDynamics重构指标服务时选择了HBase而不是别的NOSQL
查看>>
新的UWP和Win32应用程序分发模型
查看>>
关于有效的性能调优的一些建议
查看>>
刚被IBM收购的红帽,它的下一站是中国
查看>>
专访Martijn Verburg,关于AdoptOpenJDK与Nestmates
查看>>
Rust 1.0发布一周年,发展回顾与总结
查看>>
JetBrains发布DataGrip 1.0——数据库与SQL领域中的瑞士军刀
查看>>
安卓用户当心啦 这个App可能会偷走你的比特币
查看>>
架构设计复杂度的6个来源
查看>>
Thrift 简易入门与实战
查看>>
基于Vue2.0实现音乐播放器。
查看>>
FoundationDB宣布记录层支持关系数据库语义、模式管理和索引功能
查看>>
同事反馈环:如何实现持续改进的文化
查看>>
记住,永远不要在MySQL中使用“utf8”
查看>>
Pivotal发布Spring Cloud Data Flow 1.5版本
查看>>
2017年敏捷沙滩大会:技术卓越、为持续交付优化的组织、容器安全
查看>>
如何更快乐地工作
查看>>
Nexus指南中的更新强调集成和透明度的重要性
查看>>