gpt4 book ai didi

python-3.x - 如何使用pyspark将spark与hive连接?

转载 作者:行者123 更新时间:2023-12-03 23:13:12 28 4
gpt4 key购买 nike

我正在尝试使用 pyspark 读取配置单元表,远程。它指出无法连接到 Hive Metastore 客户端的错误。

我已经阅读了关于 SO 和其他来源的多个答案,它们主要是配置,但没有一个可以解决为什么我无法远程连接的问题。我读了 documentation并观察到,无需更改任何配置文件,我们就可以将 spark 连接到 hive .注意:我已经端口转发了一台机器,其中 hive正在运行并将其提供给 localhost:10000 .我什至使用 presto 连接了同样的东西并且能够在 hive 上运行查询.

代码是:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://localhost:9083")
sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.enableHiveSupport()
.getOrCreate())
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
df.write.saveAsTable('example')


我希望输出是表被保存的确认,但相反,我正面临 this error .

抽象错误是:
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 775, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'

我发出了一个命令:
ssh -i ~/.ssh/id_rsa_sc -L 9000:A.B.C.D:8080 -L 9083:E.F.G.H:9083 -L 10000:E.F.G.H:10000 ubuntu@I.J.K.l

当我通过以下命令检查端口 10000 和 9083 时:
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 10000
Connection to localhost 10000 port [tcp/webmin] succeeded!
aviral@versinator:~/testing-spark-hive$ nc -zv localhost 9083
Connection to localhost 9083 port [tcp/*] succeeded!

运行脚本后,我收到以下错误:
Caused by: java.net.UnknownHostException: ip-172-16-1-101.ap-south-1.compute.internal
... 45 more

最佳答案

问题在于在创建 spark session 本身时让 hive 配置被存储。

sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)

需要注意的是,不需要更改 spark conf,即使像 AWS Glue 这样的无服务器服务也可以有这样的连接。

完整代码:
from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession, HiveContext
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""

sparkSession = (SparkSession
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config("hive.metastore.uris", "thrift://localhost:9083", conf=SparkConf())
.enableHiveSupport()
.getOrCreate()
)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)
# Write into Hive
#df.write.saveAsTable('example')

df_load = sparkSession.sql('SELECT * FROM example')
df_load.show()
print(df_load.show())

关于python-3.x - 如何使用pyspark将spark与hive连接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55339022/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com