gpt4 book ai didi

python - Pyspark:在远程Hive服务器中选择数据

转载 作者:行者123 更新时间:2023-12-02 20:50:38 25 4
gpt4 key购买 nike

尝试从Pyspark读取和写入存储在远程Hive Server中的数据。我遵循以下示例:

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'hdfs://quickstart.cloudera:8020/user/hive/warehouse'

spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()

示例显示了如何在仓库中创建新表:
# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show()

但是,我需要访问在 iris中创建的现有tabe mytest.db,因此表位置为
table_path = warehouse_location + '/mytest.db/iris`

如何从现有表格中选择?

更新

我有metastore网址:
http://test.mysite.net:8888/metastore/table/mytest/iris

和表格位置网址:
hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris

当在上面的代码中使用 hdfs://quickstart.cloudera:8020/user/hive/warehouse作为仓库位置并尝试:
spark.sql("use mytest")

我得到异常:
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'mytest' not found;"

iris中选择正确的网址是什么?

最佳答案

您可以使用直接调用表格

spark.sql("SELECT * FROM mytest.iris")

或指定您要使用的数据库

spark.sql("use mytest")
spark.sql("SELECT * FROM iris)

关于python - Pyspark:在远程Hive服务器中选择数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46036324/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com