gpt4 book ai didi

hive - 在pyspark中查询HIVE表

转载 作者:行者123 更新时间:2023-12-03 11:09:07 28 4
gpt4 key购买 nike

我正在使用CDH5.5

我在HIVE默认数据库中创建了一个表,并且能够从HIVE命令中查询它。

输出量

hive> use default;

OK

Time taken: 0.582 seconds


hive> show tables;

OK

bank
Time taken: 0.341 seconds, Fetched: 1 row(s)

hive> select count(*) from bank;

OK

542

Time taken: 64.961 seconds, Fetched: 1 row(s)

但是,我无法从pyspark查询该表,因为它无法识别该表。
from pyspark.context import SparkContext

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)


sqlContext.sql("use default")

DataFrame[result: string]

sqlContext.sql("show tables").show()

+---------+-----------+

|tableName|isTemporary|

+---------+-----------+

+---------+-----------+


sqlContext.sql("FROM bank SELECT count(*)")

16/03/16 20:12:13 INFO parse.ParseDriver: Parsing command: FROM bank SELECT count(*)
16/03/16 20:12:13 INFO parse.ParseDriver: Parse Completed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 552, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 40, in deco
raise AnalysisException(s.split(': ', 1)[1])
**pyspark.sql.utils.AnalysisException: no such table bank; line 1 pos 5**

新错误
>>> from pyspark.sql import HiveContext
>>> hive_context = HiveContext(sc)
>>> bank = hive_context.table("default.bank")
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/22 18:33:30 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:44 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/22 18:33:50 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 565, in table
return DataFrame(self._ssql_ctx.table(tableName), self)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.table.
: org.apache.spark.sql.catalyst.analysis.NoSuchTableException
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientInterface$$anonfun$getTable$1.apply(ClientInterface.scala:123)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
at org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:406)
at org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:203)
at org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:422)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

谢谢

最佳答案

我们无法将Hive表名直接传递给Hive上下文sql方法,因为它无法理解Hive表名。在pyspark shell中读取Hive表的一种方法是:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table("default.bank")
bank.show()

要在配置单元表上运行SQL:
首先,我们需要注册从读取配置单元表获得的数据帧。
然后,我们可以运行SQL查询。
bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()

关于hive - 在pyspark中查询HIVE表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36051091/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com