gpt4 book ai didi

apache-spark - Spark SQL UNION - ORDER BY 列不在 SELECT 中

转载 作者:行者123 更新时间:2023-12-05 01:44:13 24 4
gpt4 key购买 nike

我正在执行两个临时表的 UNION 并尝试按列排序,但 spark 提示我排序所依据的列无法解析。这是错误还是我遗漏了什么?

lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
import org.apache.spark.sql.types.StringType

val oldOrders = Seq(
Seq("old_order_id1", "old_order_name1", "true"),
Seq("old_order_id2", "old_order_name2", "true")
)

val newOrders = Seq(
Seq("new_order_id1", "new_order_name1", "false"),
Seq("new_order_id2", "new_order_name2", "false")
)
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("is_old", StringType)

val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)

oldOrdersDF.createOrReplaceTempView("old_orders")
newOrdersDF.createOrReplaceTempView("new_orders")

//ordering by column not in select works if I'm not doing UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|ORDER BY oo.is_old
""".stripMargin).show()

//ordering by column not in select doesn't work as I'm doing a UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|UNION
|SELECT no.id, no.name FROM new_orders no
|ORDER BY oo.is_old
""".stripMargin).show()

The output of the above code is:

+-------------+---------------+
| id| name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+


cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]

org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]

因此,如果我不执行 UNION,则按不在 SELECT 子句中的列进行排序,如果我执行两个表的 UNION,则排序失败。

最佳答案

Spark SQL 的语法与 SQL 非常相似,但它们的工作方式却截然不同。在 Spark 的背后,一切都与 Rdds/数据帧有关。

在 UNION 语句之后,生成了一个新的数据框,如果我们没有选择它们,我们将无法引用旧表/数据框中的字段。

如何修复:

spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
| UNION
| SELECT no.id, no.name, no.is_old FROM new_orders no
| ORDER BY oo.is_old
| ) t
""".stripMargin).show()

谢谢。

关于apache-spark - Spark SQL UNION - ORDER BY 列不在 SELECT 中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46834657/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com