gpt4 book ai didi

sql - 如何在 SQL 中使用数据透视表(不是作为 DataFrame 分组运算符)?

转载 作者:行者123 更新时间:2023-12-03 23:47:32 25 4
gpt4 key购买 nike

我有这个数据框,我试图将此数据框操作转换为 sql 使用 pivot功能

  val df = Seq(
(1, "a,b,c"),
(2, "b,c")
).toDF("id", "page_path")
df.createOrReplaceTempView("df")

df.show()
df
.withColumn("splitted", split($"page_path", ","))
.withColumn("exploded", explode($"splitted"))
.groupBy("id")
.pivot("exploded")
.count().show

产生的输出:
+---+----+---+---+
| id| a| b| c|
+---+----+---+---+
| 1| 1| 1| 1|
| 2|null| 1| 1|
+---+----+---+---+


我看到这个 databricks link使用 pivot以sql方式运行,我尝试申请但失败了..

任何人都知道以 sql 方式应用数据透视函数吗?

我就是这样试的
  spark.sql(
"""
(select * from (Select id, exploded from ( select id, explode(split( page_path ,',')) as exploded from df )
group by id, exploded )
Pivot
(id, exploded) )
""".stripMargin
).show


正在产生这样的结果
+---+--------+
| id|exploded|
+---+--------+
| 2| c|
| 1| c|
| 1| b|
| 2| b|
| 1| a|
+---+--------+

不像上面使用数据帧操作显示的输出

我也试过这个
 spark.sql(
"""
select * from ( select * from df lateral view explode(split( page_path ,',')) as exploded )
pivot (exploded)
""".stripMargin
).show

异常(exception) :

Exception in thread "main" org.apache.spark.sql.AnalysisException: Number of column aliases does not match number of columns. Number of column aliases: 1; number of columns: 3.; line 2 pos 21



打断我的头,在想是否可能。

最佳答案

我们don't不得不提 groupBy 子句使用 sql pivot 作为 Spark 隐式执行 groupBy。

From documentation:

An important idea about pivot is that it performs a grouped aggregation based on a list of implicit group-by columns together with the pivot column. The implicit group-by columns are columns from the FROM clause that do not appear in any aggregate function or as the pivot column.



pivot 我们需要在透视列和 中添加聚合查询for 提及最终数据集中所需的所需列列表。

Using Spark-sql Pivot from spark-2.4:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.functions._

val df1: DataFrame = Seq((1, "a,b,c"),(2, "b,c")).toDF("id", "page_path")
df1.createOrReplaceTempView("df1")
spark.sql(
"""
|Select * from
|( select id, explode(split( page_path ,',')) as exploded from df )
|pivot(count(exploded) for exploded in ('a','b','c')
|)
""".stripMargin)
.show

结果 :
+---+----+---+---+
| id| a| b| c|
+---+----+---+---+
| 1| 1| 1| 1|
| 2|null| 1| 1|
+---+----+---+---+

Using Dataframe api:
df1.withColumn("splitted", split($"page_path", ",")).
withColumn("exploded", explode($"splitted")).
groupBy("id").
pivot("exploded").
count().
show

结果 :
+---+----+---+---+
| id| a| b| c|
+---+----+---+---+
| 1| 1| 1| 1|
| 2|null| 1| 1|
+---+----+---+---+

关于sql - 如何在 SQL 中使用数据透视表(不是作为 DataFrame 分组运算符)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61739018/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com