gpt4 book ai didi

apache-spark - 如何将集合分组为数据集上的运算符/方法?

转载 作者:行者123 更新时间:2023-12-04 05:27:14 26 4
gpt4 key购买 nike

spark scala 中没有函数级 grouping_sets 支持吗?

我不知道这个补丁适用于主人
https://github.com/apache/spark/pull/5080

我想通过 scala 数据框 api 进行这种查询。

GROUP BY expression list GROUPING SETS(expression list2)
cuberollup functions在数据集 API 中可用,但找不到分组集。为什么?

最佳答案

I want to do this kind of query by scala dataframe api.



tl;博士 在 Spark 2.1.0 之前,这是不可能的。目前没有计划将此类运算符添加到 Dataset API。

Spark SQL 支持以下所谓的 多维聚合运算符 :
  • rollup运算符(operator)
  • cube运算符(operator)
  • GROUPING SETS子句(仅在 SQL 模式下)
  • grouping()grouping_id()功能

  • 注意: GROUPING SETS仅在 SQL 模式下可用。数据集 API 中不支持。

    分组集
    val sales = Seq(
    ("Warsaw", 2016, 100),
    ("Warsaw", 2017, 200),
    ("Boston", 2015, 50),
    ("Boston", 2016, 150),
    ("Toronto", 2017, 50)
    ).toDF("city", "year", "amount")
    sales.createOrReplaceTempView("sales")

    // equivalent to rollup("city", "year")
    val q = sql("""
    SELECT city, year, sum(amount) as amount
    FROM sales
    GROUP BY city, year
    GROUPING SETS ((city, year), (city), ())
    ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
    """)
    scala> q.show
    +-------+----+------+
    | city|year|amount|
    +-------+----+------+
    | Warsaw|2016| 100|
    | Warsaw|2017| 200|
    | Warsaw|null| 300|
    |Toronto|2017| 50|
    |Toronto|null| 50|
    | Boston|2015| 50|
    | Boston|2016| 150|
    | Boston|null| 200|
    | null|null| 550| <-- grand total across all cities and years
    +-------+----+------+

    // equivalent to cube("city", "year")
    // note the additional (year) grouping set
    val q = sql("""
    SELECT city, year, sum(amount) as amount
    FROM sales
    GROUP BY city, year
    GROUPING SETS ((city, year), (city), (year), ())
    ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
    """)
    scala> q.show
    +-------+----+------+
    | city|year|amount|
    +-------+----+------+
    | Warsaw|2016| 100|
    | Warsaw|2017| 200|
    | Warsaw|null| 300|
    |Toronto|2017| 50|
    |Toronto|null| 50|
    | Boston|2015| 50|
    | Boston|2016| 150|
    | Boston|null| 200|
    | null|2015| 50| <-- total across all cities in 2015
    | null|2016| 250| <-- total across all cities in 2016
    | null|2017| 250| <-- total across all cities in 2017
    | null|null| 550|
    +-------+----+------+

    关于apache-spark - 如何将集合分组为数据集上的运算符/方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40923680/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com