gpt4 book ai didi

performance - Spark SQL 性能 : version 1. 6 与 1.5 版

转载 作者:行者123 更新时间:2023-12-04 03:08:19 25 4
gpt4 key购买 nike

我试图比较 Spark SQL 1.6 版和 1.5 版的性能。在简单的情况下,Spark 1.6 比 Spark 1.5 快得多。但是,在更复杂的查询中——在我的例子中是一个带有分组集的聚合查询,Spark SQL 1.6 版比 Spark SQL 1.5 版慢得多。有人注意到同样的问题吗?甚至更好地为这种查询提供解决方案?

这是我的代码

case class Toto(
a: String = f"${(math.random*1e6).toLong}%06.0f",
b: String = f"${(math.random*1e6).toLong}%06.0f",
c: String = f"${(math.random*1e6).toLong}%06.0f",
n: Int = (math.random*1e3).toInt,
m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )
val sqlSelect = "SELECT a, b, COUNT(1) AS k1, COUNT(DISTINCT n) AS k2, SUM(m) AS k3"
val sqlGroupBy = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlText = s"$sqlSelect $sqlGroupBy"

val rs1 = sqlContext.sql( sqlText )
rs1.saveAsParquetFile( "rs1" )

这是 2 个屏幕截图 Spark 1.5.2Spark 1.6.0使用 --driver-memory=1G。 Spark 1.6.0 上的 DAG 可以在 DAG 查看.

最佳答案

感谢 Herman van Hövell 对 Spark 开发社区的回复。为了与其他成员分享,我在这里分享他的回应。

1.6 plans single distinct aggregates like multiple distinct aggregates; this inherently causes some overhead but is more stable in case of high cardinalities. You can revert to the old behavior by setting the spark.sql.specializeSingleDistinctAggPlanning option to false. See also: https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L452-L462



实际上为了恢复设置值应该是“true”。

关于performance - Spark SQL 性能 : version 1. 6 与 1.5 版,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35181158/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com