gpt4 book ai didi

count - Spark:如何转换Dataframe API的count(distinct(value))

转载 作者:行者123 更新时间:2023-12-03 10:10:33 25 4
gpt4 key购买 nike

我正在尝试比较汇总数据的不同方式。

这是我的输入数据,包含2个元素(页面,访问者):

(PAG1,V1)
(PAG1,V1)
(PAG2,V1)
(PAG2,V2)
(PAG2,V1)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG1,V2)
(PAG1,V1)
(PAG2,V2)
(PAG1,V3)

使用以下代码在Spark SQL中使用SQL命令:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2)).toDF()
logs.registerTempTable("logs")
val sqlResult= sqlContext.sql(
"""select page
,count(distinct visitor) as visitor
from logs
group by page
""")
val result = sqlResult.map(x=>(x(0).toString,x(1).toString))
result.foreach(println)

我得到以下输出:
(PAG1,3) // PAG1 has been visited by 3 different visitors
(PAG2,2) // PAG2 has been visited by 2 different visitors

现在,我想使用Dataframes和thiers API获得相同的结果,但无法获得相同的输出:
import sqlContext.implicits._
case class Log(page: String, visitor: String)
val logs = data.map(p => Coppia(p._1,p._2)).toDF()
val result = log.select("page","visitor").groupBy("page").count().distinct
result.foreach(println)

实际上,这就是我得到的输出:
[PAG1,8]  // just the simple page count for every page
[PAG2,4]

这可能有点愚蠢,但我现在看不到。

提前致谢!

FF

最佳答案

您需要的是DataFrame聚合函数countDistinct:

import sqlContext.implicits._
import org.apache.spark.sql.functions._

case class Log(page: String, visitor: String)

val logs = data.map(p => Log(p._1,p._2))
.toDF()

val result = logs.select("page","visitor")
.groupBy('page)
.agg('page, countDistinct('visitor))

result.foreach(println)

关于count - Spark:如何转换Dataframe API的count(distinct(value)),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30218140/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com