gpt4 book ai didi

scala - 如何在不排序的情况下获取spark中的top-k频繁单词?

转载 作者:行者123 更新时间:2023-12-01 22:38:56 28 4
gpt4 key购买 nike

在spark中,我们可以很方便地使用map-reduce来统计单词出现的时间,并使用sort来获取top-k的频繁单词,

// Sort locally inside node, keep only top-k results,
// no network communication

val partialTopK = wordCount.mapPartitions(it => {
val a = it.toArray
a.sortBy(-_._2).take(10).iterator
}, true)


// Collect local top-k results, faster than the naive solution

val collectedTopK = partialTopK.collect
collectedTopK.size


// Compute global top-k at master,
// no communication, everything done on the master node

val topK = collectedTopK.sortBy(-_._2).take(10)

但我想知道是否有更好的解决方案完全避免排序?

最佳答案

我想你想要takeOrdered

Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering.

top

Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T].

还有其他几个问题/答案似乎也至少部分重复

关于scala - 如何在不排序的情况下获取spark中的top-k频繁单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29310687/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com