gpt4 book ai didi

apache-spark - 为什么reduceByKey后所有数据都在一个分区?

转载 作者:行者123 更新时间:2023-12-04 04:10:56 26 4
gpt4 key购买 nike

我有这个简单的 Spark 程序。我想知道为什么所有数据最终都在一个分区中。

val l = List((30002,30000), (50006,50000), (80006,80000), 
(4,0), (60012,60000), (70006,70000),
(40006,40000), (30012,30000), (30000,30000),
(60018,60000), (30020,30000), (20010,20000),
(20014,20000), (90008,90000), (14,0), (90012,90000),
(50010,50000), (100008,100000), (80012,80000),
(20000,20000), (30010,30000), (20012,20000),
(90016,90000), (18,0), (12,0), (70016,70000),
(20,0), (80020,80000), (100016,100000), (70014,70000),
(60002,60000), (40000,40000), (60006,60000),
(80000,80000), (50008,50000), (60008,60000),
(10002,10000), (30014,30000), (70002,70000),
(40010,40000), (100010,100000), (40002,40000),
(20004,20000),
(10018,10000), (50018,50000), (70004,70000),
(90004,90000), (100004,100000), (20016,20000))

val l_rdd = sc.parallelize(l, 2)

// print each item and index of the partition it belongs to
l_rdd.mapPartitionsWithIndex((index, iter) => {
iter.toList.map(x => (index, x)).iterator
}).collect.foreach(println)

// reduce on the second element of the list.
// alternatively you can use aggregateByKey
val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a})

// print the reduced results along with its partition index
l_reduced.mapPartitionsWithIndex((index, iter) => {
iter.toList.map(x => (index, x._1, x._2.size)).iterator
}).collect.foreach(println)

当您运行它时,您将看到数据 ( l_rdd ) 分布在两个分区中。一旦我减少,得到的 RDD( l_reduced )也有两个分区,但所有数据都在一个分区(索引 0)中,另一个是空的。即使数据很大(几 GB)也会发生这种情况。不应该是 l_reduced也被分成两个分区。

最佳答案

val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a})

引用上面的代码片段,您正在按 RDD 的第二个字段进行分区。第二个字段中的所有数字都以 0 结尾。

调用HashPartitioner时,记录的分区号由以下 function决定:
  def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}

Utils.nonNegativeMod 定义为 follows :
 def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}

让我们看看当我们将上述两个逻辑应用于您的输入时会发生什么:
scala> l.map(_._2.hashCode % 2) // numPartitions = 2
res10: List[Int] = List(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)

因此,您的所有记录最终都在分区 0 中。

您可以通过重新分区来解决此问题:
val l_reduced = l_rdd.map(x => {
(x._2, List(x._1))
}).reduceByKey((a, b) => {b ::: a}).repartition(2)

这使:
(0,100000,4)
(0,10000,2)
(0,0,5)
(0,20000,6)
(0,60000,5)
(0,80000,4)
(1,50000,4)
(1,30000,6)
(1,90000,4)
(1,70000,5)
(1,40000,4)

或者,您可以创建一个 custom partitioner .

关于apache-spark - 为什么reduceByKey后所有数据都在一个分区?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42077477/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com