gpt4 book ai didi

scala - Spark : Difference between collect(), take() 和 show() 转换为 DF 后的输出

转载 作者:行者123 更新时间:2023-12-04 02:25:58 25 4
gpt4 key购买 nike

我正在使用 Spark 1.5。

我有一列 30 个 ID,我加载为 integers从数据库:

val numsRDD = sqlContext
.table(constants.SOURCE_DB + "." + IDS)
.select("id")
.distinct
.map(row=>row.getInt(0))

这是 numsRDD 的输出:
numsRDD.collect.foreach(println(_))

643761
30673603
30736590
30773400
30832624
31104189
31598495
31723487
32776244
32801792
32879386
32981901
33469224
34213505
34709608
37136455
37260344
37471301
37573190
37578690
37582274
37600896
37608984
37616677
37618105
37644500
37647770
37648497
37720353
37741608

接下来,我要生产所有 3 的组合对于那些 ids然后将每个组合保存为以下形式的元组: < tripletID: String, triplet: Array(Int)>并将其转换为数据帧,我的操作如下:
// |combinationsDF| = 4060 combinations
val combinationsDF = sc
.parallelize(numsRDD
.collect
.combinations(3)
.toArray
.map(row => row.sorted)
.map(row => (
List(row(0), row(1), row(2)).mkString(","),
List(row(0), row(1), row(2)).toArray)))
.toDF("tripletID","triplet")

一旦我这样做了,我就会尝试打印一些 combinationsDF的内容只是为了确保一切都是它应该的样子。所以我试试这个:
combinationsDF.show

返回:
+--------------------+--------------------+
| tripletID| triplet|
+--------------------+--------------------+
|,37136455,3758227...|[32776244, 371364...|
|,37136455,3761667...|[32776244, 371364...|
|,32776244,3713645...|[31723487, 327762...|
|,37136455,3757869...|[32776244, 371364...|
|,32776244,3713645...|[31598495, 327762...|
|,37136455,3760089...|[32776244, 371364...|
|,37136455,3764849...|[32776244, 371364...|
|,37136455,3764450...|[32776244, 371364...|
|,37136455,3747130...|[32776244, 371364...|
|,32981901,3713645...|[32776244, 329819...|
|,37136455,3761810...|[32776244, 371364...|
|,34213505,3713645...|[32776244, 342135...|
|,37136455,3726034...|[32776244, 371364...|
|,37136455,3772035...|[32776244, 371364...|
|2776244,37136455...|[643761, 32776244...|
|,37136455,3764777...|[32776244, 371364...|
|,37136455,3760898...|[32776244, 371364...|
|,32879386,3713645...|[32776244, 328793...|
|,32776244,3713645...|[31104189, 327762...|
|,32776244,3713645...|[30736590, 327762...|
+--------------------+--------------------+
only showing top 20 rows

很明显,每个 tripletID 的第一个元素不见了。所以,为了 100% 确定我使用 take(20)如下:
combinationsDF.take(20).foreach(println(_))

它返回一个更详细的表示如下:
[,37136455,37582274,WrappedArray(32776244, 37136455, 37582274)]
[,37136455,37616677,WrappedArray(32776244, 37136455, 37616677)]
[,32776244,37136455,WrappedArray(31723487, 32776244, 37136455)]
[,37136455,37578690,WrappedArray(32776244, 37136455, 37578690)]
[,32776244,37136455,WrappedArray(31598495, 32776244, 37136455)]
[,37136455,37600896,WrappedArray(32776244, 37136455, 37600896)]
[,37136455,37648497,WrappedArray(32776244, 37136455, 37648497)]
[,37136455,37644500,WrappedArray(32776244, 37136455, 37644500)]
[,37136455,37471301,WrappedArray(32776244, 37136455, 37471301)]
[,32981901,37136455,WrappedArray(32776244, 32981901, 37136455)]
[,37136455,37618105,WrappedArray(32776244, 37136455, 37618105)]
[,34213505,37136455,WrappedArray(32776244, 34213505, 37136455)]
[,37136455,37260344,WrappedArray(32776244, 37136455, 37260344)]
[,37136455,37720353,WrappedArray(32776244, 37136455, 37720353)]
[2776244,37136455,WrappedArray(643761, 32776244, 37136455)]
[,37136455,37647770,WrappedArray(32776244, 37136455, 37647770)]
[,37136455,37608984,WrappedArray(32776244, 37136455, 37608984)]
[,32879386,37136455,WrappedArray(32776244, 32879386, 37136455)]
[,32776244,37136455,WrappedArray(31104189, 32776244, 37136455)]
[,32776244,37136455,WrappedArray(30736590, 32776244, 37136455)]

所以现在我确定第一个 id 来自 tripletID无论出于何种原因已被弃用。但是,如果我尝试使用 collect而不是 take(20) :
combinationsDF.collect.foreach(println(_))

一切又恢复正常(!!!):
[32776244,37136455,37582274,WrappedArray(32776244, 37136455, 37582274)]
[32776244,37136455,37616677,WrappedArray(32776244, 37136455, 37616677)]
[31723487,32776244,37136455,WrappedArray(31723487, 32776244, 37136455)]
[32776244,37136455,37578690,WrappedArray(32776244, 37136455, 37578690)]
[31598495,32776244,37136455,WrappedArray(31598495, 32776244, 37136455)]
[32776244,37136455,37600896,WrappedArray(32776244, 37136455, 37600896)]
[32776244,37136455,37648497,WrappedArray(32776244, 37136455, 37648497)]
[32776244,37136455,37644500,WrappedArray(32776244, 37136455, 37644500)]
[32776244,37136455,37471301,WrappedArray(32776244, 37136455, 37471301)]
[32776244,32981901,37136455,WrappedArray(32776244, 32981901, 37136455)]
[32776244,37136455,37618105,WrappedArray(32776244, 37136455, 37618105)]
[32776244,34213505,37136455,WrappedArray(32776244, 34213505, 37136455)]
[32776244,37136455,37260344,WrappedArray(32776244, 37136455, 37260344)]
[32776244,37136455,37720353,WrappedArray(32776244, 37136455, 37720353)]
[643761,32776244,37136455,WrappedArray(643761, 32776244, 37136455)]
[32776244,37136455,37647770,WrappedArray(32776244, 37136455, 37647770)]
[32776244,37136455,37608984,WrappedArray(32776244, 37136455, 37608984)]
[32776244,32879386,37136455,WrappedArray(32776244, 32879386, 37136455)]
[31104189,32776244,37136455,WrappedArray(31104189, 32776244, 37136455)]
[30736590,32776244,37136455,WrappedArray(30736590, 32776244, 37136455)]
...

1.我已经穷尽了之前的步骤 parallelize组合成一个 RDD 的数组,一切正常。
2.我也在 parallelize之后打印输出已应用,再次一切正常。
3. 该问题似乎与 numsRDD 到 DF 的转换有关,尽管我尽了最大努力,但还是无法解决。
4. 我也无法使用相同的代码片段重现模拟数据的问题。

首先: 是什么导致了这个问题?
第二个: 我如何解决它?

最佳答案

我会检查你的原件 numsRDD ,看起来您那里可能有一个空字符串或空值。这对我有用:

scala> val numsRDD = sc.parallelize(0 to 30)
numsRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> :pa
// Entering paste mode (ctrl-D to finish)

val combinationsDF = sc
.parallelize(numsRDD
.collect
.combinations(3)
.toArray
.map(row => row.sorted)
.map(row => (
List(row(0), row(1), row(2)).mkString(","),
List(row(0), row(1), row(2)).toArray)))
.toDF("tripletID","triplet")

// Exiting paste mode, now interpreting.

combinationsDF: org.apache.spark.sql.DataFrame = [tripletID: string, triplet: array<int>]

scala> combinationsDF.show
+---------+----------+
|tripletID| triplet|
+---------+----------+
| 0,1,2| [0, 1, 2]|
| 0,1,3| [0, 1, 3]|
| 0,1,4| [0, 1, 4]|
| 0,1,5| [0, 1, 5]|
| 0,1,6| [0, 1, 6]|
| 0,1,7| [0, 1, 7]|
| 0,1,8| [0, 1, 8]|
| 0,1,9| [0, 1, 9]|
| 0,1,10|[0, 1, 10]|
| 0,1,11|[0, 1, 11]|
| 0,1,12|[0, 1, 12]|
| 0,1,13|[0, 1, 13]|
| 0,1,14|[0, 1, 14]|
| 0,1,15|[0, 1, 15]|
| 0,1,16|[0, 1, 16]|
| 0,1,17|[0, 1, 17]|
| 0,1,18|[0, 1, 18]|
| 0,1,19|[0, 1, 19]|
| 0,1,20|[0, 1, 20]|
| 0,1,21|[0, 1, 21]|
+---------+----------+
only showing top 20 rows

我唯一能想到的就是 mkString不像你期望的那样工作。试试这个字符串插值(也不需要重新创建 List ):
val combinationsDF = sc
.parallelize(numsRDD
.collect
.combinations(3)
.toArray
.map(row => row.sorted)
.map{case List(a,b,c) => (
s"$a,$b,$c",
Array(a,b,c))}
.toDF("tripletID","triplet")

scala> combinationsDF.show
+---------+----------+
|tripletID| triplet|
+---------+----------+
| 0,1,2| [0, 1, 2]|
| 0,1,3| [0, 1, 3]|
| 0,1,4| [0, 1, 4]|
| 0,1,5| [0, 1, 5]|
| 0,1,6| [0, 1, 6]|
| 0,1,7| [0, 1, 7]|
| 0,1,8| [0, 1, 8]|
| 0,1,9| [0, 1, 9]|
| 0,1,10|[0, 1, 10]|
| 0,1,11|[0, 1, 11]|
| 0,1,12|[0, 1, 12]|
| 0,1,13|[0, 1, 13]|
| 0,1,14|[0, 1, 14]|
| 0,1,15|[0, 1, 15]|
| 0,1,16|[0, 1, 16]|
| 0,1,17|[0, 1, 17]|
| 0,1,18|[0, 1, 18]|
| 0,1,19|[0, 1, 19]|
| 0,1,20|[0, 1, 20]|
| 0,1,21|[0, 1, 21]|
+---------+----------+
only showing top 20 rows

关于scala - Spark : Difference between collect(), take() 和 show() 转换为 DF 后的输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41000273/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com