gpt4 book ai didi

apache-spark - 在Spark RDD或数据框中随机随机排列列

转载 作者:行者123 更新时间:2023-12-04 05:47:06 24 4
gpt4 key购买 nike

无论如何,我是否可以对RDD或数据帧的列进行混洗,以使该列中的条目以随机顺序出现?我不确定我可以使用哪些API来完成此任务。

最佳答案

如何选择要洗牌的列,orderBy(rand)列和zip it by index to the existing dataframe呢?

import org.apache.spark.sql.functions.rand

def addIndex(df: DataFrame) = spark.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)

case class Entry(name: String, salary: Double)

val r1 = Entry("Max", 2001.21)
val r2 = Entry("Zhang", 3111.32)
val r3 = Entry("Bob", 1919.21)
val r4 = Entry("Paul", 3001.5)

val df = addIndex(spark.createDataFrame(Seq(r1, r2, r3, r4)))
val df_shuffled = addIndex(df
.select(col("salary").as("salary_shuffled"))
.orderBy(rand))

df.join(df_shuffled, Seq("_index"))
.drop("_index")
.show(false)

+-----+-------+---------------+
|name |salary |salary_shuffled|
+-----+-------+---------------+
|Max |2001.21|3001.5 |
|Zhang|3111.32|3111.32 |
|Paul |3001.5 |2001.21 |
|Bob |1919.21|1919.21 |
+-----+-------+---------------+

关于apache-spark - 在Spark RDD或数据框中随机随机排列列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37287543/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com