gpt4 book ai didi

scala - 通过一些列值拆分 Spark 数据帧,然后独立于其他数据帧旋转每个生成的数据帧

转载 作者:行者123 更新时间:2023-12-03 12:50:20 27 4
gpt4 key购买 nike

我正在尝试 拆分 根据一个(或多个)列和 的值的数据框旋转每个生成的数据框 独立于其余部分。即,给定一个输入数据框:

val inputDF = Seq(("tom","20","a","street a","germany"),("jimmy","30","b","street b","germany"),
("lola","50","c","street c","argentina"), ("maria","60","d","street d","argentina"), ("joe","70","e","street e","argentina")
.toDF("name","age","company","address","country")

//+-----+---+-------+--------+---------+
//| name|age|company| address| country|
//+-----+---+-------+--------+---------+
//| tom| 20| a|street a| germany|
//|jimmy| 30| b|street b| germany|
//| lola| 40| c|street c|argentina|
//|maria| 50| d|street d|argentina|
//| joe| 60| e|street e|argentina|
//+-----+---+-------+--------+---------+

我需要按“国家”列的不同值拆分记录。对于输入数据帧,拆分应该产生:
//+-----+---+-------+--------+---------+
//| name|age|company| address| country|
//+-----+---+-------+--------+---------+
//| tom| 20| a|street a| germany|
//|jimmy| 30| b|street b| germany|
//+-----+---+-------+--------+---------+

//+-----+---+-------+--------+---------+
//| name|age|company| address| country|
//+-----+---+-------+--------+---------+
//| lola| 40| c|street c|argentina|
//|maria| 50| d|street d|argentina|
//| joe| 60| e|street e|argentina|
//+-----+---+-------+--------+---------+

而且我还必须轮换每个数据框下的“姓名”和“年龄”列,这样每个人都有不同的公司和地址,同时保持其余列的完整性。所需的输出数据框如下所示:
//+-----+---+-------+--------+---------+
//| name|age|company| address| country|
//+-----+---+-------+--------+---------+
//|jimmy| 30| a|street a| germany|
//| tom| 20| b|street b| germany|
//| joe| 60| c|street c|argentina|
//| lola| 40| d|street d|argentina|
//|maria| 50| e|street e|argentina|
//+-----+---+-------+--------+---------+

最后的行顺序无关紧要

我的第一次尝试(在 Spark-shell 上运行)

我尝试为每一行分配一个唯一的 id,然后打乱所需的列(名称和年龄),并使用辅助 id 值将重新排序的数据帧与数据帧的其余部分连接起来。这里的主要问题是使用 collect() 可能对大数据帧很危险,而 repartition(1) 几乎与分布式计算和 Spark 背道而驰(它已用于在使用不同数量的分区压缩 rdds 时避免异常) .
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.functions.{col, monotonically_increasing_id, rand}
import org.apache.spark.sql.types.LongType

// column(s) names to split the input dataframe
val colToSplit = Seq("country")
val splitCols = colToSplit.map(col)

// list of columns names to be rotated (together)
val colsToRotate = Seq("name", "age")
val rotateCols = colsToRotate.map(col) :+ col(auxCol)

// add an auxiliar column for joining the dataframe in the final step
val auxCol = "aux"
val dfWithID = inputDF.withColumn(auxCol, monotonically_increasing_id())

val splitValuesSchema = dfWithID.select(splitCols: _*).schema

// create one dataframe for each value of the splitting column
val splitValuesDFs = dfWithID.select(splitCols: _*).distinct().collect()
.map(row => spark.sparkContext.makeRDD(List(row)))
.map(rdd => spark.createDataFrame(rdd, splitValuesSchema))

val rotateIDCols = Array(auxCol) ++ colsToRotate

// join the split values with their records (DFs with id + colsToRotate)
val splittedDFs = splitValuesDFs
.map(df => df.join(dfWithID, colToSplit).selectExpr(rotateIDCols: _*))

// random reorder the auxiliar id column (DFs with random ids)
val randIdDFs = splittedDFs
.map(df => df.select(auxCol).orderBy(rand()).toDF())

// get rdds with random ids
val randIdRdds = randIdDFs
.map(df => df.select(auxCol).rdd.map(row => row(0)))

// finally, zip and append the rdds with the random ids to the dataframes created by
// splitting the main df to obtain the rotated dataframe with all the data
val tuples = (splittedDFs, randIdRdds).zipped
val newRdds = tuples
.map((df: DataFrame, rdd) => df.rdd.repartition(1).zip(rdd.repartition(1))
.map(row => Row.fromSeq(row._1.toSeq ++ Seq(row._2))))

val tuples2 = (splittedDFs, newRdds).zipped
val rotatedDF = tuples2.map((df: DataFrame, rdd) => spark
.createDataFrame(rdd, df.schema.add("rotated_id", LongType)).drop(auxCol))
.reduce(_ union _).withColumnRenamed("rotated_id", "column2join")

// get the rest of the columns
val noRotateCols = dfWithID.columns.diff(colsToRotate).map(col)
val noRotatedDF = dfWithID.select(noRotateCols: _*)
.withColumnRenamed(auxCol, "column2join")

// join both dataframes
val outputDF = noRotatedDF.join(rotatedDF, "column2join")
.select(inputDF.columns.map(col): _*) // to keep the initial columns order

显示输出数据帧会产生与上述预期输出相似的结果(它基本上取决于 rand() 函数顺序)

我想尽可能避免使用收集和重新分区,并获得更实用的解决方案。

欢迎任何意见或想法!

最佳答案

我一直试图通过尽可能多地删除性能不佳的调用(重新分区和一些收集)来找到更好、更清晰、更实用的解决方案。我添加了一种辅助方法来索引数据帧行,以便能够连接不相关的部分(不能由任何公共(public)列连接的列或 dfs)。这是我目前的开发,也去掉了 rdds 和 dataframes 之间的多重转换,看起来更具可读性和可理解性。

我希望这可以帮助有同样问题的人。

import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.functions.{col, monotonically_increasing_id, rand}
import org.apache.spark.sql.types.{LongType, StructField, StructType}

// auxiliar method to index row in dataframes
def addRowIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, index) => Row.fromSeq(row.toSeq :+ index) },
StructType(df.schema.fields :+ StructField("index", LongType, false))
)

// column(s) names to split the input dataframe
val colToSplit = Seq("country")
val splitCols = colToSplit.map(col)

// list of columns names to be rotated (together)
val colsToRotate = Seq("name", "age")

// add an auxiliar column for joining the dataframe in the final step
val auxCol = "aux"
val dfWithID = inputDF.withColumn(auxCol, monotonically_increasing_id())

val rotateIDCols = (Array(auxCol) ++ colsToRotate).map(col)

// get an array of dfs with the different values of the splitter column(s)
// --assuming there will not be too much different values in the splitter column--
val filterValues = dfWithID.select(splitCols: _*).distinct().collect()

// retrieve the different dfs according to the splitter values
val splitDfs = filterValues.map(filterRow => filterRow.getValuesMap(colToSplit)
.foldLeft(dfWithID) {
(df, filterField) =>
df.filter(col(filterField._1) === filterField._2)
.select(rotateIDCols: _*)
})

// get and random reorder the aux id column for each dataframe
val randIdDfs = splitDfs.map(_.select(auxCol).orderBy(rand()).toDF())

// remove aux column for each dataframe
val splitWithoutIdDfs = splitDfs.map(_.drop(auxCol))

val dfsTuples = splitWithoutIdDfs.zip(randIdDfs)

// index row of dfs with columns to rotate and dfs with random ids
val indexedDfsTuples = dfsTuples.map {
case (colsDf, idsDf) => (addRowIndex(colsDf), addRowIndex(idsDf))
}

// join reordered-ids dfs and cols to rotate dataframes by the index
val reorderedDfs = indexedDfsTuples.map {
case (df1, df2) => df1.join(df2, Seq("index"))
.drop("index").withColumnRenamed(auxCol, "column2join")
}

// union both dataframes to create the rotated df
reorderedDfs.tail.foldLeft(reorderedDfs.head) { (acc, df) => acc.union(df) }

// get the rest of the columns to get the part of the main df which does not change
val noRotateCols = dfWithID.columns.diff(colsToRotate).map(col)
val noRotatedDF = dfWithID.select(noRotateCols: _*)
.withColumnRenamed(auxCol, "column2join")

// join the rotated and no rotated dataframes
val outputDF = noRotatedDF.join(rotatedDF, "column2join")
.select(inputDF.columns.map(col): _*) // to keep the initial columns order

关于scala - 通过一些列值拆分 Spark 数据帧,然后独立于其他数据帧旋转每个生成的数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51161899/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com