在大型数据集上将 Spark DataFrame 从长到宽 reshape-6ren

在大型数据集上将 Spark DataFrame 从长到宽 reshape

转载作者：行者123 更新时间：2023-12-02 05:38:06

25

4

我正在尝试使用 Spark DataFrame API 将我的数据帧从长变为宽。数据集是学生提问的问题和答案的集合。这是一个巨大的数据集，Q(问题)和 A(答案)大约范围从 1 到 50000。我想收集所有可能的 Q*A 对并使用它们来构建列。如果学生对问题 1 的回答为 1，我们将值 1 分配给第 1_1 列。否则，我们给它一个0。数据集已在S_ID、Q、A上进行去重。

在 R 中，我可以简单地在库 reshape2 中使用 dcast，但我不知道如何使用 Spark 来做到这一点。我在下面的链接中找到了旋转的解决方案，但它需要固定数量的不同对的 Q*A。 http://rajasoftware.net/index.php/database/91446/scala-apache-spark-pivot-dataframes-pivot-spark-dataframe

我还尝试使用用户定义的函数连接 Q 和 A，并应用交叉表但是，我从控制台收到以下错误，即使到目前为止我只在示例数据文件上测试我的代码 -

The maximum limit of le6 pairs have been collected, which may not be all of the pairs.  
Please try reducing the amount of distinct items in your columns.

原始数据:

S_ID, Q, A
1, 1, 1
1, 2, 2
1, 3, 3
2, 1, 1
2, 2, 3
2, 3, 4
2, 4, 5

=> 长到宽转换后:

S_ID, QA_1_1, QA_2_2, QA_3_3, QA_2_3, QA_3_4, QA_4_5
1, 1, 1, 1, 0, 0, 0
2, 1, 0, 0, 1, 1, 1

R code.  
library(dplyr); library(reshape2);  
df1 <- df %>% group_by(S_ID, Q, A) %>% filter(row_number()==1) %>% mutate(temp=1)  
df1 %>% dcast(S_ID ~ Q + A, value.var="temp", fill=0)  

Spark code.
val fnConcatenate = udf((x: String, y: String) => {"QA_"+ x +"_" + y})
df1 = df.distinct.withColumn("QA", fnConcatenate($"Q", $"A"))
df2 = stat.crosstab("S_ID", "QA")

任何想法将不胜感激。

最佳答案

您在这里尝试执行的操作在设计上是错误的，原因有两个:

您将稀疏数据集替换为密集数据集。当涉及到内存需求和计算时，它的成本很高，而且当您拥有大型数据集时，它几乎从来都不是一个好主意
您限制了本地处理数据的能力。稍微简化一下 Spark 数据帧只是 RDD[Row] 的包装器。这意味着行越大，您可以在单个分区上放置的内容就越少，因此聚合等操作的成本要高得多，并且需要更多的网络流量。

当您拥有适当的列式存储并且可以实现高效压缩或聚合等功能时，宽表非常有用。从实用的角度来看，几乎所有可以用宽表做的事情都可以用长表使用组/窗口函数来完成。

您可以尝试的一件事是使用稀疏向量创建宽格式:

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.max
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.StringIndexer
import sqlContext.implicits._

df.registerTempTable("df")
val dfComb = sqlContext.sql("SELECT s_id, CONCAT(Q, '\t', A) AS qa FROM df")

val indexer = new StringIndexer()
  .setInputCol("qa")
  .setOutputCol("idx")
  .fit(dfComb)

val indexed = indexer.transform(dfComb)

val n = indexed.agg(max("idx")).first.getDouble(0).toInt + 1

val wideLikeDF = indexed
  .select($"s_id", $"idx")
  .rdd
  .map{case Row(s_id: String, idx: Double) => (s_id, idx.toInt)}
  .groupByKey // This assumes no duplicates
  .mapValues(vals => Vectors.sparse(n, vals.map((_, 1.0)).toArray))
  .toDF("id", "qaVec")

很酷的一点是您可以轻松地将其转换为 IndexedRowMatrix 并计算 SVD

val mat = new IndexedRowMatrix(wideLikeDF.map{
  // Here we assume that s_id can be mapped directly to Long
  // If not it has to be indexed
  case Row(id: String, qaVec: SparseVector) => IndexedRow(id.toLong, qaVec)
})

val svd = mat.computeSVD(3)

或RowMatrix并获取列统计信息或计算主成分:

val colStats = mat.toRowMatrix.computeColumnSummaryStatistic
val colSims = mat.toRowMatrix.columnSimilarities
val pc = mat.toRowMatrix.computePrincipalComponents(3)

编辑:

在 Spark 1.6.0+ 中，您可以使用pivot 函数。

关于在大型数据集上将 Spark DataFrame 从长到宽 reshape ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31979256/

25

4

0

文章推荐：具有一些只读行的 WPF Datagrid

文章推荐： r - 如何在 R 中的 dygraph 标题中使用 UTF-8 字符

文章推荐： json - 使用 JSONP 技术从 Geonames API 加载国家国旗

文章推荐： r - 错误: package 'lsei' is not installed for 'arch=x64'

reshape - 如何检查APL中的字符串是否被 reshape ？
如何检查字符串是否被 reshape ？示例:“aab”返回 0，因为“a”无法 reshape 为该字符串或任何其他更短的字符串。另一个例子是“aabbaab”返回 1，因为“aabb”可以被 r
reshape - Theano reshape
我无法清楚地理解theano的reshape。我有一个形状的图像矩阵: [batch_size, stack1_size, stack2_size, height, width] ，其中有 s
reshape - 如何检查APL中的字符串是否被 reshape ？
如何检查字符串是否被 reshape ？示例:“aab”返回 0，因为“a”无法 reshape 为该字符串或任何其他更短的字符串。另一个例子是“aabbaab”返回 1，因为“aabb”可以被 r
reshape - 如何像这样使用 python reshape 数据集
这是原始数据 a=[[1,2,3,4,5,6], [7,8,9,10,11,12]] 我想把它转换成这样的格式: b=[[1,2,3,7,8,9], [4,5,6,10,11,12]] a
python - 只是 reshape 和 reshape 和获得转置之间的区别？
我目前正在学习 CS231 作业，我意识到一些令人困惑的事情。在计算梯度时，当我第一次 reshape x 然后得到转置时，我得到了正确的结果。 x_r=x.reshape(x.shape[0],-1
r - 如何使用 reshape 包 reshape 此数据框
这个问题在这里已经有了答案: Reshaping multiple sets of measurement columns (wide format) into single columns (lon
当 reshape 无法猜测时变变量的名称时， reshape r 中的数据
我有一个包含超过 1500 列的宽格式数据集。由于许多变量都是重复的，我想将其 reshape 为长形式。然而，r 抛出一个错误: Error in guess(varying) : Failed
从长到宽 reshape 数据 - 了解 reshape 参数
我有一个长格式的数据框狗，我正在尝试使用 reshape() 函数将其重新格式化为宽格式。目前看起来是这样的: dogid month year trainingtype home scho
python - NumPy 使用 reshape 函数 reshape 数组
这个问题在这里已经有了答案: how to reshape an N length vector to a 3x(N/3) matrix in numpy using reshape (1 个回答)
python - 'numpy.reshape' 和 'ndarray.reshape' 如何等效？
我对 ndarray.reshape 的结构有疑问.我读过 numpy.reshape()和 ndarray.reshape是 python 中用于 reshape 数组的等效命令。据我所知，num
reshape - 在 Stata 中没有唯一的 "j"变量的情况下如何 reshape ？
所以这是我的麻烦:我想将一个长格式的数据文件改成宽格式。但是，我没有唯一的“j”变量；长格式文件中的每条记录都有几个关键变量。例如，我想这样做: | caseid | gender | age |
从 base reshape vs 从具有缺失值的 reshape2 reshape
Whis 这个数据框， df df id parameter visit value sex 1 01 blood V1 1 f 2 01 saliva V
python - reshape numpy 数组的列表，然后 reshape 回来
我有一个列表，其中包含几个不同形状的 numpy 数组。我想将这个数组列表 reshape 为一个 numpy 向量，然后更改向量中的每个元素，然后将其 reshape 回原始数组列表。例如: 输入
Python 使用 np.reshape 按特定顺序 reshape 数组
我有一个形状为 (1800,144) 的数组 (a) 其中 a[0:900,:] 都是实数，后半部分数组 a[900:1800,:] 全部为零。我想把数组的后半部分水平地放在前半部分旁边，然后将它们推
python - 在 Python 中使用 reshape reshape 数组
我有一个如下所示的数组: array([[0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2
python - 为什么 Tensorflow Reshape tf.reshape() 会破坏梯度流？
我正在创建一个 tf.Variable()，然后使用该变量创建一个简单的函数，然后我使用 tf.reshape() 展平原始变量，然后我在函数和展平变量之间使用了 tf.gradients()。为什么
python - 使用 array.reshape(-1, 1) reshape 数组
我有一个名为 data 的数据框，我试图从中识别任何异常价格。数据框头部看起来像: Date Last Price 0 29/12/2017 487.74 1 28/
python - 使用 numpy reshape 数组 - ValueError : cannot reshape array
我有一个 float vec 数组，我想对其进行 reshape vec.shape >>> (3,) len(vec[0]) # all 3 rows of vec have 150 columns
python - 在不使用 reshape 的情况下 reshape n 维数组的 View
tl;dr 我可以在不使用 numpy.reshape 的情况下将 numpy 数组的 View 从 5x5x5x3x3x3 reshape 为 125x1x1x3x3x3 吗？我想对一个体积(大小
reshape() function to make wide to long data(RESHAPE()函数使数据变宽变长)
set.seed(123)data <- data.frame(ID = 1:10, weight_hus = rnorm(10, 0, 1),

首页

博学

6Ren·AI

商城

在大型数据集上将 Spark DataFrame 从长到宽 reshape