gpt4 book ai didi

python - 查找 PySpark 中两个数据帧之间的更改

转载 作者:行者123 更新时间:2023-12-01 02:18:20 25 4
gpt4 key购买 nike

我有两个数据框,例如 dfA 和 dfB。

dfA:
IdCol | Col2 | Col3
id1 | val2 | val3

dfB:
IdCol | Col2 | Col3
id1 | val2 | val4

两个数据框在 IdCol 中连接。我想对每行进行比较,并保持列不同以及它们在另一个数据框中的值。例如,从上面的两个数据帧中,我想要一个结果:

dfChanges:
RowId | Col | dfA_value | dfB_value |
id1 | Col3 | val_3 | val_4 |

我有点不知道如何做到这一点。有人可以提供方向吗?提前致谢

编辑

我的尝试是这样的。但它不是很清楚或没有很好的性能。有更好的方法吗?

dfChanges = None

#for all column excpet id
for colName in dfA.column[1:]:

#Select whole columns of id and targeted column
#from both datasets and subtract to find differences
changedRows = dfA.select(['IdCol',colName]).subtract(dfB.select(['IdCol',colName]))

#Join with dfB to take the value of targeted column from there
temp = changedRows.join(dfB.select(col('IdCol'),col(colName).alias("dfB_value")),dfA.IdCol == dfB.IdCol, 'inner'). \
drop(dfB.IdCol)

#Proper Rename columns
temp = temp.withColumnRenamed(colname,"dfA_value")
temp = temp.withColumn("Col",lit(colName))

#Append to a single dataframe
if (dfChanges is None):
dfChanges = temp
else:
dfChanges = dfChanges.union(temp)

最佳答案

通过 id 连接两个数据框:

dfA = spark.createDataFrame(
[("id1", "val2", "val3")], ("Idcol1", "Col2", "Col3")
)

dfB = spark.createDataFrame(
[("id1", "val2", "val4")], ("Idcol1", "Col2", "Col3")
)

dfAB = dfA.alias("dfA").join(dfB.alias("dfB"), "idCol1")

reshape :

from pyspark.sql.functions import col, struct

ids = ["Idcol1"]

vals = [struct(
col("dfA.{}".format(c)).alias("dfA_value"),
col("dfB.{}".format(c)).alias("dfB_value")
).alias(c) for c in dfA.columns if c not in ids]

融化(定义here)

(melt(dfAB.select(ids + vals), ids, [c for c in dfA.columns if c not in ids])
.where(col("value.dfA_value") != col("value.dfB_value"))
.select(ids + ["variable" , "value.dfA_value", "value.dfB_value"])
.show())

+------+--------+---------+---------+
|Idcol1|variable|dfA_value|dfB_value|
+------+--------+---------+---------+
| id1| Col3| val3| val4|
+------+--------+---------+---------+

关于python - 查找 PySpark 中两个数据帧之间的更改,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48165234/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com