gpt4 book ai didi

python - 比较两个数据帧 Pyspark

转载 作者:行者123 更新时间:2023-12-04 14:31:17 27 4
gpt4 key购买 nike

我正在尝试比较两个具有相同列数的数据框,即 4 列的 id 作为两个数据框中的关键列

df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")

现在我想将新列附加到 DF2,即 column_names,它是值与 df1 不同的列的列表
df2.withColumn("column_names",udf())

DF1
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | UK |
| 3| GHI | 3000 | JPN |
| 4| JKL | 4500 | CHN |
+------+---------+--------+------+

DF2:
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | CAN |
| 3| GHI | 3500 | JPN |
| 4| JKL_M | 4800 | CHN |
+------+---------+--------+------+

现在我想要 DF3

DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+

我看到了这个问题, How to compare two dataframe and print columns that are different in scala .试过了,结果不一样。

我正在考虑通过将每个数据帧中的行传递给 udf 并逐列比较并返回列列表来使用 UDF 函数。然而,对于这两个数据帧应该按排序顺序,以便相同的 id 行将被发送到 udf。排序在这里是代价高昂的操作。有什么解决办法吗?

最佳答案

假设我们可以使用 id 将这两个数据集连接起来,我认为不需要 UDF。这可以通过使用内连接来解决,arrayarray_remove等功能。
首先让我们创建两个数据集:

df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])

df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
首先,我们在两个数据集之间进行内部连接,然后生成条件 df1[col] != df2[col]对于除 id 之外的每一列.当列不相等时,我们返回列名,否则返回一个空字符串。条件列表将由数组的项组成,最后我们从中删除空项:
from pyspark.sql.functions import col, array, when, array_remove

# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']

select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]

df1.join(df2, "id").select(*select_expr).show()

# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# | 1| ABC|5000| US| []|
# | 3| GHI|3500| JPN| [sal]|
# | 2| DEF|4000| CAN| [Address]|
# | 4|JKL_M|4800| CHN| [name, sal]|
# +---+-----+----+-------+------------+

关于python - 比较两个数据帧 Pyspark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60279160/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com