gpt4 book ai didi

r - 比较两个数据框并创建一个报告,在存在差异的地方给出字段名和旧/新值

转载 作者:行者123 更新时间:2023-12-04 18:00:57 24 4
gpt4 key购买 nike

我有两个数据框,df.olddf.newdf.old 包含 df.new 不包含的额外列。我想将 df.new 中的每个单元格与 df.old 中与同一行(相同的 ID_KEY)列相关的单元格进行比较.然后我想创建一个单独的数据框,它是所有提供 ID_KEY、Fieldname 以及旧值和新值的差异的报告。例如:

df.old:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition | Comment
1 22/02/2016 76% Leasehold £151,000
2 22/02/2016 75% Leasehold £151,000
3 23/02/2016 76% Leasehold £150,000
4 24/02/2016 76% Freehold £151,000

df.new:
ID_KEY | Date of Valuation | Original LTV | Tenure | Valuation in Current Condition
1 21/02/2016 76% Leasehold £151,000
2 22/02/2016 73% Leasehold £151,000
3 23/02/2016 76% Leasehold £153,000
4 24/02/2016 76% Leasehold £151,000

报告:

ID_KEY | Fieldname        |              df.old_value | df.new_value
1 Date of Valuation 22/02/2016 21/02/2016
2 Original LTV 75% 73%
3 Valuation in Current Condition £150,000 £153,000
4 Tenure Freehold Leasehold

我可以设法用 VBA 编写它,但我的 R 代码有点生疏。我知道有一种更简单的方法可以使用拆分-应用-组合在 R 中编写此代码,但我无法弄明白。

最佳答案

选项 1:一种可能的方法是将数据帧融合成长格式,合并它们,然后过滤它们以查找不匹配的值:

library(reshape2)

df.old2 <- melt(df.old, id.vars = "ID_KEY", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "ID_KEY", value.name = "df.new_value")

df.merged <- merge(df.old2, df.new2, by = c("ID_KEY","variable"))
df.merged[df.merged$df.old_value!=df.merged$df.new_value,]

给出:

   ID_KEY                    variable df.old_value df.new_value
1 1 Date.of.Valuation 22/02/2016 21/02/2016
6 2 Original.LTV 75% 73%
12 3 Valuation.Current.Condition £150,000 £153,000
15 4 Tenure Freehold Leasehold

选项 2: 另一种方法是先将数据帧合并,然后使用增强的将结果融合成长格式data.table 包中的 melt 函数,它能够根据模式在输出中提供多个值列:

# create a vector with the common fieldnames
fnames <- names(df.new)[-1]
# or:
fnames <- names(df.old)[names(df.old) %in% names(df.new)][-1]

# merge the dataframes together based on "ID_KEY"
df1 <- merge(df.old, df.new, by = "ID_KEY")

# melt 'df1' into long format and check where the two value columns don't match
library(data.table)
melt(setDT(df1), "ID_KEY",
measure.vars = patterns(".x",".y"),
variable.name = "fieldname",
value.name = c("df.old_value","df.new_value"))[, fieldname := fnames[fieldname]
][df.old_value!=df.new_value][]

给出:

   ID_KEY                   fieldname df.old_value df.new_value
1: 1 Date.of.Valuation 22/02/2016 21/02/2016
2: 2 Original.LTV 75% 73%
3: 4 Tenure Freehold Leasehold
4: 3 Valuation.Current.Condition £150,000 £153,000

注意:我使用的数据在 df.old 中也有一个不匹配的 ID_KEY:

df.old <- read.table(text="ID_KEY  Date.of.Valuation  Original.LTV  Tenure  Valuation.Current.Condition  Comment
1 22/02/2016 76% Leasehold £151,000 Comment
2 22/02/2016 75% Leasehold £151,000 Comment
3 23/02/2016 76% Leasehold £150,000 Comment
4 24/02/2016 76% Freehold £151,000 Comment
5 24/02/2016 76% Freehold £151,000 Comment", header=TRUE)

df.new <- read.table(text="ID_KEY Date.of.Valuation Original.LTV Tenure Valuation.Current.Condition
1 21/02/2016 76% Leasehold £151,000
2 22/02/2016 73% Leasehold £151,000
3 23/02/2016 76% Leasehold £153,000
4 24/02/2016 76% Leasehold £151,000", header=TRUE)

新示例数据的更新:

将方法应用于 reshape2:

df.old2 <- melt(df.old, id.vars = "Loan Identifier", value.name = "df.old_value")
df.new2 <- melt(df.new, id.vars = "Loan Identifier", value.name = "df.new_value")

df.m <- merge(df.old2, df.new2, by = c("Loan Identifier","variable"))
df.r <- df.m[which(df.m$df.old_value!=df.m$df.new_value),]

给出:

> head(df.r)
Loan Identifier variable df.old_value df.new_value
1 960959610 Advance Amount (Gross Advance) 172499 166000
8 960959610 Completion date 1446422400 1447286400
11 960959610 Income B1 22800 47211
12 960959610 Income B2 22000 19461
13 960959610 Interest Rate 0.0309 0.0409
21 960959610 Original Term 420 240

对于 data.table,第一个示例数据集上使用的方法不起作用。类似于 reshape2 方法的可行解决方案:

# making copies, not necessarily needed
df.o <- as.data.table(df.old)
df.n <- as.data.table(df.new)

df.o2 <- melt(df.o, id.vars = "Loan Identifier", value.name = "df.old_value")
df.n2 <- melt(df.n, id.vars = "Loan Identifier", value.name = "df.new_value")

df.j <- df.n2[df.o2, on = c("Loan Identifier","variable")
][df.j$df.old_value!=df.j$df.new_value]

给出:

> head(df.j)
Loan Identifier variable df.new_value df.old_value
1: 960974101 Employment Type – B1 Employed or full loan is guaranteed Self-employed
2: 960959708 Employment Type – B1 Employed or full loan is guaranteed Self-employed
3: 960959806 Employment Type – B1 Self-employed Employed or full loan is guaranteed
4: 960973707 Property Postcode LE4 8EE TA1
5: 960974101 Property Postcode FY7 8HN BB2
6: 960959610 Property Postcode RG18 4QS BH9

关于r - 比较两个数据框并创建一个报告,在存在差异的地方给出字段名和旧/新值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35555621/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com