gpt4 book ai didi

R:在两列和组中过滤两个 data.frames 中的重复值

转载 作者:行者123 更新时间:2023-12-03 14:30:14 25 4
gpt4 key购买 nike

我有一个数据框 dat存储我的普通数据和组由 ID 定义.

data <- structure(list(NAME = structure(c(1L, 1L, 2L), .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 23L, 57L), REF_YEAR = c(1920L, 1938L, 1869L), SURV_YEAR = c(1938L, 1962L, 1872L), VALUE = c(20L, 40L, 34L)), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR","VALUE"), class = "data.frame", row.names = c(NA, -3L))

NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23 1920 1938 20
2 NAME1 23 1938 1962 40
3 NAME2 57 1869 1872 34

我还有第二个 data.frame , dat_q我想与 dat 进行比较
dat_q <- structure(list(NAME = structure(1:2, .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 57L), REF_YEAR = c(1934L, 1866L), SURV_YEAR = c(1938L, 1868L), VALUE = structure(1:2, .Label = c("A", "B"), class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -2L))

NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23 1934 1938 A
2 NAME2 57 1866 1868 B

我的问题:如何删除 dat_q 中的所有行在列 REF_YEAR 中包含相等的值或 SURV_YEARdat的同列(在样本数据 1938 中)?这应该按组应用(由 ID 定义)而不是整个 data.frame

最后,使用我的样本数据,这将是来自过滤 dat_q 的结果。
  NAME  ID REF_YEAR SURV_YEAR VALUE
2 NAME2 57 1866 1868 B

编辑

以下是一些其他示例数据,@thelatemail 提供的代码无法使用这些数据。我不明白为什么, dat_q应该被过滤掉,因为它包含与 dat 完全相同的值.
data <- structure(list(NAME = structure(c(1L, 1L, 1L), .Label = "NAME1", class = "factor"), ID = c(226L, 226L, 226L), SURV_YEAR = c(2009L, 2010L, 2012L), REF_YEAR = c(2008L, 2009L, 2011L), VALUE = c(-7L, -37L,  -51L)), .Names = c("NAME", "ID", "SURV_YEAR", "REF_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L))

NAME ID SURV_YEAR REF_YEAR VALUE
1 NAME1 226 2009 2008 -7
2 NAME1 226 2010 2009 -37
3 NAME1 226 2012 2011 -51

dat_q <- structure(list(NAME = structure(1L, .Label = "NAME1", class = "factor"), ID = 226L, REF_YEAR = 2010L, SURV_YEAR = 2011L, VALUE = structure(1L, .Label = "-X", class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -1L))

NAME ID REF_YEAR SURV_YEAR VALUE
1 NAME1 226 2010 2011 -X

最佳答案

我喜欢 by在基础 R 中找出此类问题的逻辑。这有效,但可能有点慢:

do.call(rbind,by(
dat_q,
dat_q$ID,
function(x) {
subdata <- data[data$ID==x$ID,]
x[!(x$REF_YEAR %in% subdata$REF_YEAR | x$SURV_YEAR %in% subdata$SURV_YEAR),]
}
))

# NAME ID REF_YEAR SURV_YEAR VALUE
#57 NAME2 57 1866 1868 B

一个 data.table遵循相同逻辑的解决方案可能会更快:
library(data.table)
setDT(dat_q)
setDT(data)
dat_q[
,
.SD[!(REF_YEAR %in% data$REF_YEAR[data[,ID==.BY]] |
SURV_YEAR %in% data$SURV_YEAR[data[,ID==.BY]])],
by=ID
]

# ID NAME REF_YEAR SURV_YEAR VALUE
#1: 57 NAME2 1866 1868 B

data.table ,我想你也可以这样做。转换为data.tables后,
# using 1.9.3+, just remove `by=.EACHI` if you're using <= 1.9.2
setkey(data, ID)
setkey(dat_q, ID)

idx = data[dat_q, any(c(i.REF_YEAR, i.SURV_YEAR) %in% c(REF_YEAR, SURV_YEAR)), by=.EACHI]$V1
dat_q[!idx]
# NAME ID REF_YEAR SURV_YEAR VALUE
# 1: NAME2 57 1866 1868 B

我们执行连接,并在 data 的每个匹配行上对应于 dat_q , 在键列上,我们计算 j 中的表达式.这为我们提供了索引/子集所需的逻辑值 dat_q之后。

关于R:在两列和组中过滤两个 data.frames 中的重复值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26131370/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com