gpt4 book ai didi

r - 如何按条件和按组过滤我的 data.table?

转载 作者:行者123 更新时间:2023-12-04 15:38:54 24 4
gpt4 key购买 nike

问题

我在 data.table 上工作,其中每一行都是医学观察。问题是我的数据有一些错误,我需要在进行分析之前更正它们。例如,男性患者可以进行观察,其中他被编码为女性。

解决方案

我的解决方案是由患者选择变量的众数(最频繁的值)。如果一名患者有 10 个男性观察结果和一个女性观察结果,则可以安全地假设他是男性。

我发现了使用 data.table 的巧妙方法。

DATA[j  = .N, 
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]

问题是,当一个病人有多种模式时,它只保留一种模式。所以一个50%男性和50%女性的患者会被算作男性,这最终会导致偏差。我想将它们编码为 NA。

纠正这个问题的唯一方法是使用 dplyr

DATA[j  = .N, 
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N))

如果重复,则用 NA 替换 SEX 值。但它比 data.table 花费的时间更长,它不是很优化,而且我有一个包含大量变量的大数据集,这些变量也需要更正。

简历

如果不是唯一的,我如何获取患者变量的模式并用 NA 替换它?

例子

ID <- c(rep(x = "1", 6), rep(x = "2", 6))
SEX <- c("M","M","M","M","F","M","M","F","M","F","F","M")

require(data.table)
DATA <- data.table(ID, SEX)

# First method (doesn't work)
DATA[j = .N,
by = .(ID, SEX)][i = base::order(-N),
j = .(SEX = SEX[1L]),
keyby = ID]

# Second method (work with dplyr)
require(dplyr)
DATA[j = .N,
by = .(ID, SEX)] %>%
group_by(ID) %>%
filter(N == max(N)) %>%
mutate(SEX = if_else(condition = duplicated(ID) == TRUE,
true = "NA",
false = SEX)) %>%
filter(row_number() == n())

# Applied to my data it took 84.288 seconds

更新

@Cole 基于@Sindri_baldur 的想法提出的解决方案:

DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)

our_mode_fac <- function(x) {
freq <- tabulate(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {NA}
else {levels(x)[which.max(freq)]}
}

vars <- c("SEX", "V1")

DATA[j = paste0(vars) := lapply(.SD, as.factor),
.SDcols = vars][j = vars := lapply(.SD, our_mode_fac),
.SDcols = vars,
by = ID]

它工作得很好。它采用众数,即使 NA 多于因子,并在多于 1 种众数时用 NA 替换值。

现在它也非常快:11 秒用于 3M+ 观察和 1M+ 患者(@Sindri_baldur 回答为 117 秒)。非常感谢你们俩,我非常感谢!

最佳答案

our_mode <- function(x) {
freq <- table(x)
if (length(freq) == 0 || sum(freq == max(freq)) > 1 ) {
NA
} else {
names(freq)[which.max(freq)]
}
}

vars <- c("SEX", "V1")
DATA[, paste0(vars, "_corrected") := lapply(.SD, our_mode), .SDcols = vars, by = ID]

ID SEX V1 SEX_corrected V1_corrected
1: 1 M a M a
2: 1 M <NA> M a
3: 1 M a M a
4: 1 M a M a
5: 1 F b M a
6: 1 M a M a
7: 2 M b F b
8: 2 F b F b
9: 2 M b F b
10: 2 F c F b
11: 2 F b F b
12: 2 <NA> c F b

可重现的数据

DATA <- data.table(
ID = c(rep(x = "1", 6), rep(x = "2", 6)),
SEX = c("M","M","M","M","F","M","M","F","M","F","F",NA),
V1 = c("a", NA, "a", "a", "b", "a", "b", "b", "b", "c", "b", "c")
)

请注意,our_mode() 并未针对速度进行优化。请参阅 Cole 提出的关于提高评论速度的建议。

关于r - 如何按条件和按组过滤我的 data.table?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58746800/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com