gpt4 book ai didi

r - 在 R 中使用缺失值计算求和、减法或求和与减法

转载 作者:行者123 更新时间:2023-12-01 23:22:45 25 4
gpt4 key购买 nike

我想知道是否有一种优化的方法可以在缺少某些值时进行求和、减法或两者兼而有之。

比如下面的计算因为缺失不能直接做

library("data.table")
library("benchr")
library("glue")

dt <- data.table(A = c(NA, 2, 3, 4, NA),
B = c( 1, NA, 3, NA, NA),
C = c( 1, 2, NA, NA, NA))

dt[, SUM := A + B + C]
dt[, DIF := A - B - C]
dt[, MIX := A + B - C]

dt

A B C SUM DIF MIX
1: NA 1 1 NA NA NA
2: 2 NA 2 NA NA NA
3: 3 3 NA NA NA NA
4: 4 NA NA NA NA NA
5: NA NA NA NA NA NA

但是,我编写了一个可以实现预期结果的函数,但我不确定这是一种优化的方法,因为我已经制作了一份额外的数据副本,因此我不会更改原始变量。

fun1<- function(tbl, expr_text, allowed = NULL) {
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
aux_tbl <- copy(tbl)
if (is.null(allowed)) {
setnafill(aux_tbl, "const", fill = 0)
} else {
setnafill(aux_tbl, "const", fill = 0, cols = allowed)
}
aux_tbl[, eval(rlang::parse_expr(expr_text))]
expr_text <- glue::glue("{lhs} := fcase(rowSums(is.na(.SD)) < {length(rhs)}, {lhs})")
tbl[, (lhs) := aux_tbl[, eval(rlang::parse_expr(expr_text)), .SDcols = rhs][[lhs]]]
}

dt <- data.table(A = c(NA, 2, -3, 4, NA),
B = c( 1, NA, 3, NA, NA),
C = c( 1, 2, NA, NA, NA))

fun1(tbl = dt, expr_text = "SUM := A + B + C")
fun1(tbl = dt, expr_text = "DIF := A - B - C")
fun1(tbl = dt, expr_text = "MIX := A + B - C")

dt

A B C SUM DIF MIX
1: NA 1 1 2 -2 0
2: 2 NA 2 4 0 0
3: -3 3 NA 0 -6 0
4: 4 NA NA 4 4 4
5: NA NA NA 0 0 0

更新

实际上,如果所有值都缺失(第 5 行),那么结果也一定缺失,而不是像我第一次尝试时那样为零。我重新编写了函数来解决这个问题。

预期的结果应该是:

fun1 <- function(tbl, expr_text, allowed = NULL) {
tbl <- copy(tbl)
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
aux_tbl <- copy(tbl)
if (is.null(allowed)) {
setnafill(aux_tbl, "const", fill = 0)
} else {
setnafill(aux_tbl, "const", fill = 0, cols = allowed)
}
aux_tbl[, eval(rlang::parse_expr(expr_text))]
expr2 <- glue::glue("{lhs} := fcase(rowSums(is.na(.SD)) < {length(rhs)}, {lhs})")
tbl[, (lhs) := aux_tbl[[lhs]]]
tbl[, (lhs) := tbl[, eval(rlang::parse_expr(expr2)), .SDcols = rhs][[lhs]]][]
}

fun1(tbl = dt, expr_text = "MIX := A + B - C")

A B C SUM DIF MIX
1: NA 1 1 2 -2 0
2: 2 NA 2 4 0 0
3: -3 3 NA 0 -6 0
4: 4 NA NA 4 4 4
5: NA NA NA NA NA NA

基准

library("data.table")
library("benchr")
library("glue")

n <- 100000
set.seed(12345)
dt <- data.table(A = sample(c(rnorm((1 - 0.10)*n), rep(NA_real_, 0.10*n))),
B = sample(c(rnorm((1 - 0.20)*n), rep(NA_real_, 0.20*n))),
C = sample(c(rnorm((1 - 0.35)*n), rep(NA_real_, 0.35*n))))

fun1 <- function(tbl, expr_text, allowed = NULL) {
tbl <- copy(tbl)
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
aux_tbl <- copy(tbl)
if (is.null(allowed)) {
setnafill(aux_tbl, "const", fill = 0)
} else {
setnafill(aux_tbl, "const", fill = 0, cols = allowed)
}
aux_tbl[, eval(rlang::parse_expr(expr_text))]
expr2 <- glue::glue("{lhs} := fcase(rowSums(is.na(.SD)) < {length(rhs)}, {lhs})")
tbl[, (lhs) := aux_tbl[[lhs]]]
tbl[, (lhs) := tbl[, eval(rlang::parse_expr(expr2)), .SDcols = rhs][[lhs]]][]
}

fun2 <- function(tbl, expr_text, allowed = NULL) {
tbl <- copy(tbl)
sgn <- trimws(unlist(strsplit(gsub(".*:=|\\+", "", expr_text), split = "[[:alnum:]]")))
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
expr1 <- glue::glue("{lhs} := mapply(sum, {paste0(sgn, rhs, collapse=',')}, na.rm =TRUE)")
tbl[, eval(rlang::parse_expr(expr1))]
expr2 <- glue::glue("{lhs} := fcase(rowSums(is.na(.SD)) < {length(rhs)}, {lhs})")
tbl[, (lhs) := tbl[, eval(rlang::parse_expr(expr2)), .SDcols = rhs][[lhs]]][]
}

fun3 <- function(tbl, expr_text, allowed = NULL) {
tbl <- copy(tbl)
sgn <- paste0(trimws(unlist(strsplit(gsub(".*:=|\\+", "", expr_text), split = "[[:alnum:]]"))), 1, collapse = ", ")
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
expr1 <- glue::glue("{lhs} := rowSums(mapply('*', .SD, c({sgn})), na.rm =TRUE)")
tbl[, eval(rlang::parse_expr(expr1)), .SDcols = rhs]
expr2 <- glue::glue("{lhs} := fcase(rowSums(is.na(.SD)) < {length(rhs)}, {lhs})")
tbl[, (lhs) := tbl[, eval(rlang::parse_expr(expr2)), .SDcols = rhs][[lhs]]][]
}

fun4 <- function(tbl, expr_text, allowed = NULL) {
tbl <- copy(tbl)
rhs <- setdiff(trimws(unlist(strsplit(gsub(".*:=", "", expr_text), split = "[[:punct:]]"))), allowed)
lhs <- trimws(unlist(strsplit(gsub(":=.*", "", expr_text), split = "[[:punct:]]")))
aux_tbl <- copy(tbl)
if (is.null(allowed)) {
setnafill(aux_tbl, "const", fill = 0)
} else {
setnafill(aux_tbl, "const", fill = 0, cols = allowed)
}
is_missing <- tbl[, NA ^ (rowSums(!is.na(.SD)) == 0), .SDcols = rhs]
expr_text <- paste0(gsub(":=", ":= (", expr_text), ") * is_missing")
aux_tbl[, eval(rlang::parse_expr(expr_text))]
tbl[, (lhs) := aux_tbl[[lhs]]][]
}

res <- benchr::benchmark(
fun1 = fun1(tbl = dt, expr_text = "MIX := A + B + C"),
fun2 = fun2(tbl = dt, expr_text = "MIX := A + B + C"),
fun3 = fun3(tbl = dt, expr_text = "MIX := A + B + C"),
fun4 = fun4(tbl = dt, expr_text = "MIX := A + B + C")
)

print(res, order = "median")

Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
fun4 100 6.42 6.74 6.88 9.27 11.6 25.5 927 1.00
fun1 100 6.76 7.04 7.33 14.70 14.2 128.0 1470 1.07
fun3 100 8.76 9.14 13.10 16.40 18.1 101.0 1640 1.91
fun2 100 146.00 181.00 206.00 208.00 230.0 298.0 20800 30.00

我写了一些答案作为对它们进行基准测试的函数。我还创建了一个额外的 fun4,它比原来的 fun1 稍快。

Boxplot of Benchmark

我正在考虑使用 Rcpp 编写它,但我不确定它是否会使它变得更好。

有人知道更好的方法或有什么建议吗?

谢谢。

最佳答案

通过使用mapply

library(data.table)
dt <- data.table(A = c(1, 2, 3, 4, NA),
B = c( 1, NA, 3, NA, NA),
C = c( 1, 2, NA, NA, NA))

dt[, SUM := mapply(sum, A,B,C, na.rm =TRUE)]
dt[, DIF := mapply(sum, A,-B,-C, na.rm =TRUE)]
dt[, MIX := mapply(sum, A,B,-C, na.rm =TRUE)]

A B C SUM DIF MIX
1: 1 1 1 3 -1 1
2: 2 NA 2 4 0 0
3: 3 3 NA 6 0 6
4: 4 NA NA 4 4 4
5: NA NA NA 0 0 0

关于r - 在 R 中使用缺失值计算求和、减法或求和与减法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67741963/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com