gpt4 book ai didi

r - 如何根据涉及2列的规则将大数据表有效地分成两部分?

转载 作者:行者123 更新时间:2023-12-03 01:20:20 26 4
gpt4 key购买 nike

分割下表最有效(时间和空间)的方法是什么

dt = data.table(x=c(1,3,5,4,6,2), y=c(4,7,1,1,2,6))
> dt
x y
1: 1 4
2: 3 7
3: 5 1
4: 4 1
5: 6 2
6: 2 6

分成两个单独的表 dt1 和 dt2,这样 dt1 包含所有 (x,y) 行,当且仅当 (y,x) 也是 dt 中的一行,而 dt2 包含其他行:

> dt1
x y
1: 1 4
2: 4 1
3: 6 2
4: 2 6

> dt2
x y
1: 3 7
2: 5 1

效率至关重要,全表近2亿行

最佳答案

另一个选项是对其自身执行向后连接

indx <- sort.int(dt[unique(dt), on = c(x = "y", y = "x"), which = TRUE, nomatch = 0L])

dt[indx]
# x y
# 1: 1 4
# 2: 4 1
# 3: 6 2
# 4: 2 6

dt[-indx]
# x y
# 1: 3 7
# 2: 5 1

基准 - 如果您不关心顺序,我的解决方案对于 200MM 行似乎更快(两种解决方案结果都是无序的)

set.seed(123)
bigdt <- data.table(x = sample(1e3, 2e8, replace = TRUE),
y = sample(1e3, 2e8, replace = TRUE))

system.time(i1 <- bigdt[, .I[.N>1] ,.(X=pmax(x,y), Y=pmin(y,x))]$V1)
# user system elapsed
# 21.81 0.82 22.97

system.time(indx <- bigdt[unique(bigdt), on = c(x = "y", y = "x"), which = TRUE, nomatch = 0L])
# user system elapsed
# 17.74 0.90 18.80

# Checking if both unsorted and if identical when sorted
is.unsorted(i1)
# [1] TRUE
is.unsorted(indx)
# [1] TRUE

identical(sort.int(i1), sort.int(indx))
# [1] TRUE

这是一个非简并的情况(其中 indx != bigdt[, .I]):

set.seed(123)
n = 1e7
nv = 1e4
DT <- data.table(x = sample(nv, n, replace = TRUE), y = sample(nv, n, replace = TRUE))

library(microbenchmark)
microbenchmark(
akrun = {
idx = DT[, .I[.N > 1], by=.(pmax(x,y), pmin(x,y))]$V1
list(DT[idx], DT[-idx])
},
akrun2 = {
idx = DT[,{
x1 <- paste(pmin(x,y), pmax(x,y))
duplicated(x1)|duplicated(x1, fromLast=TRUE)
}]
list(DT[idx], DT[!idx])
},
davida = {
idx = DT[unique(DT), on = c(x = "y", y = "x"), which = TRUE, nomatch = 0L]
list(DT[idx], DT[-idx])
},
akrun3 = {
n = DT[, N := .N, by = .(pmax(x,y), pmin(x,y))]$N
DT[, N := NULL]
split(DT, n > 1L)
}, times = 1)

Unit: seconds
expr min lq mean median uq max neval
akrun 7.056609 7.056609 7.056609 7.056609 7.056609 7.056609 1
akrun2 22.810844 22.810844 22.810844 22.810844 22.810844 22.810844 1
davida 2.738918 2.738918 2.738918 2.738918 2.738918 2.738918 1
akrun3 5.662700 5.662700 5.662700 5.662700 5.662700 5.662700 1

关于r - 如何根据涉及2列的规则将大数据表有效地分成两部分?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36444269/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com