gpt4 book ai didi

滚动 join + sum 而不会炸毁内存

转载 作者:行者123 更新时间:2023-12-04 11:55:06 26 4
gpt4 key购买 nike

我的问题最好用一个例子来解释。

设置

library(data.table)

IDs <- 5
samplesPerId <- 100
set.seed(2019)

foo <- data.table(
id = rep(sample(1000000, size = 5, replace = FALSE), each = samplesPerId),
time = sample(999999, size = 5 * samplesPerId, replace = FALSE),
val = round(runif(n = 5 * samplesPerId, min = 0, max = 1), 2)
)

setorderv(foo, c("id", "time"))
foo[, val_cmltv_max := cummax(val), by = id]
bar <- data.table(time = seq(1, 999999, by = 1))

> foo
id time val val_cmltv_max
1: 459383 11250 0.83 0.83
2: 459383 13774 0.45 0.83
3: 459383 22266 0.27 0.83
4: 459383 44513 0.37 0.83
5: 459383 49432 0.86 0.86
---
496: 826316 950991 0.36 0.98
497: 826316 960187 0.80 0.98
498: 826316 961433 0.17 0.98
499: 826316 965398 0.36 0.98
500: 826316 994626 0.07 0.98

> bar
time
1: 1
2: 2
3: 3
4: 4
5: 5
---
999995: 999995
999996: 999996
999997: 999997
999998: 999998
999999: 999999

目标

对于每个时间点 1、2、... 999999,我想获得在该时间点已知的 ID 的 val_cmltv_max 总和。例如,在时间 1,总和应该为 0,因为根本不存在任何 ID,而在时间 999999,总和应该刚好低于 5,因为有 5 个 ID,到时间 999999,每个 ID 的 val_cmltv_max 应该接近 1。

当前解决方案

在这里,我从每个 id (1, 2, 3, 4, 5) 和每个时间点 (1, 2, ... 999999) 的笛卡尔积表开始,这使得一个 ~5M 行的中间表。然后,我使用滚动连接将每个 id 的 foo 中的最新记录连接到大中间表,然后我可以通过计算 val_cmltv_max 的总和来聚合它。

temp <- CJ(time = bar$time, id = sort(unique(foo$id)))
temp2 <- foo[temp, on = c("id", "time"), roll = TRUE]
result <- temp2[, list(sum_val_cmltv_max = sum(val_cmltv_max, na.rm = T)), by = time]

> result
time sum_val_cmltv_max
1: 1 0.00
2: 2 0.00
3: 3 0.00
4: 4 0.00
5: 5 0.00
---
999995: 999995 4.95
999996: 999996 4.95
999997: 999997 4.95
999998: 999998 4.95
999999: 999999 4.95

有没有一种方法可以既快速又节省内存,避免使用巨大的中间表?

最佳答案

呃。发布后 5 分钟,我意识到了解决方案。

# get the first row per unique (id, val_cmltv_max)
changes <- foo[foo[, .I[1L], by = list(id, val_cmltv_max)]$V1]

# For each id, get the change in val_cmltv_max
# Would use shift() here but it's slow
# changes[, val_cmltv_max_prev := shift(val_cmltv_max, type = "lag", fill = 0), by = id]
changes[, val_cmltv_max_prev := c(0, head(val_cmltv_max, -1)), by = id]
changes[, change := val_cmltv_max - val_cmltv_max_prev]

# aggregate changes by time
changes <- changes[, list(change = sum(change)), by = time]

# insert into bar and cumsum
bar[, change := 0]
bar[changes, change := i.change, on = "time"]
bar[, sum_val_cmltv_max := cumsum(change)]

关于滚动 join + sum 而不会炸毁内存,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59078415/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com