gpt4 book ai didi

r - R中Data.Table的优化?

转载 作者:行者123 更新时间:2023-12-05 01:27:40 27 4
gpt4 key购买 nike

我在 R 有两个表如下所示:

DT.Purchase <- data.frame( ID = c(1,1,1,2,2,3,3,3,3,3,4,4,4,4),
CDS = c("0389","0389", "3298", "4545", "1282", "4545",
"0389","0389", "5685", "4545", "1282", "0389",
"1282", "1282")
Date = c("5/28/2016","5/26/2016","8/9/2016","2/2/2015",
"2/24/2015", "9/27/2015", "9/27/2015", "9/5/2015",
"3/3/2016", "4/9/2014", "5/1/2014", "5/4/2014",
"6/9/2014", "7/7/2014"),
JFK = c(T,F,F,F,T,T,F,F,T,F,T,T,T,F),
RFK = c(F,T,T,F,T,F,F,F,F,T,T,T,T,T),
RUG = c(T,F,T,F,T,F,F,F,F,T,F,F,T,T),
LPG = c(T,T,T,F,F,T,T,F,F,F,F,F,T,F))


DT.Purchase$Date <- as.Date(DT.Purchase$Date, format = "%m/%d/%Y")
DT.Purchase <- data.table(DT.Purchase)
ID CDS Date JFK RFK RUG LPG
1 0389 5/28/2016 T F T T
1 0389 5/26/2016 F T F T
1 3298 8/9/2016 F T T T
2 4545 2/2/2015 F F F F
2 1282 2/24/2015 T T T F
3 4545 9/27/2015 T F F T
3 0389 9/27/2015 F F F T
3 0389 9/5/2015 F F F F
3 5685 3/3/2016 T F F F
3 4545 4/9/2014 F T T F
4 1282 5/1/2014 T T F F
4 0389 5/4/2014 T T F F
4 1282 6/9/2014 T T T T
4 1282 7/7/2014 F T T F

DT.Stay <- data.frame(Stay.ID = c(1,2,3,5,6,9,10,11),
ID = c(1,1,2,3,3,3,4,4),
Start.Date = c('5/26/2016','8/1/2016', '2/1/2015', '3/1/2016',
'9/1/2015', '4/9/2014', '4/7/2014','6/1/2014'),
End.Date = c('6/6/2016','9/1/2016','3/1/2015','3/7/2016',
'9/30/2015','4/14/2014','5/9/2014','7/11/2014'))
DT.Stay$Start.Date <- as.Date(DT.Stay$Start.Date, format = "%m/%d/%Y")
DT.Stay$End.Date <- as.Date(DT.Stay$End.Date, format = "%m/%d/%Y")

DT.Stay <- data.table(DT.Stay)

Stay.ID ID Start.Date End.Date
1 1 5/26/2016 6/6/2016
2 1 8/1/2016 9/1/2016
3 2 2/1/2015 3/1/2015
5 3 3/1/2016 3/7/2016
6 3 9/1/2015 9/30/2015
9 3 4/9/2014 4/14/2014
10 4 4/7/2014 5/9/2014
11 4 6/1/2014 7/11/2014

现在在现实中,DT.Purchase更大(1000 万个观测值),DT.Stay 超过 50000 个观测值。有 DT.Purchase[ ,.(JFK, RFK, DUG, LPG)] 的权重等于 c.weights = c(1,2,1,3) 。这些权重代表了我们正在考虑的内部成本权重。这样做的目的是根据以前的购买和重量确定购买交易。我想做的是确定之前的 cost.index这是基于权重之前所有购买的总和,之前的每个 End.Date对于每个 Stay.ID 。所以最终data.table应该看起来像

Stay.ID cost.index
1 10
2 16
3 4
5 11
6 10
9 3
10 6
11 10

我这样做的方式涉及根据 ID 进行合并( allow.cartesian = TRUE ) 两个数据集并检查是否 Date <= End.Date 。然后我将权重替换为 和 sum对于每个 Stay.ID 。它有效,但我正在寻找一种更快的方法来做到这一点。对于 1000 万和 50000 个观测值,合并会消耗时间和资源。

最佳答案

使用最新的开发版本(1.9.7+),这样的东西就可以工作:

DT.Purchase[DT.Stay, on = .(ID = ID, Date >= Start.Date, Date <= End.Date),
.(Stay.ID, sum(as.matrix(.SD) %*% c.weights)),
by = .EACHI, .SDcols = JFK:LPG]

假设您的日期采用 DateIDate 格式。

在 1.9.6 中,您可以使用 foverlaps 代替:

foverlaps(setkey(DT.Purchase[, Date2 := Date], ID, Date, Date2),
setkey(DT.Stay, ID, Start.Date, End.Date))[,
sum(as.matrix(.SD) %*% c.weights), keyby = Stay.ID, .SDcols = JFK:LPG]

关于r - R中Data.Table的优化?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39798523/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com