gpt4 book ai didi

R 使用 ddply 或聚合

转载 作者:行者123 更新时间:2023-12-04 01:40:04 32 4
gpt4 key购买 nike

我有一个包含 3 列的数据框:custId、saleDate、DelivDateTime。

> head(events22)
custId saleDate DelivDate
1 280356593 2012-11-14 14:04:59 11/14/12 17:29
2 280367076 2012-11-14 17:04:44 11/14/12 20:48
3 280380097 2012-11-14 17:38:34 11/14/12 20:45
4 280380095 2012-11-14 20:45:44 11/14/12 23:59
5 280380095 2012-11-14 20:31:39 11/14/12 23:49
6 280380095 2012-11-14 19:58:32 11/15/12 00:10

这是dput:
> dput(events22)
structure(list(custId = c(280356593L, 280367076L, 280380097L,
280380095L, 280380095L, 280380095L, 280364279L, 280364279L, 280398506L,
280336395L, 280364376L, 280368458L, 280368458L, 280368456L, 280368456L,
280364225L, 280391721L, 280353458L, 280387607L, 280387607L),
saleDate = structure(c(1352901899.215, 1352912684.484, 1352914714.971,
1352925944.429, 1352925099.247, 1352923112.636, 1352922476.55,
1352920666.968, 1352915226.534, 1352911135.077, 1352921349.592,
1352911494.975, 1352910529.86, 1352924755.295, 1352907511.476,
1352920108.577, 1352906160.883, 1352905925.134, 1352916810.309,
1352916025.673), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
DelivDate = c("11/14/12 17:29", "11/14/12 20:48", "11/14/12 20:45",
"11/14/12 23:59", "11/14/12 23:49", "11/15/12 00:10", "11/14/12 23:35",
"11/14/12 22:59", "11/14/12 20:53", "11/14/12 19:52", "11/14/12 23:01",
"11/14/12 19:47", "11/14/12 19:42", "11/14/12 23:31", "11/14/12 23:33",
"11/14/12 22:45", "11/14/12 18:11", "11/14/12 18:12", "11/14/12 19:17",
"11/14/12 19:19")), .Names = c("custId", "saleDate", "DelivDate"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20"
), class = "data.frame")

我正在尝试为每个 DelivDate 找到最新 saleDatecustId

我可以像这样使用 plyr::ddply 做到这一点:
dd1 <-ddply(events22, .(custId),.inform = T, function(x){
x[x$saleDate == max(x$saleDate),"DelivDate"]
})

我的问题是是否有更快的方法来执行此操作,因为 ddply 方法有点耗时(完整数据集约为 400k 行)。我看过使用 aggregate() 但不知道如何获得除我正在排序的值以外的值。

有什么建议?

编辑:

以下是 10k 行 @ 10 次迭代的基准测试结果:
      test replications elapsed relative user.self
2 AGG2() 10 5.96 1.000 5.93
1 AGG1() 10 20.87 3.502 20.75
5 DATATABLE() 10 61.32 1 60.31
3 DDPLY() 10 80.04 13.430 79.63
4 DOCALL() 10 90.43 15.173 88.39

编辑2:
虽然最快 AGG2() 没有给出正确答案。
    > head(agg2)
custId saleDate DelivDate
1 280336395 2012-11-14 16:38:55 11/14/12 19:52
2 280353458 2012-11-14 15:12:05 11/14/12 18:12
3 280356593 2012-11-14 14:04:59 11/14/12 17:29
4 280364225 2012-11-14 19:08:28 11/14/12 22:45
5 280364279 2012-11-14 19:47:56 11/14/12 23:35
6 280364376 2012-11-14 19:29:09 11/14/12 23:01
> agg2 <- AGG2()
> head(agg2)
custId DelivDate
1 280336395 11/14/12 17:29
2 280353458 11/14/12 17:29
3 280356593 11/14/12 17:29
4 280364225 11/14/12 17:29
5 280364279 11/14/12 17:29
6 280364376 11/14/12 17:29
> agg2 <- DDPLY()
> head(agg2)
custId V1
1 280336395 11/14/12 19:52
2 280353458 11/14/12 18:12
3 280356593 11/14/12 17:29
4 280364225 11/14/12 22:45
5 280364279 11/14/12 23:35
6 280364376 11/14/12 23:01

最佳答案

我也推荐 data.table在这里,但是因为您要的是 aggregate解决方案,这里有一个结合了 aggregatemerge获取所有列:

merge(events22, aggregate(saleDate ~ custId, events22, max))

或者只是 aggregate如果您只想要“custId”和“DelivDate”列:
aggregate(list(DelivDate = events22$saleDate), 
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)])

最后,这是一个使用 sqldf 的选项:
library(sqldf)
sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId")

基准

我不是基准测试或 data.table专家,但让我惊讶的是 data.table这里不是更快。我怀疑在更大的数据集上结果会大不相同,例如,你的 400k 行。无论如何,这里有一些基准测试代码 modeled after @mnel's answer here所以你可以对你的实际数据集做一些测试以备将来引用。
library(rbenchmark)

首先,为您想要进行基准测试的功能设置您的功能。
DDPLY <- function() { 
x <- ddply(events22, .(custId), .inform = T,
function(x) {
x[x$saleDate == max(x$saleDate),"DelivDate"]})
}
DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
AGG1 <- function() {
x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
AGG2 <- function() {
x <- aggregate(list(DelivDate = events22$saleDate),
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)]) }
SQLDF <- function() {
x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId") }
DOCALL <- function() {
do.call(rbind,
lapply(split(events22, events22$custId), function(x){
x[which.max(x$saleDate), ]
})
)
}

第二,做标杆。
benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(), 
order = "elapsed")[1:5]
# test replications elapsed relative user.self
# 4 AGG2() 100 0.285 1.000 0.284
# 3 AGG1() 100 0.891 3.126 0.896
# 6 DOCALL() 100 1.202 4.218 1.204
# 2 DATATABLE() 100 1.251 4.389 1.248
# 1 DDPLY() 100 1.254 4.400 1.252
# 5 SQLDF() 100 2.109 7.400 2.108

关于R 使用 ddply 或聚合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14048739/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com