gpt4 book ai didi

r - 连接2个数据表同时汇总其中一个数据的最快方法

转载 作者:行者123 更新时间:2023-12-04 09:12:33 24 4
gpt4 key购买 nike

问题

我有2个数据表

运动:

  library(data.table)

consEx = data.table(
begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
"2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
"2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

carId = c(1,1,1,2,2,3,3,4,4,5),
tripId = c(1:10)
)


和警报:

alertsEx = data.table(
timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
"2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
carId = c(1,1,1,2,3,3,3,4,4,4),
additionalInfo1 = rnorm(10,mean=10,sd=4)
)


运动表记录了汽车行驶的时间段 begin- end。警报表显示汽车发生警报的时间,包含 typetimestampcarId

我需要加入这2个表,并按 type汇总警报数据。当在同一时期内有多个相同类型的警报时,我需要用平均值表示 additionalInfo1

当前方法

我目前正在通过遍历 consEx并将函数应用到每行返回包含所需计算的列表的方法来进行此操作

findAlerts = function(begin,end,carId_2){
saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
totals = nrow(saida)

saida = split(saida,by="type")

resultsList = list(
"totals" = 0,
"t1Count" = 0,
"t1Mean" = 0,
"t2Count" = 0,
"t2Mean" = 0,
"t3Count" = 0,
"t3Mean" = 0)
resultsList[["totals"]] = totals

types = names(saida)
if("T1" %in% types){
resultsList[["t1Count"]] = nrow(saida[["T1"]])
resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
}

if("T2" %in% types){
resultsList[["t2Count"]] = nrow(saida[["T2"]])
resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
}

if("T3" %in% types){
resultsList[["t3Count"]] = nrow(saida[["T3"]])
resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
}

return(resultsList)
}

for(i in 1:nrow(consEx)){
aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
consEx[i,names(aux) := aux]

}


预期产量

给出预期的输出:

                 begin                 end carId tripId totals t1Count    t1Mean t2Count    t2Mean t3Count    t3Mean
1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 10.123463 1 14.479288 0 0.000000
2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0.000000 0 0.000000 0 0.000000
3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0.000000 0 0.000000 0 0.000000
4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0.000000 0 0.000000 0 0.000000
5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0.000000 0 0.000000 1 6.598062
6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0.000000 0 0.000000 0 0.000000
7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 7.610410 0 0.000000 1 10.218593
8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 0.000000 2 9.703278 0 0.000000
9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 7.095564 0 0.000000 0 0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0.000000 0 0.000000 0 0.000000


但这对于我的原始数据来说太慢了,原始数据在 consEx中有1300万行,在警报中有2M。还有几种不同的类型,我需要计算两个指标的平均值,最小值和最大值。

有没有更快的方法来编写函数或循环?

另外,用python编写几乎相同的代码是否会带来更好的结果?我正在考虑重写它,但是需要很长时间,我不确定它是否可以解决问题

谢谢

PS:我尝试过的另一种方法是遍历警报,并为每个警报从 tripId分配一个 consEx,然后汇总结果表,但这似乎更慢。

最佳答案

这个答案表明了两种data.table方法,它们都使用非等联接来将警报匹配到行程,并且使用dcast()进行重塑,但是有所不同
聚合与consEx组合的方式。
第一个变体创建一个新的result数据集,而constEx保持不变。
第二个变体修改了constEx。这与Alexis' solution相似,但更加简洁,并使用了新的setnafill()函数(在data.table的开发版本1.12.3中可用)
形式1:创建新的结果数据集

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
result <- dcast(agg[consEx, on = "tripId"], ... ~ type,
value.var = c("Count", "Mean"), fill = 0)[
, c("Count_NA", "Mean_NA") := NULL][
is.na(totals), totals := 0]
setcolorder(result, names(consEx))
result


                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 1 0 12.654609 12.375862 0.000000
2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0 0 0.000000 0.000000 0.000000
3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0 0 0.000000 0.000000 0.000000
4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0 0 0.000000 0.000000 0.000000
5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0 1 0.000000 0.000000 9.316815
6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0 0 0.000000 0.000000 0.000000
7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 0 1 8.586061 0.000000 11.498512
8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 2 0 0.000000 8.696356 0.000000
9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 0 0 9.343681 0.000000 0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0 0 0.000000 0.000000 0.000000


请注意,由于 additionalInfo1是在未事先调用 rnorm(10, mean = 10, sd = 4)的情况下由 set.seed()创建的,因此发布的不同答案之间的平均值可能会有所不同。调用 set.seed()创建可复制的随机数。
变体1的说明
第一部分通过非等额联接为每个警报找到匹配的行程,并通过 tripIdtype以及每个行程的总数计算合计:
agg


   tripId type Count      Mean totals
1: 1 T1 2 12.654609 3
2: 1 T2 1 12.375862 3
3: 5 T3 1 9.316815 1
4: 7 T1 2 8.586061 3
5: 7 T3 1 11.498512 3
6: 8 T2 2 8.696356 2
7: 9 T1 1 9.343681 1


请注意,假定 tripIdconsEx中的唯一键。
在下一步中,将聚合正确连接到 consEx
agg[consEx, on = "tripId"]


    tripId type Count      Mean totals               begin                 end carId
1: 1 T1 2 12.654609 3 2019-04-01 00:00:10 2019-04-01 20:00:00 1
2: 1 T2 1 12.375862 3 2019-04-01 00:00:10 2019-04-01 20:00:00 1
3: 2 <NA> NA NA NA 2019-04-07 10:00:00 2019-04-07 15:00:00 1
4: 3 <NA> NA NA NA 2019-04-10 23:00:00 2019-04-11 10:00:00 1
5: 4 <NA> NA NA NA 2019-04-12 20:00:00 2019-04-12 23:30:00 2
6: 5 T3 1 9.316815 1 2019-04-15 10:00:00 2019-04-15 15:00:00 2
7: 6 <NA> NA NA NA 2019-04-20 10:00:00 2019-04-21 12:00:00 3
8: 7 T1 2 8.586061 3 2019-04-22 13:30:00 2019-04-22 17:30:00 3
9: 7 T3 1 11.498512 3 2019-04-22 13:30:00 2019-04-22 17:30:00 3
10: 8 T2 2 8.696356 2 2019-04-10 15:30:00 2019-04-10 20:00:00 4
11: 9 T1 1 9.343681 1 2019-04-12 21:30:00 2019-04-13 05:00:00 4
12: 10 <NA> NA NA NA 2019-04-15 20:00:00 2019-04-15 12:30:00 5


使用 dcast()立即将输出从长格式更改为宽格式。
最后,通过删除多余的列,将 NA替换为0并更改列顺序来清理结果。
形式2:更新到位
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
wide <- dcast(agg, ... ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+

consEx


                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
1: 2019-04-01 00:00:10 2019-04-01 20:00:00 1 1 3 2 1 0 12.654609 12.375862 0.000000
2: 2019-04-07 10:00:00 2019-04-07 15:00:00 1 2 0 0 0 0 0.000000 0.000000 0.000000
3: 2019-04-10 23:00:00 2019-04-11 10:00:00 1 3 0 0 0 0 0.000000 0.000000 0.000000
4: 2019-04-12 20:00:00 2019-04-12 23:30:00 2 4 0 0 0 0 0.000000 0.000000 0.000000
5: 2019-04-15 10:00:00 2019-04-15 15:00:00 2 5 1 0 0 1 0.000000 0.000000 9.316815
6: 2019-04-20 10:00:00 2019-04-21 12:00:00 3 6 0 0 0 0 0.000000 0.000000 0.000000
7: 2019-04-22 13:30:00 2019-04-22 17:30:00 3 7 3 2 0 1 8.586061 0.000000 11.498512
8: 2019-04-10 15:30:00 2019-04-10 20:00:00 4 8 2 0 2 0 0.000000 8.696356 0.000000
9: 2019-04-12 21:30:00 2019-04-13 05:00:00 4 9 1 1 0 0 9.343681 0.000000 0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00 5 10 0 0 0 0 0.000000 0.000000 0.000000


变型2的说明
第一部分与变体1中的相同。
在下一步中,将聚集体从长格式更改为宽格式。
然后,执行更新联接,其中 consExwide的匹配行适当地更新。表达方式
(names(wide)) := mget(paste0("i.", names(wide))) 

是捷径
`:=`(totals = i.totals, Count_T1 = i.Count_T1, ..., Mean_T3 = i.Mean_T3)

其中 i.前缀是指 wide的列,用于避免歧义。
在没有匹配项的 consEx行中,附加的列包含 NA。因此,最终使用 setnafill()将所有 NA值替换为0。
基准测试
在撰写本文时,由以下人员发布了6种不同的解决方案

Fino基本 R中的OP
Dave2e使用 dplyrtidyr
python中的 Alexandre B
Alexis使用 data.table
和2个 data.table变体。

使用 R包中的 press()mark()比较5个 bench解决方案。
样本数据集被参数化以改变行数。由于某些解决方案修改了 consEx,因此每次基准测试运行均从新副本开始。
为了使代码与 bench一起使用,需要进行一些小的更改。最引人注目的是,在 ... ~ type中调用时,在 dcast()中的快捷方式公式 bench::mark()导致错误,必须替换为明确命名LHS所有变量的版本。
正如OP指出他的版本相当慢,第一次基准测试仅使用1000行 consEx和50、200和1000行 alertsEx
bm <- bench::press(
n_trips = 1E3,
n_alerts = 1E3/c(20, 5, 1),
{
n_cars <- 5
n_periods <- round(n_trips / n_cars)
n_types = 3
types = sprintf("T%02i", 1:n_types)
consEx0 <- data.table(
tripId = 1:n_trips,
carId = rep(1:n_cars, each = n_periods),
begin = seq.POSIXt(as.POSIXct("2000-01-01"), length.out = n_periods, by = "14 days") %>%
rep(n_cars) %>%
`mode<-`("double")
)[, end := begin + 7*24*60*60]
set.seed(1)
idx <- sample(n_trips, n_alerts, TRUE)
alertsEx <- consEx0[
idx,
.(timestamp = begin + runif(n_alerts, 1, 7*24*60*60 - 1),
type = sample(types, n_alerts, TRUE),
carId,
additionalInfo1 = rnorm(n_alerts, mean = 10, sd = 4))]

bench::mark(
fino = {
consEx <- copy(consEx0)
findAlerts = function(begin,end,carId_2){
saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
totals = nrow(saida)
saida = split(saida,by="type")
resultsList = list(
"totals" = 0,
"t1Count" = 0,
"t1Mean" = 0,
"t2Count" = 0,
"t2Mean" = 0,
"t3Count" = 0,
"t3Mean" = 0)
resultsList[["totals"]] = totals
types = names(saida)
if("T1" %in% types){
resultsList[["t1Count"]] = nrow(saida[["T1"]])
resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
}
if("T2" %in% types){
resultsList[["t2Count"]] = nrow(saida[["T2"]])
resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
}
if("T3" %in% types){
resultsList[["t3Count"]] = nrow(saida[["T3"]])
resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
}
return(resultsList)
}
for(i in 1:nrow(consEx)){
aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
consEx[i,names(aux) := aux]
}
consEx[]
},
dave = {
# library(dplyr)
# library(tidyr)

consEx <- copy(consEx0)
#split the consEx down to separate carId
splitcons <- split(consEx, consEx$carId)

alertsEx$tripId <- NA
#loop and only search the cons that match the alerts
alertsEx$tripId <- apply(alertsEx, 1, function(a) {
#retrive the list assicoated to the correct car
cons <- splitcons[[a[["carId"]]]]
alerttime <- as.POSIXct(a[["timestamp"]])
#find the trip which contains the alerttime
tripId <-
which((alerttime >= cons$begin) & (alerttime <= cons$end))
#identify and return the tripId
cons$tripId[tripId]
})

#Referenced:
#https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
#https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
alertsEx2 <-
alertsEx %>%
dplyr::group_by(carId, tripId, type) %>%
dplyr::summarize(mean = mean(additionalInfo1), count = dplyr::n())
alerttable <-
tidyr::gather(alertsEx2, variable, val, -(carId:type), na.rm = TRUE) %>%
tidyr::unite(temp, type, variable) %>%
tidyr::spread(temp, val, fill = 0)
consEx %>%
dplyr::left_join(alerttable, by = c("tripId", "carId"))

},
alexis = {
consEx <- copy(consEx0)
joined <- consEx[alertsEx,
.(carId, tripId, type, additionalInfo1),
on = .(carId, begin <= timestamp, end >= timestamp)]
aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
long <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
replaceNA <- function(x) { replace(x, is.na(x), 0) }
consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
lapply(long[consEx,
as.character(outer(types, c("typeCount", "typeMean"),
function(a, b) { paste0(b, a) })),
with = FALSE,
on = .(carId, tripId)],
replaceNA))]
consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
setcolorder(consEx, c("carId", "tripId", "begin", "end"))
consEx
},
uwe1 = {
consEx <- copy(consEx0)
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
result <- dcast(agg[consEx, on = "tripId"],
tripId + carId + begin+ end + totals ~ type,
value.var = c("Count", "Mean"), fill = 0)[
, c("Count_NA", "Mean_NA") := NULL][
is.na(totals), totals := 0][]
# setcolorder(result, names(consEx))
result
},
uwe2 = {
consEx <- copy(consEx0)
agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp),
.(tripId, type, additionalInfo1)][
, .(Count = .N, Mean = mean(additionalInfo1)),
by = .(tripId, type)][
, totals := sum(Count), by = tripId][]
wide <- dcast(agg, tripId + totals ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+
consEx
},
check = FALSE
)
}
)

library(ggplot2)
autoplot(bm)

enter image description here
请注意对数时标。
该图表证实了OP的假设,即他的方法比其他任何解决方案都慢得多。 1000行花了3秒,而OP的生产数据集有1200万行。 Dave2e方法的运行时间随着警报数量的增加而增加。
同样,OP的方法对分配的内存要求最高。它最多分配190 MB,而 data.table版本仅需要1 MByt。
bm


# A tibble: 15 x 15
expression n_trips n_alerts min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <dbl> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 fino 1000 50 2.95s 2.95s 0.339 149.25MB 3.39 1 10 2.95s <data~ <Rpro~ <bch~
2 dave 1000 50 21.52ms 22.61ms 41.6 779.46KB 5.94 21 3 505.01ms <data~ <Rpro~ <bch~
3 alexis 1000 50 21.98ms 23.09ms 41.2 1007.84KB 3.93 21 2 509.5ms <data~ <Rpro~ <bch~
4 uwe1 1000 50 14.47ms 15.45ms 62.3 919.12KB 1.95 32 1 513.94ms <data~ <Rpro~ <bch~
5 uwe2 1000 50 14.14ms 14.73ms 64.2 633.2KB 1.95 33 1 513.91ms <data~ <Rpro~ <bch~
6 fino 1000 200 3.09s 3.09s 0.323 155.12MB 4.53 1 14 3.09s <data~ <Rpro~ <bch~
7 dave 1000 200 53.4ms 62.11ms 11.0 1.7MB 5.49 8 4 728.02ms <data~ <Rpro~ <bch~
8 alexis 1000 200 22.42ms 23.34ms 41.5 1.03MB 3.77 22 2 530.25ms <data~ <Rpro~ <bch~
9 uwe1 1000 200 14.72ms 15.76ms 62.2 953.07KB 0 32 0 514.46ms <data~ <Rpro~ <bch~
10 uwe2 1000 200 14.49ms 15.17ms 63.6 695.63KB 1.99 32 1 503.4ms <data~ <Rpro~ <bch~
11 fino 1000 1000 3.6s 3.6s 0.278 187.32MB 3.61 1 13 3.6s <data~ <Rpro~ <bch~
12 dave 1000 1000 242.46ms 243.7ms 4.07 6.5MB 6.78 3 5 737.06ms <data~ <Rpro~ <bch~
13 alexis 1000 1000 24.32ms 25.78ms 37.6 1.21MB 3.95 19 2 505.84ms <data~ <Rpro~ <bch~
14 uwe1 1000 1000 16.04ms 16.8ms 56.8 1.05MB 1.96 29 1 510.23ms <data~ <Rpro~ <bch~
15 uwe2 1000 1000 15.69ms 16.41ms 54.8 938.74KB 3.92 28 2 510.63ms <data~ <Rpro~ <bch~


因此,Fino的方法在第二次基准测试中被省略了,较大的问题大小分别为 consEx 10k和100k行, alertsEx 2k和20k行。
enter image description here
由于Dave2e的方法在20k行中的速度比最快的方法慢100倍,因此在第三次运行中被忽略了。这将模拟OP的生产数据集,其中包含1200万行 consEx和2M行 alertsEx
enter image description here
print(bm, n = Inf)


# A tibble: 27 x 15
expression n_trips n_alerts min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <dbl> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 alexis 1.00e5 20000 747.87ms 747.87ms 1.34 38.18MB 6.69 1 5 747.87ms <data~ <Rpro~ <bch~
2 uwe1 1.00e5 20000 124.52ms 138.89ms 7.32 37.88MB 3.66 4 2 546.17ms <data~ <Rpro~ <bch~
3 uwe2 1.00e5 20000 72.29ms 73.76ms 11.3 16.63MB 1.88 6 1 533.26ms <data~ <Rpro~ <bch~
4 alexis 1.00e6 20000 7.51s 7.51s 0.133 335.44MB 3.33 1 25 7.51s <data~ <Rpro~ <bch~
5 uwe1 1.00e6 20000 995.31ms 995.31ms 1.00 346.94MB 2.01 1 2 995.31ms <data~ <Rpro~ <bch~
6 uwe2 1.00e6 20000 714.19ms 714.19ms 1.40 89.27MB 1.40 1 1 714.19ms <data~ <Rpro~ <bch~
7 alexis 1.00e7 20000 1.67m 1.67m 0.00999 3.21GB 0.340 1 34 1.67m <data~ <Rpro~ <bch~
8 uwe1 1.00e7 20000 3.44m 3.44m 0.00485 3.41GB 0.00969 1 2 3.44m <data~ <Rpro~ <bch~
9 uwe2 1.00e7 20000 42.51s 42.51s 0.0235 810.3MB 0 1 0 42.51s <data~ <Rpro~ <bch~
10 alexis 1.00e5 200000 15.38s 15.38s 0.0650 73.22MB 0.0650 1 1 15.38s <data~ <Rpro~ <bch~
11 uwe1 1.00e5 200000 1.34s 1.34s 0.747 63.81MB 0 1 0 1.34s <data~ <Rpro~ <bch~
12 uwe2 1.00e5 200000 1.47s 1.47s 0.681 58.98MB 0 1 0 1.47s <data~ <Rpro~ <bch~
13 alexis 1.00e6 200000 2.91m 2.91m 0.00573 375.93MB 0.0115 1 2 2.91m <data~ <Rpro~ <bch~
14 uwe1 1.00e6 200000 9.72s 9.72s 0.103 371.69MB 0 1 0 9.72s <data~ <Rpro~ <bch~
15 uwe2 1.00e6 200000 888.67ms 888.67ms 1.13 161.82MB 0 1 0 888.67ms <data~ <Rpro~ <bch~
16 alexis 1.00e7 200000 6.29m 6.29m 0.00265 3.15GB 0.0928 1 35 6.29m <data~ <Rpro~ <bch~
17 uwe1 1.00e7 200000 2.45m 2.45m 0.00681 3.43GB 0.0136 1 2 2.45m <data~ <Rpro~ <bch~
18 uwe2 1.00e7 200000 12.48m 12.48m 0.00134 887.99MB 0.00134 1 1 12.48m <data~ <Rpro~ <bch~
19 alexis 1.00e5 2000000 3.04s 3.04s 0.329 207MB 0 1 0 3.04s <data~ <Rpro~ <bch~
20 uwe1 1.00e5 2000000 2.96s 2.96s 0.338 196.42MB 0 1 0 2.96s <data~ <Rpro~ <bch~
21 uwe2 1.00e5 2000000 2.81s 2.81s 0.355 187.79MB 0 1 0 2.81s <data~ <Rpro~ <bch~
22 alexis 1.00e6 2000000 6.96m 6.96m 0.00239 726.14MB 0.00479 1 2 6.96m <data~ <Rpro~ <bch~
23 uwe1 1.00e6 2000000 2.01m 2.01m 0.00827 631.1MB 0 1 0 2.01m <data~ <Rpro~ <bch~
24 uwe2 1.00e6 2000000 30.54s 30.54s 0.0327 584.81MB 0 1 0 30.54s <data~ <Rpro~ <bch~
25 alexis 1.00e7 2000000 31.54m 31.54m 0.000528 3.66GB 0.0127 1 24 31.54m <data~ <Rpro~ <bch~
26 uwe1 1.00e7 2000000 8.72m 8.72m 0.00191 3.67GB 0 1 0 8.72m <data~ <Rpro~ <bch~
27 uwe2 1.00e7 2000000 12.35m 12.35m 0.00135 1.58GB 0 1 0 12.35m <data~ <Rpro~ <bch~


注意:请注意,大多数参数变化的运行时间仅通过一次测量即可确认。此外,由于计算机上的RAM数量有限(8 GB),垃圾收集(gc)可能会产生影响。因此,您的里程可能会有所不同。
时间安排不一致。通常,uwe2(就地更新)比确认 jangorecki's assumption的uwe1(新结果集)快得多。亚历克西斯几乎总是最慢的。 (请注意图表中的对数时间刻度或比较表中的时间。)
但是,某些参数变化显示与上述模式有偏差。在两种情况下,对于1000万次旅行,uwe1比uwe2快,并且Alexis也在追赶。
这些影响可能是由于垃圾回收所致。
就地更新方法的内存分配要少得多(即使这种方法并不总能节省速度)。
同样,请谨慎解释结果。应该在更大的机器上重复第三次基准测试,并且要有更多的耐心,以便进行重复的测量。

关于r - 连接2个数据表同时汇总其中一个数据的最快方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56297316/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com