r - 连接2个数据表同时汇总其中一个数据的最快方法-6ren

r - 连接2个数据表同时汇总其中一个数据的最快方法

转载作者：行者123 更新时间：2023-12-04 09:12:33

问题

我有2个数据表

运动：

  library(data.table)

  consEx = data.table(
  begin = as.POSIXct(c("2019-04-01 00:00:10"," 2019-04-07 10:00:00","2019-04-10 23:00:00","2019-04-12 20:00:00","2019-04-15 10:00:00",
                       "2019-04-20 10:00:00","2019-04-22 13:30:00","2019-04-10 15:30:00","2019-04-12 21:30:00","2019-04-15 20:00:00")),

  end = as.POSIXct(c("2019-04-01 20:00:00","2019-04-07 15:00:00","2019-04-11 10:00:00", "2019-04-12 23:30:00","2019-04-15 15:00:00",
                     "2019-04-21 12:00:00","2019-04-22 17:30:00","2019-04-10 20:00:00","2019-04-13 05:00:00", "2019-04-15 12:30:00")),

  carId = c(1,1,1,2,2,3,3,4,4,5),
  tripId = c(1:10)
)

和警报：

alertsEx = data.table(
  timestamp = as.POSIXct(c("2019-04-01 10:00:00","2019-04-01 10:30:00","2019-04-01 15:00:00","2019-04-15 13:00:00","2019-04-22 14:00:00",
                "2019-04-22 15:10:00","2019-04-22 15:40:00","2019-04-10 16:00:00","2019-04-10 17:00:00","2019-04-13 04:00:00")),
  type = c("T1","T2","T1",'T3',"T1","T1","T3","T2","T2","T1"),
  carId = c(1,1,1,2,3,3,3,4,4,4),
  additionalInfo1 = rnorm(10,mean=10,sd=4)
)

运动表记录了汽车行驶的时间段 begin- end。警报表显示汽车发生警报的时间，包含 type， timestamp和 carId

我需要加入这2个表，并按 type汇总警报数据。当在同一时期内有多个相同类型的警报时，我需要用平均值表示 additionalInfo1。

当前方法

我目前正在通过遍历 consEx并将函数应用到每行返回包含所需计算的列表的方法来进行此操作

findAlerts = function(begin,end,carId_2){
  saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
  totals = nrow(saida)

  saida = split(saida,by="type")

  resultsList = list(
    "totals" = 0,
    "t1Count" = 0,
    "t1Mean" = 0,
    "t2Count" = 0,
    "t2Mean" = 0,
    "t3Count" = 0,
    "t3Mean" = 0)
  resultsList[["totals"]] = totals

  types = names(saida)
  if("T1" %in% types){
    resultsList[["t1Count"]] = nrow(saida[["T1"]])
    resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
  }

  if("T2" %in% types){
    resultsList[["t2Count"]] = nrow(saida[["T2"]])
    resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
  }

  if("T3" %in% types){
    resultsList[["t3Count"]] = nrow(saida[["T3"]])
    resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
  }

  return(resultsList)
}

for(i in 1:nrow(consEx)){
  aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
  consEx[i,names(aux) := aux]

}

预期产量

给出预期的输出：

                 begin                 end carId tripId totals t1Count    t1Mean t2Count    t2Mean t3Count    t3Mean
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3       2 10.123463       1 14.479288       0  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0       0  0.000000       0  0.000000       0  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0       0  0.000000       0  0.000000       0  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0       0  0.000000       0  0.000000       0  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1       0  0.000000       0  0.000000       1  6.598062
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0       0  0.000000       0  0.000000       0  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3       2  7.610410       0  0.000000       1 10.218593
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2       0  0.000000       2  9.703278       0  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1       1  7.095564       0  0.000000       0  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0       0  0.000000       0  0.000000       0  0.000000

但这对于我的原始数据来说太慢了，原始数据在 consEx中有1300万行，在警报中有2M。还有几种不同的类型，我需要计算两个指标的平均值，最小值和最大值。

有没有更快的方法来编写函数或循环？

另外，用python编写几乎相同的代码是否会带来更好的结果？我正在考虑重写它，但是需要很长时间，我不确定它是否可以解决问题

谢谢

PS：我尝试过的另一种方法是遍历警报，并为每个警报从 tripId分配一个 consEx，然后汇总结果表，但这似乎更慢。

最佳答案

这个答案表明了两种data.table方法，它们都使用非等联接来将警报匹配到行程，并且使用dcast()进行重塑，但是有所不同
聚合与consEx组合的方式。
第一个变体创建一个新的result数据集，而constEx保持不变。
第二个变体修改了constEx。这与Alexis' solution相似，但更加简洁，并使用了新的setnafill()函数（在data.table的开发版本1.12.3中可用）
形式1：创建新的结果数据集

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
result <- dcast(agg[consEx, on = "tripId"], ... ~ type, 
                value.var = c("Count", "Mean"), fill = 0)[
                  , c("Count_NA", "Mean_NA") := NULL][
                    is.na(totals), totals := 0]
setcolorder(result, names(consEx))
result

                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

请注意，由于 additionalInfo1是在未事先调用 rnorm(10, mean = 10, sd = 4)的情况下由 set.seed()创建的，因此发布的不同答案之间的平均值可能会有所不同。调用 set.seed()创建可复制的随机数。
变体1的说明
第一部分通过非等额联接为每个警报找到匹配的行程，并通过 tripId和 type以及每个行程的总数计算合计：

agg

   tripId type Count      Mean totals
1:      1   T1     2 12.654609      3
2:      1   T2     1 12.375862      3
3:      5   T3     1  9.316815      1
4:      7   T1     2  8.586061      3
5:      7   T3     1 11.498512      3
6:      8   T2     2  8.696356      2
7:      9   T1     1  9.343681      1

请注意，假定 tripId是 consEx中的唯一键。
在下一步中，将聚合正确连接到 consEx

agg[consEx, on = "tripId"]

    tripId type Count      Mean totals               begin                 end carId
 1:      1   T1     2 12.654609      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 2:      1   T2     1 12.375862      3 2019-04-01 00:00:10 2019-04-01 20:00:00     1
 3:      2 <NA>    NA        NA     NA 2019-04-07 10:00:00 2019-04-07 15:00:00     1
 4:      3 <NA>    NA        NA     NA 2019-04-10 23:00:00 2019-04-11 10:00:00     1
 5:      4 <NA>    NA        NA     NA 2019-04-12 20:00:00 2019-04-12 23:30:00     2
 6:      5   T3     1  9.316815      1 2019-04-15 10:00:00 2019-04-15 15:00:00     2
 7:      6 <NA>    NA        NA     NA 2019-04-20 10:00:00 2019-04-21 12:00:00     3
 8:      7   T1     2  8.586061      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
 9:      7   T3     1 11.498512      3 2019-04-22 13:30:00 2019-04-22 17:30:00     3
10:      8   T2     2  8.696356      2 2019-04-10 15:30:00 2019-04-10 20:00:00     4
11:      9   T1     1  9.343681      1 2019-04-12 21:30:00 2019-04-13 05:00:00     4
12:     10 <NA>    NA        NA     NA 2019-04-15 20:00:00 2019-04-15 12:30:00     5

使用 dcast()立即将输出从长格式更改为宽格式。
最后，通过删除多余的列，将 NA替换为0并更改列顺序来清理结果。
形式2：更新到位

agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
              .(tripId, type, additionalInfo1)][
                , .(Count = .N, Mean = mean(additionalInfo1)), 
                by = .(tripId, type)][
                  , totals := sum(Count), by = tripId][] 
wide <- dcast(agg, ... ~ type, value.var = c("Count", "Mean"), fill = 0)
consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
setnafill(consEx, fill = 0) # data.table version 1.12.3+ 

consEx

                  begin                 end carId tripId totals Count_T1 Count_T2 Count_T3   Mean_T1   Mean_T2   Mean_T3
 1: 2019-04-01 00:00:10 2019-04-01 20:00:00     1      1      3        2        1        0 12.654609 12.375862  0.000000
 2: 2019-04-07 10:00:00 2019-04-07 15:00:00     1      2      0        0        0        0  0.000000  0.000000  0.000000
 3: 2019-04-10 23:00:00 2019-04-11 10:00:00     1      3      0        0        0        0  0.000000  0.000000  0.000000
 4: 2019-04-12 20:00:00 2019-04-12 23:30:00     2      4      0        0        0        0  0.000000  0.000000  0.000000
 5: 2019-04-15 10:00:00 2019-04-15 15:00:00     2      5      1        0        0        1  0.000000  0.000000  9.316815
 6: 2019-04-20 10:00:00 2019-04-21 12:00:00     3      6      0        0        0        0  0.000000  0.000000  0.000000
 7: 2019-04-22 13:30:00 2019-04-22 17:30:00     3      7      3        2        0        1  8.586061  0.000000 11.498512
 8: 2019-04-10 15:30:00 2019-04-10 20:00:00     4      8      2        0        2        0  0.000000  8.696356  0.000000
 9: 2019-04-12 21:30:00 2019-04-13 05:00:00     4      9      1        1        0        0  9.343681  0.000000  0.000000
10: 2019-04-15 20:00:00 2019-04-15 12:30:00     5     10      0        0        0        0  0.000000  0.000000  0.000000

变型2的说明
第一部分与变体1中的相同。
在下一步中，将聚集体从长格式更改为宽格式。
然后，执行更新联接，其中 consEx由 wide的匹配行适当地更新。表达方式

(names(wide)) := mget(paste0("i.", names(wide)))

是捷径

`:=`(totals = i.totals, Count_T1 = i.Count_T1, ..., Mean_T3 = i.Mean_T3)

其中 i.前缀是指 wide的列，用于避免歧义。
在没有匹配项的 consEx行中，附加的列包含 NA。因此，最终使用 setnafill()将所有 NA值替换为0。
基准测试
在撰写本文时，由以下人员发布了6种不同的解决方案

Fino基本 R中的OP
Dave2e使用 dplyr和 tidyr
python中的 Alexandre B
Alexis使用 data.table
和2个 data.table变体。

使用 R包中的 press()和 mark()比较5个 bench解决方案。
样本数据集被参数化以改变行数。由于某些解决方案修改了 consEx，因此每次基准测试运行均从新副本开始。
为了使代码与 bench一起使用，需要进行一些小的更改。最引人注目的是，在 ... ~ type中调用时，在 dcast()中的快捷方式公式 bench::mark()导致错误，必须替换为明确命名LHS所有变量的版本。
正如OP指出他的版本相当慢，第一次基准测试仅使用1000行 consEx和50、200和1000行 alertsEx

bm <- bench::press(
  n_trips = 1E3,
  n_alerts = 1E3/c(20, 5, 1),
  {
    n_cars <- 5
    n_periods <- round(n_trips / n_cars)
    n_types = 3
    types = sprintf("T%02i", 1:n_types)
    consEx0 <- data.table(
      tripId = 1:n_trips, 
      carId = rep(1:n_cars, each = n_periods),
      begin = seq.POSIXt(as.POSIXct("2000-01-01"), length.out = n_periods, by = "14 days") %>% 
        rep(n_cars) %>% 
        `mode<-`("double")
    )[, end := begin + 7*24*60*60]
    set.seed(1)
    idx <- sample(n_trips, n_alerts, TRUE)
    alertsEx <- consEx0[
      idx, 
      .(timestamp = begin + runif(n_alerts, 1, 7*24*60*60 - 1),
        type = sample(types, n_alerts, TRUE), 
        carId,
        additionalInfo1 = rnorm(n_alerts, mean = 10, sd = 4))]
    
    bench::mark(
      fino = {
        consEx <- copy(consEx0)
        findAlerts = function(begin,end,carId_2){
          saida = alertsEx[timestamp >= begin & timestamp <= end & carId == carId_2,]
          totals = nrow(saida)
          saida = split(saida,by="type")
          resultsList = list(
            "totals" = 0,
            "t1Count" = 0,
            "t1Mean" = 0,
            "t2Count" = 0,
            "t2Mean" = 0,
            "t3Count" = 0,
            "t3Mean" = 0)
          resultsList[["totals"]] = totals
          types = names(saida)
          if("T1" %in% types){
            resultsList[["t1Count"]] = nrow(saida[["T1"]])
            resultsList[["t1Mean"]] = mean(saida[["T1"]]$additionalInfo1)
          }
          if("T2" %in% types){
            resultsList[["t2Count"]] = nrow(saida[["T2"]])
            resultsList[["t2Mean"]] = mean(saida[["T2"]]$additionalInfo1)
          }
          if("T3" %in% types){
            resultsList[["t3Count"]] = nrow(saida[["T3"]])
            resultsList[["t3Mean"]] = mean(saida[["T3"]]$additionalInfo1)
          }
          return(resultsList)
        }
        for(i in 1:nrow(consEx)){
          aux = findAlerts(consEx$begin[i],consEx$end[i],consEx$carId[i])
          consEx[i,names(aux) := aux]
        }
        consEx[]
      },
      dave = {
        # library(dplyr)
        # library(tidyr)
        
        consEx <- copy(consEx0)
        #split the consEx down to separate carId
        splitcons <- split(consEx, consEx$carId)
        
        alertsEx$tripId <- NA
        #loop and only search the cons that match the alerts
        alertsEx$tripId <- apply(alertsEx, 1, function(a) {
          #retrive the list assicoated to the correct car
          cons <- splitcons[[a[["carId"]]]]
          alerttime <- as.POSIXct(a[["timestamp"]])
          #find the trip which contains the alerttime
          tripId <-
            which((alerttime >= cons$begin) & (alerttime <= cons$end))
          #identify and return the tripId
          cons$tripId[tripId]
        })
        
        #Referenced:
        #https://stackoverflow.com/questions/30592094/r-spreading-multiple-columns-with-tidyr
        #https://stackoverflow.com/questions/35321497/spread-multiple-columns-tidyr
        alertsEx2 <-
          alertsEx %>% 
          dplyr::group_by(carId, tripId, type) %>% 
          dplyr::summarize(mean = mean(additionalInfo1), count = dplyr::n())
        alerttable <-
          tidyr::gather(alertsEx2, variable, val, -(carId:type), na.rm = TRUE) %>%
          tidyr::unite(temp, type, variable) %>%
          tidyr::spread(temp, val, fill = 0)
        consEx %>% 
          dplyr::left_join(alerttable, by = c("tripId", "carId")) 
        
      },
      alexis = {
        consEx <- copy(consEx0)
        joined <- consEx[alertsEx,
                         .(carId, tripId, type, additionalInfo1),
                         on = .(carId, begin <= timestamp, end >= timestamp)]
        aggregated <- joined[, .(typeCount = .N, typeMean = mean(additionalInfo1)), by = .(carId, tripId, type)]
        totals <- aggregated[, .(totals = sum(typeCount)), by = .(carId, tripId)]
        long <- dcast(aggregated, carId + tripId ~ type, value.var = c("typeCount", "typeMean"), sep = "", fill = 0)
        replaceNA <- function(x) { replace(x, is.na(x), 0) }
        consEx[, `:=`(as.character(outer(types, c("Count", "Mean"), paste0)),
                      lapply(long[consEx,
                                  as.character(outer(types, c("typeCount", "typeMean"),
                                                     function(a, b) { paste0(b, a) })),
                                  with = FALSE,
                                  on = .(carId, tripId)],
                             replaceNA))]
        consEx[, totals := lapply(totals[consEx, x.totals, on = .(carId, tripId)], replaceNA)]
        setcolorder(consEx, c("carId", "tripId", "begin", "end"))
        consEx
      },
      uwe1 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        result <- dcast(agg[consEx, on = "tripId"], 
                        tripId + carId + begin+ end + totals ~ type, 
                        value.var = c("Count", "Mean"), fill = 0)[
                          , c("Count_NA", "Mean_NA") := NULL][
                            is.na(totals), totals := 0][]
        # setcolorder(result, names(consEx))
        result
      },
      uwe2 = {
        consEx <- copy(consEx0)
        agg <- consEx[alertsEx, on = .(carId, begin <= timestamp, end >= timestamp), 
                      .(tripId, type, additionalInfo1)][
                        , .(Count = .N, Mean = mean(additionalInfo1)), 
                        by = .(tripId, type)][
                          , totals := sum(Count), by = tripId][] 
        wide <- dcast(agg, tripId + totals ~ type, value.var = c("Count", "Mean"), fill = 0)
        consEx[wide, on = "tripId", (names(wide)) := mget(paste0("i.", names(wide)))]
        setnafill(consEx, fill = 0) # data.table version 1.12.3+ 
        consEx
      },
      check = FALSE
    )
  }
)

library(ggplot2)
autoplot(bm)

请注意对数时标。
该图表证实了OP的假设，即他的方法比其他任何解决方案都慢得多。 1000行花了3秒，而OP的生产数据集有1200万行。 Dave2e方法的运行时间随着警报数量的增加而增加。
同样，OP的方法对分配的内存要求最高。它最多分配190 MB，而 data.table版本仅需要1 MByt。

bm

# A tibble: 15 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 fino          1000       50    2.95s    2.95s     0.339  149.25MB     3.39     1    10      2.95s <data~ <Rpro~ <bch~
 2 dave          1000       50  21.52ms  22.61ms    41.6    779.46KB     5.94    21     3   505.01ms <data~ <Rpro~ <bch~
 3 alexis        1000       50  21.98ms  23.09ms    41.2   1007.84KB     3.93    21     2    509.5ms <data~ <Rpro~ <bch~
 4 uwe1          1000       50  14.47ms  15.45ms    62.3    919.12KB     1.95    32     1   513.94ms <data~ <Rpro~ <bch~
 5 uwe2          1000       50  14.14ms  14.73ms    64.2     633.2KB     1.95    33     1   513.91ms <data~ <Rpro~ <bch~
 6 fino          1000      200    3.09s    3.09s     0.323  155.12MB     4.53     1    14      3.09s <data~ <Rpro~ <bch~
 7 dave          1000      200   53.4ms  62.11ms    11.0       1.7MB     5.49     8     4   728.02ms <data~ <Rpro~ <bch~
 8 alexis        1000      200  22.42ms  23.34ms    41.5      1.03MB     3.77    22     2   530.25ms <data~ <Rpro~ <bch~
 9 uwe1          1000      200  14.72ms  15.76ms    62.2    953.07KB     0       32     0   514.46ms <data~ <Rpro~ <bch~
10 uwe2          1000      200  14.49ms  15.17ms    63.6    695.63KB     1.99    32     1    503.4ms <data~ <Rpro~ <bch~
11 fino          1000     1000     3.6s     3.6s     0.278  187.32MB     3.61     1    13       3.6s <data~ <Rpro~ <bch~
12 dave          1000     1000 242.46ms  243.7ms     4.07      6.5MB     6.78     3     5   737.06ms <data~ <Rpro~ <bch~
13 alexis        1000     1000  24.32ms  25.78ms    37.6      1.21MB     3.95    19     2   505.84ms <data~ <Rpro~ <bch~
14 uwe1          1000     1000  16.04ms   16.8ms    56.8      1.05MB     1.96    29     1   510.23ms <data~ <Rpro~ <bch~
15 uwe2          1000     1000  15.69ms  16.41ms    54.8    938.74KB     3.92    28     2   510.63ms <data~ <Rpro~ <bch~

因此，Fino的方法在第二次基准测试中被省略了，较大的问题大小分别为 consEx 10k和100k行， alertsEx 2k和20k行。

由于Dave2e的方法在20k行中的速度比最快的方法慢100倍，因此在第三次运行中被忽略了。这将模拟OP的生产数据集，其中包含1200万行 consEx和2M行 alertsEx：

print(bm, n = Inf)

# A tibble: 27 x 15
   expression n_trips n_alerts      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time 
   <bch:expr>   <dbl>    <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis>
 1 alexis      1.00e5    20000 747.87ms 747.87ms  1.34       38.18MB  6.69        1     5   747.87ms <data~ <Rpro~ <bch~
 2 uwe1        1.00e5    20000 124.52ms 138.89ms  7.32       37.88MB  3.66        4     2   546.17ms <data~ <Rpro~ <bch~
 3 uwe2        1.00e5    20000  72.29ms  73.76ms 11.3        16.63MB  1.88        6     1   533.26ms <data~ <Rpro~ <bch~
 4 alexis      1.00e6    20000    7.51s    7.51s  0.133     335.44MB  3.33        1    25      7.51s <data~ <Rpro~ <bch~
 5 uwe1        1.00e6    20000 995.31ms 995.31ms  1.00      346.94MB  2.01        1     2   995.31ms <data~ <Rpro~ <bch~
 6 uwe2        1.00e6    20000 714.19ms 714.19ms  1.40       89.27MB  1.40        1     1   714.19ms <data~ <Rpro~ <bch~
 7 alexis      1.00e7    20000    1.67m    1.67m  0.00999     3.21GB  0.340       1    34      1.67m <data~ <Rpro~ <bch~
 8 uwe1        1.00e7    20000    3.44m    3.44m  0.00485     3.41GB  0.00969     1     2      3.44m <data~ <Rpro~ <bch~
 9 uwe2        1.00e7    20000   42.51s   42.51s  0.0235     810.3MB  0           1     0     42.51s <data~ <Rpro~ <bch~
10 alexis      1.00e5   200000   15.38s   15.38s  0.0650     73.22MB  0.0650      1     1     15.38s <data~ <Rpro~ <bch~
11 uwe1        1.00e5   200000    1.34s    1.34s  0.747      63.81MB  0           1     0      1.34s <data~ <Rpro~ <bch~
12 uwe2        1.00e5   200000    1.47s    1.47s  0.681      58.98MB  0           1     0      1.47s <data~ <Rpro~ <bch~
13 alexis      1.00e6   200000    2.91m    2.91m  0.00573   375.93MB  0.0115      1     2      2.91m <data~ <Rpro~ <bch~
14 uwe1        1.00e6   200000    9.72s    9.72s  0.103     371.69MB  0           1     0      9.72s <data~ <Rpro~ <bch~
15 uwe2        1.00e6   200000 888.67ms 888.67ms  1.13      161.82MB  0           1     0   888.67ms <data~ <Rpro~ <bch~
16 alexis      1.00e7   200000    6.29m    6.29m  0.00265     3.15GB  0.0928      1    35      6.29m <data~ <Rpro~ <bch~
17 uwe1        1.00e7   200000    2.45m    2.45m  0.00681     3.43GB  0.0136      1     2      2.45m <data~ <Rpro~ <bch~
18 uwe2        1.00e7   200000   12.48m   12.48m  0.00134   887.99MB  0.00134     1     1     12.48m <data~ <Rpro~ <bch~
19 alexis      1.00e5  2000000    3.04s    3.04s  0.329        207MB  0           1     0      3.04s <data~ <Rpro~ <bch~
20 uwe1        1.00e5  2000000    2.96s    2.96s  0.338     196.42MB  0           1     0      2.96s <data~ <Rpro~ <bch~
21 uwe2        1.00e5  2000000    2.81s    2.81s  0.355     187.79MB  0           1     0      2.81s <data~ <Rpro~ <bch~
22 alexis      1.00e6  2000000    6.96m    6.96m  0.00239   726.14MB  0.00479     1     2      6.96m <data~ <Rpro~ <bch~
23 uwe1        1.00e6  2000000    2.01m    2.01m  0.00827    631.1MB  0           1     0      2.01m <data~ <Rpro~ <bch~
24 uwe2        1.00e6  2000000   30.54s   30.54s  0.0327    584.81MB  0           1     0     30.54s <data~ <Rpro~ <bch~
25 alexis      1.00e7  2000000   31.54m   31.54m  0.000528    3.66GB  0.0127      1    24     31.54m <data~ <Rpro~ <bch~
26 uwe1        1.00e7  2000000    8.72m    8.72m  0.00191     3.67GB  0           1     0      8.72m <data~ <Rpro~ <bch~
27 uwe2        1.00e7  2000000   12.35m   12.35m  0.00135     1.58GB  0           1     0     12.35m <data~ <Rpro~ <bch~

注意：请注意，大多数参数变化的运行时间仅通过一次测量即可确认。此外，由于计算机上的RAM数量有限（8 GB），垃圾收集（gc）可能会产生影响。因此，您的里程可能会有所不同。
时间安排不一致。通常，uwe2（就地更新）比确认 jangorecki's assumption的uwe1（新结果集）快得多。亚历克西斯几乎总是最慢的。（请注意图表中的对数时间刻度或比较表中的时间。）
但是，某些参数变化显示与上述模式有偏差。在两种情况下，对于1000万次旅行，uwe1比uwe2快，并且Alexis也在追赶。
这些影响可能是由于垃圾回收所致。
就地更新方法的内存分配要少得多（即使这种方法并不总能节省速度）。
同样，请谨慎解释结果。应该在更大的机器上重复第三次基准测试，并且要有更多的耐心，以便进行重复的测量。

关于r - 连接2个数据表同时汇总其中一个数据的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56297316/

文章推荐： r - 用系数 (R) 从 lm 中提取公式

文章推荐： emacs - 用 Emacs 交换 2 列

文章推荐： asp.net-mvc - 如何创建从 httpget 获取相同参数的 httppost？

文章推荐： visual-studio-2010 - Team Foundation Server 中的自动构建

python - 连接/连接/合并两个缺失一列的数据框
我知道这个问题可能已经被问过，但我检查了所有这些，我认为我的情况有所不同(请友善)。所以我有两个数据集，第一个是测试数据集，第二个是我保存在数据框中的预测(预测值，这就是没有数据列的原因)。我想合并两
Python - 如何组合/连接/连接 pandas 系列变量忽略空变量
在 .loc 方法的帮助下，我根据同一数据框中另一列中的值来识别 Panda 数据框中某一列中的值。下面给出了代码片段供您引用: var1 = output_df['Player'].loc[out
sftp - 服务器拒绝 SFTP 连接，但它会监听 FTP 连接
当我在 Windows 中使用 WinSCP 通过 Ubuntu 连接到 VMware 时，它提示: The server rejected SFTP connection, but it lis
java - 连接 3G 与 Wi-Fi 连接
我正在开发一个使用 xml web 服务的 android 应用程序。在 wi-fi 网络中连接时工作正常，但在 3G 网络中连接时失败(未找到 http 404)。这不仅仅发生在设备中。为了进行测
objective-c - 连接 Action 有效。连接 socket 不
我有一个XIB包含我的控件的文件，加载到 Interface Builder(Snow Leopard 上的 Xcode 4.0.2)中。文件的所有者被设置为 someClassController
mysql - 用户可以通过 PDO 连接，但不能通过 C mysql_real_connect 连接
我在本地计算机上管理 MySQL 数据库，并通过运行以下程序通过 C 连接到它: #include #include #include int main(int argc, char** arg
java - Spring 和 MySQL 连接(不断创建新连接而不是使用 hibernate 连接)
我不知道为什么每次有人访问我网站上的页面时，都会打开一个与数据库的新连接。最终我到达了大约 300 并收到错误并且页面不再加载。我认为它应该工作的方式是，我将 maxIdle 设置为 30，这意味着
python - 使用 Python 3 连接/连接 txt 文件中的行
希望清理 NMEA GPS 中的 .txt 文件。我当前的代码如下。 deletes = ['$GPGGA', '$GPGSA', '$GPGSV', '$PSRF156', ] searchquer
c# - 通过 C#.Net 创建/连接 VPN 连接
我有一个 URL、一个用户名和一个密码。我想在 C# .Net WinForms 中建立 VPN 连接。你能告诉我从哪里开始吗？任何第三方 API？代码示例将受到高度赞赏... 最佳答案您可以像
c++ - 将字符串 vector 连接(连接)到字符缓冲区，零字节作为分隔符/终止符
有没有更好的方法将字符串 vector 转换为字符 vector ，字符串之间的终止符为零。因此，如果我有一个包含以下字符串的 vector "test","my","string"，那么我想接收一
android - 在 Android Instrumented 测试中模拟无 Internet 连接/慢速 Internet 连接
我正在编写一个库，它不断检查 android 设备的连接，并在设备连接、断开连接或互联网连接变慢时给出回调。 https://github.com/muddassir235/connection_ch
Mysql 数据库无法使用 "loclhost"连接，但可以使用 "127.0.0.1"Centos 7 (Cloudlinux) 连接
我的操作系统:Centos 7 + CLOUDLINUX 7.7当我尝试从服务器登录Mysql时 [root@server3 ~]# Mysql -u root -h localhost -P 330
ruby-on-rails - 无法打开到本地主机的 TCP 连接:9200(连接被拒绝 - 连接(2)用于 "localhost"端口 9200)(Faraday::ConnectionFailed)
我收到错误:Puma 发现此错误:无法打开到本地主机的 TCP 连接:9200(连接被拒绝 - 连接(2)用于“本地主机”端口 9200)(Faraday::ConnectionFailed)在我的
ruby-on-rails - ActionView::Template::Error(无法打开到本地主机的 TCP 连接:9292(连接被拒绝 - 连接(2)用于 "localhost"端口 9292))
请给我一些解决以下错误的方法。这是一个聊天应用....代码和错误如下:: conversations_controller.rb def create if Conversation.bet
Excel 连接
我想将两个单元格中的数据连接到一个单元格中。我还想只组合那些具有相同 ID 的单元格。任务 ID 名称 4355.2 参与者 4355.2 领袖 4462.1 在线 4462.1 快速 4597.1
TSQL 连接
我经常需要连接 TSQL 中的字段... 使用“+”运算符时 TSQL 强制您处理的两个问题是 Data Type Precedence和 NULL 值。使用数据类型优先级，问题是转换错误。 1)
Facebook 连接
有没有在 iPad 或 iPhone 应用程序中使用 Facebook 连接。这个想法是登录这个应用程序，然后能够看到我的哪些 facebook 用户也在使用该应用程序及其功能。最佳答案是的。
连接/打印字符串文字
我在连接或打印字符串时遇到了一个奇怪的问题。我有一个 char * ，可以将其设置为字符串文字的几个值之一。 char *myStrLiteral = NULL; ... if(blah) myS
Xquery 连接
对于以下数据 - let $x := "Yahooooo !!!! Select one number - " let $y := 1 2 3 4 5 6 7 我想得到
Perl 连接
我正在看 UDEMY for perl 的培训视频，但是视频不清晰，看起来有错误。培训展示了如何使用以下示例连接 2 个字符串: #!usr/bin/perl print $str = "Hi";

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 连接2个数据表同时汇总其中一个数据的最快方法