gpt4 book ai didi

r - 如何加速 data.table 中的字符串拆分操作

转载 作者:行者123 更新时间:2023-12-02 05:55:32 26 4
gpt4 key购买 nike

我有一个 data.table,其中一些 ID 粘贴在一起作为单个字符列,并用下划线分隔。我正在尝试将 id 拆分为单独的列,但对于大型数据集(约 250M 行),我的最佳方法确实很慢。有趣的是,该操作似乎并不需要 O(N) 时间,这正是我所期望的。换句话说,它的速度相当快,直到大约 50M+ 行,然后变得非常慢。

制作一些数据

require(data.table)
set.seed(2016)
sim_rows <- 40000000
dt <- data.table(
LineId = rep("L0123", times=sim_rows),
StationId = rep("S0123", times=sim_rows),
TimeId = rep("T0123", times=sim_rows)
)
dt[, InfoId := paste(LineId, StationId, TimeId, sep="_")]
dt[, c("LineId", "StationId", "TimeId") := NULL]
gc(reset=T) # free up 1.5Gb of memory

dt
InfoId
1: L0123_S0123_T0123
2: L0123_S0123_T0123
3: L0123_S0123_T0123
4: L0123_S0123_T0123
5: L0123_S0123_T0123
---
39999996: L0123_S0123_T0123
39999997: L0123_S0123_T0123
39999998: L0123_S0123_T0123
39999999: L0123_S0123_T0123
40000000: L0123_S0123_T0123

检查时间

system.time( dt[1:10000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
5.179 0.634 3.867

system.time( dt[1:20000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
7.805 0.958 7.703

system.time( dt[1:30000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
12.556 1.782 12.349

system.time( dt[1:40000000, c("LineId", "StationId", "TimeId") :=
tstrsplit(InfoId, split="_", fixed=TRUE)] )
user system elapsed
29.260 2.822 29.895

检查结果

dt
InfoId LineId StationId TimeId
1: L0123_S0123_T0123 L0123 S0123 T0123
2: L0123_S0123_T0123 L0123 S0123 T0123
3: L0123_S0123_T0123 L0123 S0123 T0123
4: L0123_S0123_T0123 L0123 S0123 T0123
5: L0123_S0123_T0123 L0123 S0123 T0123
---
39999996: L0123_S0123_T0123 L0123 S0123 T0123
39999997: L0123_S0123_T0123 L0123 S0123 T0123
39999998: L0123_S0123_T0123 L0123 S0123 T0123
39999999: L0123_S0123_T0123 L0123 S0123 T0123
40000000: L0123_S0123_T0123 L0123 S0123 T0123

我怎样才能加快这个婴儿的速度?

最佳答案

stringr 较新,内部基于 stringi,通常是 even faster .

此外,stringi 和较小程度上的 stringr 都有每个字符串操作的多个变体 (fixed/coll/regex/words/boundaries/charclass),这些变体针对操作数的类型进行了优化。

尝试 stri_split_fixed(..., '_'),它应该非常快。

require(stringi)
> system.time( dt[1:1e6, c("LineId", "StationId", "TimeId") := stri_split_fixed(InfoId, "_")] )
user system elapsed
2.635 0.497 3.379 # on my old machine; please tell us your numbers?

关于r - 如何加速 data.table 中的字符串拆分操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40295162/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com