gpt4 book ai didi

r - 有效地计算跨多列的字符串出现次数

转载 作者:行者123 更新时间:2023-12-04 11:55:58 27 4
gpt4 key购买 nike

我有一个包含列 yname1 的大型数据框(> 400 万行) , yname2 , yname3存储字符串:

yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee

现在我想计算每个字符串在所有列中出现的总次数。这些应作为附加列添加:
yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa | 3 | 1 | 1
aaabbb | cccccc | aaaaaa | 2 | 2 | 3
aaaaaa | aaabbb | dddddd | 3 | 2 | 2
cccccc | dddddd | eeeeee | 2 | 2 | 1

我已经编写了以下代码,它可以完成这项工作:
data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))

然而,这真的很慢,需要几天的时间来计算。有什么想法可以加快速度吗?

最佳答案

怎么样data.table方法:

library(data.table)
setDT(d)

lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
# value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1

d[, paste0("rcount", 1:3) :=
lapply(d, function(x) lookup[x, , on = .(value)][, N])]

# yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa 3 1 1
#2: aaabbb cccccc aaaaaa 2 2 3
#3: aaaaaa aaabbb dddddd 3 2 2
#4: cccccc dddddd eeeeee 2 2 1

从 bgoldst 的示例中复制的 Microbenchmark 输出,但有 400,000 行。
Unit: seconds
expr min lq mean median uq max neval
bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697 3
alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850 3
jota(dt) 2.629337 2.692613 2.719207 2.755889 2.764141 2.772394 3
mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702 3
coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334 3
espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306 3

关于r - 有效地计算跨多列的字符串出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37491316/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com