r - 在data.table中按组cbind列表的有效方法-6ren

r - 在data.table中按组cbind列表的有效方法

转载作者：行者123 更新时间：2023-12-04 11:33:59

29

4

我有一个 data.frame

数据

data = structure(list(mystring = c("AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD", 
    "ASDSDFJSKADDKJSJKDFKSADDLKJFLAK"), class = c("cat", "dog")), .Names = c("mystring", 
    "class"), row.names = c(NA, -2L), class = "data.frame")

看起来像

#> dtt1
#                                      mystring class
#1 AASDAASADDLKJLKADDLKKLLKJLJADDLJLKJLADLKLADD   cat
#2              ASDSDFJSKADDKJSJKDFKSADDLKJFLAK   dog

我在 mystring 下的字符串的前 20 个字符中搜索模式“ADD”的开始和结束位置。考虑 class作为组。

我正在使用 str_locate的 stringr包裹。这是我的尝试

setDT(dtt1)[, 
cbind(list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,1]),
      list(str_locate_all(substr(as.character(mystring), 1, 20),"ADD")[[1]][,2])), 
      by = class]

这给了我的 所需输出

#   class V1 V2
#1:   cat  8 10
#2:   cat 16 18
#3:   dog 10 12

问题 :
我想知道这是否是一种标准方法，或者可以以更有效的方式完成。 str_locate给出 start和 end匹配模式在单独列中的位置，我将它们放在单独的列表中 cbind它们与 data.table ?另外我如何指定 colnames对于 cbinded columns这里？

最佳答案

我认为你首先应该减少每个组的操作，所以我会首先为所有组创建一个子字符串。

setDT(data)[, submystring := .Internal(substr(mystring, 1L, 20L))]

然后，使用 stringi包(我不喜欢包装器)，你可以做(虽然目前不能保证效率)

library(stringi)
data[, data.table(matrix(unlist(stri_locate_all_fixed(submystring, "ADD")), ncol = 2)), by = class]
#    class V1 V2
# 1:   cat  8 10
# 2:   cat 16 18
# 3:   dog 10 12

或者，您可以避免 matrix和 data.table每组调用，但在检测到所有位置后传播数据

res <- data[, unlist(stri_locate_all_fixed(submystring, "ADD")), by = class]
res[, `:=`(varnames = rep(c("V1", "V2"), each = .N/2), MatchCount = rep(1:(.N/2), .N/2)), by = class]
dcast(res, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12

第三个类似的选项可能是尝试第一次运行 stri_locate_all_fixed在整个数据集上，然后才到 unlist每组(而不是同时运行 unlist 和 stri_locate_all_fixed 每组)

res <- data[, .(stri_locate_all_fixed(submystring, "ADD"), class = class)]
res[, N := lengths(V1)/2L]
res2 <- res[, unlist(V1), by = "class,N"]
res2[, `:=`(varnames = rep(c("V1", "V2"), each = N[1L]), MatchCount = rep(1:(N[1L]), N[1L])), by = class]
dcast(res2, class + MatchCount ~ varnames, value.var = "V1")
#    class MatchCount V1 V2
# 1:   cat          1  8 10
# 2:   cat          2 16 18
# 3:   dog          1 10 12

关于r - 在data.table中按组cbind列表的有效方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32296010/

29

4