r - 子集 data.table 的速度以奇怪的方式取决于特定的键值？-6ren

r - 子集 data.table 的速度以奇怪的方式取决于特定的键值？

转载作者：行者123 更新时间：2023-12-03 04:24:16

有人可以解释一下以下输出吗？除非我遗漏了一些东西(我可能是这样)，否则对 data.table 进行子集化的速度似乎取决于其中一列中存储的特定值，即使它们属于同一类并且除了值。

这怎么可能？

> dim(otherTest)
[1] 3572069       2
> dim(test)
[1] 3572069       2
> length(unique(test$keys))
[1] 28741
> length(unique(otherTest$keys))
[1] 28742
> sapply(test,class)
 thingy        keys 
"character" "character" 
> sapply(otherTest,class)
 thingy        keys 
"character" "character" 
> class(test)
[1] "data.table" "data.frame"
> class(otherTest)
[1] "data.table" "data.frame"
> start = Sys.time()
>   newTest = otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.5438871 secs
> start = Sys.time()
>   newTest = test[keys%in%partition]
>   end  = Sys.time() 
>   print(end - start)
Time difference of 42.78009 secs

摘要编辑:因此速度的差异与不同大小的 data.table 无关，也与不同数量的唯一值有关。正如您在上面修改后的示例中看到的，即使在生成键以使它们具有相同数量的唯一值(并且在相同的一般范围内并共享至少 1 个值，但通常不同)之后，我得到相同的性能差异。

关于共享数据，遗憾的是我无法共享测试表，但我可以共享其他测试。整个想法是，我试图尽可能接近地复制测试表(相同的大小，相同的类/类型，相同的键，NA值的数量等)，以便我可以发布到SO - 但奇怪的是我做了up data.table 的行为非常不同，我不明白为什么!

另外，我要补充一点，我怀疑问题来自 data.table 的唯一原因是，几周前我在对 data.table 进行子集化时遇到了类似的问题，结果证明这是一个实际的错误新的 data.table 版本(我最终删除了这个问题，因为它是重复的)。该错误还涉及使用 %in% 函数对 data.table 进行子集化——如果 %in% 的右侧参数中存在重复条目，则会返回重复的输出。所以如果 x = c(1,2,3) 且 y = c(1,1,2,2)，x%in% y 将返回长度为 8 的向量。我已经重新安装了 data.table 包，所以我不认为这可能是同一个错误——但也许相关？

编辑(关于 Dean MacGregor 的评论)

> sapply(test,class)
 thingy        keys 
"character" "character" 
> sapply(otherTest,class)
 thingy        keys 
"character" "character" 


# benchmarking the original test table
>   test2 =data.table(sapply(test ,as.numeric))
>   otherTest2 =data.table(sapply(otherTest ,as.numeric))
>   start = Sys.time()
>   newTest = test[keys%in%partition])
>   end  = Sys.time()
>   print(end - start)
Time difference of 52.68567 secs
> start = Sys.time()
>   newTest = otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.3503151 secs

#benchmarking after converting to numeric
> partition = as.numeric(partition)
> start = Sys.time()
>   newTest = otherTest2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.7240109 secs
> start = Sys.time()
>    newTest = test2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 42.18522 secs

#benchmarking again after converting back to character
> partition = as.character(partition)
> otherTest2 =data.table(sapply(otherTest2 ,as.character))
> test2 =data.table(sapply(test2 ,as.character))
> start = Sys.time()
>   newTest =test2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 48.39109 secs
> start = Sys.time()
>   newTest = data.table(otherTest2[keys%in%partition])
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.1846113 secs

所以减速并不取决于等级。

编辑:问题显然来自 data.table，因为我可以转换为矩阵，问题就消失了，然后转换回 data.table，问题又回来了。

编辑:我注意到问题与 data.table 函数处理重复项的方式有关，这听起来很正确，因为它类似于我上周在上面描述的数据表 1.9.4 中发现的错误。

>   newTest =test[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 39.19983 secs
> start = Sys.time()
>   newTest =otherTest[keys%in%partition]
>   end  = Sys.time()
 >   print(end - start)
 Time difference of 0.3776946 secs
> sum(duplicated(test))/length(duplicated(test))
[1] 0.991954
> sum(duplicated(otherTest))/length(duplicated(otherTest))
[1] 0.9918879
> otherTest[duplicated(otherTest)] =NA
 > test[duplicated(test)]= NA
> start = Sys.time()
>   newTest =otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.2272599 secs
> start = Sys.time()
>   newTest =test[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.2041721 secs

因此，即使它们具有相同数量的重复项，两个 data.table(或更具体地说是 data.table 中的 %in% 函数)显然会以不同的方式处理重复项。与重复项相关的另一个有趣的观察是(请注意，我再次从原始表格开始):

> start = Sys.time()
>   newTest =test[keys%in%unique(partition)]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.6649222 secs
> start = Sys.time()
>   newTest =otherTest[keys%in%unique(partition)]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.205637 secs

因此，从 %in% 的正确参数中删除重复项也可以解决问题。

因此，鉴于这个新信息，问题仍然存在:为什么这两个 data.tables 以不同的方式处理重复值？

最佳答案

当 match 时，您将重点关注 data.table(%in% 由 match 定义) > 操作)以及您应该关注的向量的大小。一个可重现的例子:

library(microbenchmark)

set.seed(1492)

# sprintf to keep the same type and nchar of your values

keys_big <- sprintf("%014d", sample(5000, 4000000, replace=TRUE))
keys_small <- sprintf("%014d", sample(5000, 30000, replace=TRUE))

partition <- sample(keys_big, 250)

microbenchmark(
  "big"=keys_big %in% partition,
  "small"=keys_small %in% partition
)

## Unit: milliseconds
##   expr        min         lq       mean     median         uq        max neval cld
##    big 167.544213 184.222290 205.588121 195.137671 205.043641 376.422571   100   b
##  small   1.129849   1.269537   1.450186   1.360829   1.506126   2.848666   100  a

来自文档:

match returns a vector of the positions of (first) matches of its first argument in its second.

这本质上意味着它将取决于向量的大小以及如何找到(或找不到)“接近顶部”的匹配。

但是，您可以使用 data.table 中的 %chin% 来加快整个过程，因为您使用的是字符向量:

library(data.table)

microbenchmark(
  "big"=keys_big %chin% partition,
  "small"=keys_small %chin% partition
)
## Unit: microseconds
##   expr       min         lq       mean     median        uq        max neval cld
##    big 36312.570 40744.2355 47884.3085 44814.3610 48790.988 119651.803   100   b
##  small   241.045   264.8095   336.1641   283.9305   324.031   1207.864   100  a

您还可以使用 fastmatch 包(但您已经加载了 data.table 并正在使用字符向量，因此 6/1|0.5*12):

library(fastmatch)

# gives us similar syntax & functionality as %in% and %chin%

"%fmin%" <- function(x, table) fmatch(x, table, nomatch = 0) > 0

microbenchmark(
  "big"=keys_big %fmin% partition,
  "small"=keys_small %fmin% partition
)

## Unit: microseconds
##   expr       min         lq       mean     median        uq        max neval cld
##    big 75189.818 79447.5130 82508.8968 81460.6745 84012.374 124988.567   100   b
##  small   443.014   471.7925   525.2719   498.0755   559.947    850.353   100  a

无论如何，任一向量的大小将最终决定操作的快/慢。但后两个选项至少能让你更快地得到结果。以下是小向量和大向量这三者之间的比较:

library(ggplot2)
library(gridExtra)

microbenchmark(
  "small_in"=keys_small %in% partition,
  "small_ch"=keys_small %chin% partition,
  "small_fm"=keys_small %fmin% partition,
  unit="us"
) -> small

microbenchmark(
  "big_in"=keys_big %in% partition,
  "big_ch"=keys_big %chin% partition,
  "big_fm"=keys_big %fmin% partition,
  unit="us"
) -> big

grid.arrange(autoplot(small), autoplot(big))

enter image description here

更新

根据 OP 评论，这是考虑使用和不使用 data.table 子集的另一个基准:

dat_big <- data.table(keys=keys_big)

microbenchmark(

  "dt"        = dat_big[keys %in% partition],
  "not_dt"    = dat_big$keys %in% partition,

  "dt_ch"     = dat_big[keys %chin% partition],
  "not_dt_ch" = dat_big$keys %chin% partition,

  "dt_fm"     = dat_big[keys %fmin% partition],
  "not_dt_fm" = dat_big$keys %fmin% partition

)

## Unit: milliseconds
##       expr       min        lq      mean    median        uq      max neval    cld
##         dt  11.74225  13.79678  15.90132  14.60797  15.66586 129.2547   100 a     
##     not_dt 160.61295 174.55960 197.98885 184.51628 194.66653 305.9615   100      f
##      dt_ch  46.98662  53.96668  66.40719  58.13418  63.28052 201.3181   100   c   
##  not_dt_ch  37.83380  42.22255  50.53423  45.42392  49.01761 147.5198   100  b    
##      dt_fm  78.63839  92.55691 127.33819 102.07481 174.38285 374.0968   100     e 
##  not_dt_fm  67.96827  77.14590  99.94541  88.75399  95.47591 205.1925   100    d

关于r - 子集 data.table 的速度以奇怪的方式取决于特定的键值？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30680170/

文章推荐： JavaScript 串行 Promise 与 setTimeout

文章推荐： pyodbc 和插入语句

文章推荐： wikipedia - 是否有一个维基百科 API 仅用于检索内容摘要？

文章推荐： powershell - 使用 PowerShell 从 Azure 资源组导出类型

r - 子集 ffdf 对象(子集 vs ffwhich)
我正在执行大型 ffdf 对象的子集，我注意到当我使用 subset.ff 时，它会生成大量 NA。我通过使用 ffwhich 尝试了另一种方法，结果要快得多，并且没有生成 NA。这是我的测试: li
Prolog - 子集
我对 Prolog 有点陌生。我正在尝试编写一个函数subset(Set, Subset) 来确定Subset 是否是Set 的子集(duh)。另外，如果第二个参数没有实例化，它应该输出每个可能的子集
Leetcode刷题(第78题)——子集
一、题目给你一个整数数组 nums ，数组中的元素互不相同。返回该数组所有可能的子集（幂集）。解集不能包含重复的子集。你可以按任意顺序返回解集。二、示例输入：nums = [1,2
R:子集:使用除一列之外的整个数据框
我想从数据帧的操作中排除一列。当然，我可以在没有要排除的列的情况下复制数据框，但这似乎是一种解决方法。我认为必须有一种更简单的方法来进行子集化。所以这个示例代码应该显示我在做什么。 df colMe
r - 子集 SpatialPolygonsDataFrame
我有一个 SpatialPolygonsDataFrame我通过使用 readOGR 读取 shapefile 创建的在 rgdal包裹。我正在尝试使用它来使用 spsample 生成采样网格在 sp
list - 如何在Prolog中获取所有连续的子列表/子集？
我想解决一个简单的问题，但即使我尝试了很多不同的方法，我也找不到解决方案。我正在使用 SICStus Prolog (如果这很重要)，并且我想获取列表的所有子列表/子集(我不知道哪个术语是正确的)，其
r - 子集 shinyTable
我目前正在使用 shinyTable，它是 HandsonTable (https://github.com/trestletech/shinyTable) 的 shiny 兼容实现。巧合的是，我意识
Delphi - ADODataSet 子集
我正在 Delphi 中构建一个表单，其中包含服务下拉列表和用于选择服务的附加组件网格。我获取的数据来自 API，并且我将服务的数据存储在 ADODataSet 中，如下所示: ID (integer
r - 子集()一个因子的观察次数
subset() 函数有问题。如何通过观察次数对我的数据框的一个因子进行子集化？ NAME CLASS COLOR VALUE antonio
perl，比较散列，子集
我想知道是否有任何简单的算法来比较一个散列是否是另一个散列的子集。例如，如果 $HASH{A} = B; $HASH{B} = C; $HASH{C} = D; $HASH2{A} = B; $HA
arrays - 如何在postgresql中找到任意大小数组的所有组合(子集)
这个问题在这里已经有了答案: Array combinations without repetition (1 个回答) 关闭 8 年前。给定一个数组，如何在 postgresql 中找到一定大小
c++ - 子集 vector
我有下一个程序。我应该如何在 main 中使用迭代器以显示总和为 0 的子集？我的程序应该打印: 2 -2 5 -5 # include # include using namespace st
javascript - Markdown 子集
我正在寻找一个可以自定义的 Markdown 解析器，最好是 Javascript。特别是，我想删除使用实际 HTML 标签的选项。我尝试编辑摊牌的来源，但无法弄清楚。 Jquery 集成也很好，尽管
linux - 使用保存在另一个文件中的标识符列表从主文件中提取信息(子集)
我有一个包含名称列表的文件(引用文件 1): Apple Bat Cat 我有另一个文件(引用文件 2)，其中包含名称列表和详细信息引用: Apple bla blaa aaaaaaaaagggggg
linux - 如何从命令行找到两个文件的集合 - 子集？
我有两个带有排序行的文件。一个文件 (B) 是另一个文件 (A) 的子集。我想找到 A 中不在 B 中的所有行。理想情况下，我想创建一个包含这些行的文件 (C)。这在 Unix 中可能吗？我正在寻找一
r - 过滤/子集/删除在R中字符串中间包含字符的行
我有一个包含肽序列的列的数据框，我只想保留字符串中没有内部“R”或“K”的行。 df1 <- data.frame( Peptide = c("ABCOIIJUHFSAUJHR", "AOFI
r - 子集 1 列矩阵删除行名
这个问题在这里已经有了答案: How to subset matrix to one column, maintain matrix data type, maintain row/column na
r - 子集 R 中的列表向量
假设我有一个列表向量: library(tidyverse) d 2) # A tibble: 5 x 1 x 1 2 3 4 5 最佳答案应该是 lengt
javadoc 子集/java 库组织
我自己从来没有运行过javadoc(无论是在命令行还是ant's javadoc task；我将使用ant)——我需要为我编写的库生成一个javadoc。问题是我的 java 库被组织成几个包，在
cryptography - 过期多方加密方案中的 key 子集
假设一个多方加密方案，类似于答案:Encryption with multiple different keys? . 那是。一组键K可以用来破译密文。有没有办法过期: K'⊆ K 这样 K \ K

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 子集 data.table 的速度以奇怪的方式取决于特定的键值？