r - 使用 data.table 有效处理 by group 中的重复值-6ren

r - 使用 data.table 有效处理 by group 中的重复值

转载作者：行者123 更新时间：2023-12-02 04:22:54

从按组重复(即每行中的相同值)的列 ( variable) 中获取单个值的首选方法是什么？我应该使用 variable[1] 吗？或者我应该在 by 语句中包含该变量并使用 .BY$variable ？假设我希望返回值包含 variable作为专栏。

从下面的测试中可以很清楚地看出，在 by 中放置了额外的变量。语句减慢了速度，甚至降低了通过该新变量进行键控的成本(或使用技巧告诉 data.table 不需要额外的键控)。为什么额外的已键入 by变量减慢速度？

我想我曾希望包括已经键入的 by variables 将是一个方便的语法技巧，可以将这些变量包含在返回的 data.table 中，而无需在 j 中明确命名它们。声明，但这似乎是不可取的，因为即使它们已经被键入，也会有一些与变量附加相关的开销。所以我的问题是，是什么导致了这种开销？

一些示例数据:

library(data.table)
n <- 1e8
y <- data.table(sample(1:5,n,replace=TRUE),rnorm(n),rnorm(n))
y[,sumV2:=sum(V2),keyby=V1]

时间显示使用 variable[1] 的方法(在这种情况下， sumV2[1] )更快。

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])
system.time(x[, list(out=sum(V3*V2)/.BY$sumV2),keyby=list(V1,sumV2)])

我想这并不奇怪，因为 data.table无法知道由 setkey(V1) 和 setkey(V1,sumV2) 定义的组实际上是相同的。

令我感到惊讶的是，即使 data.table 的关键字是 setkey(V1,sumV2) (我们完全忽略设置新 key 所需的时间)，使用 sumV2[1]还是更快。这是为什么？

x <- copy(y)
setkey(x,V1,sumV2)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),by=V1])
system.time(x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)])

此外，完成 setkey(x,V2,sumV2) 所需的时间是不可忽略的。有什么方法可以通过告诉 data.table key 实际上没有发生实质性变化来欺骗 data.table 跳过实际重新键入 x 吗？

x <- copy(y)
system.time(setkey(x,V1,sumV2))

回答我自己的问题，似乎我们可以通过分配“已排序”属性来设置键时跳过排序。这是允许的吗？它会破坏东西吗？

x <- copy(y)
system.time({
  setattr(x, "sorted", c("V1","sumV2"))
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})

我不知道这是不好的做法还是可能破坏事物。但是使用 setattr欺骗比显式键控快得多:

x <- copy(y)
system.time({
  setkey(x,V1,sumV2)
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})

但即使使用 setattr欺骗结合使用 sumV2在 by 声明中仍然不如离开快 sumV2完全脱离 by 语句:

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])

在我看来，通过属性设置 key 并在每个组中使用 sumV2 作为长度为 1 的变量应该比仅在 V1 上键入并使用 sumV2[1] 更快。如果sumV2未指定为 by变量，然后是 sumV2 中重复值的整个向量在子集化为 sumV2[1] 之前需要为每个组生成.将此与 sumV2 时进行比较是 by变量，sumV2 只有一个长度为 1 的向量在每个组中。显然我在这里的推理是不正确的。谁能解释为什么？为什么是sumV2[1]是最快的选择，甚至与制作 sumV2 相比a 使用 setattr 后的变量诡计？

顺便说一句，我很惊讶地得知使用 attr<-不慢于 setattr (都是瞬时的，意味着根本没有复制)。这与我对 base R foo<- 的理解相反函数复制数据。

x <- copy(y)
system.time(setattr(x, "sorted", c("V1","sumV2")))
x <- copy(y)
system.time(attr(x,"sorted") <- c("V1","sumV2"))

相关SessionInfo()用于这个问题:

data.table version 1.12.2
R version 3.5.3

最佳答案

好吧，我没有很好的技术答案，但我想我已经在 options(datatable.verbose=TRUE)

的帮助下从概念上解决了这个问题

创建数据

library(data.table)
n <- 1e8

y_unkeyed_5groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_5groups[,sumV2:=sum(V2),keyby=V1]
y_unkeyed_10000groups <- data.table(sample(1:10000,n,replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups[,sumV2:=sum(V2),keyby=V1]

慢跑

x <- copy(y)
system.time({
  setattr(x, "sorted", c("V1","sumV2"))
  x[, list(out=sum(V3*V2)/.BY$sumV2),by=list(V1,sumV2)]
})
# Detected that j uses these columns: V3,V2 
# Finding groups using uniqlist on key ... 1.050s elapsed (1.050s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/.BY$sumV2)'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.305s for 6 groups
# eval(j) took 0.254s for 6 calls
# 0.560s elapsed (0.510s cpu) 
# user  system elapsed 
# 1.81    0.09    1.72

跑得快

x <- copy(y)
system.time(x[, list(out=sum(V3*V2)/sumV2[1],sumV2[1]),keyby=V1])
# Detected that j uses these columns: V3,V2,sumV2 
# Finding groups using uniqlist on key ... 0.060s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V3 * V2)/sumV2[1], sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.328s for 6 groups
# eval(j) took 0.291s for 6 calls
# 0.610s elapsed (0.580s cpu) 
# user  system elapsed 
# 1.08    0.08    0.82

finding groups 部分是造成差异的原因。我猜这里发生的事情是设置 key 实际上只是排序(我应该从属性的命名方式中猜到!)并且实际上并没有做任何事情来定义组的开始位置和结尾。因此，即使 data.table 知道 sumV2 已排序，它也不知道它们都是相同的值，因此必须找到 sumV2 中的组所在的位置 开始和结束。

我的猜测是，在技术上可以编写 data.table，其中键控不仅排序，而且实际上将每个组的开始/结束行存储在键控变量中，但是这可能会为包含大量组的 data.tables 占用大量内存。

知道了这一点，似乎建议不要一遍又一遍地重复相同的 by 语句，而是在一个 by 语句中完成您需要做的所有事情。总体而言，这可能是一个很好的建议，但对于少数群体而言并非如此。请参见以下反例:

我以我认为使用 data.table 的最快方式重写了它(只有一个 by 语句，并使用了 GForce):

library(data.table)
n <- 1e8
y_unkeyed_5groups <- data.table(sample(1:5,n, replace=TRUE),rnorm(n),rnorm(n))
y_unkeyed_10000groups <- data.table(sample(1:10000,n, replace=TRUE),rnorm(n),rnorm(n))

x <- copy(y_unkeyed_5groups)
system.time({
  x[, product:=V3*V2]
  outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
  outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
  setkey(x,V1)
  x[outDT,sumV2:=sumV2,all=TRUE]
  x[,product:=NULL]
  outDT
})

# Detected that j uses these columns: V3,V2 
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product 
# Finding groups using forderv ... 0.350s elapsed (0.810s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 1.610s elapsed (4.550s cpu) 
# Detected that j uses these columns: sumProduct,sumV2 
# Assigning to all 5 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 0.98 sec
# reorder took 3.35 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: sumV2 
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product 
# Assigning to all 100000000 rows
# user  system elapsed 
# 11.00    1.75    5.33 


x2 <- copy(y_unkeyed_5groups)
system.time({
  x2[,sumV2:=sum(V2),keyby=V1]
  outDT2 <- x2[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})
# Detected that j uses these columns: V2 
# Finding groups using forderv ... 0.310s elapsed (0.700s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# collecting discontiguous groups took 0.714s for 5 groups
# eval(j) took 0.079s for 5 calls
# 1.210s elapsed (1.160s cpu) 
# setkey() after the := with keyby= ... forder took 1.03 sec
# reorder took 3.21 sec
# 1.600s elapsed (3.700s cpu) 
# Detected that j uses these columns: sumV2,V3,V2 
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.347s for 5 groups
# eval(j) took 0.265s for 5 calls
# 0.630s elapsed (0.620s cpu) 
# user  system elapsed 
# 6.57    0.98    3.99 

all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE

好吧，事实证明，当只有 5 个组时，通过不重复语句和使用 GForce 获得的效率并不重要。但是对于更多的群体来说，这确实有所不同，(尽管我没有以一种方式来区分仅使用一个 by 语句而不是 GForce 的好处与使用 GForce 和多个 by 语句的好处):

x <- copy(y_unkeyed_10000groups)
system.time({
  x[, product:=V3*V2]
  outDT <- x[,list(sumV2=sum(V2),sumProduct=sum(product)),keyby=V1]
  outDT[,`:=`(out=sumProduct/sumV2,sumProduct=NULL) ]
  setkey(x,V1)
  x[outDT,sumV2:=sumV2,all=TRUE]
  x[,product:=NULL]
  outDT
})
# 
# Detected that j uses these columns: V3,V2 
# Assigning to all 100000000 rows
# Direct plonk of unnamed RHS, no copy.
# Detected that j uses these columns: V2,product 
# Finding groups using forderv ... 0.740s elapsed (1.220s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sum(V2), sum(product))'
# GForce optimized j to 'list(gsum(V2), gsum(product))'
# Making each group and running j (GForce TRUE) ... 0.810s elapsed (2.390s cpu) 
# Detected that j uses these columns: sumProduct,sumV2 
# Assigning to all 10000 rows
# RHS for item 1 has been duplicated because NAMED is 3, but then is being plonked. length(values)==2; length(cols)==2)
# forder took 1.97 sec
# reorder took 11.95 sec
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu) 
# Detected that j uses these columns: sumV2 
# Assigning to 100000000 row subset of 100000000 rows
# Detected that j uses these columns: product 
# Assigning to all 100000000 rows
# user  system elapsed 
# 18.37    2.30    7.31 

x2 <- copy(y_unkeyed_10000groups)
system.time({
  x2[,sumV2:=sum(V2),keyby=V1]
  outDT2 <- x[, list(sumV2=sumV2[1],out=sum(V3*V2)/sumV2[1]),keyby=V1]
})

# Detected that j uses these columns: V2 
# Finding groups using forderv ... 0.770s elapsed (1.490s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'sum(V2)'
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# collecting discontiguous groups took 1.792s for 10000 groups
# eval(j) took 0.111s for 10000 calls
# 3.960s elapsed (3.890s cpu) 
# setkey() after the := with keyby= ... forder took 1.62 sec
# reorder took 13.69 sec
# 4.660s elapsed (14.4s cpu) 
# Detected that j uses these columns: sumV2,V3,V2 
# Finding groups using uniqlist on key ... 0.070s elapsed (0.070s cpu) 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
# lapply optimization is on, j unchanged as 'list(sumV2[1], sum(V3 * V2)/sumV2[1])'
# GForce is on, left j unchanged
# Old mean optimization is on, left j unchanged.
# Making each group and running j (GForce FALSE) ... 
# memcpy contiguous groups took 0.395s for 10000 groups
# eval(j) took 0.284s for 10000 calls
# 0.690s elapsed (0.650s cpu) 
# user  system elapsed 
# 20.49    1.67   10.19 

all.equal(x,x2)
# TRUE
all.equal(outDT,outDT2)
# TRUE

更一般地说，data.table 非常快，但为了提取最快速、最有效的计算以充分利用底层 C 代码，您需要特别注意 data.table 的内部工作原理。我最近了解了 data.table 中的 GForce 优化，当有 by 语句时，似乎特定形式的 j 语句(涉及简单函数，如 mean 和 sum)直接在 C 中解析和执行。

关于r - 使用 data.table 有效处理 by group 中的重复值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58142097/

文章推荐： android - 首选项屏幕为白色

文章推荐： r - 在R中的每个元素列表中添加前缀

文章推荐：带有 "where Self"子句的 Swift 协议(protocol)

文章推荐： blazor - 从没有javascript的textarea中获取选定的文本

c - 链接描述文件中 *(.data)、*(.data*) 和 *(.data.*) 之间的区别
只是想知道这些结构之间有什么区别(text、data、rodata、bss 等)在链接描述文件中: .data : { *(.data) } .data : { *(.data*) }
haskell - 从 Data.Data.Data 了解 gfoldl 的类型签名
Data 定义为其核心功能之一 gfoldl : gfoldl :: (Data a) => (forall d b. Data d => c (d -> b) -> d -> c b)
aem - data-sly-use、data-sly-resource、data-sly-include 和 data-sly-template 之间有什么区别？
以下之间有什么区别:data-sly-use、data-sly-resource、data-sly-include 和数据-sly-模板？我正在阅读 Sightly AEM 上的文档，我非常困惑。
spring-data - 休眠搜索与 spring-data-solr ， spring-data-elasticsearch
我有一个 Spring Boot、Spring Data JPA (hibernate) Web 应用程序，并且想引入文本搜索功能。我理解以下内容 hibernate search 或 spring
c - 错误: Expected expression before 'DATA/* : typedef struct DATA DATA */
我不知道我的代码有什么问题。我读了其他有同样问题的人的一些问题，但没有找到答案。当我尝试编译时出现以下错误: ||In function 'main':| |35|error: expected ex
r - as.data.frame.default(data): cannot coerce class “” formula“” to a data.frame中的错误
我不太确定为什么会收到此错误或其含义。我的数据框称为“数据”。 library(dplyr) data %>% filter(Info==1, Male==1) %>% lm(CFL_
html - data-value、data-title、data-original-title、original-title等的用途和用法是什么？
我一直在 GitHub 等更现代的网站上看到这些属性，它们似乎总是与自定义的弹出窗口一致，如 title 属性。 Option 1 Option 2 Option 3 Option 4 我在 HTML
core-data - Swift - 用 iCloud Core Data 替换 Core Data
如何用 iCloud Core Data 替换我现有的 Core Data？这是我的持久商店协调员: lazy var persistentStoreCoordinator: NSPersistent
html - data-value、data-title、data-original-title、original-title等的用途和用法是什么？
我一直在 GitHub 等更现代的网站上看到这些属性，它们似乎总是与自定义的弹出窗口一致，如 title 属性。 Option 1 Option 2 Option 3 Option 4 我在 HTML
android -/data/data/是安装路径的可靠假设吗？
我正在通过 this project 在 Android 上摆弄 node.js ，我需要一种方法将 js 文件部署到私有(private)目录(以隐藏源代码，防止用户篡改)，该目录也物理存在于文件系
core-data - SwiftUI ImagePicker 将 (Image -> UIImage --> Data) 保存到 Core Data
大家好我有点沮丧，所以我希望得到一些帮助。我的项目在 SwiftUI 中。我想使用图像选择器将图像保存到 Core Data。我实现了让 ImagePicker 工作，但我正在努力转换 Image -
r - 尽管 data.frame 可以，但为什么 data.table 没有从表中创建 data.table？
我有以下数据和代码: mydf grp categ condition value 1 A X P 2 2 B X P 5
r - mlogit.data() 错误 : Assigned data `ids` must be compatible with existing data
我一直在努力解决这个问题，但我根本找不到任何解决问题的方法。希望这里有人可以提供帮助。我正在尝试为具有以下结构的某些数据创建个人选择矩阵: # A tibble: 2,152 x 32 a
haskell - Data.Map 与 Data.Map.Strict 和 Data.Map.Lazy
我了解 Data.Map.Lazy 和 Data.Map.Strict 是不同的。但是，当您导入 Data.Map 时，您究竟导入了什么:严格的、惰性的还是两者的组合？最佳答案懒人。看着docs
c - 如何让 DBCursor->get(...) 识别我为 key.data 和 data.data 分配的内存
我正在开发一个 C 程序，用于从 BerkeleyDB DBTree 数据库中提取数据值与特定模式匹配的记录。我创建数据库，打开它，将键的 DBT 和数据的另一个 DBT 清零，将 DBT 标志设置为
mysql : Previous Row data if data on another row is equal to data on current row
所以我有以下成员(member)历史表 User_ID | Start date | End Date | Type(0-7) | ---------------------------
r - 基准 data.frame (base), data.frame(package dataframe) 和 data.table
随着最近推出的包dataframe ，我认为是时候正确地对各种数据结构进行基准测试，并突出每种数据结构的优势。我不是每个人的不同优势的专家，所以我的问题是，我们应该如何对它们进行基准测试。我尝试过的
javascript - Vue+Laravel : How to mounted data from api if data in form array in one of tuple data
我有来自 API 的数据，但无法将数组中的数据设置为 vue.js 中的 this.data这是来自 API 的数据(JSON) 你能告诉我这个语法吗 {"id":1613, "name_org":"
javascript - Vue.js 中 'data:' 、 'data: ()' 和 'data()' 之间有什么区别
在 Vue.js到目前为止，我已经找到了两种定义数据的方法:data: {} 和 data() { return; }. data: { defaultLayout: 'default' }
spring-data-rest - 如何在Spring Data Rest中添加自定义拦截器(spring-data-rest-webmvc 2.3.0)
我正在研究Spring Data Rest Services，并在自定义拦截器中遇到一些问题。之前我使用spring-data-rest-webmvc 2.2.0并以以下方式添加了拦截器。 publi

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

r - 使用 data.table 有效处理 by group 中的重复值