r - data.table 的 `:=` 操作符真的是引用操作吗？-6ren

r - data.table 的 `:=` 操作符真的是引用操作吗？

转载作者：行者123 更新时间：2023-12-03 21:49:36

25

4

数据表的:=运营商是 documented as :

... adds or updates or removescolumn(s) by reference. It makes no copies of any part of memory atall.

那么这里会发生什么呢？

dt <- data.table(a = 1:5, b = 6:10)
address(dt$b)
# [1] "0000021cca78db58"

dt[, b := 2*a]
address(dt$b)
# [1] "0000021cc77ade10"

b的地址怎么来的栏目变化？
我正在使用 R 3.6.1 和 data.table 1.12.8。

最佳答案

您(或者可能是专栏)刚刚被打了 ;) 在帮助文本 ( ?`:=` ) 中相当彻底地描述了打嗝行为:

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

但是，文档中目前没有明确说明 plonking 和内存之间的关系(但请参见下文)。因此，您和其他人的问题(在 github 上: := does not update by reference existing column if i is missing ， := doesn't always assign in-place )。
github 的帖子里有很多有趣的点，但与其我重申它们，请直接去那里享受!一 quote from Matt Dowle不过，我认为这很好地证明了 plonk 行为的合理性:

Instead of 5 column allocatons, there's just one now for the a+a expression (the RHS, which gets created anyway) which is then plonked into the column slot by reference i.e. address(DT) doesn't change but address(DT$a) will change. That's correct behaviour, and most efficient, to save copying the whole RHS into the existing column (which is only possible if they're the same type anyway). Since the RHS is as long as the number of rows, it is just plonked in.

(免责声明:自那篇帖子以来， data.table 和 R 的情况可能都发生了变化，但我认为主要信息仍然有效。)

关于文档，有一个公开的 PR ( update and clarify := docs )，其中建议对 plonk 和 memory 进行更明确的描述:

When a column is plonked, the original column is not updated by reference, because that needs to update every single element of that column.

我被坑了吗？是的!对我来说，不是内存，而是列类引起了一些头疼，最后我到了这里: Why is data.table casting column classes when I assign all columns by reference .阅读您的问题后，我回到那个帖子并意识到 the very nice answer by Matt不仅“不仅”解决类问题，还“解决”内存问题。我认为值得在这里重复(我的粗体和评论 [] ):

if length(RHS) == nrow(DT) then the RHS (and whatever its type) isplonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT), the memory for the column (and its type) iskept in place [implicitly memory not kept in place when length(RHS) == nrow(DT), I assume] but the RHS is coerced and recycled to replace the (subset of) items in that column.

If I need to change a column's type in a large table I write:
DT[, col := as.numeric(col)]
here as.numeric allocates a new vector, coerces "col" into thatnew memory, which is then plonked into the column slot. It's asefficient as it can be. The reason that's a plonk is becauselength(RHS) == nrow(DT).