gpt4 book ai didi

r - 非对等连接的结果中的顺序是如何确定的?

转载 作者:行者123 更新时间:2023-12-04 02:16:47 25 4
gpt4 key购买 nike

我试图了解非等值连接的结果如何加入 data.table 的底层逻辑在 on 的每个级别内排序-多变的。

只是从一开始就明确表示:我对订单本身没有问题,或者在加入后以所需的方式对输出进行排序。但是,因为我找到了所有其他 data.table 的输出操作高度一致,我怀疑在非对等连接中也会显示一种排序模式。

我将举两个例子,其中两个不同的“大”数据集与一个较小的数据集相连。我试图描述每个连接中输出中最明显的模式,以及两个数据集的连接之间模式不同的实例。

library(data.table)
# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
y = c(rep(c(1, 3, 6), 3), 6, 6),
id = 1:11) # to make it easier to track the original order in the output
# x y id
# 1: b 1 1
# 2: b 3 2
# 3: b 6 3
# 4: a 1 4
# 5: a 3 5
# 6: a 6 6
# 7: c 1 7
# 8: c 3 8
# 9: c 6 9
# 10: a 6 10
# 11: b 6 11

# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))
# id val
# 1: 1 4
# 2: 2 2

第一个大数据集和小数据集之间的非等连接, on = .(y >= val) .
d1[d2, on = .(y >= val)]
# x y id i.id
# 1: b 4 3 1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2: a 4 6 1 # The rows within this match have the same order as the original data
# 3: c 4 9 1 # and runs consecutively from first to last match
# 4: a 4 10 1
# 5: b 4 11 1

# 6: b 2 2 2 # Row 6-13, second match: y >= val[2]; y >= 2
# 7: a 2 5 2 # The rows within this match do not have the same order as the original data
# 8: c 2 8 2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13)
# First chunk starts with the match with lowest index, y[2]
# 9: b 2 3 2
# 10: a 2 6 2
# 11: c 2 9 2

# 12: a 2 10 2
# 13: b 2 11 2

第二个“大”数据集:
d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
y = c(6, 1, 3),
id = 1:9)
# x y id
# 1: a 6 1
# 2: a 1 2
# 3: a 3 3
# 4: b 6 4
# 5: b 1 5
# 6: b 3 6
# 7: c 6 7
# 8: c 1 8
# 9: c 3 9

第二个大数据集与小数据集之间的相同非等连接:
d3[d2, on = .(y >= val)]

# x y id i.id
# 1: a 4 1 1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4 4 1
# 3: c 4 7 1

# 4: a 2 3 2 # Row 4-9, second match (y >= 2).
# 5: b 2 6 2 # Again, rows not consecutive.
# 6: c 2 9 2 # However, now the first chunk does not start with the match with lowest index,
# y[3] instead of y[1]

# 7: a 2 1 2 # y[1] appears after y[3]
# 8: b 2 4 2 # ditto
# 9: c 2 7 2

任何人都可以解释(1) on的每个级别内的顺序的逻辑- 变量,尤其是在第二场比赛中,其中数据的原始顺序不保留在结果中。以及 (2) 当使用两个不同的数据集时,为什么匹配中块之间的顺序不同?

最佳答案

感谢您发现并在此处报告 SO,并在 Github 上归档。这个,应该是fixed now在当前的开发版本中(撰写本文时为 v1.10.5)。

它应该很快就会在 CRAN 上以 v1.10.6 的形式发布。

来自 NEWS入口:

  1. Order of rows returned in non-equi joins were incorrect in certain scenarios as reported under #1991. This is now fixed. Thanks to @Henrik-P for reporting.

关于r - 非对等连接的结果中的顺序是如何确定的?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40932231/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com