gpt4 book ai didi

r - 如何根据时间有条件地连接来自 2 个数据集的数据?

转载 作者:行者123 更新时间:2023-12-05 02:35:17 25 4
gpt4 key购买 nike

在我的例子中,我有 2 个数据集要合并。我的第一个数据集包含一列位置和一列日期时间。

site              | datetime
---------------------------------------
Cave Mountain | 2021-08-08 06:28:26
Laurel Run | 2021-06-24 05:26:07
Laurel Run | 2021-10-05 05:24:04
Laurel Run | 2021-10-07 12:26:23
Smoke Cliffs | 2021-08-30 20:44:39
Smoke Hole Road | 2021-09-10 19:58:26
University Forest | 2021-09-25 02:15:04



data <- structure(list(site = c("Cave Mountain", "Laurel Run", "Laurel Run",
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), datetime = structure(c(1628418506.5, 1624526767, 1633425844.5,
1633623983.5, 1630370679, 1631318306, 1632550504), tzone = "America/New_York", class = c("POSIXct",
"POSIXt"))), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -7L), groups = structure(list(site = c("Cave Mountain",
"Laurel Run", "Smoke Cliffs", "Smoke Hole Road", "University Forest"
), .rows = structure(list(1L, 2:4, 5L, 6L, 7L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L), .drop = TRUE))

我的第二个数据集与第一个数据集一样有一列位置、一列日期时间和一列温度数据。

site              | datetime            | tempf
------------------------------------------------
Cave Mountain | 2020-06-20 21:20:10 | 79.601
Cave Mountain | 2020-07-16 21:05:20 | 69.640
Cave Mountain | 2020-10-07 21:12:00 | 66.706
Laurel Run | 2020-06-15 21:06:30 | 60.991
Laurel Run | 2020-06-26 21:23:40 | 64.204
Laurel Run | 2020-08-11 21:21:10 | 69.685
Smoke Cliffs | 2020-05-21 21:08:58 | 66.551
Smoke Cliffs | 2020-05-28 21:00:01 | 59.369
Smoke Hole Road | 2020-06-13 21:05:10 | 62.919
University Forest | 2020-06-14 21:14:00 | 63.230
University Forest | 2020-08-05 21:09:10 | 61.610
University Forest | 2020-08-08 21:11:00 | 65.007
University Forest | 2020-09-03 21:00:20 | 55.740
University Forest | 2020-09-19 21:16:50 | 67.246

ref <- structure(list(site = c("Cave Mountain", "Cave Mountain", "Cave Mountain",
"Laurel Run", "Laurel Run", "Laurel Run", "Smoke Cliffs", "Smoke Cliffs",
"Smoke Hole Road", "University Forest", "University Forest",
"University Forest", "University Forest", "University Forest"
), datetime = structure(c(1592702410, 1594947920, 1602119520,
1592269590, 1593221020, 1597195270, 1590109738, 1590714001, 1592096710,
1592183640, 1596676150, 1596935460, 1599181220, 1600564610),
tzone = "America/New_York", class = c("POSIXct",
"POSIXt")), tempf = c(79.601, 69.64, 66.706, 60.991, 64.204,
69.685, 66.551, 59.369, 62.919, 63.23, 61.61, 65.007, 55.74,
67.246)), row.names = c(NA, -14L), class = c("tbl_df", "tbl",
"data.frame"))

我的目标是通过查找与第一个数据集中每一列最接近的日期时间,然后将第二个数据集中的温度添加到第一个数据集中的关联行,从而从第二个数据集中提取温度。这种合并也应该是有条件的,基于站点。所需的输出如下所示。

site              | datetime            | tempf
------------------------------------------------
Cave Mountain | 2021-08-08 06:28:26 | 69.640
Laurel Run | 2021-06-24 05:26:07 | 64.204
Laurel Run | 2021-10-05 05:24:04 | 69.685
Laurel Run | 2021-10-07 12:26:23 | 69.685
Smoke Cliffs | 2021-08-30 20:44:39 | 59.369
Smoke Hole Road | 2021-09-10 19:58:26 | 62.919
University Forest | 2021-09-25 02:15:04 | 67.246

我一直在尝试使用 data.table 的 rolling join feature with roll = "nearest"来实现这一点,但我不知道如何在考虑站点的情况下有条件地实现它.

最佳答案

您可以在带有 nearest 选项的滚动连接的 on 参数中指定 sitedatetime:

library(data.table)
setDT(ref)
setDT(data)
ref[data,on=.(site,datetime),roll='nearest']

site datetime tempf
<char> <POSc> <num>
1: Cave Mountain 2021-08-08 06:28:26 66.706
2: Laurel Run 2021-06-24 05:26:07 69.685
3: Laurel Run 2021-10-05 05:24:04 69.685
4: Laurel Run 2021-10-07 12:26:23 69.685
5: Smoke Cliffs 2021-08-30 20:44:39 59.369
6: Smoke Hole Road 2021-09-10 19:58:26 62.919
7: University Forest 2021-09-25 02:15:04 67.246

如果需要 x. 还允许访问最近的引用点 datetime,例如,这对于删除过时的值很有用:

ref[data,.(site,datetime,nearest_datetime=x.datetime,tempf)
,on=.(site,datetime),roll='nearest']

site datetime nearest_datetime tempf
<char> <POSc> <POSc> <num>
1: Cave Mountain 2021-08-08 06:28:26 2020-10-07 21:12:00 66.706
2: Laurel Run 2021-06-24 05:26:07 2020-08-11 21:21:10 69.685
3: Laurel Run 2021-10-05 05:24:04 2020-08-11 21:21:10 69.685
4: Laurel Run 2021-10-07 12:26:23 2020-08-11 21:21:10 69.685
5: Smoke Cliffs 2021-08-30 20:44:39 2020-05-28 21:00:01 59.369
6: Smoke Hole Road 2021-09-10 19:58:26 2020-06-13 21:05:10 62.919
7: University Forest 2021-09-25 02:15:04 2020-09-19 21:16:50 67.246

关于r - 如何根据时间有条件地连接来自 2 个数据集的数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70610682/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com