gpt4 book ai didi

r - 如何根据 R 中最近的匹配时间压缩数据帧

转载 作者:行者123 更新时间:2023-12-04 12:36:05 25 4
gpt4 key购买 nike

我有一个数据框,目前包含两个 HH:MM:SS 格式的“时间”列。我想压缩这个数据框,以便每个唯一的“id”值只有一行。我想为每个唯一的“id”值保留该行,该值的“time1”值与“time2”值最接近。但是,“time1”需要大于“time2”。

这是一个简单的例子:

> dput(df)
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L), count = c(23L, 23L, 23L, 23L, 45L, 45L,
45L, 45L, 67L, 67L, 67L, 67L, 88L, 88L, 88L, 88L), time1 = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("00:13:00",
"01:13:00", "07:18:00", "18:14:00"), class = "factor"), time2 = structure(c(4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L), .Label = c("00:00:00",
"06:00:00", "12:00:00", "18:00:00"), class = "factor"), afn = c(3.36,
0.63, 1.77, 3.89, 3.36, 0.63, 1.77, 3.89, 3.36, 0.63, 1.77, 3.89,
3.36, 0.63, 1.77, 3.89), dfn = c(201.67, 157.27, 103.55, 191.41,
201.67, 157.27, 103.55, 191.41, 201.67, 157.27, 103.55, 191.41,
201.67, 157.27, 103.55, 191.41)), .Names = c("id", "count", "time1",
"time2", "afn", "dfn"), class = "data.frame", row.names = c(NA,
-16L))

> df
id count time1 time2 afn dfn
1 1 23 00:13:00 18:00:00 3.36 201.67
2 1 23 00:13:00 00:00:00 0.63 157.27
3 1 23 00:13:00 06:00:00 1.77 103.55
4 1 23 00:13:00 12:00:00 3.89 191.41
5 2 45 01:13:00 18:00:00 3.36 201.67
6 2 45 01:13:00 00:00:00 0.63 157.27
7 2 45 01:13:00 06:00:00 1.77 103.55
8 2 45 01:13:00 12:00:00 3.89 191.41
9 3 67 18:14:00 18:00:00 3.36 201.67
10 3 67 18:14:00 00:00:00 0.63 157.27
11 3 67 18:14:00 06:00:00 1.77 103.55
12 3 67 18:14:00 12:00:00 3.89 191.41
13 4 88 07:18:00 18:00:00 3.36 201.67
14 4 88 07:18:00 00:00:00 0.63 157.27
15 4 88 07:18:00 06:00:00 1.77 103.55
16 4 88 07:18:00 12:00:00 3.89 191.41

我想在上述情况下得到这个矩阵:

id  count   time1       time2       afn     dfn
1 23 00:13:00 00:00:00 0.63 157.27
2 45 01:13:00 00:00:00 0.63 157.27
3 67 18:14:00 18:00:00 3.36 201.67
4 88 07:18:00 06:00:00 1.77 103.55

我过去曾使用 ddply() 函数来压缩数据帧,但没有使用合并的匹配规则。我必须应用这是一个包含很多列的数据框(比这里给出的简单示例要多得多),所以任何关于如何做到这一点的建议都会很棒。任何帮助将不胜感激。非常感谢!

最佳答案

这里有一些解决方案。

1) ave 这使用了来自 R 基础的 chron times 以及 subsetave:

library(chron)

delta <- as.vector(times(df$time1) - times(df$time2))
df2 <- subset(df, delta > 0)
df2[ave(delta, df2$id, FUN = function(delta) delta == min(delta)) == 1, ]

2) dplyr 这使用 chron times 和 dplyr 包:

library(chron)
library(dplyr)

df %.%
mutate(delta = as.vector(times(time1) - times(time2))) %.%
filter(delta > 0) %.%
group_by(id) %.%
filter(delta == min(delta)) %.%
select(- delta)

3) sqldf

library(sqldf)

sqldf("select *, min(strftime('%s', time1) - strftime('%s', time2)) delta
from (select * from df where strftime('%s', time1) > strftime('%s', time2))
group by id")[seq_along(df)]

或者也许是我们在 R 中计算 delta 然后使用 sqldf 的这种变体:

library(sqldf)
library(chron)

df2 = transform(df, delta = as.vector(times(time1) - times(time2)))

sqldf("select *, min(delta) delta
from (select * from df2 where delta > 0)
group by id")[-ncol(df2)]

4) 数据表

library(data.table)
library(chron)

DT <- data.table(df)
DT[, delta := times(time1) - times(time2)
][delta > 0
][, .SD[delta == min(delta)], by = id
][, seq_along(df), with = FALSE]

添加额外的解决方案。更正了 librarysubset 语句。小幅改进。

关于r - 如何根据 R 中最近的匹配时间压缩数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21971930/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com