gpt4 book ai didi

R - 根据数据帧中的时间约束查找行元素序列

转载 作者:行者123 更新时间:2023-12-05 01:22:49 27 4
gpt4 key购买 nike

考虑以下数据帧(按 ID 和时间排序):

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24

我想计算给定的事件序列在每个“id”组中出现的次数。考虑以下具有时间限制的序列:

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

这意味着事件“a”可以随时开始,事件“b”必须在事件“a”之后不早于2点且不晚于8点开始,另一个事件“a”必须不早于12点且不晚于事件“a”开始。事件“b”之后 18 点之后。创建序列的一些规则:

  1. 事件不需要在“时间”列中连续。例如,seq 可以从第 1、3 和 5 行构建。
  2. 要进行计数,序列必须具有不同的第一个事件。例如,如果计算了 seq = 第 8、10 和 11 行,则不得计算 seq = 第 8、10 和 12 行。
  3. 如果事件不违反第二条规则,则它们可以包含在许多构建的序列中。例如,我们计算两个序列:第 1、3、5 行和第 5、6、7 行。

预期结果:

df1
id count
1 1 2
2 2 2

R - Identify a sequence of row elements by groups in a dataframe中有一些相关问题和 Finding rows in R dataframe where a column value follows a sequence .

这是使用“dplyr”解决问题的方法吗?

最佳答案

我相信这就是您正在寻找的。它为您提供所需的输出。请注意,您原来的问题中有一个拼写错误,当您在 df 中定义 time 列时,您输入的是 32,而不是 42。我说这是一个拼写错误,因为它与紧邻 df 定义下方的输出不匹配。我在下面的代码中将 32 更改为 42。

library(dplyr)

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()

这是输出:

# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2

此外,如果省略 dplyr 管道中进行计数的最后 2 个部分(以查看其匹配的序列),您将得到以下序列:

Source: local data frame [4 x 7]
Groups: id, time1 [4]

id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24

编辑关于概括这一点的评论:是的,可以将其概括为任意长度的序列,但需要一些 R 巫毒。最值得注意的是,请注意 Reduce 的使用,它允许您在对象列表上应用通用函数以及 foreach,这是我从 借用的。 >foreach 包来执行一些任意循环。代码如下:

library(dplyr)
library(foreach)

df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))

seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)

multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)

names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)

df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))

time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame

foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)

输出如下:

Source: local data frame [4 x 8]
Groups: id, time1 [4]

id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba

如果您只需要计数,可以先group_by(id),然后count(),如原始代码片段所示。

关于R - 根据数据帧中的时间约束查找行元素序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41772024/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com