gpt4 book ai didi

R:根据某些条件对变量求和

转载 作者:行者123 更新时间:2023-12-04 15:10:11 24 4
gpt4 key购买 nike

给定一张表,我正在尝试学习如何使用 R 根据满足特定条件的时间(基于同一表中的其他变量)对变量求和。

使用 dplyr 库,(我创建了一些数据)然后按组对数据求和:

#load library
library(dplyr)

#create data
data <- data.frame(

"col_a" = c("aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa", "aaa"),
"col_b" = c("123", "124", "125", "126", "127", "128", "129", "130"),
"col_c" = c("2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015"),
"col_d" = c("red", "red", "red", "blue", "blue", "green", "green", "green"),
"day_a" = c("2001-01-01", "2000-01-05", "2000-01-01", "2010-12-20", "2010-12-20", "2020-05-05", "2020-05-05", "2020-05-28"),
"day_b" = c("2001-01-10", "2000-01-10", "2000-01-01", "2010-12-25", "2010-12-22", "2020-05-15", "2020-05-20", "2020-05-30")

)

#format variable types

data$col_a = as.factor(data$col_a)
data$col_b = as.factor(data$col_b)
data$col_c = as.factor(data$col_c)

#format date variables
data$day_a = as.factor(data$day_a)
data$day_b = as.factor(data$day_b)

data$day_1 = as.Date(as.character(data$day_a))
data$day_2 = as.Date(as.character(data$day_b))

#create new variable based on difference between date variables
data$diff = data$day_2 - data$day_1
data$diff = as.numeric(data$diff)

#create file that sums days based on groups of "col_a, col_c, col_d"
file = data%>%
group_by(col_a, col_c, col_d) %>%
dplyr::summarize(Total = sum(diff, na.rm=TRUE), Count = n())

file = as.data.frame(file)

现在,对于“col_a、col_c、col_d”组,我想根据另一个条件对“diff”变量求和。

例如,对于组“aaa,2015 年,绿色”,我只想对“唯一天数”求和 - 即重叠的天数。 (2020-05-05, 2020-05-15), ( 2020-05-05, 2020-05-20), (2020-05-28,2020-05-30)

对于这个组,我希望“总计”变量的值 = 15 + 2 = 17 ... 而不是“27”。

这是因为日期(2020-05-05、2020-05-15)完全在日期(2020-05-05、2020-05-20)之内。我只想对“唯一”日期段求和。

我试图最终得到这样的东西:

final_result <- data.frame ( col_a = c("aaa", "aaa", "aaa"),
col_c = c("2015", "2015", "2015"),
col_d = c("blue", "green", "red"),
total = c("5","17","9"),
count = c("2", "3", "3")

)

任何人都可以告诉我如何做到这一点吗?

谢谢

最佳答案

这是使用 purrr::map2 的方法:

首先,将 Date 列转换为整数表示。然后使用 map2 创建两个日期之间的整数序列的向量。看来你不想计算最后一天,所以我从 day 2 中减去 1。

现在我们有一个新列 dates,它包含一个日期向量作为整数。

library(purrr)
data %>%
transmute(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq))
1 11323, 11324, 11325, 11326, 11327, 11328, 11329, 11330, 11331
2 10961, 10962, 10963, 10964, 10965
3 10957, 10956
4 14963, 14964, 14965, 14966, 14967
5 14963, 14964
6 18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396
7 18387, 18388, 18389, 18390, 18391, 18392, 18393, 18394, 18395, 18396, 18397, 18398, 18399, 18400, 18401
8 18410, 18411

然后我们可以像您之前那样进行分组,并通过取消列出特定组的日期并使用 unique 删除重复项来进行总结。然后计算日期的数量。

data %>% 
mutate(dates = map2(as.integer(day_1),as.integer(day_2)-1,seq)) %>%
group_by(col_a, col_c, col_d) %>%
dplyr::summarize(Total = length(unique(unlist(dates))), Count = n())
# A tibble: 3 x 5
# Groups: col_a, col_c [1]
col_a col_c col_d Total Count
<fct> <fct> <chr> <int> <int>
1 aaa 2015 blue 5 2
2 aaa 2015 green 17 3
3 aaa 2015 red 16 3

关于R:根据某些条件对变量求和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65333809/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com