gpt4 book ai didi

r - 计算以分号分隔的累积唯一因素 按名称分组

转载 作者:行者123 更新时间:2023-12-02 01:05:03 28 4
gpt4 key购买 nike

这就是我的数据框的样子。最右边的两列是我想要的列。我正在计算每行的唯一 FundType 的累积数量。第四列是所有“ActivityType”的累积唯一计数,第五列是仅“ActivityType==”Sale”的累积唯一计数。

dt <- read.table(text='

Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)

John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4

', header=T, row.names = NULL)

我尝试过 dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name] 但例如它计数 a & a;b & c;d 作为 3 个唯一值,而不是所需的 4 个唯一值,因为这些因素用分号分隔。请让我知道解决方案。

更新:我的真实数据集看起来更像是这样:

dt <- read.table(text='

Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email "" 0 0
John Conference "" 0 0
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
John Email "" 5 3
John Webinar "" 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4

', header=T, row.names = NULL)

唯一的累积向量需要考虑缺失值。

最佳答案

nrussell 建议了一种编写自定义函数的简洁解决方案。让我放下我得到的东西。我尝试按照您的尝试使用 cumsum() 和 duplicated() 。我做了两次大手术。一个用于 alltype,另一个用于 saleonly。首先,我为每个名字创建了索引。然后,我拆分 FundType 并使用 splitstackshape 包中的 cSplit() 将数据格式化为长格式。然后,我为每个名称的每个索引号选择最后一行。最后,我只选择了一列,alltype

library(splitstackshape)
library(zoo)
library(data.table)

setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"][,
.SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype

第二次操作仅供出售。基本上,我对待售的子集数据重复了相同的方法,即 ana。我还创建了一个不出售的数据集,即ana2。然后,我创建了一个包含两个数据集(即 l)的列表并将它们绑定(bind)。我使用 Nameind 更改了数据集的顺序,获取每个名称和索引号的最后一行,处理 NA(填充 NA 并替换第一个 NA每个Name都带0),最后选择一列。最终的操作是将原来的dtalltypesaleonly结合起来。

# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2

# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)

Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email a 1 0 1 1 0
2: John Sale a;b 2 2 2 2 2
3: John Webinar c;d 4 2 3 4 2
4: John Sale b 4 2 4 4 2
5: John Webinar e 5 2 5 5 2
6: John Conference b;d 5 2 6 5 2
7: John Sale b;e 5 3 7 5 3
8: Tom Email a 1 0 1 1 0
9: Tom Sale a;b 2 2 2 2 2
10: Tom Webinar c;d 4 2 3 4 2
11: Tom Sale b 4 2 4 4 2
12: Tom Webinar e 5 2 5 5 2
13: Tom Conference b;d 5 2 6 5 2
14: Tom Sale b;e;f 6 4 7 6 4

编辑

对于新的数据集,我尝试了以下操作。基本上,我将我的方法用于仅销售数据到这个新数据集。修订仅在 alltype 部分。首先,我添加了索引,将“”替换为 NA,并将数据子集化为具有非 NA 值的行。这是临时。其余部分与之前的答案相同。现在我想在 FundType 中获得带有 NA 的数据集,所以我使用了 setdiff()。使用 rbindlist(),我组合了两个数据集并创建了 temp。其余部分与之前的答案相同。销售部分没有任何变化。我希望这适用于您的真实数据。

### all type

setDT(dt)[, ind := 1:.N, by = "Name"][,
FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype


whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype


### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2

l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)


Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email NA 0 0 1 0 0
2: John Conference NA 0 0 2 0 0
3: John Email a 1 0 3 1 0
4: John Sale a;b 2 2 4 2 2
5: John Webinar c;d 4 2 5 4 2
6: John Sale b 4 2 6 4 2
7: John Webinar e 5 2 7 5 2
8: John Conference b;d 5 2 8 5 2
9: John Sale b;e 5 3 9 5 3
10: John Email NA 5 3 10 5 3
11: John Webinar NA 5 3 11 5 3
12: Tom Email a 1 0 1 1 0
13: Tom Sale a;b 2 2 2 2 2
14: Tom Webinar c;d 4 2 3 4 2
15: Tom Sale b 4 2 4 4 2
16: Tom Webinar e 5 2 5 5 2
17: Tom Conference b;d 5 2 6 5 2
18: Tom Sale b;e;f 6 4 7 6 4

关于r - 计算以分号分隔的累积唯一因素 按名称分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34502356/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com