gpt4 book ai didi

r - data.table 按组获取 N 个最频繁的值

转载 作者:行者123 更新时间:2023-12-04 22:19:49 25 4
gpt4 key购买 nike

假设我想为每个购买类别找到前 3 个最常出现的邮政编码。在此示例中,类别为住宅、联排别墅和公寓。我有交易数据,如:

set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )

我知道我可以这样做:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]

purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2

但这不是最优雅的 data.table 方法,而且看起来有点慢。

有什么建议可以减少通过次数吗?也许使用 table() 的东西? TYVM!

最佳答案

编辑:改进。
我认为你完全在正确的轨道上。但是,您缺少的一项关键功能是 frank 函数,它已经过优化,应该可以大大加快您的代码速度(几乎可以立即在 3m 行样本数据上运行):

d[ , .(purch_count = .N), 
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
table 的不完整答案(慢):
是的,一种方法是使用 table
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1

关于r - data.table 按组获取 N 个最频繁的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32339056/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com