gpt4 book ai didi

r - Dplyr 条件窗口

转载 作者:行者123 更新时间:2023-12-04 11:20:57 29 4
gpt4 key购买 nike

尝试转换以下 R data.frame:

    structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame")

Time Price Volume
1 09:30:01 1 100
2 09:30:29 2 200
3 09:35:56 3 300
4 09:37:17 4 100
5 09:37:21 5 200
6 09:37:28 6 300
7 09:37:35 7 100
8 09:37:51 8 200
9 09:42:11 9 600
10 10:00:31 10 100

进入这个
       Time Price  Volume Bin
1 09:30:01 1 100 1
2 09:30:29 2 200 1
3 09:35:56 3 200 1
4 09:35:56 3 100 2
5 09:37:17 4 100 2
6 09:37:21 5 200 2
7 09:37:28 6 100 2
8 09:37:28 6 200 3
9 09:37:35 7 100 3
10 09:37:51 8 200 3
11 09:42:11 9 500 4
12 09:42:11 9 100 5
13 10:00:31 10 100 5

从本质上讲,它是在每次超过 500 个时计算体积和分箱的累积总和。因此,bin 1 是 100+200+200,其中 09:35:56 的音量被拆分为 200/100,并插入了一个新行并且 bin 计数器递增。

这对于基本 R 来说相对简单,但我想知道 dplyr 是否有更优雅且希望更快的方法。

干杯

更新:

谢谢@Frank 和@AntoniosK。

为了解决您的问题,音量值的范围是从 1 到 10k 的所有正整数值。

我对这两种方法进行了微基准测试,dplyr 稍微快一点,但速度并不快,在类似于上面的数据集上,大约有 20 万行。

非常感谢迅速的 react 和帮助

最佳答案

不确定这是最好的还是最快的方式,但对于那些人来说似乎很快 Volume值。哲学很简单。基于 Volume您创建了多行 Time 的值和 PriceVolume = 1 .然后让cumsum每次有新的 500 批次时添加数字和标志。使用这些标志来创建您的 Bin值。

structure(list( Time=c("09:30:01"  ,"09:30:29"  ,"09:35:56",  "09:37:17"  ,"09:37:21"  ,"09:37:28"  ,"09:37:35"  ,"09:37:51"  ,"09:42:11"  ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame") -> dt

library(dplyr)

dt %>%
group_by(Time, Price) %>% ## for each Time and Price
do(data.frame(Volume = rep(1,.$Volume))) %>% ## create as many rows, with Volume = 1, as the value of Volume
ungroup() %>% ## forget about the grouping
mutate(CumSum = cumsum(Volume), ## cumulative sums
flag_500 = ifelse(CumSum %in% seq(501,sum(dt$Volume),by=500),1,0), ## flag 500 batches (at 501, 1001, etc.)
Bin = cumsum(flag_500)+1) %>% ## create Bin values
group_by(Bin, Time, Price) %>% ## for each Bin, Time and Price
summarise(Volume = sum(Volume)) %>% ## get new Volume values
select(Time, Price, Volume, Bin) %>% ## use only if you want to re-arrange column order
ungroup() ## use if you want to forget the grouping

# Time Price Volume Bin
# (chr) (dbl) (dbl) (dbl)
# 1 09:30:01 1 100 1
# 2 09:30:29 2 200 1
# 3 09:35:56 3 200 1
# 4 09:35:56 3 100 2
# 5 09:37:17 4 100 2
# 6 09:37:21 5 200 2
# 7 09:37:28 6 100 2
# 8 09:37:28 6 200 3
# 9 09:37:35 7 100 3
# 10 09:37:51 8 200 3
# 11 09:42:11 9 500 4
# 12 09:42:11 9 100 5
# 13 10:00:31 10 100 5

关于r - Dplyr 条件窗口,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33525397/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com