作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个大约 150 万行和数百列的数据表结构,代表带有赛马结果的日期 - 这将用于预测模型,但首先需要特征工程来计算各种实体在创建方面的罢工率每场比赛前一天的先前记录。
“击球率”可以通过多种方式定义,但一个简单的定义是任何给定的马匹、驯马师、骑师等的获胜次数与运行次数的比率。当然,这必须考虑所有以前的运行次数和获胜次数,但不包括结果来自“今天”,因为这对于构建模型来说是无稽之谈。
无论如何,根据网上的一些例子改编的简化数据结构就足以解释了。
生成数据如下:
n <- 90
dt <- data.table(
date=rep(seq(as.Date('2010-01-01'), as.Date('2015-01-01'), by='year'), n/6),
finish=c(1:5),
trainer=sort(rep(letters[1:5], n/5))
)
dt[order(trainer, date), .(strike_rate = sum(finish==1)/.N), by=trainer]
shift
函数和数据表构造,但无法让它为这个特定问题工作 - 然而,在循环上下文中它工作正常,尽管令人难以置信地展示。
#order dates most recent to oldest so that the loop works backwards in time:
dt <- dt[order(-date)]
#find unique dates (converting to character as something weird with date)
dates = as.character(unique(dt$date))
for (d in dates) {
#find unique trainers on this date
trainers = unique(dt$trainer[dt$date==d])
for (t in trainers) {
trainer_past_form = dt[trainer==t & date < d]
strike_rate = sum(trainer_past_form$finish==1)/nrow(trainer_past_form)
# save this strike rate for this day and this trainer
dt$strike_rate[dt$trainer==t & dt$date==d] <- strike_rate
}
}
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 4 b 0.2000000
5: 2015-01-01 5 b 0.2000000
6: 2015-01-01 1 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 5 d 0.2000000
11: 2015-01-01 1 d 0.2000000
12: 2015-01-01 2 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 5 a 0.1666667
17: 2014-01-01 1 a 0.1666667
18: 2014-01-01 2 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 4 d 0.1666667
26: 2014-01-01 5 d 0.1666667
27: 2014-01-01 1 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 4 a 0.1111111
32: 2013-01-01 5 a 0.1111111
33: 2013-01-01 1 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 5 c 0.1111111
38: 2013-01-01 1 c 0.1111111
39: 2013-01-01 2 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 4 c 0.0000000
53: 2012-01-01 5 c 0.0000000
54: 2012-01-01 1 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 5 e 0.1666667
59: 2012-01-01 1 e 0.1666667
60: 2012-01-01 2 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 5 b 0.3333333
65: 2011-01-01 1 b 0.3333333
66: 2011-01-01 2 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 4 e 0.0000000
74: 2011-01-01 5 e 0.0000000
75: 2011-01-01 1 e 0.0000000
76: 2010-01-01 1 a NaN
77: 2010-01-01 2 a NaN
78: 2010-01-01 3 a NaN
79: 2010-01-01 4 b NaN
80: 2010-01-01 5 b NaN
81: 2010-01-01 1 b NaN
82: 2010-01-01 2 c NaN
83: 2010-01-01 3 c NaN
84: 2010-01-01 4 c NaN
85: 2010-01-01 5 d NaN
86: 2010-01-01 1 d NaN
87: 2010-01-01 2 d NaN
88: 2010-01-01 3 e NaN
89: 2010-01-01 4 e NaN
90: 2010-01-01 5 e NaN
最佳答案
这里有一些选项。
1) 使用非对等连接:
dt[, strike_rate :=
.SD[.SD, on=.(trainer, date<date), by=.EACHI, sum(finish==1L)/.N]$V1
]
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- replace(rep(NA_real_, .N), firstd, cumsum(finish==1L)[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, as.double(1:.N)[firstd - 1L])
nafill(cs, "locf") / nafill(k, "locf")
}, trainer]
setorder(dt, -date, trainer, finish)[]
的输出:
date finish trainer strike_rate
1: 2015-01-01 1 a 0.2000000
2: 2015-01-01 2 a 0.2000000
3: 2015-01-01 3 a 0.2000000
4: 2015-01-01 1 b 0.2000000
5: 2015-01-01 4 b 0.2000000
6: 2015-01-01 5 b 0.2000000
7: 2015-01-01 2 c 0.2000000
8: 2015-01-01 3 c 0.2000000
9: 2015-01-01 4 c 0.2000000
10: 2015-01-01 1 d 0.2000000
11: 2015-01-01 2 d 0.2000000
12: 2015-01-01 5 d 0.2000000
13: 2015-01-01 3 e 0.2000000
14: 2015-01-01 4 e 0.2000000
15: 2015-01-01 5 e 0.2000000
16: 2014-01-01 1 a 0.1666667
17: 2014-01-01 2 a 0.1666667
18: 2014-01-01 5 a 0.1666667
19: 2014-01-01 3 b 0.2500000
20: 2014-01-01 4 b 0.2500000
21: 2014-01-01 5 b 0.2500000
22: 2014-01-01 1 c 0.1666667
23: 2014-01-01 2 c 0.1666667
24: 2014-01-01 3 c 0.1666667
25: 2014-01-01 1 d 0.1666667
26: 2014-01-01 4 d 0.1666667
27: 2014-01-01 5 d 0.1666667
28: 2014-01-01 2 e 0.2500000
29: 2014-01-01 3 e 0.2500000
30: 2014-01-01 4 e 0.2500000
31: 2013-01-01 1 a 0.1111111
32: 2013-01-01 4 a 0.1111111
33: 2013-01-01 5 a 0.1111111
34: 2013-01-01 2 b 0.3333333
35: 2013-01-01 3 b 0.3333333
36: 2013-01-01 4 b 0.3333333
37: 2013-01-01 1 c 0.1111111
38: 2013-01-01 2 c 0.1111111
39: 2013-01-01 5 c 0.1111111
40: 2013-01-01 3 d 0.2222222
41: 2013-01-01 4 d 0.2222222
42: 2013-01-01 5 d 0.2222222
43: 2013-01-01 1 e 0.2222222
44: 2013-01-01 2 e 0.2222222
45: 2013-01-01 3 e 0.2222222
46: 2012-01-01 3 a 0.1666667
47: 2012-01-01 4 a 0.1666667
48: 2012-01-01 5 a 0.1666667
49: 2012-01-01 1 b 0.3333333
50: 2012-01-01 2 b 0.3333333
51: 2012-01-01 3 b 0.3333333
52: 2012-01-01 1 c 0.0000000
53: 2012-01-01 4 c 0.0000000
54: 2012-01-01 5 c 0.0000000
55: 2012-01-01 2 d 0.3333333
56: 2012-01-01 3 d 0.3333333
57: 2012-01-01 4 d 0.3333333
58: 2012-01-01 1 e 0.1666667
59: 2012-01-01 2 e 0.1666667
60: 2012-01-01 5 e 0.1666667
61: 2011-01-01 2 a 0.3333333
62: 2011-01-01 3 a 0.3333333
63: 2011-01-01 4 a 0.3333333
64: 2011-01-01 1 b 0.3333333
65: 2011-01-01 2 b 0.3333333
66: 2011-01-01 5 b 0.3333333
67: 2011-01-01 3 c 0.0000000
68: 2011-01-01 4 c 0.0000000
69: 2011-01-01 5 c 0.0000000
70: 2011-01-01 1 d 0.3333333
71: 2011-01-01 2 d 0.3333333
72: 2011-01-01 3 d 0.3333333
73: 2011-01-01 1 e 0.0000000
74: 2011-01-01 4 e 0.0000000
75: 2011-01-01 5 e 0.0000000
76: 2010-01-01 1 a NA
77: 2010-01-01 2 a NA
78: 2010-01-01 3 a NA
79: 2010-01-01 1 b NA
80: 2010-01-01 4 b NA
81: 2010-01-01 5 b NA
82: 2010-01-01 2 c NA
83: 2010-01-01 3 c NA
84: 2010-01-01 4 c NA
85: 2010-01-01 1 d NA
86: 2010-01-01 2 d NA
87: 2010-01-01 5 d NA
88: 2010-01-01 3 e NA
89: 2010-01-01 4 e NA
90: 2010-01-01 5 e NA
date finish trainer strike_rate
by=trainer
的方法。进入
j
:)
dt[order(trainer, date), strike_rate := {
ri <- rleid(date)
firstd <- which(diff(ri) != 0) + 1L
cs <- cumsum(finish==1L)
cumfinishes <- replace(rep(NA_real_, .N), firstd, cs[firstd - 1L])
k <- replace(rep(NA_real_, .N), firstd, rowid(trainer)[firstd - 1L])
newt <- which(trainer != shift(trainer))
prevTrainer <- replace(rep(NA_real_, .N), newt, cs[newt - 1L])
finishes <- cumfinishes - nafill(replace(prevTrainer, 1L, 0), "locf")
finishes <- replace(finishes, newt, NaN)
nafill(finishes, "locf") / nafill(k, "locf")
}]
Rcpp
这应该是
最快 也更具可读性:
library(Rcpp)
cppFunction("
NumericVector strike(IntegerVector date, IntegerVector finish, IntegerVector trainer) {
int i, sz = date.size();
double cumstrikes = 0, prevcs = NA_REAL, days = 1, prevdays = 1;
NumericVector strikes(sz), ndays(sz);
for (i = 0; i < sz; i++) {
strikes[i] = NA_REAL;
}
if (finish[0] == 1)
cumstrikes = 1;
for (i = 1; i < sz; i++) {
if (trainer[i-1] != trainer[i]) {
cumstrikes = 0;
days = 0;
} else if (date[i-1] != date[i]) {
strikes[i] = cumstrikes;
ndays[i] = days;
} else {
strikes[i] = strikes[i-1];
ndays[i] = ndays[i-1];
}
if (finish[i] == 1) {
cumstrikes++;
}
days++;
}
for (i = 0; i < sz; i++) {
strikes[i] /= ndays[i];
}
return strikes;
}")
dt[order(trainer, date), strike_rate := strike(date, finish, rleid(trainer))]
关于r - 使用 R 数据表计算累计日期的罢工率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61485218/
我对以下需要使用 SQL 查询而不是 plsql 来解决的问题感到困惑。这个想法是建立一个累积列来计算之前的所有月份。输入表看起来像 Month 1 2 3 .. 24 我需要建立下表:
我正在寻找一个整洁的解决方案,最好使用 tidyverse 这个问题符合this answer ,但它确实有一个额外的扭曲。我的数据有一个整体分组变量“grp”。在每个这样的组中,我想根据“试验”定义
我正在尝试在 Spotfire 中创建一个运行余额列,该列应该如下图所示。本质上,我想逐行计算“金额”列的累积总计,并且我希望它随着日期的变化从 0 开始。 我尝试过几个 OVER 函数:Sum([A
我是一名优秀的程序员,十分优秀!