gpt4 book ai didi

r - 如何在R data.table中按组进行子集设置时获取不同行数的切片

转载 作者:行者123 更新时间:2023-12-03 16:58:34 25 4
gpt4 key购买 nike

我有一个 data.table,其中包含多个站点超过 15 天的温度和其他天气信息的观测值。这个 dput 用于两个站点的所有观察。

library(data.table)

structure(list(site = c("100", "100", "100", "100", "100", "100",
"100", "100", "100", "100", "100", "100", "100", "100", "100"
), precursor_date = structure(c(15203, 15202, 15201, 15200, 15199,
15198, 15197, 15196, 15195, 15194, 15193, 15192, 15191, 15190,
15189), class = "Date"), lat = c(46.864, 46.864, 46.864, 46.864,
46.864, 46.864, 46.864, 46.864, 46.864, 46.864, 46.864, 46.864,
46.864, 46.864, 46.864), lon = c(-67.998, -67.998, -67.998, -67.998,
-67.998, -67.998, -67.998, -67.998, -67.998, -67.998, -67.998,
-67.998, -67.998, -67.998, -67.998), origDate = structure(c(15204,
15204, 15204, 15204, 15204, 15204, 15204, 15204, 15204, 15204,
15204, 15204, 15204, 15204, 15204), class = "Date"), last = c(2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011), begin = c(2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011), precursor_day_labl = structure(1:15, .Label = c("obd_p1",
"obd_p2", "obd_p3", "obd_p4", "obd_p5", "obd_p6", "obd_p7", "obd_p8",
"obd_p9", "obd_p10", "obd_p11", "obd_p12", "obd_p13", "obd_p14",
"obd_p15"), class = "factor"), year = c(2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011
), yday = c(229, 228, 227, 226, 225, 224, 223, 222, 221, 220,
219, 218, 217, 216, 215), dayl = c(50112, 50457.6015625, 50457.6015625,
50803.19921875, 50803.19921875, 51148.80078125, 51148.80078125,
51494.3984375, 51494.3984375, 51840, 51840, 52185.6015625, 52185.6015625,
52531.19921875, 52531.19921875), prcp = c(0, 17, 5, 4, 6, 6,
13, 8, 0, 16, 14, 6, 0, 0, 7), srad = c(403.200012207031, 176,
249.600006103516, 288, 297.600006103516, 268.799987792969, 179.199996948242,
192, 406.399993896484, 208, 227.199996948242, 307.200012207031,
371.200012207031, 304, 182.399993896484), swe = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), tmax = c(22.5, 20.5, 24.5,
26.5, 25, 22.5, 20.5, 21, 24, 23.5, 25, 28, 24, 23.5, 22), tmin = c(10.5,
14.5, 15, 14.5, 12.5, 12, 14, 14, 11, 15.5, 16, 14, 12.5, 14.5,
15), vp = c(1280, 1640, 1720, 1640, 1440, 1400, 1600, 1600, 1320,
1760, 1800, 1600, 1440, 1640, 1720), satv = c(19.99250234375,
17.77504867875, 22.44414580875, 25.14497571375, 23.09540625,
19.99250234375, 17.77504867875, 18.30827693, 21.80845952, 21.18811306125,
23.09540625, 27.34314816, 21.80845952, 21.18811306125, 19.41676944
), r_hum = c(64.024001497749, 92.2641636397099, 76.6346830330004,
65.2217770527891, 62.3500614976193, 70.026251638163, 90.0138181850829,
87.3921672758967, 60.526971141151, 83.0654431053979, 77.9375768720241,
58.5155736507555, 66.0294230630738, 77.4018901663935, 88.5832221119478
)), class = c("data.table", "data.frame"), row.names = c(NA,
-15L), .internal.selfref = <pointer: 0x000001b632fd1ef0>)
我想获得天气数据的平均值( prcptmaxtminr_hum ),从开始日开始向后移动 15 天的每个 #-day 间隔,我称之为 0690104 中的 069014进入每个站点各自 15 天窗口的日期在 origDate 之下。只有一个 2 天平均值、一个 3 天平均值、一个 4 天平均值等等,这些将基于紧接在 DT 之前的相应 #-day 窗口。例如,如果开始日期是 2011-08-18,我想要 8-18(08-17 和 08-16)、3 天(08-17、08-16、08-15)之前 2 天的平均值等,直到最大的窗口,15 天(08-17 到 08-03)。我不需要在 15 天窗口中可能的每个小间隔平均值。只是紧接在 precursor_date 之前的那个。
为了了解我想要的子集,在 origDate 中,我可以使用
df %>% group_by(site) %>% slice_head(n= x) 
# A tibble: 2,556 x 19
# Groups: site [1,278]
site precursor_date lat lon origDate last begin precursor_day_l~ year yday dayl prcp srad swe tmax
<chr> <date> <dbl> <dbl> <date> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 100 2011-08-17 46.9 -68.0 2011-08-18 2011 2011 obd_p1 2011 229 50112 0 403. 0 22.5
2 100 2011-08-16 46.9 -68.0 2011-08-18 2011 2011 obd_p2 2011 228 50458. 17 176 0 20.5
3 101 2011-08-11 44.3 -72.7 2011-08-12 2011 2011 obd_p1 2011 223 50458. 8 272 0 26
4 101 2011-08-10 44.3 -72.7 2011-08-12 2011 2011 obd_p2 2011 222 50803. 25 253. 0 26.5
5 102 2011-08-21 46.5 -68.0 2011-08-22 2011 2011 obd_p1 2011 233 49421. 0 378. 0 27
6 102 2011-08-20 46.5 -68.0 2011-08-22 2011 2011 obd_p2 2011 232 49421. 1 397. 0 28
其中 x 是我想在获得平均值之前从每个组中提取子集的天数。
但如果我使用
df %>% group_by(site) %>% slice_head(n= x) %>% mean(prcp)
我收到一个错误,我不知道为什么。错误是
Warning message:
In mean.default(., "prcp") :
argument is not numeric or logical: returning NA
虽然我不知道为什么会发生该错误,但我更愿意让子集在 origDate 内工作。我想要子集均值的列是 prcp、tmax、tmin 和 r_hum。我最终会创建 60 个新列,每个天气变量 15 个。并且很多列都会有 NA 或其他东西,因为 DT 在列中有日常观察。为了了解输出可能是什么样子,这里有一个模型。它不必看起来像这样,只要我有 DT 中每个天气变量和时间窗口的方法与适当的站点对齐。
site precursor_date    lat     lon   origDate ... prcp2dmean prcp3dmean prcp4dmean ... tmax2dmean tmax3dmean ...
100 2011-08-17 46.864 -67.998 2011-08-18 ... 1.2 1.4 1.4 ... 25 24 ...
100 2011-08-16 46.864 -67.998 2011-08-18 ... 1.2 1.4 1.4 ... 25 24 ...
100 2011-08-15 46.864 -67.998 2011-08-18 ... NA 1.4 1.4 ... NA 24 ...
100 2011-08-14 46.864 -67.998 2011-08-18 ... NA NA 1.4 ... NA NA ...
100 2011-08-13 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-12 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-11 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-10 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-09 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-08 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-07 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-06 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-05 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-04 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
100 2011-08-03 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-17 46.864 -67.998 2011-08-18 ... 1.2 1.4 1.4 ... 25 24 ...
10 2011-08-16 46.864 -67.998 2011-08-18 ... 1.2 1.4 1.4 ... 25 24 ...
10 2011-08-15 46.864 -67.998 2011-08-18 ... NA 1.4 1.4 ... NA 24 ...
10 2011-08-14 46.864 -67.998 2011-08-18 ... NA NA 1.4 ... NA NA ...
10 2011-08-13 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-12 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-11 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-10 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-09 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-08 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-07 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-06 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-05 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-04 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
10 2011-08-03 46.864 -67.998 2011-08-18 ... NA NA NA ... NA NA ...
在我的 DT 我试过
pi_df5[, pi_df5 %>% slice_head(n=2) %>% mean(prcp), by = site]
但这不起作用。

最佳答案

这是另一种选择:

cols <- c("prcp", "tmax", "tmin", "r_hum")
winsz <- 2L:15L
DT[, as.vector(outer(cols, winsz, FUN=paste, sep="_")) := {
res <- rep(NA, .N)
ans <- outer(.SD, winsz, function(x, k) {
Map(function(v, j) replace(res, 1L:j, sum(v[1L:j]) / j), x, k)
})
}, site, .SDcols=cols]
head(DT) 的输出:
   site precursor_date    lat     lon   origDate last begin precursor_day_labl year yday    dayl prcp  srad swe tmax tmin   vp     satv    r_hum prcp_2 tmax_2 tmin_2  r_hum_2   prcp_3 tmax_3   tmin_3  r_hum_3 prcp_4 tmax_4 tmin_4  r_hum_4 prcp_5
1: 100 2011-08-17 46.864 -67.998 2011-08-18 2011 2011 obd_p1 2011 229 50112.0 0 403.2 0 22.5 10.5 1280 19.99250 64.02400 8.5 21.5 12.5 78.14408 7.333333 22.5 13.33333 77.64095 6.5 23.5 13.625 74.53616 6.4
2: 100 2011-08-16 46.864 -67.998 2011-08-18 2011 2011 obd_p2 2011 228 50457.6 17 176.0 0 20.5 14.5 1640 17.77505 92.26416 8.5 21.5 12.5 78.14408 7.333333 22.5 13.33333 77.64095 6.5 23.5 13.625 74.53616 6.4
3: 100 2011-08-15 46.864 -67.998 2011-08-18 2011 2011 obd_p3 2011 227 50457.6 5 249.6 0 24.5 15.0 1720 22.44415 76.63468 NA NA NA NA 7.333333 22.5 13.33333 77.64095 6.5 23.5 13.625 74.53616 6.4
4: 100 2011-08-14 46.864 -67.998 2011-08-18 2011 2011 obd_p4 2011 226 50803.2 4 288.0 0 26.5 14.5 1640 25.14498 65.22178 NA NA NA NA NA NA NA NA 6.5 23.5 13.625 74.53616 6.4
5: 100 2011-08-13 46.864 -67.998 2011-08-18 2011 2011 obd_p5 2011 225 50803.2 6 297.6 0 25.0 12.5 1440 23.09541 62.35006 NA NA NA NA NA NA NA NA NA NA NA NA 6.4
6: 100 2011-08-12 46.864 -67.998 2011-08-18 2011 2011 obd_p6 2011 224 51148.8 6 268.8 0 22.5 12.0 1400 19.99250 70.02625 NA NA NA NA NA NA NA NA NA NA NA NA NA
tmax_5 tmin_5 r_hum_5 prcp_6 tmax_6 tmin_6 r_hum_6 prcp_7 tmax_7 tmin_7 r_hum_7 prcp_8 tmax_8 tmin_8 r_hum_8 prcp_9 tmax_9 tmin_9 r_hum_9 prcp_10 tmax_10 tmin_10 r_hum_10 prcp_11 tmax_11 tmin_11 r_hum_11 prcp_12 tmax_12
1: 23.8 13.4 72.09894 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
2: 23.8 13.4 72.09894 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
3: 23.8 13.4 72.09894 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
4: 23.8 13.4 72.09894 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
5: 23.8 13.4 72.09894 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
6: NA NA NA 6.333333 23.58333 13.16667 71.75349 7.285714 23.14286 13.28571 74.36211 7.375 22.875 13.375 75.99087 6.555556 23 13.11111 74.27265 7.5 23.05 13.35 75.15193 8.090909 23.22727 13.59091 75.40517 7.916667 23.625
tmin_12 r_hum_12 prcp_13 tmax_13 tmin_13 r_hum_13 prcp_14 tmax_14 tmin_14 r_hum_14 prcp_15 tmax_15 tmin_15 r_hum_15
1: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
2: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
3: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
4: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
5: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
6: 13.625 73.99771 7.307692 23.65385 13.53846 73.38476 6.785714 23.64286 13.60714 73.6717 6.8 23.53333 13.7 74.6658
您可能会收到警告:

Warning message:In [.data.table(DT, , :=(as.vector(outer(cols, winsz, FUN = paste, :Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.


但我认为忽略它是安全的。

关于r - 如何在R data.table中按组进行子集设置时获取不同行数的切片,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64740630/

25 4 0