gpt4 book ai didi

r - 在 R 中有效地使用长数据帧上的函数

转载 作者:行者123 更新时间:2023-12-04 12:34:58 24 4
gpt4 key购买 nike

我有一个很长的数据框,其中包含来自桅杆的气象数据。它包含在不同高度( data$value )的不同参数(风速、风向、气温等,在 data$param 中)同时进行的观察( data$z )

我正在尝试通过 $time 有效地对这些数据进行切片,然后将函数应用于收集的所有数据。通常函数应用于单个 $param一次(即我对风速应用不同的函数而不是对气温应用不同的函数)。

当前方法

我目前的方法是使用data.frameddply .

如果我想获得所有的风速数据,我运行这个:

# find good data ----
df <- data[((data$param == "wind speed") &
!is.na(data$value)),]

然后我在 df 上运行我的函数使用 ddply() :
df.tav <- ddply(df,
.(time),
function(x) {
y <-data.frame(V1 = sum(x$value) + sum(x$z),
V2 = sum(x$value) / sum(x$z))
return(y)
})

通常 V1 和 V2 是对其他函数的调用。这些只是例子。不过,我确实需要对同一数据运行多个函数。



我目前的方法很慢。我没有对它进行基准测试,但它足够慢,我可以去喝杯咖啡,然后在处理一年的数据之前回来。

我有订单(数百个)要处理的塔,每个塔都有一年的数据和 10-12 个高度,所以我正在寻找更快的东西。

数据样本
data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600,
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200,
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0,
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160,
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60,
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature",
"humidity", "barometric pressure", "wind direction", "turbulence",
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"wind direction", "turbulence", "wind speed", "wind direction",
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed",
"temperature", "barometric pressure", "humidity", "wind direction",
"wind speed", "turbulence", "wind direction"), value = c(-2.5,
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32,
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35,
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9,
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time",
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")

最佳答案

使用 data.table :

library(data.table)
dt = data.table(data)

setkey(dt, param) # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
list(sum(value) + sum(z), sum(value)/sum(z)),
by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

如果你想为每个参数应用不同的函数,这里有一个更统一的方法。
# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
key = 'p')
fns
p fn
1: wind direction <call> # the fn column contains functions
2: wind speed <call> # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
by = list(param, time)]
# param time V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3: wind speed 2009-12-31 18:10:00 4.209735e-02
#4: wind speed 2009-12-31 18:20:00 2.180000e-02

附言我认为我必须使用 param 的事实以某种方式之前 evaleval工作是一个错误。

更新:截至 version 1.8.11此错误已修复,以下工作正常:
dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

关于r - 在 R 中有效地使用长数据帧上的函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19054723/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com