gpt4 book ai didi

r - 对非常大的数据集的日期范围进行高效索引

转载 作者:行者123 更新时间:2023-12-04 10:50:00 24 4
gpt4 key购买 nike

我有两个非常大的相关数据框,其中包含来自深度记录器的输出数据。

编辑:更容易为 MWE 下载数据:

library(RCurl)
dives_log <- read.csv(text = getURL("https://raw.githubusercontent.com/sebpardo/dive-data-mwe/master/dives_log.csv"),
stringsAsFactors = FALSE)
dives_summary <- read.csv(text = getURL("https://raw.githubusercontent.com/sebpardo/dive-data-mwe/master/dives_summary.csv"),
stringsAsFactors = FALSE)

dives_log$Date <- as.POSIXct(dives_log$Date, "%Y-%m-%d %H:%M:%S", tz="GMT")
dives_summary$Start <- as.POSIXct(dives_summary$Start, "%Y-%m-%d %H:%M:%S", tz="GMT")
dives_summary$Bottom.Start <- as.POSIXct(dives_summary$Bottom.Start, "%Y-%m-%d %H:%M:%S", tz="GMT")
dives_summary$Ascent.Start <- as.POSIXct(dives_summary$Ascent.Start, "%Y-%m-%d %H:%M:%S", tz="GMT")
dives_summary$Ascent.End <- as.POSIXct(dives_summary$Ascent.End, "%Y-%m-%d %H:%M:%S", tz="GMT")

第一个 (dives_log) 每 10 秒包含一次记录器数据:

> head(dives_log)
Date Depth
1 2010-09-11 23:03:20 6.5
2 2010-09-11 23:03:30 6.5
3 2010-09-11 23:03:40 7.0
4 2010-09-11 23:03:50 7.0
5 2010-09-11 23:04:00 7.0
6 2010-09-11 23:04:10 0.0
> nrow(dives_log)
[1] 816036

而第二个 (dives_summary) 具有潜水资料摘要,其中记录了每个潜水周期的开始、底部和上升时间。

> head(dives_summary)
Dive Start Bottom.Start Ascent.Start Ascent.End
1 1 2010-09-11 22:59:20 2010-09-11 23:03:20 2010-09-11 23:04:00 2010-09-11 23:04:10
2 2 2010-09-11 23:04:40 2010-09-11 23:04:50 2010-09-11 23:07:20 2010-09-11 23:08:30
3 3 2010-09-11 23:09:00 2010-09-11 23:13:00 2010-09-11 23:17:30 2010-09-11 23:18:00
4 4 2010-09-11 23:18:40 2010-09-11 23:19:00 2010-09-11 23:26:50 2010-09-11 23:27:20
5 5 2010-09-11 23:28:10 2010-09-11 23:28:50 2010-09-11 23:35:40 2010-09-11 23:36:20
6 6 2010-09-11 23:37:10 2010-09-11 23:37:30 2010-09-11 23:44:40 2010-09-11 23:45:30
> nrow(dives_summary)
[1] 12697

我想做的是将潜水编号 (dives_summary$Dive) 和潜水阶段(底部或上升)分配给每个个人数据点(即行) 潜水日志。到目前为止,我这样做的方法是索引 dives_log 中与 dives_summary 的每一行中的时间范围相匹配的行:

for(i in 1:nrow(dives_summary)) {  
pos.bottom <- which(dives_log$Date >= dives_summary$Bottom.Start[i] &
dives_log$Date <= dives_summary$Ascent.Start[i])
pos.ascent <- which(dives_log$Date > dives_summary$Ascent.Start[i] &
dives_log$Date <= dives_summary$Ascent.End[i])

dives_log[pos.bottom, "Phase"] <- "bottom"
dives_log[pos.ascent, "Phase"] <- "ascent"
dives_log[pos.bottom, "Number"] <- dives_summary[i, "Dive"]
dives_log[pos.ascent, "Number"] <- dives_summary[i, "Dive"]
}

这确实有效,但它极其慢,因为对于 dives_summary 的 12697 行中的每一行,for 循环必须检查多个逻辑语句比较超过 800k 行的向量:

> head(dives_log)
Date Depth Phase Number
1 2010-09-11 23:03:20 6.5 bottom 1
2 2010-09-11 23:03:30 6.5 bottom 1
3 2010-09-11 23:03:40 7.0 bottom 1
4 2010-09-11 23:03:50 7.0 bottom 1
5 2010-09-11 23:04:00 7.0 bottom 1
6 2010-09-11 23:04:10 0.0 ascent 1

有什么方法可以更快地做到这一点?我想这是可以使用 data.table 实现的,但我无法全神贯注地思考如何跨两个单独的数据帧对逻辑语句进行矢量化。任何指导将不胜感激!

最佳答案

不太清楚@Frank 所说的melt 是什么意思。也许他可以更好地说明。

这是使用非等连接的一个版本。

#get to the bottom
dives_log[dives_summary, ':=' (
Phase = 'bottom',
Number = Dive
), on=.(Date >= Bottom_Start, Date <= Ascent_Start)]

#rise to the top
dives_log[dives_summary, ':=' (
Phase = 'ascent',
Number = Dive
), on=.(Date > Ascent_Start, Date <= Ascent_End)]

数据:

dives_log <- fread('Date,Depth
"2010-09-11 23:03:20",6.5
"2010-09-11 23:03:30",6.5
"2010-09-11 23:03:40",7.0
"2010-09-11 23:03:50",7.0
"2010-09-11 23:04:00",7.0
"2010-09-11 23:04:10",0.0')[,
Date := as.POSIXct(Date)]

cols <- c("Start","Bottom_Start","Ascent_Start","Ascent_End")
dives_summary <- fread('Dive,Start,Bottom_Start,Ascent_Start,Ascent_End
1,"2010-09-11 22:59:20","2010-09-11 23:03:20","2010-09-11 23:04:00","2010-09-11 23:04:10"
2,"2010-09-11 23:04:40","2010-09-11 23:04:50","2010-09-11 23:07:20","2010-09-11 23:08:30"
3,"2010-09-11 23:09:00","2010-09-11 23:13:00","2010-09-11 23:17:30","2010-09-11 23:18:00"
4,"2010-09-11 23:18:40","2010-09-11 23:19:00","2010-09-11 23:26:50","2010-09-11 23:27:20"
5,"2010-09-11 23:28:10","2010-09-11 23:28:50","2010-09-11 23:35:40","2010-09-11 23:36:20"
6,"2010-09-11 23:37:10","2010-09-11 23:37:30","2010-09-11 23:44:40","2010-09-11 23:45:30"')[,
(cols) := lapply(.SD, as.POSIXct), .SDcols=cols]

关于r - 对非常大的数据集的日期范围进行高效索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48797012/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com