gpt4 book ai didi

r - fwrite.data.table 和 `yyyy-mm-dd hh:mm:ss` 格式优化,具有固定的 UTC 偏移量

转载 作者:行者123 更新时间:2023-12-05 07:27:48 25 4
gpt4 key购买 nike

我想使用 R data.tablefwrite 以 YYYY-MM-DD hh:nn:ss 格式(非-DST 遵守 ETC/GMT+8 时区),而不是默认的 (ISO 8601) YYYY-MM-DDThh:nn:ssZ 格式,其中一些时间戳具有小数秒,我想将其四舍五入到最接近的秒。

使用 lubridate 我已经能够使用 fread 读取日期,然后使用 x:=with_tz(x, "Etc/GMT +8"),然后是 x:=force_tz(x,"GMT")

但是,对于我的测试数据集(12 列的 650 万个条目),我的解决方案大多很慢,并且正在寻找更好的方法来解决问题。我不想使用 fwrite(..., dateTimeAs="write.csv"),因为那样会忽略固定的 UTC 偏移量以支持本地时间。

(各种解决方案移至我下面的“答案”)

你能想到的任何其他优化?

最佳答案

迄今为止的最佳解决方案:base-R + data.table + fasttime

#!/usr/bin/env Rscript
# above this point: set d_f and o_f to valid file paths

totTime<-proc.time()
install.load <- function(package.name)
{
if (!require(package.name, character.only=T)) install.packages(package.name)
library(package.name, character.only=T)
}
pp<-function(...) {
print(paste0(...))
}
ISO2Human<-function(x) {
ot<-substr(x,1,19) # ignore fractional seconds and "Z"
substr(ot,11,12)<-" "
if(anyNA(ot)) ot<-substr(x,1,10)
return(ot)
}

install.load('data.table')
install.load('fasttime')
pp("parameters read and libraries loaded: ",timetaken(totTime))

main <- function() {
dat<-fread(d_f,fill=TRUE)
# notably dat has a "d_utc" column in YYYY-MM-DD hh:nn:ss format
pp("data file Read: ",timetaken(totTime)) # 5.200sec

# A fair amount of code is inserted here. Highlights include
# 1. As computations appear to be faster in double/numeric form
# than POSIXct (and starts as character), I adjust it as follows:
# dat[,d_utc:=setattr(fastPOSIXct(d_utc,tz="GMT"),"class","numeric")]
# 2. dat gets merged with another DT using foverlaps, producing fo (see https://stackoverflow.com/q/53858287/4228193)
# as we resume code, 8.690sec have elapsed

# As my target timezone is UTC-08:00 (POSIXct ETC/GMT+8), I subtract 28800 seconds.
# But to protect against a rounding error in the double type
# (and because I have some fractional second data that I want to round)
# I add 0.5 to this value.
fo[,d_pst:=setattr(d_utc-28799.5,c("POSIXct","POSIXt"))][,d_utc:=NULL]
pp("timestamps adjusted to PST (UTC-08:00): ",timetaken(totTime)) # 16.8sec

这是我在这个问题中尝试优化的代码的特定部分;但在这样做的过程中,我发现上面使用的一些类型转换似乎更优化。

  tf<-tempfile()
fwrite(fo,file=tf)
fo<-fread(tf)
# fread reads in as character, not timestamps
# POSIXct's as.character and format calls are much slower than fwrite + fread (!!!)
fo[,DetectDate:=ISO2Human(DetectDate)]
# this truncates seconds, effectively rounding due to the previous adjustment of 0.5s
unlink(tf) # delete file
pp("coerced to string: ",timetaken(totTime)) # 26.9sec

  fwrite(fo, file = o_f, quote = FALSE)
pp("output file written: ",timetaken(totTime)) # 27.1sec
# aren't SSDs awesome?
}
main()

其他解决方案

基于 Lubridate 的 block (无临时文件)。顶部的时间是 mm:ss

# 01:17
j<-copy(fo)
tt<-proc.time()
j[,c("dd","dt"):=IDateTime(d_pst, ms="nearest")]
# if adding 0.5 seconds, trunc rather than nearest
j[,d_pst:=paste(dd,dt)][,c("dd","dt"):=NULL]
timetaken(tt) # 1:17
j
j[,lapply(.SD,class)]
rm(j)

使用 as.character 或 format 将 base-R POSIXct 转换为字符串

# 01:02
j<-copy(fo)
tt<-proc.time()
j[,DD2:=format(DetectDate,"%Y-%m-%d %H:%M:%S")]
timetaken(tt) # 1:02
j
j[,lapply(.SD,class)]
rm(j)

base-R隐式转换为字符+拼接日期时间

# 12:36
j<-copy(fo)
tt<-proc.time()
j[,DD2:=paste(lapply(DetectDate,substr,1,10),lapply(DetectDate,substr,12,19))]
timetaken(tt) # 12:36
j
j[,lapply(.SD,class)]
rm(j)

base-R,避免 lapply(傻我)

# 02:29
j<-copy(fo)
tt<-proc.time()
j[,DD2:=paste(substr(DetectDate,1,10),substr(DetectDate,12,19))]
timetaken(tt) # 2:29
j
j[,lapply(.SD,class)] # just to confirm our target column is character
rm(j)

data.table + base-R,但是使用data.table的tstrsplit和paste,而不是抓取一个字符范围

# 00:24
j<-copy(fo)
tt<-proc.time()
tf<-tempfile()
fwrite(j,file=tf)
fo2<-fread(tf)
fo2[,c("compDate","compTime","compMS"):=tstrsplit(DetectDate,"[TZ.]")][
,DD2:=paste(compDate,compTime)]
unlink(tf)
timetaken(tt) # 0:24
fo2
fo2[,lapply(.SD,class)]
rm(j,tf,fo2)

基本上是最佳解决方案,虽然重新使用变量和字段名称,但将其减少到 10 秒

# 00:14    
fap<-function(x) {
ot<-substr(x,1,19)
substr(ot,11,12)<-" "
if(is.na(ot)) ot<-substr(x,1,10)
return(ot)
}
j<-copy(fo)
tt<-proc.time()
tf<-tempfile()
fwrite(j,file=tf)
fo2<-fread(tf)
fo2[,DD2:=fap(DetectDate)]
unlink(tf)
timetaken(tt) # 0:14
fo2
fo2[,lapply(.SD,class)]
rm(j,tf,fo2,fap)

我使用的是 (n) SSD,与“标准”设置相比,它可能大大加快了临时文件解决方案的速度

关于r - fwrite.data.table 和 `yyyy-mm-dd hh:mm:ss` 格式优化,具有固定的 UTC 偏移量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53825194/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com