gpt4 book ai didi

r - 根据值是否大于R中的阈值进行分段向量

转载 作者:行者123 更新时间:2023-12-04 05:17:30 25 4
gpt4 key购买 nike

我有一个长向量,我需要根据阈值将其分为多个部分。段是超过阈值的连续值。当值下降到阈值以下时,该段结束,下一个段开始,此时值再次超过阈值。我需要记录每个段的开始和结束索引。

以下是效率低下的实现。什么是最快,最合适的写方法?这非常丑陋,我必须假设有一个更干净的实现。

set.seed(10)
test.vec <- rnorm(100, 8, 10)
threshold <- 0
segments <- list()
in.segment <- FALSE
for(i in 1:length(test.vec)){

# If we're in a segment
if(in.segment){
if(test.vec[i] > threshold){
next
}else{
end.ind <- i - 1
in.segment <- FALSE
segments[[length(segments) + 1]] <- c(start.ind, end.ind)
}
}

# if not in segment
else{
if(test.vec[i] > threshold){
start.ind <- i
in.segment <- TRUE
}
}
}

编辑:所有解决方案的运行时

感谢您的所有答复,这很有帮助并且很有启发性。下面是对所有五个解决方案的小测试(提供的四个解决方案加上原始示例)。如您所见,这四个方案都是对原始解决方案的巨大改进,但是Khashaa的解决方案是迄今为止最快的。
set.seed(1)
test.vec <- rnorm(1e6, 8, 10);threshold <- 0

originalFunction <- function(x, threshold){
segments <- list()
in.segment <- FALSE
for(i in 1:length(test.vec)){

# If we're in a segment
if(in.segment){
if(test.vec[i] > threshold){
next
}else{
end.ind <- i - 1
in.segment <- FALSE
segments[[length(segments) + 1]] <- c(start.ind, end.ind)
}
}

# if not in segment
else{
if(test.vec[i] > threshold){
start.ind <- i
in.segment <- TRUE
}
}
}
segments
}

SimonG <- function(x, threshold){

hit <- which(x > threshold)
n <- length(hit)

ind <- which(hit[-1] - hit[-n] > 1)

starts <- c(hit[1], hit[ ind+1 ])
ends <- c(hit[ ind ], hit[n])

cbind(starts,ends)
}

Rcpp::cppFunction('DataFrame Khashaa(NumericVector x, double threshold) {
x.push_back(-1);
int n = x.size(), startind, endind;
std::vector<int> startinds, endinds;
bool insegment = false;
for(int i=0; i<n; i++){
if(!insegment){
if(x[i] > threshold){
startind = i + 1;
insegment = true; }
}else{
if(x[i] < threshold){
endind = i;
insegment = false;
startinds.push_back(startind);
endinds.push_back(endind);
}
}
}
return DataFrame::create(_["start"]= startinds, _["end"]= endinds);
}')

bgoldst <- function(x, threshold){
with(rle(x>threshold),
t(matrix(c(0L,rep(cumsum(lengths),2L)[-length(lengths)]),2L,byrow=T)+1:0)[values,])
}

ClausWilke <- function(x, threshold){
suppressMessages(require(dplyr, quietly = TRUE))
in.segment <- (x > threshold)
start <- which(c(FALSE, in.segment) == TRUE & lag(c(FALSE, in.segment) == FALSE)) - 1
end <- which(c(in.segment, FALSE) == TRUE & lead(c(in.segment, FALSE) == FALSE))
data.frame(start, end)
}

system.time({ originalFunction(test.vec, threshold); })
## user system elapsed
## 66.539 1.232 67.770
system.time({ SimonG(test.vec, threshold); })
## user system elapsed
## 0.028 0.008 0.036
system.time({ Khashaa(test.vec, threshold); })
## user system elapsed
## 0.008 0.000 0.008
system.time({ bgoldst(test.vec, threshold); })
## user system elapsed
## 0.065 0.000 0.065
system.time({ ClausWilke(test.vec, threshold); })
## user system elapsed
## 0.274 0.012 0.285

最佳答案

我喜欢将for loops转换为Rcpp很简单。

Rcpp::cppFunction('DataFrame findSegment(NumericVector x, double threshold) {
x.push_back(-1);
int n = x.size(), startind, endind;
std::vector<int> startinds, endinds;
bool insegment = false;
for(int i=0; i<n; i++){
if(!insegment){
if(x[i] > threshold){
startind = i + 1;
insegment = true; }
}else{
if(x[i] < threshold){
endind = i;
insegment = false;
startinds.push_back(startind);
endinds.push_back(endind);
}
}
}
return DataFrame::create(_["start"]= startinds, _["end"]= endinds);
}')
set.seed(1); test.vec <- rnorm(1e7,8,10); threshold <- 0;
system.time(findSegment(test.vec, threshold))

# user system elapsed
# 0.045 0.000 0.045

# @SimonG's solution
system.time(findSegments(test.vec, threshold))
# user system elapsed
# 0.533 0.012 0.548

关于r - 根据值是否大于R中的阈值进行分段向量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31235227/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com