gpt4 book ai didi

python - 过滤异常值 - 如何使基于中值的 Hampel 函数更快?

转载 作者:太空狗 更新时间:2023-10-30 01:05:13 25 4
gpt4 key购买 nike

我需要对我的数据使用 Hampel 过滤器,去除异常值。

我没能在 Python 中找到一个现有的;仅在 Matlab 和 R 中。

[Matlab函数说明][1]

[Matlab Hampel函数的Stats Exchange讨论][2]

[R pracma 包小插图;包含hampel函数][3]

我编写了以下函数,根据 R pracma 包中的函数对其进行建模;但是,它比 Matlab 版本慢得多。这并不理想;将不胜感激有关如何加快速度的意见。

函数如下图-

def hampel(x,k, t0=3):
'''adapted from hampel function in R package pracma
x= 1-d numpy array of numbers to be filtered
k= number of items in window/2 (# forward and backward wanted to capture in median filter)
t0= number of standard deviations to use; 3 is default
'''
n = len(x)
y = x #y is the corrected series
L = 1.4826
for i in range((k + 1),(n - k)):
if np.isnan(x[(i - k):(i + k+1)]).all():
continue
x0 = np.nanmedian(x[(i - k):(i + k+1)])
S0 = L * np.nanmedian(np.abs(x[(i - k):(i + k+1)] - x0))
if (np.abs(x[i] - x0) > t0 * S0):
y[i] = x0
return(y)

“pracma”包中的 R 实现,我将其用作模型:

function (x, k, t0 = 3) 
{
n <- length(x)
y <- x
ind <- c()
L <- 1.4826
for (i in (k + 1):(n - k)) {
x0 <- median(x[(i - k):(i + k)])
S0 <- L * median(abs(x[(i - k):(i + k)] - x0))
if (abs(x[i] - x0) > t0 * S0) {
y[i] <- x0
ind <- c(ind, i)
}
}
list(y = y, ind = ind)
}

任何有助于提高函数效率的帮助,或指向现有 Python 模块中现有实现的指针,我们将不胜感激。下面的示例数据; %%timeit Jupyter 中的 cell magic 表明它当前需要 15 秒才能运行:

vals=np.random.randn(250000)
vals[3000]=100
vals[200]=-9000
vals[-300]=8922273
%%timeit
hampel(vals, k=6)

[1]: https://www.mathworks.com/help/signal/ref/hampel.html [2]: https://dsp.stackexchange.com/questions/26552/what-is-a-hampel-filter-and-how-does-it-work [3]:https://cran.r-project.org/web/packages/pracma/pracma.pdf

最佳答案

上述@EHB 的解决方案很有帮助,但不正确。具体来说,median_abs_deviation中计算的rolling median是difference,其本身就是每个数据点与rolling_median中计算的rolling median的差值,但它应该是滚动窗口中数据与窗口中值之间差异的中值。我拿了上面的代码并修改了它:

def hampel(vals_orig, k=7, t0=3):
'''
vals: pandas series of values from which to remove outliers
k: size of window (including the sample; 7 is equal to 3 on either side of value)
'''

#Make copy so original not edited
vals = vals_orig.copy()

#Hampel Filter
L = 1.4826
rolling_median = vals.rolling(window=k, center=True).median()
MAD = lambda x: np.median(np.abs(x - np.median(x)))
rolling_MAD = vals.rolling(window=k, center=True).apply(MAD)
threshold = t0 * L * rolling_MAD
difference = np.abs(vals - rolling_median)

'''
Perhaps a condition should be added here in the case that the threshold value
is 0.0; maybe do not mark as outlier. MAD may be 0.0 without the original values
being equal. See differences between MAD vs SDV.
'''

outlier_idx = difference > threshold
vals[outlier_idx] = rolling_median[outlier_idx]
return(vals)

关于python - 过滤异常值 - 如何使基于中值的 Hampel 函数更快?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46819260/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com