gpt4 book ai didi

Pythonic 检测一维观测数据中异常值的方法

转载 作者:IT老高 更新时间:2023-10-28 20:31:11 26 4
gpt4 key购买 nike

对于给定的数据,我想将异常值(由 95% 置信水平或 95% 分位数函数或任何所需的值定义)设置为 nan 值。以下是我现在正在使用的数据和代码。如果有人能进一步解释我,我会很高兴。

import numpy as np, matplotlib.pyplot as plt

data = np.random.rand(1000)+5.0

plt.plot(data)
plt.xlabel('observation number')
plt.ylabel('recorded value')
plt.show()

最佳答案

使用 percentile 的问题在于,被识别为异常值的点是样本大小的函数。

测试异常值的方法有很多种,您应该考虑如何对它们进行分类。理想情况下,您应该使用先验信息(例如,“任何高于/低于此值的东西都是不现实的,因为......”)

但是,一个常见的、不太合理的异常值测试是根据“中值绝对偏差”删除点。

这是 N 维情况的实现(来自本文的一些代码:https://github.com/joferkington/oost_paper_code/blob/master/utilities.py):

def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.

Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.

Returns:
--------
mask : A numobservations-length boolean array.

References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)

modified_z_score = 0.6745 * diff / med_abs_deviation

return modified_z_score > thresh

这与 one of my previous answers 非常相似,但我想详细说明样本量的影响。

让我们比较基于百分位数的异常值检验(类似于@CTZhu 的回答)和中值绝对偏差 (MAD) 检验,适用于各种不同的样本量:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def main():
for num in [10, 50, 100, 1000]:
# Generate some data
x = np.random.normal(0, 0.5, num-3)

# Add three outliers...
x = np.r_[x, -3, -10, 12]
plot(x)

plt.show()

def mad_based_outlier(points, thresh=3.5):
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)

modified_z_score = 0.6745 * diff / med_abs_deviation

return modified_z_score > thresh

def percentile_based_outlier(data, threshold=95):
diff = (100 - threshold) / 2.0
minval, maxval = np.percentile(data, [diff, 100 - diff])
return (data < minval) | (data > maxval)

def plot(x):
fig, axes = plt.subplots(nrows=2)
for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]):
sns.distplot(x, ax=ax, rug=True, hist=False)
outliers = x[func(x)]
ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)

kwargs = dict(y=0.95, x=0.05, ha='left', va='top')
axes[0].set_title('Percentile-based Outliers', **kwargs)
axes[1].set_title('MAD-based Outliers', **kwargs)
fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14)

main()

enter image description here


enter image description here


enter image description here


enter image description here

请注意,无论样本量如何,基于 MAD 的分类器都能正常工作,而基于百分位数的分类器分类的点越多,样本量越大,无论它们是否实际上是异常值。

关于Pythonic 检测一维观测数据中异常值的方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22354094/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com