Recommended anomaly detection technique for simple, one-dimensional scenario?(对于简单的一维场景，推荐使用异常检测技术？)-6ren

Recommended anomaly detection technique for simple, one-dimensional scenario?(对于简单的一维场景，推荐使用异常检测技术？)

转载作者：bug小助手更新时间：2023-10-25 23:48:39

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

我有一个场景，我有几千个数据实例。数据本身表示为单个整数值。我希望能够检测到一个实例何时是极端离群值。

For example, with the following example data:

例如，使用以下示例数据：

a = 10
b = 14
c = 25
d = 467
e = 12

d is clearly an anomaly, and I would want to perform a specific action based on this.

D显然是一个反常现象，我想在此基础上执行特定的操作。

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

我想试着利用我对特定领域的知识来检测异常。例如，计算出与平均值之间的有用距离，并基于启发式方法进行检查。然而，我认为如果我研究更通用、更健壮的异常检测技术，可能会更好，这些技术背后有一些理论支持。

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.

由于我的数学应用知识有限，我希望找到一种简单的方法，比如使用标准差。希望数据的单维性会使这成为一个相当常见的问题，但如果需要关于该场景的更多信息，请留下评论，我会提供更多信息。

Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

编辑：我想我会添加更多关于数据和我尝试的内容的信息，以防它使一个答案比另一个更正确。

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

这些值都是正数且非零。我预计这些值将形成正态分布。这种期望是基于对领域的直觉，而不是通过分析，如果这不是一件坏事，请让我知道。在集群方面，除非也有标准的算法来选择k值，否则我会发现很难将这个值提供给k-均值算法。

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

我想对离群值/异常采取的操作是将其呈现给用户，并建议基本上从数据集中删除数据点(我不会深入了解他们将如何做到这一点，但这对我的域是有意义的)，因此它不会用作另一个函数的输入。

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.

到目前为止，我已经在我有限的数据集上尝试了三西格玛和IQR离群值测试。IQR标记了不够极端的值，三西格玛指出了更符合我对该领域的直觉的实例。

Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

有关算法、技术或资源链接的信息，以了解这一特定场景是有效和受欢迎的答案。

What is a recommended anomaly detection technique for simple, one-dimensional data?

对于简单的一维数据，推荐的异常检测技术是什么？

更多回答

Don't underestimate the value of scientific knowledge. Black box procedures are rarely the way to go. Try to express your scientific knowledge in terms of simple statistics.

不要低估科学知识的价值。黑匣子程序很少是可行的。试着用简单的统计数据来表达你的科学知识。

@Tristan: are you saying you think I should try to come up with a model which has some grounding in statistics, but ultimately is specific for my problem domain?

@特里斯坦：你是说你认为我应该尝试提出一个模型，这个模型有一些统计学基础，但最终是针对我的问题域的？

I'm just saying that your knowledge of what is reasonable (i.e., what is the model that generates the good data and bad data) is important information. You should design a procedure, such as using IQR, that is motivated by your scientific knowledge of the domain. I don't like things like k-means because it is not well motivated and is inherently inflexible, in my view.

我只是说你对什么是合理的知识（即，生成好数据和坏数据的模型是什么）是重要的信息。你应该设计一个程序，例如使用IQR，这是由你的领域的科学知识的动机。我不喜欢像k-means这样的东西，因为在我看来，它没有很好的动机，本质上是不灵活的。

优秀答案推荐

Check out the three-sigma rule:

看看三西格玛规则：

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

An alternative method is the IQR outlier test:

另一种方法是IQR离群值检验：

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

this test is usually employed by Box plots (indicated by the whiskers):

此测试通常由盒子图(由胡须指示)使用：

boxplot

EDIT:

编辑：

For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.

对于您的情况(简单的一维单变量数据)，我认为我的第一个答案非常合适。然而，这并不适用于多变量数据。

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

@smaclell建议使用K-Means来寻找离群值。除了它主要是一种聚类算法(不是真正的离群点检测技术)这一事实之外，k-Means的问题在于它需要预先知道聚类数目K的好值。

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

一种更适合的技术是DBSCAN：一种基于密度的集群算法。基本上，它将具有足够高密度的区域生长成簇，这将是密度连接点的最大集合。

dbscan_clustering

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

DBSCAN需要两个参数：Epsilon和minPoints。它从一个尚未被访问的任意点开始。然后，它找到起点距离epsilon内的所有邻近点。

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

如果邻居的数量大于或等于minPoints，则形成集群。起始点及其邻居被添加到该集群，并且起始点被标记为已访问。然后，该算法递归地对所有邻居重复评估过程。

If the number of neighbors is less than minPoints, the point is marked as noise.

如果邻居的数量小于minPoints，则将该点标记为噪波。

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

如果簇被完全扩展(访问范围内的所有点)，则算法继续迭代剩余的未访问点，直到它们被耗尽。

Finally the set of all points marked as noise are considered outliers.

最后，标记为噪声的所有点的集合被认为是异常值。

There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

您可以使用各种集群技术来尝试识别数据中的中心趋势。我们在模式识别课程中经常使用的一种算法是K-Means。这将允许您识别是否存在多个相关的数据集，例如双峰分布。这确实需要您对预期的集群数量有一定的了解，但这是相当高效且易于实现的。

After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

在你有了方法之后，你可以试着找出任何一点离任何方法都很远。你可以随心所欲地定义“Far”，但我会推荐@Amro的建议作为一个很好的起点。

For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

有关集群算法的更深入讨论，请参考关于集群的维基百科条目。

This is an old topic but still it lacks some information.

这是一个古老的话题，但仍然缺乏一些信息。

Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:

显然，这可以被视为单变量异常值检测的情况。上面提出的方法有几个利弊。以下是一些薄弱环节：

Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.

The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing

I think this problem can be solved in a few lines of python code like this:

我认为这个问题可以通过如下所示的几行Python代码来解决：

import numpy as np
import scipy.stats as sts

x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion

Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:

随后，您拒绝超过某个阈值(数据分布的97.5%)的值，在假设正态分布的情况下，阈值是2.24。在这里，它翻译为：

array([ 0.6745  ,  0.      ,  1.854875, 76.387125,  0.33725 ])

or the 467 entry being rejected.

或被拒绝的467条目。

Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.

当然，人们可以争辩说，疯子(如上所述)也假定有正常的离散性。因此，为什么上面的论点2(小样本)不适用于这里？答案是MAD有一个非常高的故障点。很容易从不同的分布中选择不同的阈值，并得出相同的结论：467是异常值。

Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.

三西格玛规则和IQR检验都是经常使用的，而且有几个简单的算法来检测异常。

The three-sigma rule is correct
mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

The IQR test should be:

IQR测试应该是：

Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25         // inter-quartile range
If x >  Q75  + 1.5 * IQR or  x   < Q25 - 1.5 * IQR THEN  x is a mild outlier
If x >  Q75  + 3.0 * IQR or  x   < Q25 – 3.0 * IQR THEN  x is a extreme outlier

The anomaly detection of one-dimensional data is an open challenge. I have published a Python package, named xiezhi, which can be applied to detect the abnormal data in a list, especially when the list is large while only a few data in it are anomalies. This tool is based on one of my research papers, and it has been proven to be theoretically robust. Here is a tutorial for xiezhi: https://medium.com/@hellojerrywong18/xiezhi-the-anomaly-detection-tool-for-one-dimensional-data-9108c539e692

一维数据的异常检测是一个开放的挑战。我已经发布了一个名为协志的Python包，它可以用来检测列表中的异常数据，特别是当列表很大而其中只有几个数据是异常的时候。这个工具是基于我的一篇研究论文，它已经被证明在理论上是健壮的。这里有一个谢志的教程：https://medium.com/@hellojerrywong18/xiezhi-the-anomaly-detection-tool-for-one-dimensional-data-9108c539e692

If you have any problem or suggestion, please let me know.

如果您有任何问题或建议，请让我知道。

更多回答

+1 three-sigma and IQR look like good techniques, thanks for the insightful answer.

+1三西格玛和IQR看起来是很好的技术，谢谢你富有洞察力的回答。

I like this simple advice. The IQR based statistic has the advantage of not being influenced by extreme outliers which will change the mean/sd.

我喜欢这个简单的建议。基于IQR的统计量的优点是不受极端异常值的影响，极端异常值会改变均值/标准差。

@Anony-Mousse: fixed, thanks. Funny enough I first learned about DBSCAN in a machine-learning class using Weka software/book

@anony-Mousse：好了，谢谢。有趣的是，我第一次了解DBSCAN是在使用Weka软件/书籍的机器学习课程上

Yes, the Weka software and book are very widely used. Which is why it is a pity they made this error. Plus, the DBSCAN implementation in Weka is really crappy. It benchmarked way over 100x as slow as mine, and even slower as their OPTICS implementation? OPTICS should be quite a bit slower.

是的，Weka软件和图书使用非常广泛。这就是为什么他们犯了这个错误，这是很遗憾的。另外，Weka中的DBSCAN实现非常糟糕。它的基准速度是我的100倍以上，甚至比他们的光学实施更慢？光学的速度应该要慢得多。

@Anony-Mousse: If you are willing and have the time, you could contribute your implementation to Weka. It is open sourced under GPL, and no I'm not affiliated with them in any way :)

@anony-Mousse：如果你愿意并且有时间，你可以把你的实现贡献给Weka。它是在GPL下开源的，不，我与他们没有任何关联：)

Agreed. K-Means is a simple, effective, and adaptive solution for this problem. Create two clusters, initialize properly, and one of the clusters should contain the meaningful data while the other gets the outlier(s). But be careful; if you have no outliers, then both clusters will contain meaningful data.

同意。K-Means算法是解决这一问题的一种简单、有效和自适应的解决方案。创建两个簇，正确初始化，其中一个簇应该包含有意义的数据，而另一个簇则得到离群值(S)。但是要小心；如果没有异常值，那么两个集群都将包含有意义的数据。

Well that is where it gets fun. It is often very difficult to determine the number of clusters and would be even harder doing it in a live system. Even in that case of one true cluster and another outlier cluster it could be argued the outliers are starting to represent a real mode for the data. I am going to add more links to provide other options.

这就是它变得有趣的地方。通常很难确定集群的数量，在实时系统中更难确定。即使在一个真实聚类和另一个离群值聚类的情况下，也可以认为离群值开始代表数据的真实模式。我将添加更多链接以提供其他选项。

This strikes me as the wrong tool for the job. He's primarily interested in fat tails, not bimodal distributions.

在我看来，这是不适合这项工作的工具。他主要对肥尾感兴趣，而不是双峰分布。

It depends on the asker's intent, so we cannot be completely sure. If the only intent is to assess how anomalous a data point is, then use simple statistics, of course. But if you want to, say, use the "good" data as an input to a subsequent function, then there may be value in classifying the points as "good" or "bad" (e.g., through K-means, etc.).

这取决于提问者的意图，所以我们不能完全确定。如果唯一的目的是评估一个数据点的异常程度，那么当然可以使用简单的统计数据。但是，如果你想，比如说，使用“好的”数据作为后续函数的输入，那么将点分类为“好的”或“坏的”可能是有价值的(例如，通过K-均值等)。

@Steve That is actually wrong. There is no reason why all the outliers should form a cluster. K-Means finds clusters for which the euclidean distance from its center is minimized - if the outliers are distributed evenly around the clusters, this will not help at all. The Euclidean distance results from a Gaussian assumption which is very vulnerable to outliers. Don't use K-Means for outlier detection only. You might want to use it for preprocessing and using three sigma afterwards, as stated by the original author.

@史蒂夫，这实际上是错误的。没有理由说所有的离群值都应该形成一个集群。K-Means算法会找到离其中心的欧几里得距离最小的星团--如果离群值均匀分布在星团周围，这将毫无帮助。欧几里德距离是高斯假设的结果，它很容易受到异常值的影响。不要只将K-Means用于异常值检测。正如原作者所述，您可能希望将其用于预处理和随后使用三西格玛。

I just noticed this and you are right, my IQR test wasn't correct. I'll update my answer, thanks.

我刚刚注意到这一点，你是对的，我的IQR测试不正确。我会更新我的答案，谢谢。

FYI, the time complexity of xiezhi is O(N) while N is the size of the list.

仅供参考，反之的时间复杂度为O(N)，而N为列表的大小。

Can you share the method used by xiezhi, e.g. by providing a reference to your research paper?

你能分享谢志使用的方法吗，例如，提供你的研究论文的参考？

java - hibernate : Difference between @ Embedded annotation technique and @OneToOne annotation Technique
@Embedded 注释技术和@OneToOne 注释技术之间的区别是什么，因为在 Embedded 中，java 类在类中包含“Has a”关系，并且在 @Embedded 注释的帮助下，我们将 h
android - 如何无限次重复YoYo Animation Techniques？
大家好.. 我正在使用来自 Github 的非常好的动画技术.这个家伙为我们提供了非常好的文字效果，我喜欢无限次地使用其中的一些，不仅是在用户按下那个特定的按钮然后播放那个效果的时候。这是我的代码:
linux - 代理链 : What is the technique used?
在直接代理中，youtube 在我的学校被屏蔽了。但是通过代理链，每个被阻止的站点都可以访问。它使用了什么技术？最佳答案 proxychains 将所有请求发送到目标服务器，甚至是 dns 查询。m
MPI + GPU : how to mix the two techniques
我的程序非常适合 MPI。每个 CPU 都做自己的、特定的(复杂的)工作，产生一个 double ，然后我使用 MPI_Reduce将每个 CPU 的结果相乘。但是我重复了很多很多次(> 100,0
optimization - 机器学习挑战: technique for collect the coins
假设有一家公司拥有几个vending machines收集硬币的。当投币保险箱已满时，机器将无法出售任何新元素。为了防止这种情况发生，公司必须在此之前收集硬币。但如果公司太早 dispatch 技术人
java - 什么是 "Conventional Techniques"来避免死锁？
我在 Java 规范中看到了以下声明。 Programs where threads hold (directly or indirectly) locks on multiple objects s
javascript - javascript : swapping technique, 添加随机文本
在我的代码中，每次我用鼠标单击该图像时，该图像都会交换到另一张图像。我在向第二个(交换的)图像添加随机文本时遇到困难。当我看到第一张图片时，文本不应出现。这是我的代码。当我尝试为随机文本编写代码时，j
algorithm - 无需微积分即可找到局部最小值/最大值 : name of technique?
如果 f 是 [a,b] 上的连续函数并且在[a,b] 并且 [a,b] 中没有局部最大值，您可以找到最小值通过重复将间隔切成三分之一来实现任意精度，在这三分之一的端点上观察 f 的值，并且重复。这
c# - 调试 WPF : Tools and Techniques
好吧，我只是浪费了一个小时来寻找为什么我的 lookless WPF 控件没有出现在窗口中。最终，我追查到我忘记在 themes 目录下的 generic.xaml 文件中添加 ResourceDi
c - C 中的字符串 : pitfalls and techniques
下个月我将指导一个 ACM 团队(请看图)，现在是时候讨论 C 中的字符串了。除了讨论标准库、strcpy、strcmp 等，我想给他们一些提示(比如 str[0] 等同于 *str 之类的东西)。
银光 3 : Techniques for adjusting to screen resolution
我的开发人员盒子的屏幕分辨率为 1680 x 1050。我正在开发一个全屏 Silverlight 3 应用程序，我正在考虑将其部署到 Internet。所以，我想确保应用程序在各种屏幕分辨率下看起来
javascript - 分机号 : Proper technique to filter a combobox?
当我通过向底层存储添加过滤器来过滤组合框时，过滤器有时起作用(项目被删除)，有时不起作用。我已经调试了 filterBy 函数；它被调用并返回真/假，因为我希望过滤/显示项目。我在 ExtJS 论坛
c - 分析 "technique"用于交换 2 个变量而没有第三个临时变量
这个问题在这里已经有了答案: Why don't people use xor swaps? [closed] (4 个答案) 关闭 7 年前。我遇到了一种交换 2 个变量(整数、字符或指针)的“
algorithm - 用于并行处理的分区二叉树的 "m-bridge technique"是什么？
它是如何工作的？请用英语或伪代码足够详细地解释，以便我可以用任何语言实现。在这篇论文中提到并简要描述了: http://citeseerx.ist.psu.edu/viewdoc/download?
CSS 圆 Angular 快速完成 : is this an awful technique?
这是我一直在尝试的一种快速但肮脏的圆 Angular 技术。是的，它很丑，但速度很快，边 Angular 流畅，它避免了嵌套的 div，并且不
Java Web 应用程序 : How to implement caching techniques?
我正在开发一个 Java Web 应用程序，该应用程序的行为基于从 Web 服务加载的大型 XML 配置文件。由于在访问应用程序的特定部分之前实际上并不需要这些文件，因此它们是延迟加载的。当需要这些文
jquery - MVC 4 : How to use the “partial validation” technique?
我也需要在MVC 4上实现“部分验证技术”，如this answer中所述: public class DontValidateEmailAttribute : ActionFilterA
php - 喜欢一些MySql Optimization techniques for Bulk data table
我在 MySql 中遇到了对大量数据进行查询处理的问题。我需要通过连接函数将数据提取到四个以上的表中。查询在服务器上运行速度非常慢。如何优化多个连接查询的处理时间。我目前正在使用 innodb 引
php - HTML\PHP : Flash Player detection technique
当我们在我们的网站中放置 flash 文件时，它 OFF-COURSE 需要在客户端机器上安装 flash player，并提示安装 flash player... 是否有一些 php 代码，我可以使
c++ - 如何在程序崩溃后释放 managed_shared_memory : what are effective techniques to use during debugging?
无可否认，我是一名新手和自学成才的程序员，并且终于开始探索 C 和 C++ 的深度和强大功能。这种自学过程带来的一些东西不是教科书或公开的 google 知识，例如在困难情况下使用的技巧和调试策略。

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Recommended anomaly detection technique for simple, one-dimensional scenario?(对于简单的一维场景，推荐使用异常检测技术？)