gpt4 book ai didi

Recommended anomaly detection technique for simple, one-dimensional scenario?(对于简单的一维场景,推荐使用异常检测技术?)

转载 作者:bug小助手 更新时间:2023-10-25 23:48:39 28 4
gpt4 key购买 nike



I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

我有一个场景,我有几千个数据实例。数据本身表示为单个整数值。我希望能够检测到一个实例何时是极端离群值。



For example, with the following example data:

例如,使用以下示例数据:



a = 10
b = 14
c = 25
d = 467
e = 12


d is clearly an anomaly, and I would want to perform a specific action based on this.

D显然是一个反常现象,我想在此基础上执行特定的操作。



I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

我想试着利用我对特定领域的知识来检测异常。例如,计算出与平均值之间的有用距离,并基于启发式方法进行检查。然而,我认为如果我研究更通用、更健壮的异常检测技术,可能会更好,这些技术背后有一些理论支持。



Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.

由于我的数学应用知识有限,我希望找到一种简单的方法,比如使用标准差。希望数据的单维性会使这成为一个相当常见的问题,但如果需要关于该场景的更多信息,请留下评论,我会提供更多信息。






Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

编辑:我想我会添加更多关于数据和我尝试的内容的信息,以防它使一个答案比另一个更正确。



The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

这些值都是正数且非零。我预计这些值将形成正态分布。这种期望是基于对领域的直觉,而不是通过分析,如果这不是一件坏事,请让我知道。在集群方面,除非也有标准的算法来选择k值,否则我会发现很难将这个值提供给k-均值算法。



The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

我想对离群值/异常采取的操作是将其呈现给用户,并建议基本上从数据集中删除数据点(我不会深入了解他们将如何做到这一点,但这对我的域是有意义的),因此它不会用作另一个函数的输入。



So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.

到目前为止,我已经在我有限的数据集上尝试了三西格玛和IQR离群值测试。IQR标记了不够极端的值,三西格玛指出了更符合我对该领域的直觉的实例。






Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

有关算法、技术或资源链接的信息,以了解这一特定场景是有效和受欢迎的答案。



What is a recommended anomaly detection technique for simple, one-dimensional data?

对于简单的一维数据,推荐的异常检测技术是什么?


更多回答

Don't underestimate the value of scientific knowledge. Black box procedures are rarely the way to go. Try to express your scientific knowledge in terms of simple statistics.

不要低估科学知识的价值。黑匣子程序很少是可行的。试着用简单的统计数据来表达你的科学知识。

@Tristan: are you saying you think I should try to come up with a model which has some grounding in statistics, but ultimately is specific for my problem domain?

@特里斯坦:你是说你认为我应该尝试提出一个模型,这个模型有一些统计学基础,但最终是针对我的问题域的?

I'm just saying that your knowledge of what is reasonable (i.e., what is the model that generates the good data and bad data) is important information. You should design a procedure, such as using IQR, that is motivated by your scientific knowledge of the domain. I don't like things like k-means because it is not well motivated and is inherently inflexible, in my view.

我只是说你对什么是合理的知识(即,生成好数据和坏数据的模型是什么)是重要的信息。你应该设计一个程序,例如使用IQR,这是由你的领域的科学知识的动机。我不喜欢像k-means这样的东西,因为在我看来,它没有很好的动机,本质上是不灵活的。

优秀答案推荐

Check out the three-sigma rule:

看看三西格玛规则:



mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier


An alternative method is the IQR outlier test:

另一种方法是IQR离群值检验:



Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier


this test is usually employed by Box plots (indicated by the whiskers):

此测试通常由盒子图(由胡须指示)使用:



boxplot






EDIT:

编辑:



For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.

对于您的情况(简单的一维单变量数据),我认为我的第一个答案非常合适。然而,这并不适用于多变量数据。



@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

@smaclell建议使用K-Means来寻找离群值。除了它主要是一种聚类算法(不是真正的离群点检测技术)这一事实之外,k-Means的问题在于它需要预先知道聚类数目K的好值。



A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

一种更适合的技术是DBSCAN:一种基于密度的集群算法。基本上,它将具有足够高密度的区域生长成簇,这将是密度连接点的最大集合。



dbscan_clustering



DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

DBSCAN需要两个参数:Epsilon和minPoints。它从一个尚未被访问的任意点开始。然后,它找到起点距离epsilon内的所有邻近点。



If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

如果邻居的数量大于或等于minPoints,则形成集群。起始点及其邻居被添加到该集群,并且起始点被标记为已访问。然后,该算法递归地对所有邻居重复评估过程。



If the number of neighbors is less than minPoints, the point is marked as noise.

如果邻居的数量小于minPoints,则将该点标记为噪波。



If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

如果簇被完全扩展(访问范围内的所有点),则算法继续迭代剩余的未访问点,直到它们被耗尽。



Finally the set of all points marked as noise are considered outliers.

最后,标记为噪声的所有点的集合被认为是异常值。



There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

您可以使用各种集群技术来尝试识别数据中的中心趋势。我们在模式识别课程中经常使用的一种算法是K-Means。这将允许您识别是否存在多个相关的数据集,例如双峰分布。这确实需要您对预期的集群数量有一定的了解,但这是相当高效且易于实现的。



After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

在你有了方法之后,你可以试着找出任何一点离任何方法都很远。你可以随心所欲地定义“Far”,但我会推荐@Amro的建议作为一个很好的起点。



For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

有关集群算法的更深入讨论,请参考关于集群的维基百科条目。



This is an old topic but still it lacks some information.

这是一个古老的话题,但仍然缺乏一些信息。


Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:

显然,这可以被视为单变量异常值检测的情况。上面提出的方法有几个利弊。以下是一些薄弱环节:



  1. Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.

  2. The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
    And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
    Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing


I think this problem can be solved in a few lines of python code like this:

我认为这个问题可以通过如下所示的几行Python代码来解决:


import numpy as np
import scipy.stats as sts

x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion

Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:

随后,您拒绝超过某个阈值(数据分布的97.5%)的值,在假设正态分布的情况下,阈值是2.24。在这里,它翻译为:


array([ 0.6745  ,  0.      ,  1.854875, 76.387125,  0.33725 ])

or the 467 entry being rejected.

或被拒绝的467条目。


Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.

当然,人们可以争辩说,疯子(如上所述)也假定有正常的离散性。因此,为什么上面的论点2(小样本)不适用于这里?答案是MAD有一个非常高的故障点。很容易从不同的分布中选择不同的阈值,并得出相同的结论:467是异常值。



Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.

三西格玛规则和IQR检验都是经常使用的,而且有几个简单的算法来检测异常。



The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier


The IQR test should be:

IQR测试应该是:



Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier


The anomaly detection of one-dimensional data is an open challenge. I have published a Python package, named xiezhi, which can be applied to detect the abnormal data in a list, especially when the list is large while only a few data in it are anomalies. This tool is based on one of my research papers, and it has been proven to be theoretically robust. Here is a tutorial for xiezhi: https://medium.com/@hellojerrywong18/xiezhi-the-anomaly-detection-tool-for-one-dimensional-data-9108c539e692

一维数据的异常检测是一个开放的挑战。我已经发布了一个名为协志的Python包,它可以用来检测列表中的异常数据,特别是当列表很大而其中只有几个数据是异常的时候。这个工具是基于我的一篇研究论文,它已经被证明在理论上是健壮的。这里有一个谢志的教程:https://medium.com/@hellojerrywong18/xiezhi-the-anomaly-detection-tool-for-one-dimensional-data-9108c539e692


If you have any problem or suggestion, please let me know.

如果您有任何问题或建议,请让我知道。


更多回答

+1 three-sigma and IQR look like good techniques, thanks for the insightful answer.

+1三西格玛和IQR看起来是很好的技术,谢谢你富有洞察力的回答。

I like this simple advice. The IQR based statistic has the advantage of not being influenced by extreme outliers which will change the mean/sd.

我喜欢这个简单的建议。基于IQR的统计量的优点是不受极端异常值的影响,极端异常值会改变均值/标准差。

@Anony-Mousse: fixed, thanks. Funny enough I first learned about DBSCAN in a machine-learning class using Weka software/book

@anony-Mousse:好了,谢谢。有趣的是,我第一次了解DBSCAN是在使用Weka软件/书籍的机器学习课程上

Yes, the Weka software and book are very widely used. Which is why it is a pity they made this error. Plus, the DBSCAN implementation in Weka is really crappy. It benchmarked way over 100x as slow as mine, and even slower as their OPTICS implementation? OPTICS should be quite a bit slower.

是的,Weka软件和图书使用非常广泛。这就是为什么他们犯了这个错误,这是很遗憾的。另外,Weka中的DBSCAN实现非常糟糕。它的基准速度是我的100倍以上,甚至比他们的光学实施更慢?光学的速度应该要慢得多。

@Anony-Mousse: If you are willing and have the time, you could contribute your implementation to Weka. It is open sourced under GPL, and no I'm not affiliated with them in any way :)

@anony-Mousse:如果你愿意并且有时间,你可以把你的实现贡献给Weka。它是在GPL下开源的,不,我与他们没有任何关联:)

Agreed. K-Means is a simple, effective, and adaptive solution for this problem. Create two clusters, initialize properly, and one of the clusters should contain the meaningful data while the other gets the outlier(s). But be careful; if you have no outliers, then both clusters will contain meaningful data.

同意。K-Means算法是解决这一问题的一种简单、有效和自适应的解决方案。创建两个簇,正确初始化,其中一个簇应该包含有意义的数据,而另一个簇则得到离群值(S)。但是要小心;如果没有异常值,那么两个集群都将包含有意义的数据。

Well that is where it gets fun. It is often very difficult to determine the number of clusters and would be even harder doing it in a live system. Even in that case of one true cluster and another outlier cluster it could be argued the outliers are starting to represent a real mode for the data. I am going to add more links to provide other options.

这就是它变得有趣的地方。通常很难确定集群的数量,在实时系统中更难确定。即使在一个真实聚类和另一个离群值聚类的情况下,也可以认为离群值开始代表数据的真实模式。我将添加更多链接以提供其他选项。

This strikes me as the wrong tool for the job. He's primarily interested in fat tails, not bimodal distributions.

在我看来,这是不适合这项工作的工具。他主要对肥尾感兴趣,而不是双峰分布。

It depends on the asker's intent, so we cannot be completely sure. If the only intent is to assess how anomalous a data point is, then use simple statistics, of course. But if you want to, say, use the "good" data as an input to a subsequent function, then there may be value in classifying the points as "good" or "bad" (e.g., through K-means, etc.).

这取决于提问者的意图,所以我们不能完全确定。如果唯一的目的是评估一个数据点的异常程度,那么当然可以使用简单的统计数据。但是,如果你想,比如说,使用“好的”数据作为后续函数的输入,那么将点分类为“好的”或“坏的”可能是有价值的(例如,通过K-均值等)。

@Steve That is actually wrong. There is no reason why all the outliers should form a cluster. K-Means finds clusters for which the euclidean distance from its center is minimized - if the outliers are distributed evenly around the clusters, this will not help at all. The Euclidean distance results from a Gaussian assumption which is very vulnerable to outliers. Don't use K-Means for outlier detection only. You might want to use it for preprocessing and using three sigma afterwards, as stated by the original author.

@史蒂夫,这实际上是错误的。没有理由说所有的离群值都应该形成一个集群。K-Means算法会找到离其中心的欧几里得距离最小的星团--如果离群值均匀分布在星团周围,这将毫无帮助。欧几里德距离是高斯假设的结果,它很容易受到异常值的影响。不要只将K-Means用于异常值检测。正如原作者所述,您可能希望将其用于预处理和随后使用三西格玛。

I just noticed this and you are right, my IQR test wasn't correct. I'll update my answer, thanks.

我刚刚注意到这一点,你是对的,我的IQR测试不正确。我会更新我的答案,谢谢。

FYI, the time complexity of xiezhi is O(N) while N is the size of the list.

仅供参考,反之的时间复杂度为O(N),而N为列表的大小。

Can you share the method used by xiezhi, e.g. by providing a reference to your research paper?

你能分享谢志使用的方法吗,例如,提供你的研究论文的参考?

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com