I have thousands of x,y data paris of observed and theoretical temperatures in a given region, where I need to identify some anomalous (for the purpose of my work) data points. I have tried PCA analysys (i.e., DBSCAN), statistical thresholds, curve fittings, etc., but none of them did work. I then tried a KDE-based approach, which seems to 'better' work for my data. To achieve the results shown below I used https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization as:
我有数千个给定区域的观测和理论温度的x,y数据巴黎,在那里我需要识别一些异常的数据点(为了我的工作)。我尝试了PCA Analysys(即DBSCAN)、统计阈值、曲线拟合等方法,但都不起作用。然后我尝试了一种基于KDE的方法,这似乎更适合我的数据。为了实现如下所示的结果,我使用了https://it.mathworks.com/matlabcentral/fileexchange/8430-flow-cytometry-data-reader-and-visualization:
[hAxes,col,ctrs1,ctrs2,F] = dscatter(X,Y,'BINS',[250,250]);
% Please note, I modified the code to export hAxes,col,ctrs1,ctrs2 and plot the contour line as:
contour(ctrs1,ctrs2,F,0.0015,'k-'); % or 0.015 ---> See figure below
By using this apporach I was able to draw a contour line around the main cluster of my data, and I was mostly able to draw the line between 'main cluster' and 'anomalous' points. However, if I have too many 'anomalous' data points (see xy4 in the figure below), the method, using a fixed threshold, fails. Beside, I have to adjust the threshold based on the region of each x,y pair and I have no idea on how to find the correct threshold level (see the figure below - For the same region, a threshold level of 0.0015 seems to work for situations with a few anomalous points, but a threshold of 0.015 is needed when more spreaded points occurred).
通过使用这个比例,我能够围绕我的数据的主聚类绘制一条轮廓线,并且我基本上能够在“主聚类”和“异常”点之间画出一条线。然而,如果我有太多的“异常”数据点(见下图中的xy4),使用固定阈值的方法就会失败。此外,我必须根据每个x,y对的区域调整阈值,我不知道如何找到正确的阈值水平(见下图-对于同一区域,阈值水平0.0015似乎适用于有几个异常点的情况,但当出现更多扩散点时,阈值需要0.015)。
What I would really like to do is to draw a curve around the main cluster of my scatterplot, so to devide the anomalous data points. I fully understand this may be a challenging task, but I hope you may provide some good alternatives and/or solutions.
我真正想做的是在散点图的主簇周围画一条曲线,这样就可以划分出异常的数据点。我完全理解这可能是一项具有挑战性的任务,但我希望您能提供一些好的替代方案和/或解决方案。
Another solution, as it seems to work, may be defining automatically the density threshold level, but I don't really know where to start from.
另一种似乎有效的解决方案可能是自动定义密度阈值水平,但我真的不知道从哪里开始。
Below, you can see 4 examples (xy 1 to 4 are attached
note col1 and col2 = xy1 - col3 and col4 = xy2, etc. You can find the data here: https://www.mashupstack.com/share/64fc9e4fb3683).
下面,您可以看到4个示例(XY 1到4是附注COL1和COL2=XY1-COL3和COL4=XY2,等等。您可以在这里找到数据:https://www.mashupstack.com/share/64fc9e4fb3683).
Example
示例
To the left is the simple x,y scatterplot. In the middle, the ideal curve (in red) defining the boundary of my scatterplot (manually sketched). To the right, the KDE-based approach discussed above. Please note the last figure where a threshold of 0.0015 fails, and a threshld of 0.015 is needed instead.
左边是简单的x,y散点图。在中间,理想曲线(红色)定义了我的散点图(手动绘制)的边界。右边是上面讨论过的基于KDE的方法。请注意最后一个图,其中阈值0.0015失败,而需要阈值0.015。
Any help is grately appreciated!
如有任何帮助,我们不胜感激!
Any other approach to identify the points outside the red boundaries is more than wellcome!
任何其他识别红色边界外的点的方法都比Wellcome更好!
更多回答
Related - note a keyword which might help your search is "convex hull"
相关-注意一个关键字,这可能有助于您的搜索是“凸包”
我是一名优秀的程序员,十分优秀!