matlab - k-均值聚类的不规则图，异常值去除-6ren

matlab - k-均值聚类的不规则图，异常值去除

转载作者：太空宇宙更新时间：2023-11-03 19:21:49

您好，我正在尝试对来自 1999 年 darpa 数据集的网络数据进行聚类。不幸的是，我并没有真正得到聚类数据，没有与一些文献相比，使用相同的技术和方法。

我的数据出来是这样的:

Matlab Figure 1

如您所见，它不是很集群。这是由于数据集中有很多异常值(噪声)。我研究了一些离群值去除技术，但到目前为止我没有尝试真正清理数据。我试过的方法之一:

%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:

    mu = mean(data)
    sigma = std(data)
    [n,p] = size(data);
    % Create a matrix of mean values by replicating the mu vector for n rows
    MeanMat = repmat(mu,n,1);
    % Create a matrix of standard deviation values by replicating the sigma vector for n rows
    SigmaMat = repmat(sigma,n,1);
    % Create a matrix of zeros and ones, where ones indicate the location of outliers
    outliers = abs(data - MeanMat) > 3*SigmaMat;
    % Calculate the number of outliers in each column
    nout = sum(outliers) 
    % To remove an entire row of data containing the outlier
    data(any(outliers,2),:) = [];

在第一次运行中，它从从完整数据集中选择的 1000 个标准化随机行中删除了 48 行。

这是我在数据上使用的完整脚本:

    %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);

        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features

        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute

        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);

        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end

        %# all numeric dataset
        fulldata = cell2mat(C);

%% dimensionality reduction 
columns = 6
[U,S,V]=svds(fulldata,columns);

%% randomly select dataset
rows = 1000;
columns = 6;

%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';

%# pick random columns
indY = indY(1:columns);

%# filter data
data = U(indX,indY);

% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));

% output matching data
dataSample = fulldata(indX, :)

%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];

%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;

%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

这是输出的两个不同的集群:

enter image description here

如您所见，数据看起来比原始数据更清晰、更集中。但是我仍然认为可以使用更好的方法。

例如观察整体聚类，我仍然有很多来自数据集的噪音(离群值)。从这里可以看出:

enter image description here

我需要将离群行放入单独的数据集中以供以后分类(仅从聚类中删除)

这是 darpa 数据集的链接，请注意 10% 的数据集的列数显着减少，大部分以 0 或 1 贯穿的列已被删除(42 列减少到 6 列) :

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

编辑

保存在数据集中的列是:

src_bytes: continuous.

dst_bytes: continuous.

count: continuous.

srv_count: continuous.  

dst_host_count: continuous.

dst_host_srv_count: continuous.

重新编辑:

根据与 Anony-Mousse 的讨论和他的回答，可能有一种使用 K-Medoids 减少聚类噪音的方法 http://en.wikipedia.org/wiki/K-medoids .我希望我目前拥有的代码没有太大变化，但到目前为止我还不知道如何实现它来测试这是否会显着降低噪音。因此，只要有人可以向我展示一个工作示例，这将被接受为答案。

最佳答案

请注意，不鼓励使用此数据集:

该数据集有错误:KDD Cup '99 dataset (Network Intrusion) considered harmful

重新考虑使用不同的算法。 k-means 并不真正适合混合类型的数据，其中许多属性是离散的，并且具有非常不同的尺度。 K-means 需要能够计算合理的 均值。对于二进制向量，“0.5”不是一个合理的平均值，它应该是 0 或 1。

另外，k-means 不太喜欢异常值。

绘图时，请确保将它们均匀缩放，否则结果看起来不正确。您的 X 轴长度约为 0.9，您的 y 轴只有 0.2 - 难怪它们看起来被压扁了。

总的来说，也许数据集只是没有 k-means-style 聚类？您绝对应该尝试基于密度的方法(因为这些可以处理异常值)，例如 DBSCAN。但是从你添加的可视化来看，我会说它最多有 4-5 个集群，而且它们并不是很有趣。它们可能可以在某些维度上使用多个阈值来捕获。

Parallel coordinates

这是 z 归一化后数据集的可视化，以平行坐标可视化，包含 5000 个样本。亮绿色是正常的。

您可以清楚地看到数据集的特殊属性。所有攻击在属性 3 和 4(count 和 srv_count)上都明显不同，并且最集中在 dst_host_count 和 dst_host_srv_count。

我也在这个数据集上运行了 OPTICS。它发现了一些集群，其中大部分呈酒红色攻击模式。但它们并不是很有趣。如果您有 10 个不同的主机 ping-flooding，它们将形成 10 个集群。

OPTICS Clusters

你可以很清楚地看到 OPTICS设法聚集了一些这样的攻击。它错过了所有橙色的东西(也许如果我将 minpts 设置得更低，它就会很分散)但它甚至在酒红色攻击中发现了*结构)，将其分解为许多单独的事件。

要真正理解这个数据集，您应该从特征提取开始，例如将此类 ping 泛洪连接尝试合并到一个聚合事件.

另请注意，这是一个不切实际的场景。

攻击中涉及众所周知的模式，特别是端口扫描。这些最好用专门的端口扫描检测器来检测，而不是学习。
模拟数据模拟了很多完全没有意义的“攻击”。例如Smurf attack从 90 年代开始，>50% 的数据集，Syn flood 是另外 20%；而正常流量为 <20%!
对于这类攻击，有众所周知的签名。
许多现代攻击(例如 SQL 注入(inject))与普通的 HTTP 流量一起流动，并且不会在原始流量模式中表现出异常。

只是不要将此数据用于分类或异常值检测。只是不要。

引用上面的 KDNuggets 链接:

As a result, we strongly recommend that

(1) all researchers stop using the KDD Cup '99 dataset,

(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and

(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.

这既不是真实也不是现实数据。去拿别的东西。

关于matlab - k-均值聚类的不规则图，异常值去除，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11373176/