machine-learning - 为什么向量归一化可以提高聚类和分类的准确性？-6ren

machine-learning - 为什么向量归一化可以提高聚类和分类的准确性？

转载作者：行者123 更新时间：2023-11-30 08:22:29

Mahout in Action中描述了归一化可以稍微提高准确性。谁能解释一下原因，谢谢!

最佳答案

标准化并不总是需要的，但它很少有坏处。

一些例子:

K-means :

K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance.

Matlab 中的示例:

X = [randn(100,2)+ones(100,2);...
     randn(100,2)-ones(100,2)];

% Introduce denormalization
% X(:, 2) = X(:, 2) * 1000 + 500;

opts = statset('Display','final');

[idx,ctrs] = kmeans(X,2,...
                    'Distance','city',...
                    'Replicates',5,...
                    'Options',opts);

plot(X(idx==1,1),X(idx==1,2),'r.','MarkerSize',12)
hold on
plot(X(idx==2,1),X(idx==2,2),'b.','MarkerSize',12)
plot(ctrs(:,1),ctrs(:,2),'kx',...
     'MarkerSize',12,'LineWidth',2)
plot(ctrs(:,1),ctrs(:,2),'ko',...
     'MarkerSize',12,'LineWidth',2)
legend('Cluster 1','Cluster 2','Centroids',...
       'Location','NW')
title('K-means with normalization')

enter image description here

(仅供引用:How can I detect if my dataset is clustered or unclustered (i.e. forming one single cluster)

Distributed clustering :

The comparative analysis shows that the distributed clustering results depend on the type of normalization procedure.

Artificial neural network (inputs) :

If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

Artificial neural network (inputs/outputs)

Should you do any of these things to your data? The answer is, it depends.

Standardizing either input or target variables tends to make the training process better behaved by improving the numerical condition (see ftp://ftp.sas.com/pub/neural/illcond/illcond.html) of the optimization problem and ensuring that various default values involved in initialization and termination are appropriate. Standardizing targets can also affect the objective function.

Standardization of cases should be approached with caution because it discards information. If that information is irrelevant, then standardizing cases can be quite helpful. If that information is important, then standardizing cases can be disastrous.

<小时/>

有趣的是，改变测量单位甚至可能会导致人们看到一种非常不同的聚类结构:Kaufman, Leonard, and Peter J. Rousseeuw.. "Finding groups in data: An introduction to cluster analysis." (2005).

In some applications, changing the measurement units may even lead one to see a very different clustering structure. For example, the age (in years) and height (in centimeters) of four imaginary people are given in Table 3 and plotted in Figure 3. It appears that {A, B ) and { C, 0) are two well-separated clusters. On the other hand, when height is expressed in feet one obtains Table 4 and Figure 4, where the obvious clusters are now {A, C} and { B, D}. This partition is completely different from the first because each subject has received another companion. (Figure 4 would have been flattened even more if age had been measured in days.)

To avoid this dependence on the choice of measurement units, one has the option of standardizing the data. This converts the original measurements to unitless variables.

enter image description here

Kaufman et al.继续一些有趣的考虑(第 11 页):

From a philosophical point of view, standardization does not really solve the problem. Indeed, the choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985). On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user.

关于machine-learning - 为什么向量归一化可以提高聚类和分类的准确性？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15777201/