gpt4 book ai didi

algorithm - 高斯混合模型的EM算法实现

转载 作者:塔克拉玛干 更新时间:2023-11-03 03:42:25 25 4
gpt4 key购买 nike

使用EM算法,我想在给定的数据集上训练一个包含四个分量的高斯混合模型这个装置是三维的,包含300个样品。
问题是,经过大约6轮EM算法后,根据matlab,协方差矩阵sigma变得接近奇异(rank(sigma) = 2而不是3)这反过来又会导致不希望的结果,如评估高斯分布的复杂值gm(k,i)
此外,我使用高斯对数来解释下溢问题-见e-step。我不确定这是不是正确的,我是不是要把责任人的经验(w|k|x^(I),θ)带到别的地方?
你能告诉我到目前为止em算法的实现是否正确吗?
如何用接近奇异的协方差sigma来解释这个问题?
下面是我对em算法的实现:
首先,我使用kmeans初始化组件的均值和协方差:

load('data1.mat');

X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components

% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components

% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
sigma(:,:,k) = cov(X(idx==k,:));
end

对于E-step,我使用以下公式计算责任。
responsibility
w_k是k个高斯分量。
x^(i)是单个数据点(示例)
theta代表高斯混合模型的参数:mu,sigma,pi。
下面是相应的代码:
% variables for convergence 
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
round = round +1
gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities
% E-step: Evaluate the responsibilities using the current parameters
% compute the nominator and denominator of the responsibilities
for k = 1:K
for i = 1:N
Xmu = X-mu;
% I am using log to prevent underflow of the gaussian distribution (exp("small value"))
logPdf = log(1/sqrt(det(sigma(:,:,k))*(2*pi)^D)) + (-0.5*Xmu*(sigma(:,:,k)\Xmu'));
gm(k,i) = log(p(k)) * logPdf;
sumGM(i) = sumGM(i) + gm(k,i);
end
end

% calculate responsibilities
res = zeros(K,N); % responsibilities
Nk = zeros(4,1);
for k = 1:K
for i = 1:N
% I tried to use the exp(gm(k,i)/sumGM(i)) to compute res but this leads to sum(pi) > 1.
res(k,i) = gm(k,i)/sumGM(i);
end
Nk(k) = sum(res(k,:));
end

Nk(k)使用M-step中给出的公式计算,并在M-step中用于计算新的概率 p(k)
M步
reestimate parameters using current responsibilities
    % M-step: Re-estimate the parameters using the current responsibilities
for k = 1:K
for i = 1:N
mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
end
mu(k,:) = mu(k,:)./Nk(k);
sigma(:,:,k) = sigma(:,:,k)./Nk(k);
p(k) = Nk(k)/N;
end

现在为了检查收敛性,对数似然是使用以下公式计算的:
log-likelihood
    % Evaluate the log-likelihood and check for convergence of either 
% the parameters or the log-likelihood. If not converged, go to E-step.
loglikelihood = 0;
for i = 1:N
loglikelihood = loglikelihood + log(sum(gm(:,i)));
end


% Check for convergence of parameters
errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
if (errorLoglikelihood <= eps)
converged = 1;
end

errorMu = abs(mu(:)-prevMu(:));
errorSigma = abs(sigma(:)-prevSigma(:));
errorPi = abs(p(:)-prevPi(:));

if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
converged = 1;
end

prevLoglikelihood = loglikelihood;
prevMu = mu;
prevSigma = sigma;
prevPi = p;

end % while

我在Matlab中对高斯混合模型的EM算法的实现有什么问题吗?
以前的问题:
问题是我不能使用对数似然来检查收敛性,因为它 -Inf。这是在评估责任公式中的高斯值时,从四舍五入的零值得出的结果(请参见e-step)。
你能告诉我到目前为止em算法的实现是否正确吗?
如何用四舍五入的零值来解释这个问题呢?
下面是我对em算法的实现:
首先,我使用kmeans初始化组件的均值和协方差:
load('data1.mat');

X = Data'; % 300x3 data set
D = size(X,2); % dimension
N = size(X,1); % number of samples
K = 4; % number of Gaussian Mixture components

% Initialization
p = [0.2, 0.3, 0.2, 0.3]; % arbitrary pi
[idx,mu] = kmeans(X,K); % initial means of the components

% compute the covariance of the components
sigma = zeros(D,D,K);
for k = 1:K
sigma(:,:,k) = cov(X(idx==k,:));
end

对于E-step,我使用以下公式计算责任
responsibility
下面是相应的代码:
% variables for convergence 
converged = 0;
prevLoglikelihood = Inf;
prevMu = mu;
prevSigma = sigma;
prevPi = p;
round = 0;
while (converged ~= 1)
round = round +1
gm = zeros(K,N); % gaussian component in the nominator -
% some values evaluate to zero
sumGM = zeros(N,1); % denominator of responsibilities
% E-step: Evaluate the responsibilities using the current parameters
% compute the nominator and denominator of the responsibilities
for k = 1:K
for i = 1:N
% HERE values evalute to zero e.g. exp(-746.6228) = -Inf
gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*(X(i,:)-mu(k,:))*inv(sigma(:,:,k))*(X(i,:)-mu(k,:))');
sumGM(i) = sumGM(i) + gm(k,i);
end
end

% calculate responsibilities
res = zeros(K,N); % responsibilities
Nk = zeros(4,1);
for k = 1:K
for i = 1:N
res(k,i) = gm(k,i)/sumGM(i);
end
Nk(k) = sum(res(k,:));
end

Nk(k)使用m步中给出的公式计算。
M步
reestimate parameters using current responsibilities
    % M-step: Re-estimate the parameters using the current responsibilities
mu = zeros(K,3);
for k = 1:K
for i = 1:N
mu(k,:) = mu(k,:) + res(k,i).*X(k,:);
sigma(:,:,k) = sigma(:,:,k) + res(k,i).*(X(k,:)-mu(k,:))*(X(k,:)-mu(k,:))';
end
mu(k,:) = mu(k,:)./Nk(k);
sigma(:,:,k) = sigma(:,:,k)./Nk(k);
p(k) = Nk(k)/N;
end

现在为了检查收敛性,对数似然是使用以下公式计算的:
log-likelihood
    % Evaluate the log-likelihood and check for convergence of either 
% the parameters or the log-likelihood. If not converged, go to E-step.
loglikelihood = 0;
for i = 1:N
loglikelihood = loglikelihood + log(sum(gm(:,i)));
end


% Check for convergence of parameters
errorLoglikelihood = abs(loglikelihood-prevLoglikelihood);
if (errorLoglikelihood <= eps)
converged = 1;
end

errorMu = abs(mu(:)-prevMu(:));
errorSigma = abs(sigma(:)-prevSigma(:));
errorPi = abs(p(:)-prevPi(:));

if (all(errorMu <= eps) && all(errorSigma <= eps) && all(errorPi <= eps))
converged = 1;
end

prevLoglikelihood = loglikelihood;
prevMu = mu;
prevSigma = sigma;
prevPi = p;

end % while

第一轮之后 loglikelihood大约是700。
第二轮是 -Inf,因为E步中的一些 gm(k,i)值为零因此对数显然是负无穷大。
零值还导致 sumGM等于零,因此导致 musigma矩阵中的所有NaN项。
我怎样才能解决这个问题?
你能告诉我我的执行是否有问题吗?
它能通过提高Matlab的精度来解决吗?
编辑:
我为gm(k,i)中的exp()项添加了一个缩放。
不幸的是,这没有多大帮助再转几圈后,我还是有底流问题。
scale = zeros(N,D);
for i = 1:N
max = 0;
for k = 1:K
Xmu = X(i,:)-mu(k,:);
if (norm(scale(i,:) - Xmu) > max)
max = norm(scale(i,:) - Xmu);
scale(i,:) = Xmu;
end
end
end


for k = 1:K
for i = 1:N
Xmu = X(i,:)-mu(k,:);
% scale gm to prevent underflow
Xmu = Xmu - scale(i,:);
gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
sumGM(i) = sumGM(i) + gm(k,i);
end
end

此外,我注意到kmeans初始化平均值的方式与在M步中计算平均值的后续轮次完全不同。
kmeans(韩国):
mu =   13.500000000000000   0.026602138870044   0.062415945993735
88.500000000000000 -0.009869960132085 -0.075177888210981
39.000000000000000 -0.042569305020309 0.043402772876513
64.000000000000000 -0.024519281362918 -0.012586980924762

M步后:
round = 2

mu = 1.000000000000000 0.077230046948357 0.024498886414254
2.000000000000000 0.074260118474053 0.026484346404660
3.000000000000002 0.070944016105476 0.029043085983168
4.000000000000000 0.067613431480832 0.031641849205021

在接下来的几轮比赛中 mu根本没有变化。和第二轮一样。
我想这是因为gm(k,i)的流量不足引起的吧?
我对缩放的实现不正确,或者算法的整个实现在某个地方不正确:(
编辑2
四轮之后,我得到了 NaN值,并对gm进行了更详细的研究。只看一个样本(没有0.5因子), gm在所有成分中都变成零。放入matlab gm(:,1) = [0 0 0 0]这反过来导致sumgm等于0->nan,因为我除以0。我已经在
round = 1

mu = 62.0000 -0.0298 -0.0078
37.0000 -0.0396 0.0481
87.5000 -0.0083 -0.0728
12.5000 0.0303 0.0614

gm(:,1) = [11.7488, 0.0000, 0.0000, 0.0000]


round = 2

mu = 1.0000 0.0772 0.0245
2.0000 0.0743 0.0265
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316


gm(:,1) = [0.0000, 0.0000, 0.0000, 0.3128]

round = 3

mu = 1.0000 0.0772 0.0245
2.0000 0.0743 0.0265
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316


gm(:,1) = [0, 0, 0.0000, 0.2867]


round = 4


mu = 1.0000 0.0772 0.0245
NaN NaN NaN
3.0000 0.0709 0.0290
4.0000 0.0676 0.0316

gm(:,1) = 1.0e-105 * [0, NaN, 0, 0.5375]

首先,这些手段似乎没有改变,与国民党的初始化完全不同。
根据 gm(:,1)的输出,每个样本(不只是像这里这样的第一个样本)只对应于一个高斯分量。样本不应该在每个高斯分量之间“部分分布”吗?
编辑3:
所以我想mu不改变的问题是m-step中的第一行: mu = zeros(K,3);
为了解决下溢问题,我目前正试图使用高斯对数:
function logPdf = logmvnpdf(X, mu, sigma, D)
Xmu = X-mu;
logPdf = log(1/sqrt(det(sigma)*(2*pi)^D)) + (-0.5*Xmu*inv(sigma)*Xmu');
end

新问题是协方差矩阵sigmaMatlab声称:
警告:矩阵接近单数或严重缩放。结果可能不准确。
六轮之后,我得到gm(高斯分布)的虚值。
更新的E-Step现在如下所示:
gm = zeros(K,N); % gaussian component in the nominator
sumGM = zeros(N,1); % denominator of responsibilities


for k = 1:K
for i = 1:N
%gm(k,i) = p(k)/sqrt(det(sigma(:,:,k))*(2*pi)^D)*exp(-0.5*Xmu*inv(sigma(:,:,k))*Xmu');
%gm(k,i) = p(k)*mvnpdf(X(i,:),mu(k,:),sigma(:,:,k));
gm(k,i) = log(p(k)) + logmvnpdf(X(i,:), mu(k,:), sigma(:,:,k), D);
sumGM(i) = sumGM(i) + gm(k,i);
end
end

最佳答案

看起来你应该能够使用比例因子标度(i)将gm(k,i)带入一个可表示的范围,因为如果你用gm(k,i)乘以标度(i),这个结果也会乘以sumGM(i),并且当你计算res(k,i)=gm(k,i)/sumGM(i)时会被取消。
我会在理论上把标度(I)=1/max_k(exp(-0.5*(X(I,:)-mu(k,:))),实际上不用做指数运算就可以计算出来,所以你最终会处理它的对数,max_k(-0.5*(X(I,:)-mu(k,:))——这给了你一个可以加到-0.5*(X(I,:)-mu(k,在使用Exp()之前,至少在一个可表示的范围内保持最大值——在这个修正之后,任何仍然溢出到零的东西,无论如何你都不关心,因为它比其他贡献小得多。

关于algorithm - 高斯混合模型的EM算法实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31774147/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com