gpt4 book ai didi

apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差?

转载 作者:行者123 更新时间:2023-12-03 11:47:12 26 4
gpt4 key购买 nike

我正在降低 Spark DataFrame 的维度与 PCA带有 pyspark 的模型(使用 spark ml 库)如下:

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)
哪里 dataSpark DataFrame一列标记为 features这是一个 DenseVector 3个维度:
data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')
拟合后,我转换数据:
transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))
如何提取此 PCA 的特征向量?我如何计算他们解释的差异有多大?

最佳答案

[ 更新:从 Spark 2.2 开始,PCA 和 SVD 在 PySpark 中都可用 - 参见 JIRA 票 SPARK-6227PCA & PCAModel对于 Spark ML 2.2;下面的原始答案仍然适用于较旧的 Spark 版本。]
好吧,这似乎令人难以置信,但确实没有办法从 PCA 分解中提取此类信息(至少从 Spark 1.5 开始)。但同样,也有很多类似的“投诉”——见here ,例如,无法从 CrossValidatorModel 中提取最佳参数.
幸运的是,几个月前,我参加了 'Scalable Machine Learning' AMPLab(伯克利)和 Databricks 的 MOOC,即 Spark 的创建者,我们在其中“手工”实现了完整的 PCA 管道,作为家庭作业的一部分。我从那时起修改了我的函数(请放心,我得到了充分的信任:-),以便将数据帧用作输入(而不是 RDD),格式与您的格式相同(即 DenseVectors 的行包含数字特征)。
我们首先需要定义一个中间函数,estimatedCovariance , 如下:

import numpy as np

def estimateCovariance(df):
"""Compute the covariance matrix for a given dataframe.

Note:
The multi-dimensional covariance array should be calculated using outer products. Don't
forget to normalize the data by first subtracting the mean.

Args:
df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.

Returns:
np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
length of the arrays in the input dataframe.
"""
m = df.select(df['features']).map(lambda x: x[0]).mean()
dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean

return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()
然后,我们可以写一个主 pca功能如下:
from numpy.linalg import eigh

def pca(df, k=2):
"""Computes the top `k` principal components, corresponding scores, and all eigenvalues.

Note:
All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
each eigenvectors as a column. This function should also return eigenvectors as columns.

Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k (int): The number of principal components to return.

Returns:
tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of
rows equals the length of the arrays in the input `RDD` and the number of columns equals
`k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays
of length `k`. Eigenvalues is an array of length d (the number of features).
"""
cov = estimateCovariance(df)
col = cov.shape[1]
eigVals, eigVecs = eigh(cov)
inds = np.argsort(eigVals)
eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]
components = eigVecs[0:k]
eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals
score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
# Return the `k` principal components, `k` scores, and all eigenvalues

return components.T, score, eigVals
测试
让我们首先看看现有方法的结果,使用来自 Spark ML PCA 的示例数据 documentation (将它们修改为全部 DenseVectors ):
 from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])
pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca_extracted.fit(df)
model.transform(df).collect()

[Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]
然后,使用我们的方法:
 comp, score, eigVals = pca(df)
score.collect()

[array([ 1.64857282, 4.0132827 ]),
array([-4.64510433, 1.11679727]),
array([-6.42888054, 5.33795143])]
让我强调一下我们 不要使用任何 collect()我们定义的函数中的方法 - scoreRDD ,理所当然。
请注意,我们第二列的符号与现有方法得出的符号完全相反;但这不是问题:根据(可免费下载) An Introduction to Statistical Learning ,由 Hastie 和 Tibshirani 合着,p。 382

Each principal component loading vector is unique, up to a sign flip. Thismeans that two different software packages will yield the same principalcomponent loading vectors, although the signs of those loading vectorsmay differ. The signs may differ because each principal component loadingvector specifies a direction in p-dimensional space: flipping the sign has noeffect as the direction does not change. [...] Similarly, the score vectors are uniqueup to a sign flip, since the variance of Z is the same as the variance of −Z.


最后,既然我们有了可用的特征值,那么编写一个解释方差百分比的函数就很简单了:
 def varianceExplained(df, k=1):
"""Calculate the fraction of variance explained by the top `k` eigenvectors.

Args:
df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
k: The number of principal components to consider.

Returns:
float: A number between 0 and 1 representing the percentage of variance explained
by the top `k` eigenvectors.
"""
components, scores, eigenvalues = pca(df, k)
return sum(eigenvalues[0:k])/sum(eigenvalues)


varianceExplained(df,1)
# 0.79439325322305299
作为测试,我们还检查示例数据中解释的方差是否为 1.0,对于 k=5(因为原始数据是 5 维的):
 varianceExplained(df,5)
# 1.0
[使用 Spark 1.5.0 和 1.5.1 开发和测试]

关于apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33428589/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com