apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差？-6ren

apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差？

转载作者：行者123 更新时间：2023-12-03 11:47:12

我正在降低 Spark DataFrame 的维度与 PCA带有 pyspark 的模型(使用 spark ml 库)如下:

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

哪里 data是 Spark DataFrame一列标记为 features这是一个 DenseVector 3个维度:

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')

拟合后，我转换数据:

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))

如何提取此 PCA 的特征向量？我如何计算他们解释的差异有多大？

最佳答案

[ 更新:从 Spark 2.2 开始，PCA 和 SVD 在 PySpark 中都可用 - 参见 JIRA 票 SPARK-6227和 PCA & PCAModel对于 Spark ML 2.2；下面的原始答案仍然适用于较旧的 Spark 版本。]
好吧，这似乎令人难以置信，但确实没有办法从 PCA 分解中提取此类信息(至少从 Spark 1.5 开始)。但同样，也有很多类似的“投诉”——见here ，例如，无法从 CrossValidatorModel 中提取最佳参数.
幸运的是，几个月前，我参加了 'Scalable Machine Learning' AMPLab(伯克利)和 Databricks 的 MOOC，即 Spark 的创建者，我们在其中“手工”实现了完整的 PCA 管道，作为家庭作业的一部分。我从那时起修改了我的函数(请放心，我得到了充分的信任:-)，以便将数据帧用作输入(而不是 RDD)，格式与您的格式相同(即 DenseVectors 的行包含数字特征)。
我们首先需要定义一个中间函数，estimatedCovariance ，如下:

import numpy as np

def estimateCovariance(df):
    """Compute the covariance matrix for a given dataframe.

    Note:
        The multi-dimensional covariance array should be calculated using outer products.  Don't
        forget to normalize the data by first subtracting the mean.

    Args:
        df:  A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.

    Returns:
        np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
            length of the arrays in the input dataframe.
    """
    m = df.select(df['features']).map(lambda x: x[0]).mean()
    dfZeroMean = df.select(df['features']).map(lambda x:   x[0]).map(lambda x: x-m)  # subtract the mean

    return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()

然后，我们可以写一个主 pca功能如下:

from numpy.linalg import eigh

def pca(df, k=2):
    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.

    Note:
        All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
        each eigenvectors as a column.  This function should also return eigenvectors as columns.

    Args:
        df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
        k (int): The number of principal components to return.

    Returns:
        tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
        scores, eigenvalues).  Eigenvectors is a multi-dimensional array where the number of
        rows equals the length of the arrays in the input `RDD` and the number of columns equals
        `k`.  The `RDD` of scores has the same number of rows as `data` and consists of arrays
        of length `k`.  Eigenvalues is an array of length d (the number of features).
     """
    cov = estimateCovariance(df)
    col = cov.shape[1]
    eigVals, eigVecs = eigh(cov)
    inds = np.argsort(eigVals)
    eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]  
    components = eigVecs[0:k]
    eigVals = eigVals[inds[-1:-(col+1):-1]]  # sort eigenvals
    score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
    # Return the `k` principal components, `k` scores, and all eigenvalues

    return components.T, score, eigVals

测试
让我们首先看看现有方法的结果，使用来自 Spark ML PCA 的示例数据 documentation (将它们修改为全部 DenseVectors ):

 from pyspark.ml.feature import *
 from pyspark.mllib.linalg import Vectors
 data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
 df = sqlContext.createDataFrame(data,["features"])
 pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
 model = pca_extracted.fit(df)
 model.transform(df).collect()

 [Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
  Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
  Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]

然后，使用我们的方法:

 comp, score, eigVals = pca(df)
 score.collect()

 [array([ 1.64857282,  4.0132827 ]),
  array([-4.64510433,  1.11679727]),
  array([-6.42888054,  5.33795143])]

让我强调一下我们不要使用任何 collect()我们定义的函数中的方法 - score是 RDD ，理所当然。
请注意，我们第二列的符号与现有方法得出的符号完全相反；但这不是问题:根据(可免费下载) An Introduction to Statistical Learning ，由 Hastie 和 Tibshirani 合着，p。 382

Each principal component loading vector is unique, up to a sign flip. Thismeans that two different software packages will yield the same principalcomponent loading vectors, although the signs of those loading vectorsmay differ. The signs may differ because each principal component loadingvector specifies a direction in p-dimensional space: flipping the sign has noeffect as the direction does not change. [...] Similarly, the score vectors are uniqueup to a sign flip, since the variance of Z is the same as the variance of −Z.

最后，既然我们有了可用的特征值，那么编写一个解释方差百分比的函数就很简单了:

 def varianceExplained(df, k=1):
     """Calculate the fraction of variance explained by the top `k` eigenvectors.

     Args:
         df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
         k: The number of principal components to consider.

     Returns:
         float: A number between 0 and 1 representing the percentage of variance explained
             by the top `k` eigenvectors.
     """
     components, scores, eigenvalues = pca(df, k)  
     return sum(eigenvalues[0:k])/sum(eigenvalues)

 
 varianceExplained(df,1)
 # 0.79439325322305299

作为测试，我们还检查示例数据中解释的方差是否为 1.0，对于 k=5(因为原始数据是 5 维的):

 varianceExplained(df,5)
 # 1.0

[使用 Spark 1.5.0 和 1.5.1 开发和测试]

关于apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33428589/

文章推荐： python-2.7 - 如何在Tensorflow中进行切片分配

文章推荐： rust - 如何重新启动基板节点

css - mini-css-extract-plugin block 中的警告chunkName [mini-css-extract-plugin]之间的冲突顺序:
警告在块AccessRights〜Groups〜Navigator中[mini-css-extract-plugin] 之间的顺序冲突: css ../node_modules/css-loader?
mysql - 选择 unique (extract(year from saledate) || extract(month from saledate)) as ym from trnsact order by ym;
解决练习问题“对于数据库中的每个月/年组合，交易表的销售日期列中有多少个不同的日期？”我发现一个有效的查询，它显然结合了年份和月份 SELECT DISTINCT (extract(year from
mysql - 使用 EXTRACT YEAR、EXTRACT MONTH 和 CONCAT 函数获取 "Month Name, YYYY"格式的日期
我有一个名为 Student 的 MySQL 表，它有一个名为 entry_date 的列，类型为 date。我想选择以下形式的 entry_date: November, 2014 即它将显示 en
Python + Selenium ( Chrome ): How can I extract a specific text from my current url and use the extracted text to go to another?
我已经使用 iMacros 很长时间了，我有一个代码可以提取文本并使用它来创建我需要的特定 URL。事情是这样的: #I can extract the XPath text with this.
extract - 如何从saz文件中提取文件？
我将一个 session 从 Fiddler 导出到 saz 文件。此 session 仅包含 jpg 文件，我想知道 - 如何快速轻松地从 saz 中提取 jpg 文件？谢谢! 最佳答案提取 J
extract - 如何从倒谱中提取基频？
应用 FFT 后，我得到了具有多个频率段的频谱。如何使用倒谱方法从该频谱中获取基频？我做了很多研究，尝试了很多代码，并在 stackoverflow 上问了三遍(这很有帮助)，我非常确定倒谱方法是在
extract - 反编译/提取微软代理？
我正在寻找一种解压缩或反编译微软代理的方法。例如梅林代理我想提取动画/图像。到目前为止我发现的最好方法是。用粉红色背景录制我的屏幕。并分割视频......但这不是一个很好的做法...... 有小费
python - --extract-audio等同于YoutubeDL类？
我只想从youtube链接中提取MP3格式的音频，但是如果不从命令行使用youtube-dl选项调用--extract-audio，就无法弄清楚该如何做。在YoutubeDL类中是否有一种类似于her
Azure管道: could not extract archive
我有一个 Azure 管道，应该构建一个项目并将 jar 复制到 Artifactory。这是应该安装节点的 yml: - task: NodeTool@0 inputs: version
php - 使用带有连字符的 extract()
例如，如果您有一个如下所示的关联数组: $array = array('first-value' => 'Hello'); 然后你要提取它: extract($array); 由于变量名称中不能使用连
text-extraction - 如何使用查询从大文本中自动提取数据
我有大型 pdf 文件(法语的 100 页)描述了我的事件部门的一套规则。我正在寻找一种服务，允许我一次查询一个 pdf(或我从中提取的文本)以自动获取信息。 (示例:x 的最大授权长度是多少？)
extract - RDFa Reader提取工具
我是 RDF 初学者，我想从 HTML 中提取 RDF我正在使用 GRDDL，但它不太适合我，我每次都会收到安全异常:(您能向我推荐另一个工具吗？感谢您的帮助。最佳答案我通常使用说唱歌手，你可以
php - extract() 如何在当前范围内创建变量？
我很好奇，PHP 的函数是如何实现的extract有用吗？我想做一个稍微修改的版本。我希望我的函数在从蛇形符号到驼峰式的数组键中提取时生成变量名，例如: 现在 extract 这样做: $array
PHP extract() 函数
如果我使用 PHP 的 extract() 函数从数组中导入变量，同名变量会被覆盖吗？我问的原因是因为我正在尝试初始化所有变量。感谢您的宝贵时间。最佳答案默认情况下它将覆盖。 http://ph
text-extraction - 如何使用查询从大文本中自动提取数据
我有大型 pdf 文件(法语的 100 页)描述了我的事件部门的一套规则。我正在寻找一种服务，允许我一次查询一个 pdf(或我从中提取的文本)以自动获取信息。 (示例:x 的最大授权长度是多少？)
Python多处理: Extracting results
我正在尝试在 Python 中运行大量模拟，因此我尝试使用多处理来实现它。 import numpy as np import matplotlib.pyplot as plt import mult
javascript extract ..在值的第一个下划线处停止
尝试从主机名-rt45_34_we_35 中提取主机名-rt45。我正在使用/(.)_?./g。这似乎不起作用。我已经查看了正则表达式文档。想用？会使它变得贪婪并在第一个下划线处停止。我错过了什么？
java - Extract 通过正则表达式连接字符串中的变量
我目前正在从事一个数据挖掘项目。我必须阅读 C# 源代码，并且必须找到连接 SQL 语句的位置。我真正想要的是获取连接字符串变量名称的名称。示例: stat = "SELECT * FROM CUS
MySQL Extract 函数给出语法错误
我正在订单表上运行查询，以计算每个用户在 6 个月前一个月内的任何一天发出的请求总数(例如:2013 年 8 月)。这工作正常:- SELECT userid,firstname,surname,s
PHP:如何避免在特定情况下使用 extract()
我对 PHP 很陌生，并且到处都看到不建议使用提取函数。我正在从 mysql 表中获取数据来填充网站的一部分。因此我不知道该表可以有多少行。所以我使用 extract 函数，它为每行提供一个数组数组

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

apache-spark - Pyspark 和 PCA : How can I extract the eigenvectors of this PCA? 如何计算它们解释的方差？