scala - BucketedRandomProjectionLSH 的性能 (org.apache.spark.ml.feature.BucketedRandomProjectionLSH)-6ren

scala - BucketedRandomProjectionLSH 的性能 (org.apache.spark.ml.feature.BucketedRandomProjectionLSH)

转载作者：行者123 更新时间：2023-12-03 10:38:59

25

4

您好，我正在使用 BucketedRandomProjectionLSH(2 个桶 3 个哈希表)算法在约 300,000 条记录的数据集中查找相似的人。我正在为每个记录创建一个稀疏的二元组向量(每个向量中有 1296 个维度)，并对数据集进行近似相似性自连接，正如我提到的那样，数据集并不太大。在 3 节点 spark 集群(主节点:m3.xlarge，核心节点:2 m4.4xlarge)上，大约需要 7 个小时才能完成。性能太慢，我正在寻找某人可能为该算法创建的一些基准。此外，有关如何调整此算法的任何指导都将非常有帮助。

这是供您引用的代码片段:

val rdd=sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost:27017/Single.master","readPreference.name" -> "secondaryPreferred")))
val aggregatedRdd = rdd.withPipeline(Seq(Document.parse("{$unwind:'$sources'}"),Document.parse("{$project:{_id:0,id:'$sources._id',val:{$toLower:{$concat:['$sources.first_name','$sources.middle_name','$sources.last_name',{$substr:['$sources.gender',0,1]},'$sources.dob','$sources.address.street','$sources.address.city','$sources.address.state','$sources.address.zip','$sources.phone','$sources.email']}}}}")))
val fDF=aggregatedRdd.map(line=>line.values()).map(ll=>bigramMap(ll.toArray)).toDF("id","idx","keys")
val columnNames = Seq("idx","keys")
val result = fDF.select(columnNames.head, columnNames.tail: _*)
val brp = new BucketedRandomProjectionLSH().setBucketLength(2).setNumHashTables(3).setInputCol("keys").setOutputCol("values")
val model = brp.fit(result)
var outDD=model.approxSimilarityJoin(result, result, 100).filter("datasetA.idx < datasetB.idx").select(col("datasetA.idx").alias("idA"),col("datasetB.idx").alias("idB"),col("distCol"))

最佳答案

我尝试使用 BucketedRandomProjectionLSH 处理 10,000,000 个数据。需要3个小时。我之前只存储了Dataframe的现金。

df.persist()

关于scala - BucketedRandomProjectionLSH 的性能 (org.apache.spark.ml.feature.BucketedRandomProjectionLSH)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43927844/

25

4

0

文章推荐： command-line - 在 Vim 中打开 URL 的最佳方法

文章推荐： list - 什么是 DList？

文章推荐： java - 代码性能比较，线程与非线程

文章推荐： sql - 文档存储的最佳用途是什么？

process - BDD features of features，我应该制作一个新故事还是属于某个场景？
好的，所以我刚刚开始尝试将 BDD 用于我们正在进行的一些新开发，并且我为日志查看器功能写了一个这样的故事: 故事:用户查看工作流执行日志 As a user I want to review the
python - 值错误 : Feature not in features dictionary
我正在尝试使用 TensorFlow 编写一个简单的深度机器学习模型。我正在使用我在 Excel 中制作的玩具数据集，只是为了让模型工作并接受数据。我的代码如下: import pandas as p
python - 机器学习: combining features into single feature
我是机器学习的初学者。我很困惑如何将数据集的不同特征组合成一个特征。例如，我在 Python Pandas 数据框架中有一个数据集，其特征如下: movie unknown actio
language-features - 语言和 VM : Features that are hard to optimize and why
我正在做一项功能调查，为一个研究项目做准备。说出难以优化的主流语言或语言功能，以及为什么该功能值得或不值得付出代价，或者只是用轶事证据驳斥我下面的理论。在有人将其标记为主观之前，我要求提供语言或功能
release - 哪个更好 : shipping a buggy feature or not shipping the feature at all?
这是一个有点哲学问题。我正在为我的软件添加一个小功能，我认为大多数用户都会使用它，但他们使用该软件的次数可能只有 10%。换句话说，该软件没有它 3 个月就很好，但是有 4 或 5 个用户要求它，我同
Git 流 : Can I publish a feature more than once before I finish the feature?
我开始使用 git flow。我创建了一个功能: git flow feature start eval 然后我做了一些工作并添加并提交了更改: git add (files) git commit
git - pull 请求是 "Git Feature"还是 GitHub Feature”？
pull 请求是内置在 Git 中还是 GitHub 虚构的概念？最佳答案概念和该概念的实现之间存在区别。 “请求 pull ”的概念是 DVCS 系统有别于传统版本控制系统的部分原因。使用传统的
feature-selection - 计算机视觉中的 "Bag of Words"和 "Bag of features"有什么区别？
研究该主题，可以找到作者使用“词袋”模型进行图像分类/检索的论文，而其他人则使用“特征袋”模型进行类似任务。尽管我对所涉及的方法有基本的了解(检测和提取视觉词、构建视觉词典、使用机器学习训练分类器)
ruby-on-rails - 如何建模 "Featuring"的概念(即，当艺术家在一首歌曲中为 "featured"时)
有时一首歌会有不止一个艺术家。例如，Jay-z 的新歌“A Star is Born”以艺术家 Cole 为主角，因此在目录中会被列为“Jay-z(以 Cole 为主角)- A Star is Bor
rust - Cargo.toml : how do I select a dependency's feature based on my crate's features?
This question already has an answer here: How do I 'pass down' feature flags to subdependencies in C
numpy - sklearn : get feature names after L1-based feature selection
This question and answer演示当使用 scikit-learn 的专用特征选择例程之一执行特征选择时，可以按如下方式检索所选特征的名称: np.asarray(vectorize
rust cargo : how to use different features for a dep when a particular feature is enabled?
例如，我定义了 2 个没有依赖关系的特性: [features] default = [] py2 = [] py3 = [] 基于选定的功能 (--features py3) 我想为依赖项 (cpy
php - 帮助 Wordpress 站点自定义 'Featured Img Size' & 'Non-Featured'
我正在完成一个小型 Wordpress“杂志”类型网站的定制。由于我是 PHP 的新手，我遇到了一些需要帮助的问题。我有一个“首屏，主要特色区域，包含 3 张图片”和帖子标题的小摘录。在首屏下，我在
c# - 一个用户在 "Apps & Features"和 "Programs & Features"中的应用可见性，但对另一个用户不可见
我已经为 Windows 10 创建了一个 C# 应用程序。它是通过使用 WIX 生成的 MSI 安装的。但是，当它为一台机器上的一个用户安装时，并非出于我的意图，它不会为同一台机器上的其他用户安装。
java - ArcGIS 运行时 : How to identify the topmost feature across all feature layers?
在 ArcGIS Runtime Java API 文档中，有一个 identifyLayersAsync() method . 来自文档: Asynchronously identifies the
Git 流 : Do you have to manually delete the feature branches from remote after finishing the feature?
我是 GIT 和 GIT-Flow 的新手。 [在我的 python-django 项目上] 我做了什么: git flow feature start new_feature # perform s
angular - 属性 'features' 在类型 'Feature' 上不存在
我是 Angular 的新手，我正在尝试使用 Angular/d3 构建德国 map 。 map 数据存储在 Topojson 文件 plz_map_ger.json 中: { "type": "To
rest - 当端点被 feature-flag/feature-toggle 禁用时，您使用什么 HTTP 状态代码？
我一直在使用 503 服务不可用或停机维护。但是一些 http 客户端库，即 axios 将 503 视为可重试错误。如果由于高负载而产生响应，则重试它是有意义的，但 503 也适合功能切换情况
maven - karaf 的 features-maven-plugin generate-features-xml 目标的包属性的格式是什么
要列出您希望包含在生成的 features.xml 中的一堆包，文档说: bundles File A properties file that contains a list of bund
c# - 错误 "A template containing a class feature must end with a class feature"
我在 Visual Studio 2010 下开发 C# T4 预处理模板时遇到以下编译错误: A template containing a class feature must end with

首页

博学

6Ren·AI

商城

scala - BucketedRandomProjectionLSH 的性能 (org.apache.spark.ml.feature.BucketedRandomProjectionLSH)