gpt4 book ai didi

scala - pyspark 与 scala 中的 FPgrowth 计算关联

转载 作者:行者123 更新时间:2023-11-30 09:20:16 24 4
gpt4 key购买 nike

使用:

http://spark.apache.org/docs/1.6.1/mllib-frequent-pattern-mining.html

Python 代码:

from pyspark.mllib.fpm import FPGrowth
model = FPGrowth.train(dataframe,0.01,10)

斯卡拉:

import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD

val data = sc.textFile("data/mllib/sample_fpgrowth.txt")

val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))

val fpg = new FPGrowth()
.setMinSupport(0.2)
.setNumPartitions(10)
val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

val minConfidence = 0.8
model.generateAssociationRules(minConfidence).collect().foreach { rule =>
println(
rule.antecedent.mkString("[", ",", "]")
+ " => " + rule.consequent .mkString("[", ",", "]")
+ ", " + rule.confidence)
}

来自代码here它表明 scala 部分没有最低置信度。

def trainFPGrowthModel(
data: JavaRDD[java.lang.Iterable[Any]],
minSupport: Double,
numPartitions: Int): FPGrowthModel[Any] = {
val fpg = new FPGrowth()
.setMinSupport(minSupport)
.setNumPartitions(numPartitions)

val model = fpg.run(data.rdd.map(_.asScala.toArray))
new FPGrowthModelWrapper(model)
}

pyspark如何添加minConfidence来生成关联规则?我们可以看到scala有示例,而python没有示例。

最佳答案

Spark >= 2.2

有一个DataFrame基础ml API,它提供AssociationRules:

from pyspark.ml.fpm import FPGrowth

data = ...

fpm = FPGrowth(minSupport=0.3, minConfidence=0.9).fit(data)
associationRules = fpm.associationRules.

Spark <2.2

目前 PySpark 不支持提取关联规则(基于 DataFrame 且支持 Python 的 FPGrowth API 正在开发中 SPARK-1450 ),但我们可以轻松解决那个。

首先,您必须安装 SBT(只需转到 the downloads page )并按照适用于您的操作系统的说明进行操作。

接下来,您必须创建一个仅包含两个文件的简单 Scala 项目:

.
├── AssociationRulesExtractor.scala
└── build.sbt

您可以稍后调整以遵循the established directory structure .

接下来将以下内容添加到 build.sbt(调整 Scala 版本和 Spark 版本以匹配您使用的版本):

name := "fpm"

version := "1.0"

scalaVersion := "2.10.6"

val sparkVersion = "1.6.2"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
)

并遵循AssociationRulesExtractor.scala:

package com.example.fpm

import org.apache.spark.mllib.fpm.AssociationRules.Rule
import org.apache.spark.rdd.RDD

object AssociationRulesExtractor {
def apply(rdd: RDD[Rule[String]]) = {
rdd.map(rule => Array(
rule.confidence, rule.javaAntecedent, rule.javaConsequent
))
}
}

打开您选择的终端模拟器,转到项目的根目录并调用:

sbt package

它将在目标目录中生成一个jar文件。例如,在 Scala 2.10 中它将是:

target/scala-2.10/fpm_2.10-1.0.jar

启动 PySpark shell 或使用 spark-submit 并将路径传递给生成的 jar 文件,如 --driver-class-path:

bin/pyspark --driver-class-path /path/to/fpm_2.10-1.0.jar

非本地模式:

bin/pyspark --driver-class-path /path/to/fpm_2.10-1.0.jar --jars /path/to/fpm_2.10-1.0.jar

在集群模式下,jar 应该存在于所有节点上。

添加一些方便的包装:

from pyspark import SparkContext
from pyspark.mllib.fpm import FPGrowthModel
from pyspark.mllib.common import _java2py
from collections import namedtuple


rule = namedtuple("Rule", ["confidence", "antecedent", "consequent"])

def generateAssociationRules(model, minConfidence):
# Get active context
sc = SparkContext.getOrCreate()

# Retrieve extractor object
extractor = sc._gateway.jvm.com.example.fpm.AssociationRulesExtractor

# Compute rules
java_rules = model._java_model.generateAssociationRules(minConfidence)

# Convert rules to Python RDD
return _java2py(sc, extractor.apply(java_rules)).map(lambda x:rule(*x))

最后,您可以将这些助手用作函数:

generateAssociationRules(model, 0.9)

或者作为一种方法:

FPGrowthModel.generateAssociationRules = generateAssociationRules
model.generateAssociationRules(0.9)

此解决方案依赖于内部 PySpark 方法,因此不能保证它在版本之间可移植。

关于scala - pyspark 与 scala 中的 FPgrowth 计算关联,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42222456/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com