gpt4 book ai didi

python - 在 PySpark ML 中创建自定义 Transformer

转载 作者:IT老高 更新时间:2023-10-28 21:18:34 25 4
gpt4 key购买 nike

我是 Spark SQL DataFrames 和 ML 的新手 (PySpark)。如何创建自定义标记器,例如删除停用词并使用 中的一些库?我可以扩展默认的吗?

最佳答案

Can I extend the default one?

不是真的。默认 Tokenizerpyspark.ml.wrapper.JavaTransformer 的子类,并且与来自 pyspark.ml.feature 的其他转换器和估计器相同,代表对其 Scala 对应物的实际处理。既然你想使用 Python,你应该直接扩展 pyspark.ml.pipeline.Transformer

import nltk

from pyspark import keyword_only ## < 2.0 -> pyspark.ml.util.keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
# Available in PySpark >= 2.3.0
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

class NLTKWordPunctTokenizer(
Transformer, HasInputCol, HasOutputCol,
# Credits https://stackoverflow.com/a/52467470
# by https://stackoverflow.com/users/234944/benjamin-manns
DefaultParamsReadable, DefaultParamsWritable):

stopwords = Param(Params._dummy(), "stopwords", "stopwords",
typeConverter=TypeConverters.toListString)


@keyword_only
def __init__(self, inputCol=None, outputCol=None, stopwords=None):
super(NLTKWordPunctTokenizer, self).__init__()
self.stopwords = Param(self, "stopwords", "")
self._setDefault(stopwords=[])
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, inputCol=None, outputCol=None, stopwords=None):
kwargs = self._input_kwargs
return self._set(**kwargs)

def setStopwords(self, value):
return self._set(stopwords=list(value))

def getStopwords(self):
return self.getOrDefault(self.stopwords)

# Required in Spark >= 3.0
def setInputCol(self, value):
"""
Sets the value of :py:attr:`inputCol`.
"""
return self._set(inputCol=value)

# Required in Spark >= 3.0
def setOutputCol(self, value):
"""
Sets the value of :py:attr:`outputCol`.
"""
return self._set(outputCol=value)

def _transform(self, dataset):
stopwords = set(self.getStopwords())

def f(s):
tokens = nltk.tokenize.wordpunct_tokenize(s)
return [t for t in tokens if t.lower() not in stopwords]

t = ArrayType(StringType())
out_col = self.getOutputCol()
in_col = dataset[self.getInputCol()]
return dataset.withColumn(out_col, udf(f, t)(in_col))

使用示例(来自 ML - Features 的数据):

sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])

tokenizer = NLTKWordPunctTokenizer(
inputCol="sentence", outputCol="words",
stopwords=nltk.corpus.stopwords.words('english'))

tokenizer.transform(sentenceDataFrame).show()

对于自定义 Python Estimator,请参阅 How to Roll a Custom Estimator in PySpark mllib

⚠ 此答案取决于内部 API,并且与 Spark 2.0.3、2.1.1、2.2.0 或更高版本 (SPARK-19348) 兼容。有关与以前 Spark 版本兼容的代码,请参阅 revision 8 .

关于python - 在 PySpark ML 中创建自定义 Transformer,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32331848/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com