gpt4 book ai didi

python - pyspark.ml 管道 : are custom transformers necessary for basic preprocessing tasks?

转载 作者:行者123 更新时间:2023-11-28 17:08:01 25 4
gpt4 key购买 nike

开始使用 pyspark.ml 和管道 API,我发现自己为典型的预处理任务编写自定义转换器,以便在管道中使用它们。示例:

from pyspark.ml import Pipeline, Transformer


class CustomTransformer(Transformer):
# lazy workaround - a transformer needs to have these attributes
_defaultParamMap = dict()
_paramMap = dict()
_params = dict()

class ColumnSelector(CustomTransformer):
"""Transformer that selects a subset of columns
- to be used as pipeline stage"""

def __init__(self, columns):
self.columns = columns


def _transform(self, data):
return data.select(self.columns)


class ColumnRenamer(CustomTransformer):
"""Transformer renames one column"""


def __init__(self, rename):
self.rename = rename

def _transform(self, data):
(colNameBefore, colNameAfter) = self.rename
return data.withColumnRenamed(colNameBefore, colNameAfter)


class NaDropper(CustomTransformer):
"""
Drops rows with at least one not-a-number element
"""

def __init__(self, cols=None):
self.cols = cols


def _transform(self, data):
dataAfterDrop = data.dropna(subset=self.cols)
return dataAfterDrop


class ColumnCaster(CustomTransformer):

def __init__(self, col, toType):
self.col = col
self.toType = toType

def _transform(self, data):
return data.withColumn(self.col, data[self.col].cast(self.toType))

它们有效,但我想知道这是一种模式还是反模式 - 这样的转换器是使用管道 API 的好方法吗?是否有必要实现它们,或者是否在其他地方提供了等效功能?

最佳答案

我会说它主要是基于意见,虽然它看起来不必要地冗长和 Python Transformers不能很好地与 Pipeline 的其余部分集成API。

同样值得指出的是,您在这里拥有的一切都可以通过 SQLTransformer 轻松实现。 .例如:

from pyspark.ml.feature import SQLTransformer

def column_selector(columns):
return SQLTransformer(
statement="SELECT {} FROM __THIS__".format(", ".join(columns))
)

def na_dropper(columns):
return SQLTransformer(
statement="SELECT * FROM __THIS__ WHERE {}".format(
" AND ".join(["{} IS NOT NULL".format(x) for x in columns])
)
)

通过一些努力,您可以将 SQLAlchemy 与 Hive 方言一起使用,以避免手写 SQL。

关于python - pyspark.ml 管道 : are custom transformers necessary for basic preprocessing tasks?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49734374/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com