gpt4 book ai didi

python - 在 PySpark 中使用 Apache Spark 数据帧删除重音的最佳方法是什么?

转载 作者:太空狗 更新时间:2023-10-29 17:19:28 25 4
gpt4 key购买 nike

我需要从不同数据集中删除西类牙语和其他语言字符的重音。

我已经根据此 post 中提供的代码做了一个函数删除特殊的口音。问题在于该函数运行缓慢,因为它使用了 UDF。我只是想知道我是否可以提高函数的性能以在更短的时间内获得结果,因为这对小数据帧有好处,但对大数据帧不利。

提前致谢。

这里是代码,您将能够按照显示的方式运行它:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf, col
import unicodedata

# Building a simple dataframe:
schema = StructType([StructField("city", StringType(), True),
StructField("country", StringType(), True),
StructField("population", IntegerType(), True)])

countries = ['Venezuela', 'US@A', 'Brazil', 'Spain']
cities = ['Maracaibó', 'New York', ' São Paulo ', '~Madrid']
population = [37800000,19795791,12341418,6489162]

# Dataframe:
df = sqlContext.createDataFrame(list(zip(cities, countries, population)), schema=schema)

df.show()

class Test():
def __init__(self, df):
self.df = df

def clearAccents(self, columns):
"""This function deletes accents in strings column dataFrames,
it does not eliminate main characters, but only deletes special tildes.

:param columns String or a list of column names.
"""
# Filters all string columns in dataFrame
validCols = [c for (c, t) in filter(lambda t: t[1] == 'string', self.df.dtypes)]

# If None or [] is provided with column parameter:
if (columns == "*"): columns = validCols[:]

# Receives a string as an argument
def remove_accents(inputStr):
# first, normalize strings:
nfkdStr = unicodedata.normalize('NFKD', inputStr)
# Keep chars that has no other char combined (i.e. accents chars)
withOutAccents = u"".join([c for c in nfkdStr if not unicodedata.combining(c)])
return withOutAccents

function = udf(lambda x: remove_accents(x) if x != None else x, StringType())
exprs = [function(col(c)).alias(c) if (c in columns) and (c in validCols) else c for c in self.df.columns]
self.df = self.df.select(*exprs)

foo = Test(df)
foo.clearAccents(columns="*")
foo.df.show()

最佳答案

一个可能的改进是构建自定义 Transformer ,它将处理 Unicode 规范化和相应的 Python 包装器。它应该减少在 JVM 和 Python 之间传递数据的总体开销,并且不需要对 Spark 本身进行任何修改或访问私有(private) API。

在 JVM 端你需要一个类似于这个的转换器:

package net.zero323.spark.ml.feature

import java.text.Normalizer
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.param._
import org.apache.spark.ml.util._
import org.apache.spark.sql.types.{DataType, StringType}

class UnicodeNormalizer (override val uid: String)
extends UnaryTransformer[String, String, UnicodeNormalizer] {

def this() = this(Identifiable.randomUID("unicode_normalizer"))

private val forms = Map(
"NFC" -> Normalizer.Form.NFC, "NFD" -> Normalizer.Form.NFD,
"NFKC" -> Normalizer.Form.NFKC, "NFKD" -> Normalizer.Form.NFKD
)

val form: Param[String] = new Param(this, "form", "unicode form (one of NFC, NFD, NFKC, NFKD)",
ParamValidators.inArray(forms.keys.toArray))

def setN(value: String): this.type = set(form, value)

def getForm: String = $(form)

setDefault(form -> "NFKD")

override protected def createTransformFunc: String => String = {
val normalizerForm = forms($(form))
(s: String) => Normalizer.normalize(s, normalizerForm)
}

override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got $inputType.")
}

override protected def outputDataType: DataType = StringType
}

相应的构建定义(调整 Spark 和 Scala 版本以匹配您的 Spark 部署):

name := "unicode-normalization"

version := "1.0"

crossScalaVersions := Seq("2.11.12", "2.12.8")

organization := "net.zero323"

val sparkVersion = "2.4.0"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % sparkVersion
)

在 Python 方面,您需要一个类似于此的包装器。

from pyspark.ml.param.shared import *
# from pyspark.ml.util import keyword_only # in Spark < 2.0
from pyspark import keyword_only
from pyspark.ml.wrapper import JavaTransformer

class UnicodeNormalizer(JavaTransformer, HasInputCol, HasOutputCol):

@keyword_only
def __init__(self, form="NFKD", inputCol=None, outputCol=None):
super(UnicodeNormalizer, self).__init__()
self._java_obj = self._new_java_obj(
"net.zero323.spark.ml.feature.UnicodeNormalizer", self.uid)
self.form = Param(self, "form",
"unicode form (one of NFC, NFD, NFKC, NFKD)")
# kwargs = self.__init__._input_kwargs # in Spark < 2.0
kwargs = self._input_kwargs
self.setParams(**kwargs)

@keyword_only
def setParams(self, form="NFKD", inputCol=None, outputCol=None):
# kwargs = self.setParams._input_kwargs # in Spark < 2.0
kwargs = self._input_kwargs
return self._set(**kwargs)

def setForm(self, value):
return self._set(form=value)

def getForm(self):
return self.getOrDefault(self.form)

构建 Scala 包:

sbt +package

在启动 shell 或提交时包含它。例如,对于使用 Scala 2.11 构建的 Spark:

bin/pyspark --jars path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar \
--driver-class-path path-to/target/scala-2.11/unicode-normalization_2.11-1.0.jar

你应该准备好了。剩下的就是一些正则表达式的魔法:

from pyspark.sql.functions import regexp_replace

normalizer = UnicodeNormalizer(form="NFKD",
inputCol="text", outputCol="text_normalized")

df = sc.parallelize([
(1, "Maracaibó"), (2, "New York"),
(3, " São Paulo "), (4, "~Madrid")
]).toDF(["id", "text"])

(normalizer
.transform(df)
.select(regexp_replace("text_normalized", "\p{M}", ""))
.show())

## +--------------------------------------+
## |regexp_replace(text_normalized,\p{M},)|
## +--------------------------------------+
## | Maracaibo|
## | New York|
## | Sao Paulo |
## | ~Madrid|
## +--------------------------------------+

请注意,这遵循与内置文本转换器相同的约定,并且不安全。您可以通过检查 createTransformFunc 中的 null 轻松纠正该问题。

关于python - 在 PySpark 中使用 Apache Spark 数据帧删除重音的最佳方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38359534/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com