gpt4 book ai didi

python - 将 PySpark DataFrame ArrayType 字段组合成单个 ArrayType 字段

转载 作者:太空狗 更新时间:2023-10-29 18:15:58 26 4
gpt4 key购买 nike

我有一个带有 2 个 ArrayType 字段的 PySpark DataFrame:

>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]

我想将它们组合成一个 ArrayType 字段:

>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]

适用于字符串的语法在这里似乎不起作用:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)

谢谢!

最佳答案

Spark >= 2.4

您可以使用 concat 函数 ( SPARK-23736 ):

from pyspark.sql.functions import col, concat 

df.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False)

# +---------------------------------+
# |concat(tokens, tokens_bigrams) |
# +---------------------------------+
# |[one, two, two, one two, two two]|
# |null |
# +---------------------------------+

要在其中一个值为 NULL 时保留数据,您可以使用 arraycoalesce:

from pyspark.sql.functions import array, coalesce      

df.select(concat(
coalesce(col("tokens"), array()),
coalesce(col("tokens_bigrams"), array())
)).show(truncate = False)

# +--------------------------------------------------------------------+
# |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))|
# +--------------------------------------------------------------------+
# |[one, two, two, one two, two two] |
# |[three] |
# +--------------------------------------------------------------------+

Spark < 2.4

不幸的是,在一般情况下连接 array 列需要一个 UDF,例如:

from itertools import chain
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *


def concat(type):
def concat_(*args):
return list(chain.from_iterable((arg if arg else [] for arg in args)))
return udf(concat_, ArrayType(type))

可以用作:

df = spark.createDataFrame(
[(["one", "two", "two"], ["one two", "two two"]), (["three"], None)],
("tokens", "tokens_bigrams")
)

concat_string_arrays = concat(StringType())
df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False)

# +---------------------------------+
# |concat_(tokens, tokens_bigrams) |
# +---------------------------------+
# |[one, two, two, one two, two two]|
# |[three] |
# +---------------------------------+

关于python - 将 PySpark DataFrame ArrayType 字段组合成单个 ArrayType 字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37284077/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com