gpt4 book ai didi

arrays - 如何从 Spark 中的数组 Column 中删除元素?

转载 作者:行者123 更新时间:2023-12-02 06:44:00 36 4
gpt4 key购买 nike

我有一个 Seq 和数据帧。数据框包含一列数组类型。我正在尝试从列中删除 Seq 中的元素。

例如:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, and, for, s, shirts, sleeve, warm] |
|[3, 5, kitchenaid, s] |
|[5, 6, case, flip, inch, iphone, on, xs] |
|[almonds, chocolate, covered, dark, joe, s, the] |
|null |
|[] |
|[animation, book] |

预期输出:

+---------------------------------------------------+
|sorted_items |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe, the] |
|null |
|[] |
|[animation, book] |

如何以有效且优化的方式完成此操作?

最佳答案

使用spark.sql.functions中的array_except:

import org.apache.spark.sql.{functions => F}

val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))

newDF.show(false)

输出:

+----------------------------------------+
|sorted_items |
+----------------------------------------+
|[flannel, shirts, sleeve, warm] |
|[3, 5, kitchenaid] |
|[5, 6, case, flip, inch, iphone, xs] |
|[almonds, chocolate, covered, dark, joe]|
|null |
|[] |
|[animation, book] |
+----------------------------------------+

关于arrays - 如何从 Spark 中的数组 Column 中删除元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56180887/

36 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com