gpt4 book ai didi

apache-spark - 拆分 PySpark 数据框中字符串列的内容

转载 作者:行者123 更新时间:2023-12-04 05:09:47 26 4
gpt4 key购买 nike

我有一个 pyspark 数据框,它有一列包含字符串。我想将此列拆分为单词
代码:

>>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true')
>>> sentenceData.show(truncate=False)
+---+---------------------------+
|key|desc |
+---+---------------------------+
|1 |Virat is good batsman |
|2 |sachin was good |
|3 |but modi sucks big big time|
|4 |I love the formulas |
+---+---------------------------+


Expected Output
---------------

>>> sentenceData.show(truncate=False)
+---+-------------------------------------+
|key|desc |
+---+-------------------------------------+
|1 |[Virat,is,good,batsman] |
|2 |[sachin,was,good] |
|3 |.... |
|4 |... |
+---+-------------------------------------+
我怎样才能做到这一点?

最佳答案

使用 split功能:

from pyspark.sql.functions import split

df.withColumn("desc", split("desc", "\s+"))

关于apache-spark - 拆分 PySpark 数据框中字符串列的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41283478/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com