gpt4 book ai didi

apache-spark - Pyspark 将 Dataframe 字符串列拆分为多列

转载 作者:行者123 更新时间:2023-12-05 05:00:43 24 4
gpt4 key购买 nike

我正在 spark 3.0.0 上执行 Spark Structure 流式传输的示例,为此,我使用了 Twitter 数据。我在 Kafka 中推送了 twitter 数据,单个记录看起来像这样

2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India

这里每个字段都用'|'分隔字段是

  1. 日期时间

  2. 用户编号

  3. 推文

  4. 位置

现在在 Spark 中阅读这条消息,我得到了这样的数据框

 key |   value 
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India

并根据this回答,我在我的应用程序中添加了以下代码块。

split_col =  pyspark.sql.functions.split(df['value'], '|')


df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")

但它给我这样的输出,

                            A                                                                                                                                                                         |  B   |   C     |   D    |  E  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |

但我想要这样的输出

       Tweet Time      |       User ID           |                            Tweet text                                                                                                        |   Location        |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |

最佳答案

因为它接受模式:表示正则表达式的字符串。正则表达式字符串应该是Java 正则表达式。

使用"\\|"以管道或'[|]'分割

split_col =  split(df.value, '\\|',)

df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")

输出:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+

关于apache-spark - Pyspark 将 Dataframe 字符串列拆分为多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63009833/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com