gpt4 book ai didi

hadoop - 应用逻辑后,regex模式在pyspark中不起作用

转载 作者:行者123 更新时间:2023-12-02 19:54:37 30 4
gpt4 key购买 nike

我有以下数据:

>>> df1.show()
+-----------------+--------------------+
| corruptNames| standardNames|
+-----------------+--------------------+
|Sid is (Good boy)| Sid is Good Boy|
| New York Life| New York Life In...|
+-----------------+--------------------+

因此,根据上述数据,我需要应用正则表达式,创建一个新列并获取第二列中的数据,即 standardNames。我尝试下面的代码:
spark.sql("select *, case when corruptNames rlike '[^a-zA-Z ()]+(?![^(]*))' or corruptNames rlike 'standardNames' then standardNames else 0 end as standard from temp1").show()  

它抛出以下错误:
pyspark.sql.utils.AnalysisException: "cannot resolve '`standardNames`' given input columns: [temp1.corruptNames, temp1. standardNames];

最佳答案

尝试不使用select sql的示例。我假设您要在正则表达式模式为true的情况下基于贪婪的名称创建一个称为standardNames的新列,否则“执行其他操作……”。

注意:您的模式无法编译,因为您需要使用\转义倒数第二个)。

pattern = '[^a-zA-Z ()]+(?![^(]*))' #this won't compile
pattern = r'[^a-zA-Z ()]+(?![^(]*\))' #this will


import pyspark.sql.functions as F

df_text = spark.createDataFrame([('Sid is (Good boy)',),('New York Life',)], ('corruptNames',))

pattern = r'[^a-zA-Z ()]+(?![^(]*\))'

df = (df_text.withColumn('standardNames', F.when(F.col('corruptNames').rlike(pattern), F.col('corruptNames'))
.otherwise('Do something else'))
.show()
)

df.show()

#+-----------------+---------------------+
#| corruptNames| standardNames|
#+-----------------+---------------------+
#|Sid is (Good boy)| Do something else|
#| New York Life| Do something else|
#+-----------------+---------------------+

关于hadoop - 应用逻辑后,regex模式在pyspark中不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58694497/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com