gpt4 book ai didi

regex - 来自列值和正则表达式的 Pyspark 字符串模式

转载 作者:行者123 更新时间:2023-12-04 01:57:33 27 4
gpt4 key购买 nike

您好,我有一个包含 2 列的数据框:

+----------------------------------------+----------+
| Text | Key_word |
+----------------------------------------+----------+
| First random text tree cheese cat | tree |
| Second random text apple pie three | text |
| Third random text burger food brain | brain |
| Fourth random text nothing thing chips | random |
+----------------------------------------+----------+

我想生成第三列,其中一个词出现在文本中的 key_word 之前。

+----------------------------------------+----------+-------------------+--+
| Text | Key_word | word_bef_key_word | |
+----------------------------------------+----------+-------------------+--+
| First random text tree cheese cat | tree | text | |
| Second random text apple pie three | text | random | |
| Third random text burger food brain | brain | food | |
| Fourth random text nothing thing chips | random | Fourth | |
+----------------------------------------+----------+-------------------+--+

我试过了,但是不行

df2=df1.withColumn('word_bef_key_word',regexp_extract(df1.Text,('\\w+)'df1.key_word,1))

这是创建数据框示例的代码

df = sqlCtx.createDataFrame(
[
('First random text tree cheese cat' , 'tree'),
('Second random text apple pie three', 'text'),
('Third random text burger food brain' , 'brain'),
('Fourth random text nothing thing chips', 'random')
],
('Text', 'Key_word')
)

最佳答案

更新

您也可以do this without a udf通过使用 pyspark.sql.functions.expr通过 column values as a parameterpyspark.sql.functions.regexp_extract :

from pyspark.sql.functions import expr

df = df.withColumn(
'word_bef_key_word',
expr(r"regexp_extract(Text, concat('\\w+(?= ', Key_word, ')'), 0)")
)
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+

原始答案

一种方法是使用 udf 来执行正则表达式:

import re
from pyspark.sql.functions import udf

def get_previous_word(text, key_word):
matches = re.findall(r'\w+(?= {kw})'.format(kw=key_word), text)
return matches[0] if matches else None

get_previous_word_udf = udf(
lambda text, key_word: get_previous_word(text, key_word),
StringType()
)

df = df.withColumn('word_bef_key_word', get_previous_word_udf('Text', 'Key_word'))
df.show(truncate=False)
#+--------------------------------------+--------+-----------------+
#|Text |Key_word|word_bef_key_word|
#+--------------------------------------+--------+-----------------+
#|First random text tree cheese cat |tree |text |
#|Second random text apple pie three |text |random |
#|Third random text burger food brain |brain |food |
#|Fourth random text nothing thing chips|random |Fourth |
#+--------------------------------------+--------+-----------------+

正则表达式模式 '\w+(?= {kw})'.format(kw=key_word) 表示匹配后跟空格和 key_word 的单词。如果有多个匹配项,我们将返回第一个。如果没有匹配项,该函数返回 None

关于regex - 来自列值和正则表达式的 Pyspark 字符串模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49538327/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com