gpt4 book ai didi

python - 如果该词在另一个词的特定数量的词内,则替换字符串中的一个词

转载 作者:太空宇宙 更新时间:2023-11-03 14:05:06 25 4
gpt4 key购买 nike

我在名为“DESCRIPTION”的数据框中有一个文本列。我需要找到单词“tile”或“tiles”在单词“roof”的 6 个单词以内的所有实例,然后仅将单词“tile/s”更改为“rooftiles”。我需要对“floor”和“tiles”做同样的事情(将“tiles”改为“floortiles”)。当某些词与其他词结合使用时,这将有助于区分我们正在查看的建筑行业。

为了说明我的意思,数据示例和我最近的错误尝试是:

s1=pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2=pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3=pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])
df=pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])
df

我所追求的解决方案应该看起来像这样(数据帧格式):

1.After the storm the roof was damaged and some of the rooftiles are missing      
2.I dropped the saw and it fell on the floor and damaged some of the floortiles
3.the roof was leaking and when I checked I saw that some of the tiles were cracked

在这里,我尝试使用 REGEX 模式来匹配以替换“tiles”一词,但这是完全错误的……有没有办法做我想做的事情?我是 Python 新手...

regex=r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*tiles)"
replacedString=re.sub(regex, r"(roof)\b\s+([^\s]+\s+){0,6}\b(.*rooftiles)", df['DESCRIPTION'])

更新:解决方案

感谢大家的帮助!我设法使用 Jan 的代码并进行了一些添加/调整使其正常工作。最终工作代码如下(使用真实而非示例文件和数据):

claims_file = pd.read_csv(project_path + claims_filename) # Read input file
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].fillna('NA') #get rid of encoding errors generated because some text was just 'NA' and it was read in as NaN
#create the REGEX
rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)

#create the reverse REGEX
rx2 = re.compile(r'''
( # outer group
\b(tiles?) # tile or tiles
(?:\W+\w+){0,6}\s* # any six "words"
)
\b(floor|roof)\b # roof or floor
''', re.VERBOSE)
#apply it to every row of Loss Description:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))

#apply the reverse regex:
claims_file["LOSS_DESCRIPTION"] = claims_file["LOSS_DESCRIPTION"].apply(lambda x: rx2.sub(r'\3\1\2', x))

# Write results into CSV file and check results
claims_file.to_csv(project_path + output_filename, index = False
, encoding = 'utf-8')

最佳答案

您可以在此处使用带有正则表达式的解决方案:

(                      # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles

参见 a demo for the regex on regex101.com .


之后,只需将捕获的部分组合起来,然后用 rx.sub() 将它们重新组合在一起。并将其应用于 DESCRIPTION 的所有项目列,这样您最终会得到以下代码:

import pandas as pd, re

s1 = pd.Series(["After the storm the roof was damaged and some of the tiles are missing"])
s2 = pd.Series(["I dropped the saw and it fell on the floor and damaged some of the tiles"])
s3 = pd.Series(["the roof was leaking and when I checked I saw that some of the tiles were cracked"])

df = pd.DataFrame([list(s1), list(s2), list(s3)], columns = ["DESCRIPTION"])

rx = re.compile(r'''
( # outer group
\b(floor|roof) # floor or roof
(?:\W+\w+){1,6}\s* # any six "words"
)
\b(tiles?)\b # tile or tiles
''', re.VERBOSE)

# apply it to every row of "DESCRIPTION"
df["DESCRIPTION"] = df["DESCRIPTION"].apply(lambda x: rx.sub(r'\1\2\3', x))
print(df["DESCRIPTION"])


请注意,尽管您最初的问题不是很清楚:此解决方案只会找到 tiletiles 之后 roof , 意思是像 Can you give me the tile for the roof, please? 这样的句子不会匹配(尽管单词 tileroof 的六个单词的范围内)。

关于python - 如果该词在另一个词的特定数量的词内,则替换字符串中的一个词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44512411/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com