gpt4 book ai didi

python - 使用正则表达式提取包含某些单词的句子

转载 作者:行者123 更新时间:2023-12-02 01:47:36 25 4
gpt4 key购买 nike

假设我有以下字符串:

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'

我想做的是获取所有包含“汽车”或“车轮”的句子(即逗号之间的句子)。使用正则表达式,我执行了以下操作:

re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)

我得到了这个结果:

['the car', ' the car', 'wheels', 'wheels are round', ' wheels make the car']

显然,它只返回“car”和“wheels”之间的内容,而且顺序似乎很重要。我想要得到的是:

['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']

关于如何做到这一点有什么想法吗?

最佳答案

你的正则表达式

re.findall('[^,]*{}|{}[^,]*'.format('car', 'wheels'), txt)

只需要进行小的修改,包含(非捕获)组,否则|适用于整个正则表达式,而不仅仅是car|wheels.

您的新正则表达式将是

re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)

输出:

['the car is running', ' the car has wheels', ' wheels are round', ' wheels make the car go']

但是,我认为正则表达式不适合解决这个问题。我建议采用以下解决方案:

txt = 'the car is running, the car has wheels, wheels are round, the road is clear, wheels make the car go'
# Either:
sentences = [sentence.strip() for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]
# Or alternatively:
words = ["car", "wheels"]
sentences = [
sentence.strip() # Remove spaces before and after the sentence
for sentence in txt.split(",")
if any(
word in sentence
for word in words
)
]
# This second method allows for checking for more than just 2 words

输出:

['the car is running', 'the car has wheels', 'wheels are round', 'wheels make the car go']

性能

两种方法(列表理解和正则表达式)的性能可以与以下脚本进行比较,该脚本针对包含 40k 句子的文本运行字符串中的代码 100 次。

import timeit
import re

# Set up a testing text with 40k sentences.
txt = (
"the car is running, the car has wheels, wheels are round, the road is clear, "
* 10000
)

# The (simple) list comprehension strategy
list_comp_time = timeit.timeit(
'[sentence for sentence in txt.split(",") if "car" in sentence or "wheels" in sentence]',
globals=globals(),
number=100,
)

# A strategy using regex
regex_time = timeit.timeit(
"re.findall('[^,]*(?:{}|{})[^,]*'.format('car', 'wheels'), txt)",
globals=globals(),
number=100,
)

print(f"The List Comprehension method took {list_comp_time:.8f}s")
print(f"The Regex method took {regex_time:.8f}s")

输出为:

The List Comprehension method took 0.48497320s
The Regex method took 3.71355870s

换句话说,列表理解方法更加省时。

关于python - 使用正则表达式提取包含某些单词的句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70729655/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com