gpt4 book ai didi

Python正则表达式匹配完整的句子,包括关键字,而不中断不结束句子的句点(.com、U.S.等)

转载 作者:太空宇宙 更新时间:2023-11-03 21:16:08 25 4
gpt4 key购买 nike

我正在尝试创建一个正则表达式来匹配包含关键字的完整句子。这是一个示例段落:

“2016 年,扣除退款后的现金税额为 4.12 亿美元。美国税法对外国子公司的累计收入征收一次性强制性税,并改变了外国收入在美国纳税的方式。”

我想匹配包含关键字“subsidiaries”的完整句子。为了实现这一点,我一直使用以下正则表达式:

[^.]*?subsidiaries[^.]*\.

但是,这仅匹配“Tax Act对外国子公司的累积收入征收强制性一次性税,并改变了外国收入如何缴纳U”,因为表达式以“.”开头和结尾。在我们中。”。有没有办法在表达式中指定我不希望它停在特定短语处,例如“U.S.”或“.com”?

最佳答案

我建议tokenizing the text into sentences with NLTK ,然后检查每个项目中是否存在字符串。

import nltk, re
text = "Cash taxes paid, net of refunds, were $412 million 2016. The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax."
sentences = nltk.sent_tokenize(text)
word = "subsidiaries"
print([sent for sent in sentences if word in sent])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']

要仅提取肯定句(以.结尾),请添加和sent.endswith('.')条件:

print([sent for sent in sentences if word in sent and sent.endswith('.')])

您甚至可以检查您过滤的单词是否是带有正则表达式的整个单词搜索:

print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
# => ['The U.S. Tax Act imposed a mandatory one-time tax on accumulated earnings of foreign subsidiaries and changed how foreign earnings are subject to U.S. tax.']

关于Python正则表达式匹配完整的句子,包括关键字,而不中断不结束句子的句点(.com、U.S.等),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54695000/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com