python - 使用 Python 正则表达式捕获占有者和前缀-6ren

python - 使用 Python 正则表达式捕获占有者和前缀

转载作者：太空宇宙更新时间：2023-11-03 15:47:03

26

4

我正在尝试为 Python 编写一个正则表达式来捕获语料库中出现的各种形式的“群岛”。

这是一个测试字符串:

这是我关于岛屿、群岛和群岛空间的句子。我想确保群岛的猫不会被遗忘。我们不能忘记元群岛和原群岛历史学家，他们倾向于拼写复数“archipelagoes”。

我想从字符串中捕获以下内容:

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

尝试 1

使用正则表达式 (archipelag.*?)\b 并使用 Pythex 进行测试, 我捕获了所有六种形式的一部分。但是也有问题:

archipelago's 仅被捕获为 archipelago。我想要占有欲。
meta-archipelagic 仅作为 archipelagic 捕获。我希望能够捕获带连字符的前缀。
protoarchipelagic 仅被捕获为 archipelagic。我希望能够捕获非连字符前缀。

尝试 2

如果我尝试使用正则表达式 (archipelag.*?)\s(请参阅 Pythex )，所有格 archipelago's 现在会被捕获，但是后面的逗号第一个实例也被捕获(例如，archipelagos,)。它未能完全捕获最终的'archipelagoes.'。

最佳答案

正则表达式 ((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?) 适用于此。如果您有其他要求，您可能希望进一步修改它。

注意使用非捕获组 (?:) 来对表达式进行分组，这样我们就可以使用 ? 匹配零个或其中一个

import re

pat = re.compile(r"((?:\b\w+\b-)?\b\w*archipelag\w*\b(?:'s)?)")

corpus = "This is my sentence about islands, archipelagos, and archipelagic spaces. I want to make sure that the archipelago's cat is not forgotten. And we cannot forget the meta-archipelagic and protoarchipelagic historians, who tend to spell the plural 'archipelagoes.'"

for match in pat.findall(corpus):
    print(match)

打印

archipelagos
archipelagic
archipelago's
meta-archipelagic
protoarchipelagic
archipelagoes

Here it is on regex101

关于python - 使用 Python 正则表达式捕获占有者和前缀，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49439828/

26

4

0

文章推荐： php - Woocommerce 结帐页面上的额外 paypal 费用

文章推荐： c# - 更新数据库 Image 只存储 0x

文章推荐： python - 保存后字段归零

regex - 贪婪 vs. 不情愿 vs. 占有限定词
我找到了这个tutorial关于正则表达式，虽然我直观地理解“贪婪”、“不情愿”和“占有”限定符的作用，但我的理解似乎存在严重漏洞。具体来说，在以下示例中: Enter your regex: .*

首页

博学

6Ren·AI

商城