gpt4 book ai didi

python - 从文本中删除正则表达式数字

转载 作者:太空宇宙 更新时间:2023-11-04 09:27:32 25 4
gpt4 key购买 nike

我正在尝试清理用于机器学习应用程序的文本。基本上这些是“半结构化”的规范文档,我正在尝试删除干扰 NLTK sent_tokenize() 函数的节号。

这是我正在处理的文本示例:

and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3

...

(b)

until thirty-five days after the time fixed for receiving this tender,

whichever first occurs.
2.4

AGREEMENT

Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.

我正在尝试删除所有分节符(例如 2.3.3、2.4、(b)),但不删除日期数字。

这是我目前的正则表达式:[0-9]*\.[0-9]|[0-9]\.

不幸的是,它与最后一段中的部分日期匹配(2019。变成 201),我真的不知道如何解决这个问题,因为我不是正则表达式专家。

感谢您的帮助!

最佳答案

您可以尝试用空字符串替换以下模式

((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))

output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)

此模式通过将节号匹配为 \d+(?:\.\d+)* 来工作,但前提是它出现在一行的开头。它还将字母部分标题匹配为 \([a-z]+\)

关于python - 从文本中删除正则表达式数字,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57020171/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com