gpt4 book ai didi

Python正则表达式替换不应该匹配的字符串

转载 作者:行者123 更新时间:2023-12-01 03:04:15 25 4
gpt4 key购买 nike

更新:此问题是由 regex 模块中的错误引起的,开发人员已在 commit be893e9 中解决了该错误。

如果您遇到类似问题,请更新您的 regex 模块。
您需要 2017.04.23 或更高版本。

<强> See here 了解更多信息。

<小时/>

背景:我在第 3 方 Text2Speech 引擎中使用正则表达式 (english.lex) 集合,在说出输入文本之前对其进行规范化。

出于调试目的,我编写了下面的脚本来查看我的正则表达式集合对输入文本的实际影响。

我的问题是它正在替换 simply does not match 的正则表达式

<小时/>

我有 3 个文件:

regex_preview.py

#!/usr/bin/env python
import codecs
import regex as re

input="Text2Speach Regex Test.txt"
dictionary="english.lex"

with codecs.open(dictionary, "r", "utf16") as f:
reg_exen = f.readlines()
with codecs.open(input, "r+", "utf16") as g:
content = g.read().replace(r'\\\\\"','"')

# apply all regular expressions to content
for line in reg_exen:
line=line.strip()

# skip comments
if line == "" or line[0] == "#":
pass
else:
# remove " from lines and split them into pattern and substitue
pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')

print("\n'%s' ==> '%s'" % (pattern, substitute))

print(content.strip())
content = re.sub(pattern, substitute, content)
print(content.strip())

english.lex - utf16 编码

# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."

# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O

Text2Speach Regex Test.txt - utf16 编码

“Erm….yes. Thank you for that.”
<小时/>

运行脚本会生成此输出,其中最后一个正则表达式以某种方式与内容匹配:

'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."

'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."

'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."
<小时/>

到目前为止我尝试过的:

我创建了这个片段来重现该问题:

#!/usr/bin/env python

import re
import codecs

content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)

print(content)

但这实际上表现得像它应该的那样。所以我对这里发生的事情感到困惑。

希望有人能指出我进一步调查的正确方向......

最佳答案

原始脚本使用替代方案 regex module 而不是标准库 re 模块。

import regex as re

在这种情况下,两者之间显然存在一些差异。我的猜测是这与嵌套组有关。这个表达式在非捕获组中包含一个捕获组,这对我来说太神奇了。

import re     # standard library
import regex # completely different implementation

content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

输出:

"Erm....yes. Thank you for that."
"-yes. Thank you for that."

关于Python正则表达式替换不应该匹配的字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43560759/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com