python - Python 正则表达式中的反斜杠转义序列和单词边界-6ren

python - Python 正则表达式中的反斜杠转义序列和单词边界

转载作者：行者123 更新时间：2023-12-01 02:21:59

当前使用re.sub(re.escape("andrew)"), "SUB", stringVar)

预期行为:

stringVar = " andrew) "
re.sub(re.escape("andrew)"), "SUB", stringVar) # Returns " SUB "

意外行为:

stringVar = "zzzandrew)zzz"
re.sub(re.escape("andrew)"), "SUB", stringVar) # Returns "zzzSUBzzz"

所以我尝试使用单词边界来修复“zzzandrew)zzz”，但是我的修复破坏了我的基本情况。

stringVar = " andrew) "
re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar) # Breaks and returns the original stringVar

来自:https://docs.python.org/2.0/ref/strings.html -> 原始字符串并对反斜杠转义序列使用不同的规则。那么除了re.escape我还应该做什么呢？

最佳答案

来自 python re 模块 docs

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

在您的情况下，单词边界被识别为 andrew 和 ) 之间，它是第一个非字母数字非下划线字符。下面的示例说明了如果您在转义中包含或排除“)”会发生什么情况。

>>> stringVar = " andrew) "
>>> re.sub(r'\b%s\b' % re.escape("andrew)"), "SUB", stringVar)
' andrew) '
>>> re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)
' SUB) '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s\b' % re.escape("andrew"), "SUB", stringVar)
'zzzandrew)zzz'

如果您必须使用“)”作为转义的一部分，您可以使用正向先行断言，如下所示，如果存在空格 (\s) 或非字母数字字符，则该断言会匹配(\W) 在“安德鲁”之后

>>> stringVar = " andrew) "
>>> re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)
' SUB '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s(?=\s)' % re.escape("andrew)"), "SUB", stringVar)
'zzzandrew)zzz'
>>> stringVar = " andrew) "
>>> re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)
' SUB '
>>> stringVar = "zzzandrew)zzz"
>>> re.sub(r'\b%s(?=\W)' % re.escape("andrew)"), "SUB", stringVar)
'zzzandrew)zzz'

关于python - Python 正则表达式中的反斜杠转义序列和单词边界，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47871938/