gpt4 book ai didi

python - 正则表达式 unicode 字符不匹配

转载 作者:行者123 更新时间:2023-12-01 03:44:27 25 4
gpt4 key购买 nike

我正在尝试对包含一些特殊字符(如 à、è、ù 等)的文本使用正则表达式。

filter_2 = ur'(?:^\|\s+)?(?:(?:main_interests)|(?:influenced)|(?:influences))\s+?=[\s\W]+?(?:[\w}])*?([\d\w\s\-()*–&;\[\]|.<>:/",\']*)(?=\n)'
compiled = re.compile(filter_2, flags=re.U | re.M)
filter_list = re.findall(compiled, information)

下面的文本是表达式求值的结果。

[[Pedro Calderón de la Barca|Calderón]], [[Christian Fürchtegott Gellert|Gellert]], [[Oliver Goldsmith|Goldsmith]], [[Hafez]], [[Johann Gottfried Herder|Herder]], [[Homer]], [[Kālidāsa]], [[Kant]], [[Friedrich Gottlieb Klopstock|Klopstock]], [[Gotthold Ephraim Lessing|Lessing]], [[Carl Linnaeus|Linnaeus]], [[James Macpherson|Macpherson]], [[Jean-Jacques Rousseau|Rousseau]], [[Friedrich Schiller|Schiller]], [[William Shakespeare|Shakespeare]], [[Spinoza]], [[Emanuel Swedenborg|Swedenborg]],[[Karl Robert Mandelkow]], Bodo Morawe: Goethes Briefe. 2. edition. Vol. 1: Briefe der Jahre 1764–1786. ''Christian Wegner'', Hamburg 1968, p. 709 [[Johann Joachim Winckelmann|Winckelmann]]`

现在,当我尝试在上面的文本上使用另一个正则表达式来推断方括号中的单词时,结果是错误的。所有表示特殊字符的单词(例如 à ù 或 è)都会被删除,结果不是预期的结果。

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'
another_compiled = re.compile(filter_6, flags=re.U | re.M)
another_filtered_list = re.findall(another_compiled, (str(filter_list)))

这些是我的结果:

[('Pedro Calder', ''), ('Christian F', ''), ('Oliver Goldsmith', ''), ('Hafez', ''), ('Johann Gottfried Herder', ''), ('Homer', ''), ('K', ''), ('Kant', ''), ('Friedrich Gottlieb Klopstock', ''), ('Gotthold Ephraim Lessing', ''), ('Carl Linnaeus', ''), ('James Macpherson', ''), ('Jean-Jacques Rousseau', ''), ('Friedrich Schiller', ''), ('William Shakespeare', ''), ('Spinoza', ''), ('Emanuel Swedenborg', ''), ('Karl Robert Mandelkow', ''), ('Johann Joachim Winckelmann', ''), ('Thomas Carlyle', ''), ('Ernst Cassirer', ''), ('Charles Darwin', ''), ('Sigmund Freud', ''), ('G', ''), ('Andr', ''), ('Hermann Hesse', ''), ('G.W.F. Hegel', ''), ('Muhammad Iqbal', ''), ('Daisaku Ikeda', ''), ('Carl Gustav Jung', ''), ('Milan Kundera', ''), ('S', ''), ('Jean-Baptiste Lamarck', ''), ('Joaquim Maria Machado de Assis', ''), ('Thomas Mann', ''), ('Friedrich Nietzsche', ''), ('France Pre', ''), ('Grigol Robakidze', ''), ('Friedrich Schiller', ''), ('Oswald Spengler', ''), ('Max Stirner', ''), ('Friedrich Wilhelm Joseph Schelling', ''), ('Arthur Schopenhauer', ''), ('Oswald Spengler', ''), ('Rudolf Steiner', ''), ('Henry David Thoreau', ''), ('Nikola Tesla', ''), ('Ivan Turgenev', ''), ('Ludwig Wittgenstein', ''), ('Richard Wagner', ''), ('Leopold von Ranke', '')]

这些是我想要达到的结果

MATCH 1 1. [2-28] Pedro Calderón de la Barca MATCH 2 1. [43-72] Christian Fürchtegott Gellert MATCH 3 1. [86-102] Oliver Goldsmith MATCH 4 1. [118-123] Hafez MATCH 5 1. [129-152] Johann Gottfried Herder MATCH 6 1. [165-170] Homer MATCH 7 1. [176-184] Kālidāsa MATCH 8 1. [190-194] Kant MATCH 9 1. [200-228] Friedrich Gottlieb Klopstock MATCH 10 1. [244-268] Gotthold Ephraim Lessing MATCH 11 1. [282-295] Carl Linnaeus MATCH 12 1. [310-326] James Macpherson MATCH 13 1. [343-364] Jean-Jacques Rousseau MATCH 14 1. [379-397] Friedrich Schiller MATCH 15 1. [412-431] William Shakespeare MATCH 16 1. [449-456] Spinoza MATCH 17 1. [462-480] Emanuel Swedenborg MATCH 18 1. [501-522] Karl Robert Mandelkow MATCH 19 1. [659-685] Johann Joachim Winckelmann

所有正则表达式均经过在线测试,完美运行。有没有办法真正包含这些特殊字符?

最佳答案

Python 3中,正则表达式无法编译。当我改变时,这似乎对我有用:

filter_6 = ur'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

只是一个 unicode(非原始)字符串:

filter_6 = u'(?<=\[\[)([\w\s.-]+)((?=]])|(?=|))'

Python 2中,我认为问题在于将列表转换为字符串。将 str(filter_list) 更改为 ' '.join(filter_list) 似乎对我有用。

关于python - 正则表达式 unicode 字符不匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39170123/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com