gpt4 book ai didi

python - 正则表达式:用制表符和换行符拆分长字符串

转载 作者:太空宇宙 更新时间:2023-11-04 09:57:57 24 4
gpt4 key购买 nike

考虑以下字符串:

08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet,  Member of the Executive Board of the ECB,  conducted by Pascal Dendooven and Goele De Cort on 3 July 2017,  published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré,  Member of the Executive Board of the ECB,  conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa),  on 3 July,  published on 7 July 2017ENGLISH"

我想提取里面的两句话,即:

  • “08/07/2017Peter Praet:采访 De Standaard欧洲央行执行委员会成员 Peter Praet 的采访,由 Pascal Dendooven 和 Goele De Cort 于 2017 年 7 月 3 日进行,发表于 8 2017 年 7 月英文版”

  • “NederlandsNL07/07/2017Benoît Coeuré:Le Monde 和 La Stampa 采访欧洲央行执行委员会成员 Benoît Coeuré,由 Marie Charrel (Le Monde) 和 Alessandro Barbera (La Stampa),7 月 3 日,2017 年 7 月 7 日出版 ENGLISH"

我尝试使用 [\w]+(?!\\t) 但这捕获了 t(1 等中的 t东西。

这里的正确语法是什么?谢谢!

最佳答案

给你,就此分开

r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr]).)*\\[\\ntr])*'

http://www.regex101.com/r/lNv8VO/1

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times

注意
上面的正则表达式会将文本最多 分成两部分。

如果拆分内容包含非转义的 r,n,t,那么您可以允许
如果文本低于某个阈值,则进行多次拆分。

@MadPhysicist 建议长度为 20。我给它 40,并在
正则表达式在这部分 (?:(?:(?!\\[\\ntr]).){0,20} 中给它一个范围。

新的正则表达式是

r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\\[\\ntr]).){0,40 }?\s*\\[\\ntr])*'

https://regex101.com/r/lNv8VO/3

解释

 (?s)                          # Modifiers:  dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times

关于python - 正则表达式:用制表符和换行符拆分长字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45047122/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com