gpt4 book ai didi

Python 在工作正则表达式上不返回任何匹配项

转载 作者:行者123 更新时间:2023-12-01 06:57:14 25 4
gpt4 key购买 nike

我在 python 中有一个类似于以下内容的字符串(不是原始字符串):

Plenary Papers (1)
Peer-reviewed Papers (113)
PLENARY MANUSCRIPTS (1)
First Author Index

Harrer
Plenary Papers

One Some title
John W. Doe
2018 Physics SOmething Proceedings
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation
PEER REVIEWED MANUSCRIPTS (113)
First Author Index

Doe · Doe2 · Doe3 · Jonathan
Peer-reviewed Papers

Two some title
Alex White, Paul Klee, and Jacson Pollock
2018 Physics Research Conference Proceedings, doi:10.1234/perc.2018.pr.White
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

Tree Some title
Suzanne Heck, Alex Someone, John I. Smith, and Andrew Bourgogne
2018 Physics Education Research Conference Proceedings, doi:10.2345/perc.2018.pr.Heck
Full Text: Download PDF - PER-Central Record
Show Abstract - Show Citation

..

我想抓取这三篇论文的元数据,即每个标题后面的几行(例如“One Some title”“John W. Doe”和 2018 Chemistry Something Proceedings”)。

我想对选择的开始和结束使用两种模式:

'r"\n\n"' 和 'r"显示摘要 - 显示引文"'。

这(几乎)适用于 https://regex101.com/使用这个正则表达式:

\n\n(.*?)Show Abstract - Show Citation

一个小问题是它在前两篇论文上是贪婪的。

但不在 python 中:

    pattern=r"\n\n(.*?)Show Abstract - Show Citation"

re.findall(pattern, titles) #titles is the text above

#output is []
pattern_only_one_line=r"\nShow Abstract - Show Citation"

re.findall(pattern_only_one_line, titles)

#output shows three lines

这可能是原始字符串的另一个问题吗?

最佳答案

缺少 re.DOTALL 标志。没有它 . 将不匹配换行符。

但我们可以做得更好(当然取决于您具体需要什么):https://regex101.com/r/iN6pX6/199

import re
import pprint

titles = '''
[Omitted for brevity]
..
'''

pattern = r'''
(?P<title>[^\n]+)\n
(?P<subtitle>[^\n]+)\n
((?P<etc>[^\n].*?)\n\n|\n)
'''

# Make sure we don't have any extraneous whitespace but add the separator
titles = titles.strip() + '\n\n'

for match in re.finditer(pattern, titles, re.DOTALL | re.VERBOSE):
title = match.group('title')
subtitle = match.group('subtitle')
etc = match.group('etc')
print('## %r' % title)
print('# %r' % subtitle)
if etc:
print(etc)
print()
# pprint.pprint(match.groupdict())

关于Python 在工作正则表达式上不返回任何匹配项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58766018/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com