gpt4 book ai didi

python - 正则表达式捕获多行文本正文

转载 作者:太空宇宙 更新时间:2023-11-04 03:54:41 24 4
gpt4 key购买 nike

所以我有一些看起来像这样的文本文档:

1a  Title
Subtitle
Description
1b Title
Subtitle A
Description
Subtitle B
Description
2 Title
Subtitle A
Description
Subtitle B
Description
Subtitle C
Description

我正在尝试使用正则表达式捕获由 3 个制表符缩进的“描述”行。我遇到的问题是有时描述行会换行到下一行并再次缩进 3 个制表符。这是一个例子:

1   Demo
Example
This is the description text body that I am
trying to capture with regex.

我想在一组中捕获此文本,以结束:

This is the description text body that I am trying to capture with regex.

一旦我能够做到这一点,我还想“扁平化”文档,使一行中的每个部分由字符而不是行和制表符分隔。所以我的示例代码将变为:

1->Demo->->Example->->->This is the description text...

我将在 Python 中实现它,但非常感谢任何正则表达式指导!


升级
我更改了扁平化文本中的分隔符以指示它之前的关系。 IE; 1 个标签 ->,2 个标签 ->->,3 个标签 ->->-> 等等。

此外,如果每个标题(节)有多个副标题(小节),则扁平化文本应该是这样的:

1a->Title->->Subtitle->->->Description
1b->Title->->Subtitle A->->->Description
1b->Title->->Subtitle B->->->Description
2->Title->->Subtitle A->->->Description
2->Title->->Subtitle B->->->Description
2->Title->->Subtitle C->->->Description

基本上只是为每个子级(副标题)“重用”父级(编号/标题)。

最佳答案

您可以在没有正则表达式的情况下执行此操作:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
\t\tSep
\t\t\tAnd Another Section
\t\t\tOn two lines
'''

cap=[]
buf=[]
for line in txt.splitlines():
if line.startswith('\t\t\t'):
buf.append(line.strip())
continue
if buf:
cap.append(' '.join(buf))
buf=[]
else:
if buf:
cap.append(' '.join(buf))

print cap

打印:

['This is the description text body that I am trying to capture with regex.', 
'And Another Section On two lines']

优点是用 3 个制表符分别缩进的不同部分保持分离。


好的:这是一个完整的正则表达式解决方案:

txt='''\
1\tDemo
\t\tExample
\t\t\tThis is the description text body that I am
\t\t\ttrying to capture with regex.
2\tSecond Demo
\t\tAnother Section
\t\t\tAnd Another 3rd level Section
\t\t\tOn two lines
3\tNo section below
4\tOnly one level below
\t\tThis is that one level
'''

import re

result=[]
for ms in re.finditer(r'^(\d+.*?)(?=^\d|\Z)',txt,re.S | re.M):
section=ms.group(1)
tm=map(len,re.findall(r'(^\t+)', section, re.S | re.M))
subsections=max(tm) if tm else 0
sec=[re.search(r'(^\d+.*)', section).group(1)]
if subsections:
for i in range(2,subsections+1):
lt=r'^{}([^\t]+)$'.format(r'\t'*i)
level=re.findall(lt, section, re.M)
sec.append(' '.join(s.strip() for s in level))

print '->'.join(sec)

打印:

1   Demo->Example->This is the description text body that I am trying to capture with regex.
2 Second Demo->Another Section->And Another 3rd level Section On two lines
3 No section below
4 Only one level below->This is that one level

限制:

1) This is limited to the format you described.
2) It will not handle reverse levels properly:
1 Section
Second Level
Third Level
Second Level Again <== This would be jammed in with 'second level'
How would you handel multi levels?

3) Won't handle multiline section headers:

3 Like
This

在您的完整示例上运行:

1a  Title->Subtitle->Description Second Line of Description
1b Title->Subtitle A Subtitle B->Description Description
2 Title->Subtitle A Subtitle B Subtitle C->Description Description Description

您可以看到第二层和第三层是连接的,但我不知道您希望如何处理该格式。

关于python - 正则表达式捕获多行文本正文,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19428739/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com