gpt4 book ai didi

python - 如何使用正则表达式从python中的片段中抓取整个句子

转载 作者:行者123 更新时间:2023-12-04 08:10:20 25 4
gpt4 key购买 nike

我有一个 vtt 文件如下

WEBVTT

1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes

4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language.
我想从文件中提取片段并将它们合并成句子。输出应该是这样的
['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'
我能够使用这个提取片段
pattern = r"[A-z0-9 ,.*?='\";\n-\/%$#@!()]+"

content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]
我不知道如何提取整个句子而不是片段。
另请注意,这只是 vtt 文件的一个示例。整个 vtt 文件包含大约 630 个片段,其中一些片段还包含整数和其他特殊字符
任何帮助表示赞赏

最佳答案

使用 re.sub我们可以先尝试删除不需要的重复文本。然后,进行第二次替换以用单个空格替换剩余的换行符:

inp = """1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes

4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language."""

output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', inp)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
print(sentences)
这打印:
["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
'Regular expressions or regexes are written in a condensed formatting language.']

关于python - 如何使用正则表达式从python中的片段中抓取整个句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66004017/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com