gpt4 book ai didi

python - 从文本文件中随机选择句子,找到对应的ID号

转载 作者:太空宇宙 更新时间:2023-11-03 13:38:18 27 4
gpt4 key购买 nike

我正在帮助我的一位教授进行一项研究项目,该项目涉及从一组 20 个文本文件中随机抽取一千个句子。这是来自当代美国英语语料库的所有数据,如果有人熟悉使用它的话。在这些文本文件中,数据排列如下:

Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .

Blockquote>

因此,有数百个段落,每个段落都以六位数字开头,前面加上“##”。该数字对应于句子的来源。我需要从这些文件中随机抽取句子,并获得六位数字来标识它们的来源。所以理想情况下,我会得到类似的东西:

Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .

我已经成功地从文件中获取随机句子(在 stackoverflow 的好心人的帮助下),但我不知道如何获取附加在它们上面的数字(例如,如果我从一段的中间,我怎么能从段落的开头得到数字)。谁能帮我想办法做到这一点?这是我到目前为止的代码,它成功地提取了句子。

# -*- coding: utf-8 -*-

import re
from random import sample

sentences = []
for i in range(1990,2013):
with open('w_acad_{}.txt'.format(i)) as f:
sentences += re.findall(r".*?[\.\!\?]+", f.read())

selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
f.write('\n'.join(selected))

最佳答案

也许您可以使用正则表达式提取每个段落及其来源 ID,然后从该段落中提取句子,这与您目前的做法类似。这应该可以帮助您捕获段落:

# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]

现在,sentences 应该是像 ('##123', 'A sentence.') 这样的元组列表,您可以从中像以前一样进行采样。

关于python - 从文本文件中随机选择句子,找到对应的ID号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36164828/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com