gpt4 book ai didi

python - 如何在 Python 中使用 nltk.corpus 逐行读取和标记文本文件

转载 作者:太空宇宙 更新时间:2023-11-03 18:25:16 27 4
gpt4 key购买 nike

我的问题是在给定两个训练数据 good_reviews.txtbad_reviews.txt 的情况下对文档进行分类。因此,首先我需要加载并标记我的训练数据,其中每一行本身就是一个文档,对应于评论。所以我的主要任务是从给定的测试数据中对评论(行)进行分类。

我找到了一种如何加载和标记名称数据的方法,如下所示:

from nltk.corpus import names
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

所以我想要的是一个类似的东西,它标记而不是单词。我期望代码会是这样的,这当然不起作用,因为 .lines 是无效的语法:

reviews = ([(review, 'good_review') for review in reviews.lines('good_reviews.txt')] +
[(review, 'bad_review') for review in reviews.lines('bad_reviews.txt')])

我希望得到这样的结果:

>>> reviews[0]
('This shampoo is very good blablabla...', 'good_review')

最佳答案

如果您正在读取自己的文本文件,那么与 NLTK 没有什么关系,您只需使用 file.readlines():

good_reviews = """This is great!
Wow, it amazes me...
An hour of show, a lifetime of enlightment
"""
bad_reviews = """Comme si, Comme sa.
I just wasted my foo bar on this.
An hour of s**t, ****.
"""
with open('/tmp/good_reviews.txt', 'w') as fout:
fout.write(good_reviews)
with open('/tmp/bad_reviews.txt', 'w') as fout:
fout.write(bad_reviews)

reviews = []
with open('/tmp/good_reviews.txt', 'r') as fingood, open('/tmp/bad_reviews.txt', 'r') as finbad:
reviews = ([(review, 'good_review') for review in fingood.readlines()] + [(review, 'bad_review') for review in finbad.readlines()])

print reviews

[输出]:

[('This is great!\n', 'good_review'), ('Wow, it amazes me...\n', 'good_review'), ('An hour of show, a lifetime of enlightment\n', 'good_review'), ('Comme si, Comme sa.\n', 'bad_review'), ('I just wasted my foo bar on this.\n', 'bad_review'), ('An hour of s**t, ****.\n', 'bad_review')]

如果您要使用 NLTK 电影评论语料库,请参阅 Classification using movie review corpus in NLTK/Python

关于python - 如何在 Python 中使用 nltk.corpus 逐行读取和标记文本文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23329051/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com