gpt4 book ai didi

python-2.7 - NLTK python 标记 CSV 文件

转载 作者:行者123 更新时间:2023-12-05 00:59:36 24 4
gpt4 key购买 nike

我已经开始尝试使用 Python 和 NLTK。我遇到了一条冗长的错误消息,我找不到解决方案,如果您有任何见解,我将不胜感激。

import nltk,csv,numpy 
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)

我在 OSX Yosemite 上运行 Python 2.7 和最新的 nltk 包。这些也是我尝试的两行代码,结果没有差异:

with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)

这些是我看到的错误信息:

Traceback (most recent call last):
File "nltk_text.py", line 11, in <module>
tokenData = nltk.word_tokenize(reader)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

提前致谢

最佳答案

正如您在 Python csv documentation 中所读到的那样, csv.reader “返回一个读取器对象,它将遍历给定 csv 文件中的行”。换句话说,如果您想对 csv 文件中的文本进行标记化,则必须遍历这些行和这些行中的字段:

for line in reader:
for field in line:
tokens = word_tokenize(field)

此外,当您在脚本开头导入 word_tokenize 时,您应该将其称为 word_tokenize,而不是 nltk.word_tokenize .这也意味着您可以删除 import nltk 语句。

关于python-2.7 - NLTK python 标记 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30571733/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com