gpt4 book ai didi

python - 使用 NLTK 的 FreqDist

转载 作者:太空宇宙 更新时间:2023-11-04 01:39:44 25 4
gpt4 key购买 nike

我正在尝试使用 Python 获取一组文档的频率分布。我的代码由于某种原因无法正常工作并产生此错误:

Traceback (most recent call last):
File "C:\Documents and Settings\aschein\Desktop\freqdist", line 32, in <module>
fd = FreqDist(corpus_text)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 104, in __init__
self.update(samples)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 472, in update
self.inc(sample, count=count)
File "C:\Python26\lib\site-packages\nltk\probability.py", line 120, in inc
self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

你能帮忙吗?

这是目前的代码:

import os
import nltk
from nltk.probability import FreqDist


#The stop=words list
stopwords_doc = open("C:\\Documents and Settings\\aschein\\My Documents\\stopwords.txt").read()
stopwords_list = stopwords_doc.split()
stopwords = nltk.Text(stopwords_list)

corpus = []

#Directory of documents
directory = "C:\\Documents and Settings\\aschein\\My Documents\\comments"
listing = os.listdir(directory)

#Append all documents in directory into a single 'document' (list)
for doc in listing:
doc_name = "C:\\Documents and Settings\\aschein\\My Documents\\comments\\" + doc
input = open(doc_name).read()
input = input.split()
corpus.append(input)

#Turn list into Text form for NLTK
corpus_text = nltk.Text(corpus)

#Remove stop-words
for w in corpus_text:
if w in stopwords:
corpus_text.remove(w)

fd = FreqDist(corpus_text)

最佳答案

我希望至少有助于回答的两个想法。

首先,nltk.text.Text() 方法的文档说明(强调我的):

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

所以我不确定 Text() 是您想要处理这些数据的方式。在我看来,使用列表就可以了。

其次,我会提醒您考虑一下您要求 NLTK 在此处执行的计算。在确定频率分布之前删除停用词意味着您的频率会出现偏差;我不明白为什么停用词在制表之前被删除,而不是在事后检查分布时被忽略。 (我想第二点会比答案的一部分提出更好的查询/评论,但我觉得值得指出的是比例会倾斜。)根据您打算使用频率分布的目的,这可能会也可能会本身不是问题。

关于python - 使用 NLTK 的 FreqDist,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6284855/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com