Python 频率分布 (FreqDist/NLTK) 问题-6ren

Python 频率分布 (FreqDist/NLTK) 问题

转载作者：行者123 更新时间：2023-11-28 21:27:45

25

4

我正在尝试将单词列表(标记化字符串)分解为每个可能的子字符串。然后我想在每个子字符串上运行 FreqDist，以找到最常见的子字符串。第一部分工作正常。但是，当我运行 FreqDist 时，出现错误:

TypeError: unhashable type: 'list'

这是我的代码:

import nltk

string = ['This','is','a','sample']
substrings = []

count1 = 0
count2 = 0

for word in string:
    while count2 <= len(string):
        if count1 != count2:
            temp = string[count1:count2]
            substrings.append(temp)
        count2 += 1
    count1 +=1
    count2 = count1

print substrings

fd = nltk.FreqDist(substrings)

print fd

substrings 的输出没问题。在这里:

[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'], ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'], ['sample']]

但是，我无法让 FreqDist 在其上运行。任何见解将不胜感激。在这种情况下，每个子字符串的 FreqDist 仅为 1，但该程序旨在运行更大的文本样本。

最佳答案

我不完全确定你想要什么，但错误消息是说它想要对列表进行哈希处理，这通常是将其放入集合中或将其用作字典键的标志。我们可以通过给它元组来解决这个问题。

>>> import nltk
>>> import itertools
>>> 
>>> sentence = ['This','is','a','sample']
>>> contiguous_subs = [sentence[i:j] for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'],
 ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'],
 ['sample']]

但我们还有

>>> fd = nltk.FreqDist(contiguous_subs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 107, in __init__
    self.update(samples)
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 437, in update
    self.inc(sample, count=count)
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 122, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

但是，如果我们将子序列变成元组:

>>> contiguous_subs = [tuple(sentence[i:j]) for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[('This',), ('This', 'is'), ('This', 'is', 'a'), ('This', 'is', 'a', 'sample'), ('is',), ('is', 'a'), ('is', 'a', 'sample'), ('a',), ('a', 'sample'), ('sample',)]
>>> fd = nltk.FreqDist(contiguous_subs)
>>> print fd
<FreqDist: ('This',): 1, ('This', 'is'): 1, ('This', 'is', 'a'): 1, ('This', 'is', 'a', 'sample'): 1, ('a',): 1, ('a', 'sample'): 1, ('is',): 1, ('is', 'a'): 1, ('is', 'a', 'sample'): 1, ('sample',): 1>

这就是你要找的吗？

关于Python 频率分布 (FreqDist/NLTK) 问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10031470/

25

4

0

文章推荐： python - python 日志记录如何获取它的配置

文章推荐： python - django 多个站点 wsgi 就足够了吗？

文章推荐： ios - 在 Swift-2.2、Xcode 7.3.1 中模糊使用 'SubScript'

文章推荐： python - 在给定输入中查找最大路径

python - 绘制两个 nltk 频率分布
我一直在关注可以在这里找到的风格测量教程(programminghistorian.com)。这使用 matplotlib 绘制某些文本的频率分布。相关代码如下: for author in auth
java - 文件中字符出现的计数/频率分布，包括不可见的字符
我的目标是在大型平面文件(1GB+)中创建每个字符的频率分布，以便以后导入数据库。理想情况下，输出是一个 ASCII 字符列表，每个字符后跟一个计数。我目前正在使用 HashMap 来创建字符串中每
Python 频率分布 (FreqDist/NLTK) 问题
我正在尝试将单词列表(标记化字符串)分解为每个可能的子字符串。然后我想在每个子字符串上运行 FreqDist，以找到最常见的子字符串。第一部分工作正常。但是，当我运行 FreqDist 时，出现错误:

首页

博学

6Ren·AI

商城

Python 频率分布 (FreqDist/NLTK) 问题