gpt4 book ai didi

python - Pandas NLTK 标记 "unhashable type: ' 列表'"

转载 作者:太空宇宙 更新时间:2023-11-03 16:19:15 24 4
gpt4 key购买 nike

以下示例:Twitter data mining with Python and Gephi: Case synthetic biology

CSV 至:df['Country', 'Responses']

'Country'
Italy
Italy
France
Germany

'Responses'
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
  1. 对“响应”中的文本进行标记
  2. 删除 100 个最常见的单词(基于 Brown.corpus)
  3. 找出剩下的 100 个最常见的单词

我可以完成步骤 1 和 2,但在步骤 3 中出现错误:

TypeError: unhashable type: 'list'

我相信这是因为我正在数据框中工作并进行了此(可能是错误的)修改:

原始示例:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(tweets)

我的代码:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

我的完整代码:

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

words = df['tokenized_sents']

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
',',
'.',
'of',
'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

TypeError: unhashable type: 'list'

关于不可哈希列表有很多问题,但我认为没有一个问题是完全相同的。有什么建议么?谢谢。

<小时/>

回溯

TypeError                                 Traceback (most recent call last)
<ipython-input-164-a0d17b850b10> in <module>()
1 #keep only most common words
----> 2 fdist = FreqDist(words)
3 mostcommon = fdist.most_common(100)
4 mclist = []
5 for i in range(len(mostcommon)):

/home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
104 :type samples: Sequence
105 """
--> 106 Counter.__init__(self, samples)
107
108 def N(self):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
521 raise TypeError('expected at most 1 arguments, got %d' % len(args))
522 super(Counter, self).__init__()
--> 523 self.update(*args, **kwds)
524
525 def __missing__(self, key):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
608 super(Counter, self).update(iterable) # fast path when counter is empty
609 else:
--> 610 _count_elements(self, iterable)
611 if kwds:
612 self.update(kwds)

TypeError: unhashable type: 'list'

最佳答案

FreqDist函数接受可迭代的可哈希对象(制成字符串,但它可能适用于任何对象)。您收到的错误是因为您传递了一个可迭代的列表。正如您所建议的,这是因为您所做的更改:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

如果我理解Pandas apply function documentation正确的是,该行正在应用 nltk.word_tokenize功能到某些系列。 word-tokenize返回单词列表。

作为解决方案,只需在尝试应用 FreqDist 之前将列表添加在一起即可。 ,像这样:

allWords = []
for wordList in words:
allWords += wordList
FreqDist(allWords)

更完整的修订,可以满足您的需求。如果您需要的只是识别第二组 100,请注意 mclist将有第二次。

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists = df['tokenized_sents']
words = []
for wordList in lists:
words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
',',
'.',
'of',
'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist

关于python - Pandas NLTK 标记 "unhashable type: ' 列表'",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38666973/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com