gpt4 book ai didi

python IndexError 使用 gensim 进行 LDA 主题建模

转载 作者:太空宇宙 更新时间:2023-11-04 10:34:59 26 4
gpt4 key购买 nike

Another thread有一个与我类似的问题,但遗漏了可重现的代码。

相关脚本的目标是创建一个内存效率尽可能高的进程。所以我尝试编写一个类 corpus() 来利用 gensims 的功能。但是,我遇到了一个 IndexError,我不确定在创建 lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics)) 时如何解决

我使用的文档与 gensim 教程中使用的文档相同,我将其放入 tutorial_example.txt:

$ cat tutorial_example.txt 
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

收到错误

$./gensim_topic_modeling.py -mn2 -w'english' -l1 tutorial_example.txt 
Traceback (most recent call last):
File "./gensim_topic_modeling.py", line 98, in <module>
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 306, in __init__
self.update(corpus)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 543, in update
self.log_perplexity(chunk, total_docs=lencorpus)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 454, in log_perplexity
perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 630, in bound
gammad, _ = self.inference([doc])
File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 366, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index 7 is out of bounds for axis 1 with size 7

下面是 gensim_topic_modeling.py 脚本:

##gensim_topic_modeling.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import re
import codecs
import logging
import fileinput
from operator import *
from itertools import *
from sklearn.cluster import KMeans
from gensim import corpora, models, similarities, matutils
import argparse
from nltk.corpus import stopwords

reload(sys)
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stdin = codecs.getreader('utf-8')(sys.stdin)


##defs

def stop_word_gen():
nltk_langs=['danish', 'dutch', 'english', 'french', 'german', 'italian','norwegian', 'portuguese', 'russian', 'spanish', 'swedish']
stoplist = []
for lang in options.stop_langs.split(","):
if lang not in nltk_langs:
sys.stderr.write('\n'+"Language {0} not supported".format(lang)+'\n')
continue
stoplist.extend(stopwords.words(lang))
return stoplist


def clean_texts(texts):
# remove tokens that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
return [[word for word in text if word not in tokens_once] for text in texts]

##class

class corpus(object):
"""sparse vector matrix and dictionary"""
def __iter__(self):
first=True
for line in fileinput.FileInput(options.input, openhook=fileinput.hook_encoded("utf-8")):
# assume there's one document per line; tokenizer option determines how to split
if options.space_tokenizer:
rl = re.compile('\s+', re.UNICODE).split(unicode(line,'utf-8'))
else:
rl = re.compile('\W+', re.UNICODE).split(tagRE.sub(' ',line))
# create dictionary
tokens=[token.strip().lower() for token in rl if token != '' and token.strip().lower() not in stoplist]
if first:
first=False
self.dictionary=corpora.Dictionary([tokens])
else:
self.dictionary.add_documents([tokens])
self.dictionary.compactify
yield self.dictionary.doc2bow(tokens)


##main

if __name__ == '__main__':
##parser
parser = argparse.ArgumentParser(
description="Topic model from a column of text. Each line is a document in the corpus")
parser.add_argument("input", metavar="args")
parser.add_argument("-l", "--document-frequency-limit", dest="doc_freq_limit", default=1,
help="Remove all tokens less than or equal to limit (default 1)")
parser.add_argument("-m", "--create-model", dest="create_model", default=False, action="store_true",
help="Create and save a model from existing dictionary and input corpus.")
parser.add_argument("-n", "--number-of-topics", dest="number_of_topics", default=2,
help="Number of topics (default 2)")
parser.add_argument("-t", "--space-tokenizer", dest="space_tokenizer", default=False, action="store_true",
help="Use alternate whitespace tokenizer")
parser.add_argument("-w", "--stop-word-languages", dest="stop_langs", default="danish,dutch,english,french,german,italian,norwegian,portuguese,russian,spanish,swedish",
help="Desired languages for stopword lists")
options = parser.parse_args()

##globals

stoplist=set(stop_word_gen())
tagRE = re.compile(r'<.*?>', re.UNICODE) # Remove xml/html tags
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename="topic-modeling-log")
logr = logging.getLogger("topic_model")
logr.info("#"*15 + " started " + "#"*15)

##instance of class

checker=corpus()
logr.info("#"*15 + " SPARSE MATRIX (pre-filter)" + "#"*15)

##view sparse matrix and dictionary

for vector in checker:
logr.info(vector)
logr.info("#"*15 + " DICTIONARY (pre-filter)" + "#"*15)
logr.info(checker.dictionary)
logr.info(checker.dictionary.token2id)
#filter
checker.dictionary.filter_extremes(no_below=int(options.doc_freq_limit)+1)
logr.info("#"*15 + " DICTIONARY (post-filter)" + "#"*15)
logr.info(checker.dictionary)
logr.info(checker.dictionary.token2id)

##Create lda model

if options.create_model:
tfidf = models.TfidfModel(checker,normalize=False)
print tfidf
logr.info("#"*15 + " corpus_tfidf " + "#"*15)
corpus_tfidf = tfidf[checker]
logr.info("#"*15 + " lda " + "#"*15)
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
logr.info("#"*15 + " corpus_lda " + "#"*15)
corpus_lda = lda[corpus_tfidf]

##Evaluate topics based on threshold

scores = list(chain(*[[score for topic,score in topic] \
for topic in [doc for doc in corpus_lda]]))
threshold = sum(scores)/len(scores)
print "threshold:",threshold
print
cluster1 = [j for i,j in zip(corpus_lda,documents) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(corpus_lda,documents) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(corpus_lda,documents) if i[2][1] > threshold]

生成的 topic-modeling-log 文件如下。在此先感谢您的帮助!

主题建模日志

2014-05-25 02:58:50,482 : INFO : ############### started ###############
2014-05-25 02:58:50,483 : INFO : ############### SPARSE MATRIX (pre-filter)###############
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,483 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,483 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (27, 1), (28, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(9, 1), (26, 1), (30, 1)]
2014-05-25 02:58:50,485 : INFO : ############### DICTIONARY (pre-filter)###############
2014-05-25 02:58:50,485 : INFO : Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : {'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}
2014-05-25 02:58:50,486 : INFO : keeping 12 tokens which were in no less than 2 and no more than 4 (=50.0%) documents
2014-05-25 02:58:50,486 : INFO : resulting dictionary: Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : ############### DICTIONARY (post-filter)###############
2014-05-25 02:58:50,486 : INFO : Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : {'minors': 0, 'graph': 1, 'system': 2, 'trees': 3, 'eps': 4, 'computer': 5, 'survey': 6, 'user': 7, 'human': 8, 'time': 9, 'interface': 10, 'response': 11}
2014-05-25 02:58:50,486 : INFO : collecting document frequencies
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,486 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,486 : INFO : PROGRESS: processing document #0
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,486 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,488 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)
2014-05-25 02:58:50,488 : INFO : ############### corpus_tfidf ###############
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,488 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : ############### lda ###############
2014-05-25 02:58:50,489 : INFO : using symmetric alpha at 0.5
2014-05-25 02:58:50,489 : INFO : using serial LDA version on this node
2014-05-25 02:58:50,489 : WARNING : input corpus stream has no len(); counting documents
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,489 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,489 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,491 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50 with a convergence threshold of 0
2014-05-25 02:58:50,491 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,491 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)

最佳答案

这是由于使用的 corpusdictionary 没有相同的 id-to-word 映射。如果您修剪字典并在错误的时间调用 dictionary.compactify() ,就会发生这种情况。

一个简单的例子就很清楚了。让我们制作一本字典:

from gensim.corpora.dictionary import Dictionary
documents = [
['here', 'is', 'one', 'document'],
['here', 'is', 'another', 'document'],
]
dictionary = Dictionary()
dictionary.add_documents(documents)

这本词典现在有这些词的条目并将它们映射到整数 id。将文档转换为 (id, count) 元组的向量(我们希望在将它们传递到模型之前这样做):

vectorized_corpus = [dictionary.doc2bow(doc) for doc in corpus]

有时您会想要修改您的词典。例如,您可能想要删除非常罕见或非常常见的词:

dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)
dictionary.compactify()

删除单词会在字典中创建空白,但调用 dictionary.compactify() 会重新分配 id 以填补空白。但这意味着我们上面的 vectorized_corpus 不再使用与 dictionary 相同的 id,如果我们将它们传递到模型中,我们将得到一个 索引错误

解决方案:在进行更改并调用 dictionary.compactify() 之后, 使用字典制作矢量表示!

关于python IndexError 使用 gensim 进行 LDA 主题建模,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23853828/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com