Python Gensim LDAMallet CalledProcessError 与大语料库(在小语料库中运行良好)-6ren

Python Gensim LDAMallet CalledProcessError 与大语料库(在小语料库中运行良好)

转载作者：行者123 更新时间：2023-12-01 08:07:07

当我在约 1600 万个文档的完整语料库上运行 Gensim LDAMallet 模型时，出现 CalledProcessError“非零退出状态 1”错误。有趣的是，如果我在包含约 160,000 个文档的测试语料库上运行完全相同的代码，则代码运行得非常好。由于它在我的小语料库上运行良好，我倾向于认为代码很好，但我不确定还有什么会/可能导致此错误......

我已尝试按照建议编辑 mallet.bat 文件 here ，但无济于事。我还仔细检查了路径，但这不应该成为问题，因为它适用于较小的语料库。

id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

这是完整的回溯和错误:

  File "<ipython-input-57-f0e794e174a6>", line 8, in <module>
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
    self.train(corpus)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
    self.convert_input(corpus, infer=False)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
    check_output(args=cmd, shell=True)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
    raise error

CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.

最佳答案

我很高兴您找到我的帖子，但很抱歉它对您不起作用。我遇到这个错误的原因有很多，主要是 Java 没有安装属性，并且路径没有调用环境变量。

由于您的代码在较小的数据集上运行，因此我会首先查看您的数据。 Mallet 很挑剔，因为它只接受可能包含空值、标点符号或 float 的最干净的数据。

您是否获取了字典的样本大小，或者是否传递了整个数据集？

这基本上就是它正在做的事情:句子到单词 - 单词到数字 - 然后计算频率，例如:

[(3, 1), (13, 1), (37, 1)]

单词 3(“协助”)出现 1 次。单词 13(“付款”)出现 1 次。单词 37(“帐户”)出现 1 次。

然后你的 LDA 会查看一个单词并根据它与字典中所有其他单词出现的频率进行评分，并且它会对整个字典执行此操作，因此如果你让它查看数百万个单词，它会很快就会崩溃。

这就是我实现 mallet 并缩小字典的方式，不包括词干提取或其他预处理步骤:

# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.

dictionary = gensim.corpora.Dictionary(processed_docs[:])
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

# we want to throw out words that are so frequent that they tell us little about the topic 
# as well as words that are too infrequent >15 rows then keep just 100,000 words

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 27 words word indexed 2 shows up 4 times
# preview the bag of words

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]

os.environ['MALLET_HOME'] = 'C:\\mallet\\mallet-2.0.8'

mallet_path = 'C:\\mallet\\mallet-2.0.8\\bin\\mallet'

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus, num_topics=20, alpha =.1, 
                                             id2word=dictionary, iterations = 1000, random_seed = 569356958)

此外，我会将您的 ldamallet 分成一个单独的单元格，因为编译时间很慢，尤其是在这种大小的数据集上。我希望这有助于让我知道您是否仍然遇到错误:)

关于Python Gensim LDAMallet CalledProcessError 与大语料库(在小语料库中运行良好)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55485908/

文章推荐： python - 为什么 model.predict 与最后一层(keras)的输出不同？

文章推荐： python - 如何使 while 循环在每次运行时存储连续变量？

文章推荐： Python 解析字符串赋值

gensim - (gensim) LdaMallet vs LdaModel？
使用 gensim.models.LdaMallet 有什么区别和 gensim.models.LdaModel ?我注意到参数并不完全相同，想知道什么时候应该使用一个而不是另一个？最佳答案 TL;
python - 执行 gensim.LdaMallet 时出错
我按照此链接(“http://radimrehurek.com/2014/03/tutorial-on-mallet-in-python/”)上的说明进行操作，但是当我尝试训练模型时遇到错误:
Python Gensim LDAMallet CalledProcessError 与大语料库(在小语料库中运行良好)
当我在约 1600 万个文档的完整语料库上运行 Gensim LDAMallet 模型时，出现 CalledProcessError“非零退出状态 1”错误。有趣的是，如果我在包含约 160,000
python - gensim LdaMallet 引发 CalledProcessError，但在命令行运行 mallet 时没有错误
标题几乎说明了一切。下面是一些测试代码: import os os.environ.update({'MALLET_HOME': r'C:/Users/somebody/a/place/LDA/mal
gensim - pyLDAvis 与 Mallet LDA 实现 : LdaMallet object has no attribute 'inference'
是否可以使用 LDA 的 Mallet 实现来绘制 pyLDAvis ？我对 LDA_Model 没有任何问题，但是当我使用 Mallet 时，我得到: 'LdaMallet' object has

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Python Gensim LDAMallet CalledProcessError 与大语料库(在小语料库中运行良好)