gpt4 book ai didi

python - Gensim LDA 多核 Python 脚本运行速度太慢

转载 作者:行者123 更新时间:2023-11-29 09:53:09 40 4
gpt4 key购买 nike

我正在大型数据集(大约 100 000 个项目)上运行以下 python 脚本。目前执行速度慢得令人无法接受,可能至少需要一个月才能完成(毫不夸张)。显然我希望它运行得更快。

我添加了一条评论,突出显示我认为瓶颈所在。我已经编写了自己的导入数据库函数。

感谢任何帮助!

# -*- coding: utf-8 -*-
import database
from gensim import corpora, models, similarities, matutils
from gensim.models.ldamulticore import LdaMulticore
import pandas as pd
from sklearn import preprocessing



def getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary):
vec_bow = dictionary.doc2bow([researcher['full_proposal_text']])
vec_lda = ldamodel[vec_bow]

# normalization
try:
vec_lda = preprocessing.normalize(vec_lda)
except:
pass

similar_authors = []

for index, other_author in authors.iterrows():
if(other_author['id'] != author['id']):
other_vec_bow = dictionary.doc2bow([other_author['full_proposal_text']])

other_vec_lda = ldamodel[other_vec_bow]
# normalization
try:
other_vec_lda = preprocessing.normalize(vec_lda)
except:
pass

sim = matutils.cossim(vec_lda, other_vec_lda)
similar_authors.append({'id': other_author['id'], 'cosim': sim})
similar_authors = sorted(similar_authors, key=lambda k: k['cosim'], reverse=True)
return similar_authors[:5]


def get_top_five_similar(author, authors, ldamodel, dictionary):
top_five_similar_authors = getTopFiveSimilarAuthors(author, authors, ldamodel, dictionary)
database.insert_top_five_similar_authors(author['id'], top_five_similar_authors, cursor)

connection = database.connect()
authors = []
authors = pd.read_sql("SELECT id, full_text FROM author WHERE full_text IS NOT NULL;", connection)

# create the dictionary
dictionary = corpora.Dictionary([authors["full_text"].tolist()])

# create the corpus/ldamodel
author_text = []

for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)

corpus = [dictionary.doc2bow(text) for text in author_text]
ldamodel = LdaMulticore(corpus, num_topics=50, id2word = dictionary, workers=30)

#BOTTLENECK: the script hangs after this point.
authors.apply(lambda x: get_top_five_similar(x, authors, ldamodel, dictionary), axis=1)

最佳答案

我注意到你的代码中存在这些问题..但我不确定它们是执行缓慢的原因..这里的循环是无用的,它永远不会运行:

 for text in author_text['full_text'].tolist():
word_list = []
for word in text:
word_list.append(word)
author_text.append(word_list)

也不需要循环文本中的单词,对其使用 split 函数就足够了,这将是一个单词列表,由删除作者 courser 组成。

尝试这样写:第一:

all_authors_text = []
for author in authors:
all_authors_text.append(author['full_text'].split())

然后制作字典:

dictionary = corpora.Dictionary(all_authors_text)

关于python - Gensim LDA 多核 Python 脚本运行速度太慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54431187/

40 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com