gpt4 book ai didi

Python Gensim 如何通过多处理使 WMD 相似性运行得更快

转载 作者:太空狗 更新时间:2023-10-29 21:48:37 25 4
gpt4 key购买 nike

我正在尝试更快地运行 gensim WMD similarity。通常,这是文档中的内容:示例语料库:

    my_corpus = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]

my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']

model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)

from gensim import Word2Vec
from gensim.similarities import WmdSimilarity

def init_instance(my_corpus,model,num_best):
instance = WmdSimilarity(my_corpus, model,num_best = 1)
return instance
instance[my_tokenized_query]

最匹配的文档是“实验室 abc 计算机应用程序的人机界面”,它很棒。

但是上面的instance 函数需要很长时间。所以我想到将语料库分成 N 部分,然后用 num_best = 1 对每个部分执行 WMD,然后在它的末尾,得分最高的部分将是最相似的。

    from multiprocessing import Process, Queue ,Manager

def main( my_query,global_jobs,process_tmp):
process_query = gensim.utils.simple_preprocess(my_query)

def worker(num,process_query,return_dict):
instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
x = instance[process_query][0][0]
y = instance[process_query][0][1]
return_dict[x] = y
manager = Manager()
return_dict = manager.dict()

for num in range(num_workers):
process_tmp = Process(target=worker, args=(num,process_query,return_dict))
global_jobs.append(process_tmp)
process_tmp.start()
for proc in global_jobs:
proc.join()

return_dict = dict(return_dict)
ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
print corpus[ind]
>>> "Graph minors A survey"

我遇到的问题是,即使它输出了一些东西,它也没有从我的语料库中给我一个很好的相似查询,即使它获得了所有部分的最大相似性。

我做错了什么吗?

最佳答案

Comment: chunk is a static variable: e.g. chunk = 600 ...

如果您将 chunk 定义为静态,那么您必须计算 num_workers

10001 / 600 = 16,6683333333 = 17 num_workers

通常使用的进程不超过您拥有的核心
如果您有 17 个核心,那没问题。

cores 是静态的,因此您应该:

num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)

  1. 不一样的结果,改为:

    #process_query = gensim.utils.simple_preprocess(my_query)
    process_query = my_tokenized_query
  2. 所有 worker 结果索引 0..n.
    因此,return_dict[x] 可以从具有较低值的相同索引的最后一个 worker 中覆盖。 return_dict 中的索引与 my_corpus 中的索引相同。更改为:

    #return_dict[x] = y
    return_dict[ (num * chunk)+x ] = y
  3. 在 block 大小计算中使用 +1,将跳过第一个 Document
    我不知道你如何计算 chunk,考虑这个例子:

    def chunksize(iterable, num_workers):
    c_size, extra = divmod(len(iterable), num_workers)
    if extra:
    c_size += 1
    if len(iterable) == 0:
    c_size = 0
    return c_size

    #Usage
    chunk = chunksize(my_corpus, num_workers)
    ...
    #my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
    my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]

Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)

With multiprocessing, with chunk=5:
02,09:(3, 8), 01,03:(3, 5):
System and human system engineering testing of EPS
04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
Human machine interface for lab abc computer applications

Without multiprocessing, with chunk=5:
01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
System and human system engineering testing of EPS
03,04,06:(0, 5):
Human machine interface for lab abc computer applications

Without multiprocessing, without chunking:
01,02,03,04,06,07,08:(3, -1):
System and human system engineering testing of EPS
05,09,10:(0, -1):
Human machine interface for lab abc computer applications

使用 Python 测试:3.4.2

关于Python Gensim 如何通过多处理使 WMD 相似性运行得更快,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44000997/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com