gpt4 book ai didi

python - spacy : Is joblib necessary? 的多线程

转载 作者:太空宇宙 更新时间:2023-11-04 02:08:06 24 4
gpt4 key购买 nike

this文档的一部分,提到 nlp.pipe() 并行工作,并给出了以下示例:

for doc in nlp.pipe(texts, batch_size=10000, n_threads=3):
pass

之后,给出了另一个使用 joblib 的更长的示例。不太明白这两者之间的关系。据我了解文档,如果我只是想并行化许多文档的标记化,上面的简单 for 循环就可以工作,我不必使用 joblib,对吗?

我的管道是这样的:

nlp = spacy.load('en', disable=['parser', 'ner', 'textcat'])

我什么时候需要使用 joblib?

最佳答案

基于 Spacy github issues 中的回答:

We kept the n_threads argument to avoid breaking people's code, but unfortunately the implementation doesn't currently release the GIL, the way we did in v1. In v2 the neural network model is more complicated and more subject to change, so we haven't implemented it in Cython. We might at a later date.

In v2.1.0 (you can get an alpha by installing spacy-nightly, the matrix multiplications are now single-threaded. This makes it safe to launch multiple processes for the pipeline, so we can look at doing that internally. In the meantime, the n_threads argument sits idle...Which I agree is confusing, but removing it and breaking backwards compatibility seems worse.

因此,总结一下:n_threads 在 v2.1 中不起作用。我现在正在做的是使用 Spacy 和 joblib 以小批量读取数据集。

Spacy 为此发布了一个示例:Spacy Multiprocessing ,而且效果很好。

我有一个包含近 4M 短文本的数据集。不使用他们发布的示例,花了将近 23 个小时才完成解析,但是使用带有 spacy 的 joblib,花了 1 个半小时才完成!

为了让这个问题的读者引用 Spacy Multiprocessing 示例:Spacy Multiprocessing

关于python - spacy : Is joblib necessary? 的多线程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54201004/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com