python - 在 docker 容器之间共享 gensim 的 KeyedVectors 对象的内存-6ren

python - 在 docker 容器之间共享 gensim 的 KeyedVectors 对象的内存

转载作者：行者123 更新时间：2023-12-01 01:43:42

28

4

已关注 related question solution我创建了 docker 容器，它在 docker 容器内加载 GoogleNews-vectors-negative300 KeyedVector 并将其全部加载到内存

KeyedVectors.load(model_path, mmap='r')
word_vectors.most_similar('stuff')

我还有另一个 Docker 容器，它提供 REST API，可以使用

加载此模型

KeyedVectors.load(model_path, mmap='r')

我观察到满载的容器需要超过 5GB 的内存，每个 Gunicorn Worker 需要 1.7GB 的内存。

CONTAINER ID        NAME                        CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
acbfd080ab50        vectorizer_model_loader_1   0.00%               5.141GiB / 15.55GiB   33.07%              24.9kB / 0B         32.9MB / 0B         15
1a9ad3dfdb8d        vectorizer_vectorizer_1     0.94%               1.771GiB / 15.55GiB   11.39%              26.6kB / 0B         277MB / 0B          17

但是，我希望所有这些进程都为 KeyedVector 共享相同的内存，因此所有容器之间只需要共享 5.4 GB。

有人尝试过实现这一目标并取得成功吗？

编辑:我尝试了以下代码片段，它确实在不同容器之间共享相同的内存。

import mmap
from threading import Semaphore

with open("data/GoogleNews-vectors-negative300.bin", "rb") as f:
    # memory-map the file, size 0 means whole file
    fileno = f.fileno()
    mm = mmap.mmap(fileno, 0, access=mmap.ACCESS_READ)
    # read whole content
    mm.read()
    Semaphore(0).acquire()
    # close the map
    mm.close()

所以问题是 KeyedVectors.load(model_path, mmap='r') 不共享内存

编辑2:研究gensim的源代码我发现调用了np.load(subname(fname, attrib), mmap_mode=mmap)来打开memmaped文件。以下代码示例在多个容器之间共享内存。

from threading import Semaphore

import numpy as np

data = np.load('data/native_format.bin.vectors.npy', mmap_mode='r')
print(data.shape)
# load whole file to memory
print(data.mean())
Semaphore(0).acquire()

最佳答案

经过大量调试后，我发现 mmap 对于 KeyedVectors 对象中的 numpy 数组按预期工作。

但是，KeyedVectors 还有其他属性，例如 self.vocab、self.index2word 和 self.index2entity，这些属性不共享并且消耗约 1.7每个对象的 GB 内存。

关于python - 在 docker 容器之间共享 gensim 的 KeyedVectors 对象的内存，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51616074/

28

4

0

文章推荐： Laravel 有时会验证数组

文章推荐： python - 无法在 Pycharm Windows 中运行 .Robot 文件

文章推荐： python - 使用 ModelForm 仅更新带有文本的字段

python - gensim Word2vec作为HTTP服务的代码 'KeyedVectors' 属性错误
我正在使用 w2v_server_googlenews来自运行在 https://rare-technologies.com/word2vec-tutorial/#bonus_app 的 word2v
python-3.x - Gensim:KeyedVectors.train()
我从 here 下载维基百科词向量.我加载了向量: model_160 = KeyedVectors.load_word2vec_format(wiki_160_path, binary=False)
docker - 如何将 gensim 的 KeyedVectors 对象存储在 Redis 队列工作程序内的全局变量中
我正在尝试将数据存储在 Redis 队列 (RQ) 工作程序内的全局变量中，以便这些数据保持预加载，即不需要为每个 RQ 作业加载它。具体来说，我正在使用 Word2Vec 向量并使用 gensim
python - 在 docker 容器之间共享 gensim 的 KeyedVectors 对象的内存
已关注 related question solution我创建了 docker 容器，它在 docker 容器内加载 GoogleNews-vectors-negative300 KeyedVect
machine-learning - KeyedVector 中的 Gensim Doc2Vec.infer_vector() 等效项
我有一个使用 gensim 中的 doc2vec 的工作应用程序。我知道KeyedVector现在是推荐的方法，并尝试移植，但我不确定 Doc2Vec 中的 infer_vector 方法的等效方法是
machine-learning - 将向量加载到 gensim Word2Vec 模型中 - 不是 KeyedVectors
我正在尝试将一些预先训练的向量加载到 gensim Word2Vec模型，以便可以使用新数据对它们进行重新训练。我的理解是我可以用 gensim.Word2Vec.train() 进行再培训。但是，我
python - 无法在上获取属性 'gensim.models.keyedvectors'
我训练并保存一个 gensim word2vec 模型: W2V_MODEL_FN = r"C:\Users\models\w2v.model" model = Word2Vec(X, size=15
python - gensim 4.1.2 的 KeyedVectors\' object has no attribute\' wv
我已经从 gensim 3.8.3 迁移到 4.1.2，并且正在使用这个 claim = [claim_text 中的 token 的 token (如果 w2v_model.wv.vocab 中的

首页

博学

6Ren·AI

商城

python - 在 docker 容器之间共享 gensim 的 KeyedVectors 对象的内存