gpt4 book ai didi

python - 如何从 gensim 的 Word2Vec 模型中完全删除一个单词?

转载 作者:太空狗 更新时间:2023-10-29 17:32:37 27 4
gpt4 key购买 nike

给定一个模型,例如

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

可以从 w2v 词汇表中删除该词,例如

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

但是当我们在删除 graph 之后对其他单词进行相似性处理时,我们会看到单词 graph 弹出,例如

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

如何从gensim中的Word2Vec模型中完全删除一个单词?


已更新

回答@vumaasha 的评论:

could you give some details as to why you want to delete a word

  • 让我们在语料库中的所有单词中说出我的单词世界,以了解所有单词之间的密集关系。

  • 但是当我想生成相似词时,它应该只来自特定领域词的一个子集。

  • 可以从 .most_similar() 中生成足够多的词,然后过滤这些词,但可以说特定域的空间很小,我可能正在寻找一个排名过的词第 1000 个最相似,这是低效的。

  • 如果单词完全从单词向量中删除,那么 .most_similar() 单词将不会返回特定域之外的单词会更好。

最佳答案

我编写了一个函数,用于从 KeyedVectors 中删除不在预定义单词列表中的单词。

def restrict_w2v(w2v, restricted_word_set):
new_vectors = []
new_vocab = {}
new_index2entity = []
new_vectors_norm = []

for i in range(len(w2v.vocab)):
word = w2v.index2entity[i]
vec = w2v.vectors[i]
vocab = w2v.vocab[word]
vec_norm = w2v.vectors_norm[i]
if word in restricted_word_set:
vocab.index = len(new_index2entity)
new_index2entity.append(word)
new_vocab[word] = vocab
new_vectors.append(vec)
new_vectors_norm.append(vec_norm)

w2v.vocab = new_vocab
w2v.vectors = new_vectors
w2v.index2entity = new_index2entity
w2v.index2word = new_index2entity
w2v.vectors_norm = new_vectors_norm

它重写了所有与基于Word2VecKeyedVectors的单词相关的变量。 .

用法:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

关于python - 如何从 gensim 的 Word2Vec 模型中完全删除一个单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48941648/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com