python - Word2Vec:如何检查训练好的模型的向量值？-6ren

python - Word2Vec:如何检查训练好的模型的向量值？

转载作者：行者123 更新时间：2023-12-01 09:03:25

我最近尝试使用 word2vec，我训练了我的模型并分配了所有向量。但是，我不知道如何找到每个向量的值。

我尝试打印模型，但它只输出它训练过的所有向量。但是，我仍然不明白，我认为向量是基于每个单词的，但不知何故，所有内容都在一个列表中。

我对word2vec的理解是，每个单词(假设这个W1)都有自己的向量，每个向量代表当前单词(W1)和word2(W2)之间的相似度。由于每个单词都分配有稀疏向量，因此它应该包含仅 W1 的大量向量。但是，当我打印模型时，我(可能)只收到一个单词，但我不确定这是哪个单词。有人可以帮助我吗？

我的代码:

import collections
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

batch_size = 20
embedding_size = 2
num_sampled = 15


sentences  = ["I have something that I want to say to him",
            "How are you",
            "We can see many stars tonight",
            "That's our house",
            "sung likes cats",
            "she loves dogs",
            "Do you know what he has done",
            "cats are great companions when they want to be",
            "We need to invest in clean, renewable energy",
            "women love his man",
            "queen love his king",
            "girl love his boy",
            "The line is too long. Why don't you come back tomorrow",
            "man and women roam in park",
            "Does it really matter",
            "dynasty king remain mortal"]

words = " ".join(sentences).split()
count = collections.Counter(words).most_common()
# Build dictionaries
reverse_dictionary = [i[0] for i in count] #reverse dic, idx -> word
dic = {w: i for i, w in enumerate(reverse_dictionary)} #dic, word -> id
voc_size = len(dic)
data = [dic[word] for word in words]


cbow_pairs = []
for i in range(1, len(data)-1) :
    cbow_pairs.append([[data[i-1], data[i+1]], data[i]])

    skip_gram_pairs = []
for c in cbow_pairs:
    skip_gram_pairs.append([c[1], c[0][0]])
    skip_gram_pairs.append([c[1], c[0][1]])



def  generate_batch (size):
    assert size < len(skip_gram_pairs)
    x_data=[]
    y_data = []
    r = np.random.choice(range(len(skip_gram_pairs)), size, replace=False)
    for i in r:
        x_data.append(skip_gram_pairs[i][0])  # n dim
        y_data.append([skip_gram_pairs[i][1]])  # n, 1 dim
    return x_data, y_data

# Input data
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
    # Look up embeddings for inputs.
    embeddings = tf.Variable(
        tf.random_uniform([voc_size, embedding_size], -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs) # lookup table

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.random_uniform([voc_size, embedding_size],-1.0, 1.0))
nce_biases = tf.Variable(tf.zeros([voc_size]))

# Compute the average NCE loss for the batch.
# This does the magic:
#   tf.nn.nce_loss(weights, biases, inputs, labels, num_sampled, num_classes ...)
# It automatically draws negative samples when we evaluate the loss.
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, train_labels, embed, num_sampled, voc_size))
# Use the adam optimizer
train_op = tf.train.AdamOptimizer(1e-1).minimize(loss)


# Launch the graph in a session# Launch 
with tf.Session() as sess:
    # Initializing all variables
    tf.global_variables_initializer().run()

    for step in range(100):
        batch_inputs, batch_labels = generate_batch(batch_size)
        _, loss_val = sess.run([train_op, loss],
                feed_dict={train_inputs: batch_inputs, train_labels: batch_labels})

    # Final embeddings are ready for you to use. Need to normalize for practical use
    trained_embeddings = embeddings.eval()
    print(trained_embeddings)

当前输出:此输出在某种程度上似乎仅针对一个单词，而不是语料库中的所有单词。

[[-0.751498   -1.4963825 ]
 [-0.7022982  -1.4211462 ]
 [-1.6240289  -0.96706766]
 [-3.2109795  -1.2967492 ]
 [-0.8835893  -1.5251521 ]
 [-1.4316636  -1.4322135 ]
 [-1.8665589  -1.1734825 ]
 [-0.4726948  -1.836668  ]
 [-0.11171409 -2.0847342 ]
 [-1.0599283  -0.9792351 ]
 [-1.6748023  -0.9584413 ]
 [-0.8855507  -1.3226773 ]
 [-0.9565117  -1.5730425 ]
 [-1.2891663  -1.1687953 ]
 [-0.06940217 -1.7782353 ]
 [-0.92220575 -1.8264929 ]
 [-3.2258956  -1.105678  ]
 [-2.4262347  -0.9806146 ]
 [-0.36716968 -2.3782976 ]
 [-0.4972397  -1.9926786 ]
 [-0.65995616 -1.2129989 ]
 [-0.53334516 -1.5244756 ]
 [-1.4961753  -0.5592766 ]
 [-0.57391864 -1.9852302 ]
 [-0.6580112  -1.0749325 ]
 [-0.7821078  -1.598069  ]
 [-1.264001   -1.002861  ]
 [-0.23881587 -2.103974  ]
 [-0.3729657  -1.9456012 ]
 [-0.9266953  -1.516872  ]
 [-1.4948957  -1.1232641 ]
 [-1.109361   -1.3108519 ]
 [-2.0748782  -0.93853486]
 [-2.0241299  -0.8716516 ]
 [-0.9448593  -1.0530868 ]
 [-1.4578291  -0.57673496]
 [-0.31915158 -1.4830168 ]
 [-1.2568909  -1.0629684 ]
 [-0.50458056 -2.2233846 ]
 [-1.2059065  -1.0402468 ]
 [-0.17204402 -1.8913956 ]
 [-1.5484996  -1.0246676 ]
 [-1.7026784  -1.4470854 ]
 [-2.114282   -1.2304462 ]
 [-1.6737207  -1.2598573 ]
 [-0.9031189  -1.8086503 ]
 [-1.4084693  -0.9171761 ]
 [-1.261698   -1.5333931 ]
 [-2.7891722  -0.69629264]
 [-2.7634912  -1.0250676 ]
 [-2.171037   -1.3402877 ]
 [-1.5588827  -1.4741637 ]
 [-2.012083   -1.6028976 ]
 [-1.4286829  -1.485801  ]
 [-0.06908941 -2.370034  ]
 [-1.3277153  -1.2935033 ]
 [-0.52055264 -1.2549478 ]
 [-2.4971442  -0.6335571 ]
 [-2.7244987  -0.6136059 ]
 [-0.7155211  -1.8717885 ]
 [-2.1862056  -0.78832203]
 [-2.068198   -0.96536046]
 [-0.9023069  -1.6741301 ]
 [-0.39895654 -1.584905  ]
 [-0.656657   -1.6787726 ]
 [ 0.13354267 -2.105389  ]
 [-1.248123   -1.7273897 ]
 [-0.6168909  -1.3929827 ]
 [-0.1866242  -2.0612721 ]
 [-2.3246803  -1.1561321 ]
 [ 0.88145804  0.35487294]]

预期输出示例:

[-0.751498 -1.4963825 ] 显示这两个向量的值。例如，“如何”或"is"。

最佳答案

如果您已经训练了 Word2Vec 模型来学习每个单词的二维向量，则每个单词都会有一个二维向量。

我无法评估您的完整实现 - 您可能应该使用已知良好的现成标准 Word2Vec 库。此外，Word2Vec 确实依赖于大量、多样化的训练数据 - 玩具大小的示例通常不会显示真实的行为和好处。

但是由于您的句子看起来有几十个独特的单词，因此显示完整trained_embeddings的输出包含几十个二维向量似乎是正确的。

如果您只需要一个单词的向量，则需要在训练前分配的完整集合中的任何位置查找它。

关于python - Word2Vec:如何检查训练好的模型的向量值？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52273067/

文章推荐： rubygems - 无法为新的 Rails 安装安装 bcrypt-ruby

文章推荐： java - 连接四 - 寻找获胜的处理 (Java)

文章推荐： internet-explorer - 使用 htaccess 重定向 IE 6、7 和 8 用户

文章推荐： java - 保存、获取和设置立方体的长度(扩展类)

search - Solr - 如何以复合 "word-1, word-1 + word-2, word-1 + word-2 ... word-n"方式标记字符串中的单词？
我想标记一个字符串，例如 Best Beat Makers，以几乎类似于 NGram 的方式为每个单词生成标记，例如: IN: "Best Beat Makers" OUT: ["Best", "B
html - Word wrap break word for a word only？
这个问题在这里已经有了答案: Is there a way to word-wrap long words in a div? (6 个答案) 关闭 7 年前。
python - 如何用正则表达式将此表单 : (word. Word) 替换为 (word.\nWord)？
我想编写一个 Python 代码来检查字符串是否包含类似于以下内容的内容: 'word.Word' => 将其替换为 'word.\nWord'。 smallLetter.capitalLetter
python - 如何用正则表达式将此表单 : (word. Word) 替换为 (word.\nWord)？
我想编写一个 Python 代码来检查字符串是否包含类似于以下内容的内容: 'word.Word' => 将其替换为 'word.\nWord'。 smallLetter.capitalLetter
javascript - 匹配 "--word"和 "--no-word"但不匹配 "---word"
我有以下正则表达式: ^--([\w|-]+) 我想匹配 --word --no-word 但不是: ---word ----word 最佳答案将表情更改为 ^--(\w[-\w]*) 这需要在两个
ms-word - VSTO Word 加载项 - 如果从可执行文件启动 Word，则不会触发新文档事件
在我的加载项中，我需要为每个打开的文档创建一个任务 Pane 。在加载项的启动方法中，我订阅了 ApplicationEvents4_Event.NewDocument 和 Application.D
ms-word - 如何在 Word 插件中打开新的 Word docx 文档
我使用 word javascript api 开发了一个 word 插件。我的文档 .docx 文件在服务器上，我需要在加载项中单击按钮打开该 .docx 文档作为新的 Word 文档。请指导我如
html - word-wrap : break-word and word-break: break-word之间的区别
我需要在某个地方修复一些 CSS，因为我的文本没有环绕，如果它是一个非常长的单词，它会无限期地继续下去。在大多数情况下，我在我的 CSS 文件中尝试了 word-wrap: break-word;
css - `word-break: break-all` 和 `word-wrap: break-word` 有什么区别？
这个问题在这里已经有了答案: What is the difference between "word-break: break-all" versus "word-wrap: break-word
Css: overflow-wrap: break-word; word-wrap: break-word;差异
这个问题在这里已经有了答案: What is the differect between word-wrap and overflow-wrap? [duplicate] (1 个回答) Is t
算法题: Word transformation from a given word to another using only the words in a given dict
问题详细描述如下: 给定两个单词(beginWord 和 endWord)和字典的单词列表，找出是否存在从 beginWord 到 endWord 的转换序列，这样: 一次只能更改一个字母每个转换后
ms-word - Word 邮件合并字段
我以前没有使用过邮件合并字段，我发现的所有内容都要求您在能够插入合并字段之前选择一个数据源。我想要做的就是将字段放在 word 文档上，并且在代码使用它之前不要将其合并。我基本上是在创建文档模板。这在
ms-word - Word VSTO在运行时吞没了异常而无需调试？
将此代码放置在ThisDocument_Startup之外的Word文档级VSTO解决方案中的某个位置（创建带单击事件的功能区按钮）： int zero = 0; int divideByZero =
ms-word - 不使用加载项启动 Word
有没有办法在没有加载项的情况下启动 MS Word(仅此实例)？我只能找到一种方法来完全禁用加载项。最佳答案来自Word command line switches documentation ，
ms-word - 在没有加载项的情况下启动 Word
有没有办法在没有加载项的情况下启动 MS Word(仅此实例)？我只找到一种方法来完全禁用加载项。最佳答案来自Word command line switches documentation ，您
ms-word - URI 方案 ms-word :nft|u| not working and not opening word
当使用 URI 方案从网页上托管的 word 模板打开新文档时不起作用。 https://msdn.microsoft.com/en-us/library/office/dn906146.aspx 这
html - `overflow-wrap: break-word` 和 `word-break: break-word` 的行为是否不同？
我的问题: overflow-wrap: break-word 和 word-break: break-word 有区别吗？非重复: 这里有一些现有的问题，乍一看可能是重复的，但实际上不是。 Wha
java - Word Net - Word Synonyms & related word constructs - Java 或 Python
我希望使用 WordNet 从一组基本术语中寻找相似术语的集合。例如，单词'discouraged' - 潜在的同义词可能是:daunted, glum, deterred, pessimistic
ms-word - 是否有针对以下 VSTO Word 2010 word addin click once deployment error 的解决方法？
部署 Word Add in 时，发布没有错误。复制文件后出现以下错误。我没有太多事情要做。这是堆栈跟踪。 ************** Exception Text **************
Java 正则表达式 : Match any words word and must contain a certain work but except one word
我需要一个 Java 正则表达式来匹配除某个单词之外的任何单词，同时包含另一个单词。例如字符串中不能包含Apple，但必须有Peach。 Apple and Peach - Not match Pe

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Word2Vec:如何检查训练好的模型的向量值？