gpt4 book ai didi

machine-learning - BERT 中的 token 嵌入是如何创建的?

转载 作者:行者123 更新时间:2023-11-30 08:51:53 28 4
gpt4 key购买 nike

paper describing BERT ,有一段是关于WordPiece Embeddings的。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2. Fig 2 from the paper

据我了解,WordPiece 将单词分割成单词片段,例如 #I #like #swim #ing,但它不会生成嵌入。但我在论文和其他来源中没有找到任何内容,这些 token 嵌入是如何生成的。他们在实际预训练之前接受过预训练吗?如何?或者它们是随机初始化的?

最佳答案

单词片段是单独训练的,因此最常见的单词保持在一起,而不太常见的单词最终会被分割成字符。

嵌入与 BERT 的其余部分联合训练。反向传播是通过所有层完成的,直到嵌入,就像网络中的任何其他参数一样更新。

请注意,只有训练批处理中实际存在的标记嵌入才会更新,其余部分保持不变。这也是为什么您需要拥有相对较小的单词词汇量的原因,以便所有嵌入在训练期间得到足够频繁的更新。

关于machine-learning - BERT 中的 token 嵌入是如何创建的?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57960995/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com