gpt4 book ai didi

tensorflow - seq2seq模型的预处理

转载 作者:行者123 更新时间:2023-11-30 08:37:46 27 4
gpt4 key购买 nike

我正在尝试构建一个 seq2seq 模型,我尝试遵循 Tensorflow 官方教程,但没有提到预处理步骤。我尝试在网上搜索,每个教程都是从模型开始,没有预处理步骤信息。

我需要一些有关 seq2seq 中涉及的预处理步骤的信息:

如果我有一个像这样的数据集:(使用index2word词汇编码后)

encoder [1, 2, 1, 3, 4] decoder [2, 3, 4]
encoder [2, 3, 4, 1] decoder [11, 3, 4, 5, 1, 22, 45, 1, 3, 42, 32, 65]
encoder [4, 5, 3, 11, 23, 1, 33, 44, 1, 3] decoder [4, 2, 3, 5]
encoder [44, 55] decoder [5, 6, 3, 2, 4, 22, 42, 11, 34]
encoder [1] decoder [55, 6, 3, 2, 4, 5, 6, 7, 7]
encoder [4, 2, 3, 4, 5] decoder [6, 5, 3, 5, 6, 7, 8, 2, 4, 5]
encoder [44, 2, 1, 22, 5, 3, 2] decoder [6, 5, 3, 4, 5, 6, 7]
encoder [55, 3, 1, 5, 1] decoder [5, 3, 2, 3, 4, 5]
encoder [14] decoder [5, 6, 7]

如果我将 5 作为批量大小,则第一批:

encoder [1, 2, 1, 3, 4] decoder [2, 3, 4]
encoder [2, 3, 4, 1] decoder [11, 3, 4, 5, 1, 22, 45, 1, 3, 42, 32, 65]
encoder [4, 5, 3, 11, 23, 1, 33, 44, 1, 3] decoder [4, 2, 3, 5]
encoder [44, 55] decoder [5, 6, 3, 2, 4, 22, 42, 11, 34]
encoder [1] decoder [55, 6, 3, 2, 4, 5, 6, 7, 7]

现在,在阅读了许多文章后,我发现有四个特殊的标记,您必须使用它们来编码数据:

<PAD>: During training, we’ll need to feed our examples to the network in batches.

<EOS>: This is another necessity of batching as well, but more on the decoder side. It allows us to tell the decoder where a sentence ends, and it allows the decoder to indicate the same thing in its outputs as well.

<UNK>: replace unknown with .

<GO>: This is the input to the first time step of the decoder to let the decoder know when to start generating output.

现在,如果我以我的批处理为例,那么我在填充后有问题:

编码器批处理应该与解码器批处理大小相同吗?

如果我的填充编码器数据批处理如下所示:

encoder_input=[[1, 2, 1, 3, 4],
[2, 3, 4, 1],
[4, 5, 3, 11, 23, 1, 33, 44, 1, 3],
[44, 55],
[1]]

#after padding ( max time stamp is 10 )

encoder_padded=[[1, 2, 1, 3, 4, 0, 0, 0, 0, 0],
[2, 3, 4, 1, 0, 0, 0, 0, 0, 0],
[4, 5, 3, 11, 23, 1, 33, 44, 1, 3],
[44, 55, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

现在我应该将解码器序列长度填充到相同的大小吗? (最大 10?)或者我应该用解码器最大序列(最大 12)填充,如下所示:

decoder_input=[[2, 3, 4],
[11, 3, 4, 5, 1, 22, 45, 1, 3, 42, 32, 65],
[4, 2, 3, 5],
[5, 6, 3, 2, 4, 22, 42, 11, 34],
[55, 6, 3, 2, 4, 5, 6, 7, 7]]

#after padding ( decoder batch max length is 12)

decoder_padded=[[2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[11, 3, 4, 5, 1, 22, 45, 1, 3, 42, 32, 65],
[4, 2, 3, 5, 0, 0, 0, 0, 0, 0, 0, 0],
[5, 6, 3, 2, 4, 22, 42, 11, 0, 0, 0, 0],
[55, 6, 3, 2, 4, 5, 6, 7, 7, 0, 0, 0]]

以及我最后的预处理数据应该是什么样子:

encoder_input  = ['hello','how','are','you','<PAD>','<PAD>','<PAD'>]

decoder_output = ['<GO>','i','am','fine','<EOS>','<PAD>','<PAD>']

这个格式正确吗?

最佳答案

我希望这有用。

should encoder batch should be same size to decoder batch ?

不,解码器计算跟随编码器,因此相应的数据将在不同的时间馈送到网络。你展示的例子是正确的。

上一个示例中的一个小修正,您提到的解码器输出应该是解码器输入。对于这对输入,您应该具有目标标签:

encoder_input  = ['hello','how','are','you','<PAD>','<PAD>','<PAD'>]
decoder_input = ['<GO>','i','am','fine','<EOS>','<PAD>','<PAD>']
target_label = ['i','am','fine','<EOS>','<PAD>','<PAD>']

关于tensorflow - seq2seq模型的预处理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51089903/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com