pytorch - 在 PyTorch 中实现 Luong Attention-6ren

pytorch - 在 PyTorch 中实现 Luong Attention

转载作者：行者123 更新时间：2023-12-02 15:01:08

25

4

我正在尝试实现 Luong et al. 2015 中描述的注意力我自己在 PyTorch 中，但我无法让它工作。下面是我的代码，我现在只对“一般”注意情况感兴趣。我想知道我是否遗漏了任何明显的错误。它可以运行，但似乎没有学习。

class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p

        self.embedding = nn.Embedding(
            num_embeddings=self.output_size,
            embedding_dim=self.hidden_size
        )
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size, self.hidden_size)
        # hc: [hidden, context]
        self.Whc = nn.Linear(self.hidden_size * 2, self.hidden_size)
        # s: softmax
        self.Ws = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        gru_out, hidden = self.gru(embedded, hidden)

        # [0] remove the dimension of directions x layers for now
        attn_prod = torch.mm(self.attn(hidden)[0], encoder_outputs.t())
        attn_weights = F.softmax(attn_prod, dim=1) # eq. 7/8
        context = torch.mm(attn_weights, encoder_outputs)

        # hc: [hidden: context]
        out_hc = F.tanh(self.Whc(torch.cat([hidden[0], context], dim=1)) # eq.5
        output = F.log_softmax(self.Ws(out_hc), dim=1) eq. 6

        return output, hidden, attn_weights

我研究了

中实现的注意力

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

和

https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb

第一个并不是我正在寻找的确切注意力机制。一个主要缺点是它的注意力取决于序列长度( self.attn = nn.Linear(self.hidden_size * 2, self.max_length) )，这对于长序列来说可能会很昂贵。
第二个与论文中描述的更相似，但仍然不一样，因为没有tanh 。此外，更新到最新版本的pytorch( ref )后，速度真的很慢。我也不知道为什么它需要最后一个上下文( ref )。

最佳答案

这个版本有效，并且严格遵循 Luong Attention(一般)的定义。与问题中的主要区别在于 embedding_size 和 hidden_size 的分离，这对于实验后的训练似乎很重要。之前，我将它们都设置为相同的大小(256)，这给学习带来了麻烦，并且看起来网络只能学习一半的序列。

class EncoderRNN(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size,
                 num_layers=1, bidirectional=False, batch_size=1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.batch_size = batch_size

        self.embedding = nn.Embedding(input_size, embedding_size)

        self.gru = nn.GRU(embedding_size, hidden_size, num_layers,
                          bidirectional=bidirectional)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def initHidden(self):
        directions = 2 if self.bidirectional else 1
        return torch.zeros(
            self.num_layers * directions,
            self.batch_size,
            self.hidden_size,
            device=DEVICE
        )


class AttnDecoderRNN(nn.Module):
    def __init__(self, embedding_size, hidden_size, output_size, dropout_p=0):
        super(AttnDecoderRNN, self).__init__()
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p

        self.embedding = nn.Embedding(
            num_embeddings=output_size,
            embedding_dim=embedding_size
        )
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(embedding_size, hidden_size)
        self.attn = nn.Linear(hidden_size, hidden_size)
        # hc: [hidden, context]
        self.Whc = nn.Linear(hidden_size * 2, hidden_size)
        # s: softmax
        self.Ws = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        gru_out, hidden = self.gru(embedded, hidden)

        attn_prod = torch.mm(self.attn(hidden)[0], encoder_outputs.t())
        attn_weights = F.softmax(attn_prod, dim=1)
        context = torch.mm(attn_weights, encoder_outputs)

        # hc: [hidden: context]
        hc = torch.cat([hidden[0], context], dim=1)
        out_hc = F.tanh(self.Whc(hc))
        output = F.log_softmax(self.Ws(out_hc), dim=1)

        return output, hidden, attn_weights

关于pytorch - 在 PyTorch 中实现 Luong Attention，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50571991/

25

4

0

文章推荐： haskell - 如何在 Haskell 中使用并行策略编写嵌套循环问题

文章推荐： asp.net-mvc - MVC 和 MVVM 有什么区别和相似之处？

文章推荐： xcode - 从 Storyboard 中的选项卡栏中呈现ModalViewController

文章推荐： git - git 可以通过 ssh 端口转发工作吗？

python - 正弦嵌入 - Attention is all you need
在Attention Is All You Need ，作者实现了位置嵌入(它添加了关于单词在序列中的位置的信息)。为此，他们使用正弦嵌入: PE(pos,2i) = sin(pos/10000**(
nlp - tensorflow的seq2seq.embedding_attention_seq2seq中的"attention heads"
我是 tensorflow 的新手，正在尝试根据教程实现“seq2seq”模型。我不确定函数“embedding_attention_seq2seq”的一个参数“num_heads”(默认值=1)。它
tensorflow - 如何理解transformer中的masked multi-head attention
我目前正在研究transformer的代码，但我无法理解解码器的屏蔽多头。论文上说是为了不让你看到生成词，但是我无法理解生成词后的词如果没有生成，怎么能看到呢？我尝试阅读变压器的代码(链接:http
python - 使用 Bahdanau Attention 的上下文向量形状
我正在寻找here在 Bahdanau 注意力类。我注意到上下文向量的最终形状是(batch_size，hidden_size)。我想知道他们是如何得到这个形状的，因为attention_weig
【机器学习】李宏毅——自注意力机制(Self-attention)
前面我们所讲的模型，输入都是一个向量，但有没有可能在某些场景中输入是多个向量，即一个向量集合，并且这些向量的数目并不是固定的呢 ? 这一类的场景包括文字识别、语音识别、图网络等等.
pytorch - 在 PyTorch 中实现 Luong Attention
我正在尝试实现 Luong et al. 2015 中描述的注意力我自己在 PyTorch 中，但我无法让它工作。下面是我的代码，我现在只对“一般”注意情况感兴趣。我想知道我是否遗漏了任何明显的错误。
python - Tensorflow NMT with Attention 教程——需要帮助理解损失函数
我正在关注 Tensorflow 的带有注意力机制的神经机器翻译教程 ( link )，但不清楚一些实现细节。如果有人可以帮助澄清或向我推荐来源/更好的地方来询问，那就太好了: 1) def loss
android - Android 中的 "Notification"和 "Attention"有什么区别？
我正在学习 android 源代码并研究了 LightsService 框架。我知道这些编码因设备而异。但是我在看到以下代码时感到困惑。 static int set_light_leds(stru
c++ - 在 iOS 上实现 TensorFlow Attention OCR
我已成功训练(使用 Inception V3 权重作为初始化)此处描述的 Attention OCR 模型:https://github.com/tensorflow/models/tree/mast
python - 如何使用 keras-self-attention 包可视化注意力 LSTM？
我正在使用 (keras-self-attention)在 KERAS 中实现注意力 LSTM。训练模型后如何可视化注意力部分？这是一个时间序列预测案例。 from keras.models impo
python - 在 Keras 中为多标签文本分类神经网络创建一个带有 Attention 的 LSTM 层
尊敬的社区成员，您好。我正在创建一个神经网络来预测多标签 y。具体来说，神经网络采用 5 个输入( Actor 列表、情节摘要、电影特征、电影评论、标题)并尝试预测电影类型的顺序。在神经网络中，我使用
deep-learning - 为什么 softmax 在论文 'Attention is all you need' 中的值很大时会得到小梯度
这是原论文的画面:the screen of the paper .我理解论文的意思是当dot-product的值很大时，softmax的梯度会变得很小。然而，我尝试用交叉熵损失计算softmax的
python-3.x - 凯拉斯 : Attention Mechanism For Text Summarization
我正在尝试实现 Attention使用 Keras 生成抽象文本摘要的机制从中得到了很多帮助 GitHub线程，其中有很多关于实现的信息性讨论。我正在努力理解代码的某些非常基本的部分，以及我需要修改什
sql-server - 如何捕获/记录SQL Server 2005 “Attention”事件？
这是在这里引用这个问题: Check Contraint Bypassing CATCH block in Distributed Transaction 显然，在该分布式事务方案中，“注意事件”被发
winapi - 是否有 "this application wants attention"的 Windows API Hook ？
在 Windows XP 中，当程序需要用户注意时，它的任务栏按钮会闪烁橙色。任何使用 IM 程序的人都可能熟悉这种行为。但是，当我玩全屏模式游戏时，我看不到它，并且消息无人回复。现在我正在编写自己
tensorflow - 有没有办法将原生 tf Attention 层与 keras Sequential API 一起使用？
有没有办法将原生 tf Attention 层与 keras Sequential API 一起使用？我想使用这个particular class 。我发现了自定义实现，例如 this one 。我
python - 在 Tensorflow 2.0 中的简单 LSTM 层之上添加 Attention
我有一个由一个 LSTM 和两个 Dense 层组成的简单网络: model = tf.keras.Sequential() model.add(layers.LSTM(20, input_shape
objective-c - 列出如何将 "briefly draw attention"到 iOS 屏幕上的对象的示例？
在 iOS 中，如何短暂地吸引注意力到屏幕上的某个对象？假设，创建一个短暂的发光或使阴影出现然后消失？为了这个问题的目的，让我们将“屏幕上的对象”定义为 UIImageView 的一个实例。此外，
python - Gtk3/Gnome 3 彩色按钮 : apply ".needs-attention" css styles
前言在 gnome 3 应用程序中，一些按钮通过有色背景而不是普通按钮的那种灰色来突出显示。这些按钮不仅在使用标准 Adwaita 主题时颜色不同，而且在其他各种主题中也有实现。下面是 Adwait
git - E325 : ATTENTION Found a swap file by the name ".git/.COMMIT_EDITMSG.swp"
我已经在一个团队中工作了将近一年。使用 github/git pull 和推送更改总是很容易: git pull git add . git commit -a -m "my work desc" g

首页

博学

6Ren·AI

商城

pytorch - 在 PyTorch 中实现 Luong Attention