gpt4 book ai didi

pytorch - AdamW 和 Adam 的权重衰减

转载 作者:行者123 更新时间:2023-12-03 16:15:22 35 4
gpt4 key购买 nike

torch.optim.Adam(weight_decay=0.01)有什么区别吗和 torch.optim.AdamW(weight_decay=0.01) ?
链接到文档:torch.optim

最佳答案

是的,Adam 和 AdamW 的权重衰减是不同的。

Hutter pointed out in their paper (Decoupled Weight Decay Regularization) that the way weight decay is implemented in Adam in every library seems to be wrong, and proposed a simple way (which they call AdamW) to fix it.


在 Adam 中,权重衰减通常是通过添加 wd*w 来实现的。 ( wd 在这里是权重衰减)到梯度(第一种情况),而不是实际从权重中减去(第二种情况)。
# Ist: Adam weight decay implementation (L2 regularization)
final_loss = loss + wd * all_weights.pow(2).sum() / 2
# IInd: equivalent to this in SGD
w = w - lr * w.grad - lr * wd * w

These methods are same for vanilla SGD, but as soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization (first equation) and weight decay (second equation) become different.


AdamW 遵循权重衰减的第二个方程。
在亚当

weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)


在亚当W

weight_decay (float, optional) – weight decay coefficient (default: 1e-2)


fastai blog 上阅读更多信息.

关于pytorch - AdamW 和 Adam 的权重衰减,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64621585/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com