machine-learning - 车杆的 SARSA 值近似值-6ren

machine-learning - 车杆的 SARSA 值近似值

转载作者：行者123 更新时间：2023-11-30 08:37:46

我有一个关于 this 的问题SARS FA。

在输入单元格 142 中我看到此修改后的更新

w += alpha * (reward - discount * q_hat_next) * q_hat_grad

其中 q_hat_next 是 Q(S', a')，q_hat_grad 是 Q(S, a) 的导数(假设S, a, R, S' a' 序列)。

我的问题是更新不应该是这样的吗？

w += alpha * (reward + discount * q_hat_next - q_hat) * q_hat_grad

修改后的更新背后的直觉是什么？

最佳答案

我认为你是对的。我还预计更新包含 TD 误差项，它应该是奖励 + 折扣 * q_hat_next - q_hat。

作为引用，这是实现:

if done: # (terminal state reached)
   w += alpha*(reward - q_hat) * q_hat_grad
   break
else:
   next_action = policy(env, w, next_state, epsilon)
   q_hat_next = approx(w, next_state, next_action)
   w += alpha*(reward - discount*q_hat_next)*q_hat_grad
   state = next_state

这是来自 Reinforcement Learning:An Introduction (by Sutton & Barto) 的伪代码(第 171 页):

由于实现为TD(0)，n为1。那么伪代码中的更新可以简化:

w <- w + a[G - v(S_t,w)] * dv(S_t,w)

变成(通过替换G ==奖励+折扣*v(S_t+1,w)))

w <- w + a[reward + discount*v(S_t+1,w) - v(S_t,w)] * dv(S_t,w)

或者使用原始代码示例中的变量名称:

w += alpha * (reward + discount * q_hat_next - q_hat) * q_hat_grad

我最终得到了与您相同的更新公式。看起来像是非最终状态更新中的错误。

只有最终情况(如果 done 为 true)应该是正确的，因为根据定义，q_hat_next 始终为 0，因为情节已经结束，不能再获得任何奖励获得了。

关于machine-learning - 车杆的 SARSA 值近似值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51371975/

文章推荐： java - 如何在java中使用jssc从串口读取数据？

文章推荐： javascript - 提交按钮与其他按钮 react

文章推荐： python - 使用predict_generator和VGG16的内存错误

文章推荐： javascript - Previous Function 如何将其结果传递给回调

node.js - eclipse 车 : bash: gdb: command not found
Che 看起来很有前途，但有人在使用它吗？或者它对任何人都有效吗？偶尔我会尝试让 Che 调试器与 golang 或 nodejs 一起工作。我相信 Che 是开发人员使用 docker 的方式，我

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

machine-learning - 车杆的 SARSA 值近似值