gpt4 book ai didi

reinforcement-learning - 威廉姆斯提出的 REINFORCE 算法的任何示例代码?

转载 作者:行者123 更新时间:2023-12-04 01:05:34 29 4
gpt4 key购买 nike

最佳答案

是的,在 GitHub 上搜索一下,你会得到一大堆结果:

GitHub: WILLIAMS+REINFORCE

最流行的使用此代码(在 Python 中):

__author__ = 'Thomas Rueckstiess, ruecksti@in.tum.de'

from pybrain.rl.learners.directsearch.policygradient import PolicyGradientLearner
from scipy import mean, ravel, array


class Reinforce(PolicyGradientLearner):
""" Reinforce is a gradient estimator technique by Williams (see
"Simple Statistical Gradient-Following Algorithms for
Connectionist Reinforcement Learning"). It uses optimal
baselines and calculates the gradient with the log likelihoods
of the taken actions. """

def calculateGradient(self):
# normalize rewards
# self.ds.data['reward'] /= max(ravel(abs(self.ds.data['reward'])))

# initialize variables
returns = self.dataset.getSumOverSequences('reward')
seqidx = ravel(self.dataset['sequence_index'])

# sum of sequences up to n-1
loglhs = [sum(self.loglh['loglh'][seqidx[n]:seqidx[n + 1], :]) for n in range(self.dataset.getNumSequences() - 1)]
# append sum of last sequence as well
loglhs.append(sum(self.loglh['loglh'][seqidx[-1]:, :]))
loglhs = array(loglhs)

baselines = mean(loglhs ** 2 * returns, 0) / mean(loglhs ** 2, 0)
# TODO: why gradient negative?
gradient = -mean(loglhs * (returns - baselines), 0)

return gradient

关于reinforcement-learning - 威廉姆斯提出的 REINFORCE 算法的任何示例代码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28457688/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com