- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我正在尝试使用 Keras 和 Tensorflow 实现 Actor-Critic。但是,它永远不会收敛,我不明白为什么。我降低了学习率,但它没有改变。
代码在python3.5.1和tensorflow1.2.1
import gym
import itertools
import matplotlib
import numpy as np
import sys
import tensorflow as tf
import collections
from keras.models import Model
from keras.layers import Input, Dense
from keras.utils import to_categorical
from keras import backend as K
env = gym.make('CartPole-v0')
NUM_STATE = env.env.observation_space.shape[0]
NUM_ACTIONS = env.env.action_space.n
LEARNING_RATE = 0.0005
TARGET_AVG_REWARD = 195
class Actor_Critic():
def __init__(self):
l_input = Input(shape=(NUM_STATE, ))
l_dense = Dense(16, activation='relu')(l_input)
## Policy Network
action_probs = Dense(NUM_ACTIONS, activation='softmax')(l_dense)
policy_network = Model(input=l_input, output=action_probs)
## Value Network
state_value = Dense(1, activation='linear')(l_dense)
value_network = Model(input=l_input, output=state_value)
graph = self._build_graph(policy_network, value_network)
self.state, self.action, self.target, self.action_probs, self.state_value, self.minimize, self.loss = graph
def _build_graph(self, policy_network, value_network):
state = tf.placeholder(tf.float32)
action = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS))
target = tf.placeholder(tf.float32, shape=(None))
action_probs = policy_network(state)
state_value = value_network(state)[0]
advantage = tf.stop_gradient(target) - state_value
log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1))
p_loss = -log_prob * advantage
v_loss = tf.reduce_mean(tf.square(advantage))
loss = p_loss + (0.5 * v_loss)
# optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=.99)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
minimize = optimizer.minimize(loss)
return state, action, target, action_probs, state_value, minimize, loss,
def predict_policy(self, sess, state):
return sess.run(self.action_probs, { self.state: [state] })
def predict_value(self, sess, state):
return sess.run(self.state_value, { self.state: [state] })
def update(self, sess, state, action, target):
feed_dict = {self.state:[state], self.target:target, self.action:to_categorical(action, NUM_ACTIONS)}
_, loss = sess.run([self.minimize, self.loss], feed_dict)
return loss
def train(env, sess, estimator, num_episodes, discount_factor=1.0):
Transition = collections.namedtuple("Transition", ["state", "action", "reward", "loss"])
last_100 = np.zeros(100)
for i_episode in range(num_episodes):
# Reset the environment and pick the fisrst action
state = env.reset()
episode = []
# One step in the environment
for t in itertools.count():
# Take a step
action_probs = estimator.predict_policy(sess, state)[0]
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(action)
target = reward + (0 if done else discount_factor * estimator.predict_value(sess, next_state))
# Update our policy estimator
loss = estimator.update(sess, state, action, target)
# Keep track of the transition
episode.append(Transition(state=state, action=action, reward=reward, loss=loss))
if done:
break
state = next_state
total_reward = sum(e.reward for e in episode)
last_100[i_episode % 100] = total_reward
last_100_avg = sum(last_100) / 100
total_loss = sum(e.loss for e in episode)
print('episode %s loss: %f reward: %f last 100: %f' % (i_episode, total_loss, total_reward, last_100_avg))
if last_100_avg >= TARGET_AVG_REWARD:
break
return
estimator = Actor_Critic()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
stats = train(env, sess, estimator, 2000, discount_factor=0.99)
这里是开头的日志:(last 100是最近100集的平均奖励,前100集自动递增,无视)
episode 0 loss: 17.662344 reward: 15.000000 last 100: 0.150000
episode 1 loss: 15.319713 reward: 13.000000 last 100: 0.280000
episode 2 loss: 38.097054 reward: 32.000000 last 100: 0.600000
episode 3 loss: 22.229492 reward: 19.000000 last 100: 0.790000
episode 4 loss: 31.027534 reward: 26.000000 last 100: 1.050000
episode 5 loss: 21.037663 reward: 18.000000 last 100: 1.230000
episode 6 loss: 18.750641 reward: 16.000000 last 100: 1.390000
episode 7 loss: 23.268227 reward: 20.000000 last 100: 1.590000
episode 8 loss: 27.251028 reward: 23.000000 last 100: 1.820000
episode 9 loss: 20.008078 reward: 17.000000 last 100: 1.990000
episode 10 loss: 28.213932 reward: 24.000000 last 100: 2.230000
episode 11 loss: 28.109922 reward: 23.000000 last 100: 2.460000
episode 12 loss: 25.068121 reward: 21.000000 last 100: 2.670000
episode 13 loss: 59.581238 reward: 50.000000 last 100: 3.170000
episode 14 loss: 26.618759 reward: 22.000000 last 100: 3.390000
episode 15 loss: 28.847467 reward: 24.000000 last 100: 3.630000
episode 16 loss: 22.534216 reward: 17.000000 last 100: 3.800000
episode 17 loss: 19.760979 reward: 15.000000 last 100: 3.950000
episode 18 loss: 31.018209 reward: 25.000000 last 100: 4.200000
episode 19 loss: 22.938683 reward: 16.000000 last 100: 4.360000
episode 20 loss: 30.372072 reward: 24.000000 last 100: 4.600000
500集之后,不仅没有进步,反而比开始时还差。
episode 501 loss: 97.043335 reward: 8.000000 last 100: 13.500000
episode 502 loss: 101.957603 reward: 11.000000 last 100: 13.510000
episode 503 loss: 100.277809 reward: 11.000000 last 100: 13.520000
episode 504 loss: 96.754257 reward: 9.000000 last 100: 13.510000
episode 505 loss: 99.436943 reward: 11.000000 last 100: 13.530000
episode 506 loss: 105.161621 reward: 16.000000 last 100: 13.580000
episode 507 loss: 65.993591 reward: 12.000000 last 100: 13.610000
episode 508 loss: 59.837429 reward: 9.000000 last 100: 13.600000
episode 509 loss: 92.478806 reward: 9.000000 last 100: 13.570000
episode 510 loss: 96.697289 reward: 14.000000 last 100: 13.620000
episode 511 loss: 94.611366 reward: 10.000000 last 100: 13.620000
episode 512 loss: 100.259460 reward: 15.000000 last 100: 13.680000
episode 513 loss: 88.776451 reward: 10.000000 last 100: 13.690000
episode 514 loss: 86.659203 reward: 9.000000 last 100: 13.700000
episode 515 loss: 105.494476 reward: 17.000000 last 100: 13.770000
episode 516 loss: 90.662186 reward: 12.000000 last 100: 13.770000
episode 517 loss: 90.777634 reward: 12.000000 last 100: 13.810000
episode 518 loss: 91.290558 reward: 14.000000 last 100: 13.860000
episode 519 loss: 94.902023 reward: 11.000000 last 100: 13.870000
episode 520 loss: 86.746582 reward: 12.000000 last 100: 13.900000
另一方面,普通的 Policy Gradient 确实会收敛。
import gym
import itertools
import matplotlib
import numpy as np
import sys
import tensorflow as tf
import collections
from keras.models import Model
from keras.layers import Input, Dense
from keras.utils import to_categorical
from keras import backend as K
env = gym.make('CartPole-v0')
NUM_STATE = env.env.observation_space.shape[0]
NUM_ACTIONS = env.env.action_space.n
LEARNING_RATE = 0.0005
TARGET_AVG_REWARD = 195
class PolicyEstimator():
"""
Policy Function approximator.
"""
def __init__(self):
l_input = Input(shape=(NUM_STATE, ))
l_dense = Dense(16, activation='relu')(l_input)
action_probs = Dense(NUM_ACTIONS, activation='softmax')(l_dense)
model = Model(inputs=[l_input], outputs=[action_probs])
self.state, self.action, self.target, self.action_probs, self.minimize, self.loss = self._build_graph(model)
def _build_graph(self, model):
state = tf.placeholder(tf.float32)
action = tf.placeholder(tf.float32, shape=(None, NUM_ACTIONS))
target = tf.placeholder(tf.float32, shape=(None))
action_probs = model(state)
log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1))
loss = -log_prob * target
# optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, decay=.99)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
minimize = optimizer.minimize(loss)
return state, action, target, action_probs, minimize, loss
def predict(self, sess, state):
return sess.run(self.action_probs, { self.state: [state] })
def update(self, sess, state, action, target):
feed_dict = {self.state:[state], self.target:[target], self.action:to_categorical(action, NUM_ACTIONS)}
_, loss = sess.run([self.minimize, self.loss], feed_dict)
return loss
def train(env, sess, estimator_policy, num_episodes, discount_factor=1.0):
Transition = collections.namedtuple("Transition", ["state", "action", "reward"])
last_100 = np.zeros(100)
for i_episode in range(num_episodes):
# Reset the environment and pick the fisrst action
state = env.reset()
episode = []
# One step in the environment
for t in itertools.count():
# Take a step
action_probs = estimator_policy.predict(sess, state)[0]
action = np.random.choice(np.arange(len(action_probs)), p=action_probs)
next_state, reward, done, _ = env.step(action)
# Keep track of the transition
episode.append(Transition(state=state, action=action, reward=reward))
if done:
break
state = next_state
# Go through the episode and make policy updates
for t, transition in enumerate(episode):
# The return after this timestep
target = sum(discount_factor**i * t2.reward for i, t2 in enumerate(episode[t:]))
# Update our policy estimator
loss = estimator_policy.update(sess, transition.state, transition.action, target)
total_reward = sum(e.reward for e in episode)
last_100[i_episode % 100] = total_reward
last_100_avg = sum(last_100) / 100
print('episode %s reward: %f last 100: %f' % (i_episode, total_reward, last_100_avg))
if last_100_avg >= TARGET_AVG_REWARD:
break
return
policy_estimator = PolicyEstimator()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
stats = train(env, sess, policy_estimator, 2000, discount_factor=1.0)
引用代码
https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py
https://github.com/coreylynch/async-rl
感谢任何帮助。
[更新]
我将 _build_graph
中的代码从
advantage = tf.stop_gradient(target) - state_value
log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1))
p_loss = -log_prob * advantage
v_loss = tf.reduce_mean(tf.square(advantage))
loss = p_loss + (0.5 * v_loss)
到
advantage = target - state_value
log_prob = tf.log(tf.reduce_sum(action_probs * action, reduction_indices=1))
p_loss = -log_prob * tf.stop_gradient(advantage)
v_loss = 0.5 * tf.reduce_mean(tf.square(advantage))
loss = p_loss + v_loss
它变得更好并获得了 200 个奖励(最大值)。然而,4000集之后,它仍然没有达到195的平均水平。
最佳答案
第一个显而易见的事情是错误的梯度被阻止在优势上:
advantage = tf.stop_gradient(target) - state_value
应该是
advantage = target - tf.stop_gradient(state_value)
因为无论哪种方式都没有目标梯度(它是一个常数),而你想要实现的是缺少梯度流过值(value)网络(基线)的策略梯度。您有一个单独的基线损失(看起来不错)。
另一个可能的错误是减少损失的方式。您明确地为 v_loss 调用 reduce_mean,但从不为 p_loss 调用。因此,缩放比例关闭,您的值(value)网络可能学习得更慢(因为您首先对 - 可能是时间 - 维度进行平均)。
关于python - Actor-Critic 模型永远不会收敛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45428574/
可不可以命名为MVVM模型?因为View通过查看模型数据。 View 是否应该只与 ViewModelData 交互?我确实在某处读到正确的 MVVM 模型应该在 ViewModel 而不是 Mode
我正在阅读有关设计模式的文章,虽然作者们都认为观察者模式很酷,但在设计方面,每个人都在谈论 MVC。 我有点困惑,MVC 图不是循环的,代码流具有闭合拓扑不是很自然吗?为什么没有人谈论这种模式: mo
我正在开发一个 Sticky Notes 项目并在 WPF 中做 UI,显然将 MVVM 作为我的架构设计选择。我正在重新考虑我的模型、 View 和 View 模型应该是什么。 我有一个名为 Not
不要混淆:How can I convert List to Hashtable in C#? 我有一个模型列表,我想将它们组织成一个哈希表,以枚举作为键,模型列表(具有枚举的值)作为值。 publi
我只是花了一些时间阅读这些术语(我不经常使用它们,因为我们没有任何 MVC 应用程序,我通常只说“模型”),但我觉得根据上下文,这些意味着不同的东西: 实体 这很简单,它是数据库中的一行: 2) In
我想知道你们中是否有人知道一些很好的教程来解释大型应用程序的 MVVM。我发现关于 MVVM 的每个教程都只是基础知识解释(如何实现模型、 View 模型和 View ),但我对在应用程序页面之间传递
我想realm.delete() 我的 Realm 中除了一个模型之外的所有模型。有什么办法可以不列出所有这些吗? 也许是一种遍历 Realm 中当前存在的所有类型的方法? 最佳答案 您可以从您的 R
我正在尝试使用 alias 指令模拟一个 Eloquent 模型,如下所示: $transporter = \Mockery::mock('alias:' . Transporter::class)
我正在使用 stargazer 创建我的 plm 汇总表。 library(plm) library(pglm) data("Unions", package = "pglm") anb1 <- pl
我读了几篇与 ASP.NET 分层架构相关的文章和问题,但是读得太多后我有点困惑。 UI 层是在 ASP.NET MVC 中开发的,对于数据访问,我在项目中使用 EF。 我想通过一个例子来描述我的问题
我收到此消息错误: Inceptionv3.mlmodel: unable to read document 我下载了最新版本的 xcode。 9.4 版测试版 (9Q1004a) 最佳答案 您没有
(同样,一个 MVC 验证问题。我知道,我知道......) 我想使用 AutoMapper ( http://automapper.codeplex.com/ ) 来验证我的创建 View 中不在我
需要澄清一件事,现在我正在处理一个流程,其中我有两个 View 模型,一个依赖于另一个 View 模型,为了处理这件事,我尝试在我的基本 Activity 中注入(inject)两个 View 模型,
如果 WPF MVVM 应该没有代码,为什么在使用 ICommand 时,是否需要在 Window.xaml.cs 代码中实例化 DataContext 属性?我已经并排观看并关注了 YouTube
当我第一次听说 ASP.NET MVC 时,我认为这意味着应用程序由三个部分组成:模型、 View 和 Controller 。 然后我读到 NerdDinner并学习了存储库和 View 模型的方法
Platform : ubuntu 16.04 Python version: 3.5.2 mmdnn version : 0.2.5 Source framework with version :
我正在学习本教程:https://www.raywenderlich.com/160728/object-oriented-programming-swift ...并尝试对代码进行一些个人调整,看看
我正试图围绕 AngularJS。我很喜欢它,但一个核心概念似乎在逃避我——模型在哪里? 例如,如果我有一个显示多个交易列表的应用程序。一个列表向服务器查询匹配某些条件的分页事务集,另一个列表使用不同
我在为某个应用程序找出最佳方法时遇到了麻烦。我不太习惯取代旧 TLA(三层架构)的新架构,所以这就是我的来源。 在为我的应用程序(POCO 类,对吧??)设计模型和 DAL 时,我有以下疑问: 我的模
我有两个模型:Person 和 Department。每个人可以在一个部门工作。部门可以由多人管理。我不确定如何在 Django 模型中构建这种关系。 这是我不成功的尝试之一 [models.py]:
我是一名优秀的程序员,十分优秀!