gpt4 book ai didi

python - 为什么我的 Deep Q Network 没有掌握一个简单的 Gridworld (Tensorflow)? (如何评估 Deep-Q-Net)

转载 作者:IT老高 更新时间:2023-10-28 20:55:48 25 4
gpt4 key购买 nike

我尝试熟悉 Q-learning 和深度神经网络,目前尝试实现 Playing Atari with Deep Reinforcement Learning .

为了测试我的实现并尝试使用它,我坚持尝试了一个简单的网格世界。我有一个 N x N 网格,从左上角开始,在右下角结束。可能的 Action 有:左、上、右、下。

尽管我的实现与 this 非常相似(希望它是一个好的)它似乎没有学到任何东西。看看它需要完成的总步数(我猜网格大小为 10x10 的平均值约为 500,但也有非常低和高的值),它比其他任何东西都更加随机。

我在使用和不使用卷积层的情况下尝试了它,并使用了所有参数,但老实说,我不知道我的实现是否有问题或者它需要训练更长时间(我让它训练了相当长的时间) 管他呢。但至少它看起来会收敛,这里是第一次训练的损失值图:

Loss image

那么这个案例有什么问题呢?

但也许更重要的是,我如何“调试”这个 Deep-Q-Nets,在监督训练中有训练、测试和验证集,例如,通过精确度和召回率可以评估它们。对于使用 Deep-Q-Nets 进行无监督学习,我有哪些选择,以便下次我可以自己修复它?

最后是代码:

这是网络:

ACTIONS = 5

# Inputs
x = tf.placeholder('float', shape=[None, 10, 10, 4])
y = tf.placeholder('float', shape=[None])
a = tf.placeholder('float', shape=[None, ACTIONS])

# Layer 1 Conv1 - input
with tf.name_scope('Layer1'):
W_conv1 = weight_variable([8,8,4,8])
b_conv1 = bias_variable([8])
h_conv1 = tf.nn.relu(conv2d(x, W_conv1, 5)+b_conv1)

# Layer 2 Conv2 - hidden1
with tf.name_scope('Layer2'):
W_conv2 = weight_variable([2,2,8,8])
b_conv2 = bias_variable([8])
h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 1)+b_conv2)
h_conv2_max_pool = max_pool_2x2(h_conv2)

# Layer 3 fc1 - hidden 2
with tf.name_scope('Layer3'):
W_fc1 = weight_variable([8, 32])
b_fc1 = bias_variable([32])
h_conv2_flat = tf.reshape(h_conv2_max_pool, [-1, 8])
h_fc1 = tf.nn.relu(tf.matmul(h_conv2_flat, W_fc1)+b_fc1)

# Layer 4 fc2 - readout
with tf.name_scope('Layer4'):
W_fc2 = weight_variable([32, ACTIONS])
b_fc2 = bias_variable([ACTIONS])
readout = tf.matmul(h_fc1, W_fc2)+ b_fc2

# Training
with tf.name_scope('training'):
readout_action = tf.reduce_sum(tf.mul(readout, a), reduction_indices=1)
loss = tf.reduce_mean(tf.square(y - readout_action))
train = tf.train.AdamOptimizer(1e-6).minimize(loss)

loss_summ = tf.scalar_summary('loss', loss)

这里是培训:

# 0 => left
# 1 => up
# 2 => right
# 3 => down
# 4 = noop

ACTIONS = 5
GAMMA = 0.95
BATCH = 50
TRANSITIONS = 2000
OBSERVATIONS = 1000
MAXSTEPS = 1000

D = deque()
epsilon = 1

average = 0
for episode in xrange(1000):
step_count = 0
game_ended = False

state = np.array([0.0]*100, float).reshape(100)
state[0] = 1

rsh_state = state.reshape(10,10)
s = np.stack((rsh_state, rsh_state, rsh_state, rsh_state), axis=2)

while step_count < MAXSTEPS and not game_ended:
reward = 0
step_count += 1

read = readout.eval(feed_dict={x: [s]})[0]

act = np.zeros(ACTIONS)
action = random.randint(0,4)
if len(D) > OBSERVATIONS and random.random() > epsilon:
action = np.argmax(read)
act[action] = 1

# play the game
pos_idx = state.argmax(axis=0)
pos = pos_idx + 1

state[pos_idx] = 0
if action == 0 and pos%10 != 1: #left
state[pos_idx-1] = 1
elif action == 1 and pos > 10: #up
state[pos_idx-10] = 1
elif action == 2 and pos%10 != 0: #right
state[pos_idx+1] = 1
elif action == 3 and pos < 91: #down
state[pos_idx+10] = 1
else: #noop
state[pos_idx] = 1
pass

if state.argmax(axis=0) == pos_idx and reward > 0:
reward -= 0.0001

if step_count == MAXSTEPS:
reward -= 100
elif state[99] == 1: # reward & finished
reward += 100
game_ended = True
else:
reward -= 1


s_old = np.copy(s)
s = np.append(s[:,:,1:], state.reshape(10,10,1), axis=2)

D.append((s_old, act, reward, s))
if len(D) > TRANSITIONS:
D.popleft()

if len(D) > OBSERVATIONS:
minibatch = random.sample(D, BATCH)

s_j_batch = [d[0] for d in minibatch]
a_batch = [d[1] for d in minibatch]
r_batch = [d[2] for d in minibatch]
s_j1_batch = [d[3] for d in minibatch]

readout_j1_batch = readout.eval(feed_dict={x:s_j1_batch})
y_batch = []

for i in xrange(0, len(minibatch)):
y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

train.run(feed_dict={x: s_j_batch, y: y_batch, a: a_batch})

if epsilon > 0.05:
epsilon -= 0.01

感谢您的每一个帮助和想法!

最佳答案

对于那些感兴趣的人,我进一步调整了参数和模型,但最大的改进是切换到一个简单的前馈网络,它有 3 层,隐藏层中大约有 50 个神经元。对我来说,它在相当不错的时间内收敛了。

顺便提一下关于调试的更多提示!

关于python - 为什么我的 Deep Q Network 没有掌握一个简单的 Gridworld (Tensorflow)? (如何评估 Deep-Q-Net),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35394446/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com