gpt4 book ai didi

python - tf.contrib.layer.fully_connected、tf.layers.dense、tf.contrib.slim.fully_connected、tf.keras.layers.Dense 之间的不一致

转载 作者:太空宇宙 更新时间:2023-11-04 02:08:00 31 4
gpt4 key购买 nike

我正在尝试为上下文强盗问题 (https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c) 实现策略梯度。

我正在 tensorflow 中定义一个模型,以使用单个全连接 层解决这个问题。

我正在尝试来自 tensorflow 的不同 API,但我想避免使用 contrib 包,因为它不受 tensorflow 支持。我对使用 keras API 很感兴趣,因为我已经熟悉它的功能接口(interface),它现在被实现为 tf.keras。但是,我似乎只能在使用 tf.contrib.slim.fully_connectedtf.contrib.layers.fully_connected(前者调用后者)时才能得到结果.

以下两个片段可以正常工作(one_hot_encoded_state_inputnum_actions 都符合层的预期张量形状)。

import tensorflow.contrib.slim as slim
action_probability_distribution = slim.fully_connected(
one_hot_encoded_state_input, \
num_actions, \
biases_initializer=None, \
activation_fn=tf.nn.sigmoid, \
weights_initializer=tf.ones_initializer())

from tensorflow.contrib.layers import fully_connected
action_probability_distribution = fully_connected(
one_hot_encoded_state_input,
num_actions,\
biases_initializer=None, \
activation_fn=tf.nn.sigmoid, \
weights_initializer=tf.ones_initializer())

另一方面,以下都不起作用:

action_probability_distribution = tf.layers.dense(
one_hot_encoded_state_input, \
num_actions, \
activation=tf.nn.sigmoid, \
bias_initializer=None, \
kernel_initializer=tf.ones_initializer())

也不

action_probability_distribution = tf.keras.layers.Dense(
num_actions, \
activation='sigmoid', \
bias_initializer=None, \
kernel_initializer = 'Ones')(one_hot_encoded_state_input)

最后两个案例使用了tensorflow 的高级API layerskeras。理想情况下,我想知道我是否使用后两种情况错误地实现了前两种情况,以及我遇到的唯一问题是后两种情况不等同于前两个.

为了完整起见,这里是运行它所需的全部代码(注意:使用了 python 3.5.6 和 tensorflow 1.12.0)。

import tensorflow as tf
import numpy as np
tf.reset_default_graph()

num_states = 3
num_actions = 4
learning_rate = 1e-3

state_input = tf.placeholder(shape=(None,),dtype=tf.int32, name='state_input')
one_hot_encoded_state_input = tf.one_hot(state_input, num_states)

# DOESN'T WORK
action_probability_distribution = tf.keras.layers.Dense(num_actions, activation='sigmoid', bias_initializer=None, kernel_initializer = 'Ones')(one_hot_encoded_state_input)

# WORKS
# import tensorflow.contrib.slim as slim
# action_probability_distribution = slim.fully_connected(one_hot_encoded_state_input,num_actions,\
# biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())

# WORKS
# from tensorflow.contrib.layers import fully_connected
# action_probability_distribution = fully_connected(one_hot_encoded_state_input,num_actions,\
# biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())

# DOESN'T WORK
# action_probability_distribution = tf.layers.dense(one_hot_encoded_state_input,num_actions, activation=tf.nn.sigmoid, bias_initializer=None, kernel_initializer=tf.ones_initializer())

action_probability_distribution = tf.squeeze(action_probability_distribution)
action_chosen = tf.argmax(action_probability_distribution)

reward_input = tf.placeholder(shape=(None,), dtype=tf.float32, name='reward_input')
action_input = tf.placeholder(shape=(None,), dtype=tf.int32, name='action_input')
responsible_weight = tf.slice(action_probability_distribution, action_input, [1])
loss = -(tf.log(responsible_weight)*reward_input)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update = optimizer.minimize(loss)


bandits = np.array([[0.2,0,-0.0,-5],
[0.1,-5,1,0.25],
[-5,5,5,5]])

assert bandits.shape == (num_states, num_actions)

def get_reward(state, action): # the lower the value of bandits[state][action], the higher the likelihood of reward
if np.random.randn() > bandits[state][action]:
return 1
return -1

max_episodes = 10000
epsilon = 0.1

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
rewards = np.zeros(num_states)
for episode in range(max_episodes):
state = np.random.randint(0,num_states)
action = sess.run(action_chosen, feed_dict={state_input:[state]})
if np.random.rand(1) < epsilon:
action = np.random.randint(0, num_actions)

reward = get_reward(state, action)
sess.run([update, action_probability_distribution, loss], feed_dict = {reward_input: [reward], action_input: [action], state_input: [state]})

rewards[state] += reward

if episode%500 == 0:
print(rewards)

当使用注释为 # THIS WORKS 的 block 时,代理会学习并最大化所有三个状态的奖励。另一方面,那些评论#这行不通# 的人不会学习,通常会非常迅速地收敛到选择一个 Action 。例如,working 行为应该打印一个正数递增的 reward 列表(每个状态的良好累积奖励)。 非工作 行为看起来像一个奖励 列表,其中只有一个 Action 会增加累积奖励,通常会牺牲另一个(负累积奖励)。

最佳答案

对于遇到此问题的任何人,特别是因为 tensorflow 有许多用于实现的 API,差异归结为偏差初始化和默认值。对于 tf.contribtf.slim,使用 biases_initializer = None 意味着不使用偏置。使用 tf.layerstf.keras 复制它需要 use_bias=False

关于python - tf.contrib.layer.fully_connected、tf.layers.dense、tf.contrib.slim.fully_connected、tf.keras.layers.Dense 之间的不一致,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54221778/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com