gpt4 book ai didi

How to use actor-critique method in reinforcement learning to calculate slope when there are multiple behavioral probabilities, and more questions(当存在多个行为概率时,如何在强化学习中使用行动者批判方法来计算斜率,以及更多问题)

转载 作者:bug小助手 更新时间:2023-10-22 15:46:22 26 4
gpt4 key购买 nike



I am trying to run reinforcement learning with actor-critic methods

我正在尝试用演员-评论家的方法进行强化学习


Link: https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py

链接:https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py


The agent takes as input a map of size max_x * max_y, randomly filled with 1s or 0s.

代理将大小为max_x*max_y的映射作为输入,该映射随机填充有1或0。


The agent then updates the map by returning each step's x, y, and z probabilities.

然后,代理通过返回每个步骤的x、y和z概率来更新映射。


Each step is trained to change the map to equal target_map, which is initialized at the beginning of training and is rewarded +1 for changing the map with that action and getting closer to target_map, 0 for failing to change the map, and -1 for changing the map incorrectly.


The episode ends when target_map and cur_map become the same or when it receives a -1 reward.

当target_map和cur_map相同或获得-1奖励时,该集结束。


where x and y are positions on the map and z is 0 or 1.

其中x和y是地图上的位置,z是0或1。


So I have a few questions.

所以我有几个问题。



  1. Did I train with the regular rewards?



  2. To calculate the loss, I am using the probabilities of X, Y, and Z like the code below, is there a problem?




while True
episode_reward = 0
state = env.reset()

for in step_per_episode

x_probs, y_probs, z_probs, critic_value = model(state)
critic_value_history.append(critic_value[0, 0])

x = np.random.choice(np.arange(max_x), p=np.squeeze(x_probs[:max_x]))
y = np.random.choice(np.arange(max_y), p=np.squeeze(y_probs[:max_y]))
z = np.random.choice(np.arange(2), p=np.squeeze(tile_probs[:2]))

#this
action_probs_history.append(tf.math.log(x_probs[x] * y_probs[y] * z_probs[z]))

state, reward, done = env.step(x, y, z)
rewards_history.append(reward)
episode_reward += reward

running_reward = running_reward * (1 - episode_weight) + episode_reward * episode_weight

# Calculate expected value from rewards
returns = []
discounted_sum = 0
for r in rewards_history[::-1]:
discounted_sum = r + gamma * discounted_sum
returns.insert(0, discounted_sum)

# Normalize
returns = np.array(returns)
returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
returns = returns.tolist()

# Calculating loss values to update our network
history = zip(action_probs_history, critic_value_history, returns)
actor_losses = []
critic_losses = []

for log_prob, value, ret in history:
diff = ret - value
actor_losses.append(-log_prob * diff)

critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))

# Backpropagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

action_probs_history.clear()
critic_value_history.clear()
rewards_history.clear()

.
.
.


I need help. welcome to comments. thanks.

我需要帮助。欢迎评论。谢谢


更多回答
优秀答案推荐
更多回答

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com