I am trying to run reinforcement learning with actor-critic methods
我正在尝试用演员-评论家的方法进行强化学习
Link: https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py
链接:https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py
The agent takes as input a map of size max_x * max_y, randomly filled with 1s or 0s.
代理将大小为max_x*max_y的映射作为输入,该映射随机填充有1或0。
The agent then updates the map by returning each step's x, y, and z probabilities.
然后,代理通过返回每个步骤的x、y和z概率来更新映射。
Each step is trained to change the map to equal target_map, which is initialized at the beginning of training and is rewarded +1 for changing the map with that action and getting closer to target_map, 0 for failing to change the map, and -1 for changing the map incorrectly.
The episode ends when target_map and cur_map become the same or when it receives a -1 reward.
当target_map和cur_map相同或获得-1奖励时,该集结束。
where x and y are positions on the map and z is 0 or 1.
其中x和y是地图上的位置,z是0或1。
So I have a few questions.
所以我有几个问题。
Did I train with the regular rewards?
To calculate the loss, I am using the probabilities of X, Y, and Z like the code below, is there a problem?
while True
episode_reward = 0
state = env.reset()
for in step_per_episode
x_probs, y_probs, z_probs, critic_value = model(state)
critic_value_history.append(critic_value[0, 0])
x = np.random.choice(np.arange(max_x), p=np.squeeze(x_probs[:max_x]))
y = np.random.choice(np.arange(max_y), p=np.squeeze(y_probs[:max_y]))
z = np.random.choice(np.arange(2), p=np.squeeze(tile_probs[:2]))
#this
action_probs_history.append(tf.math.log(x_probs[x] * y_probs[y] * z_probs[z]))
state, reward, done = env.step(x, y, z)
rewards_history.append(reward)
episode_reward += reward
running_reward = running_reward * (1 - episode_weight) + episode_reward * episode_weight
# Calculate expected value from rewards
returns = []
discounted_sum = 0
for r in rewards_history[::-1]:
discounted_sum = r + gamma * discounted_sum
returns.insert(0, discounted_sum)
# Normalize
returns = np.array(returns)
returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
returns = returns.tolist()
# Calculating loss values to update our network
history = zip(action_probs_history, critic_value_history, returns)
actor_losses = []
critic_losses = []
for log_prob, value, ret in history:
diff = ret - value
actor_losses.append(-log_prob * diff)
critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))
# Backpropagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
action_probs_history.clear()
critic_value_history.clear()
rewards_history.clear()
.
.
.
I need help. welcome to comments. thanks.
我需要帮助。欢迎评论。谢谢
更多回答
我是一名优秀的程序员,十分优秀!