gpt4 book ai didi

python - Q-学习模型没有改进

转载 作者:行者123 更新时间:2023-12-01 08:19:58 25 4
gpt4 key购买 nike

我正在尝试解决 openAI 健身房中的推车问题。通过Q学习。我认为我误解了 Q-learning 的工作原理,因为我的模型没有改进。

我使用字典作为我的 Q 表。所以我对每个观察结果进行“散列”(变成字符串)。并使用它作为我表中的键。

表中的每个键(观察)都映射到另一个字典。我在其中存储在此状态下采取的每个 Action 及其关联的 Q 值。

话虽如此,我的表格中的条目可能如下所示:

'[''0.102'', ''1.021'', ''-0.133'', ''-1.574'']':
0: 0.1

所以在状态(观察)中:'[''0.102'', ''1.021'', ''-0.133'', ''-1.574'']'已记录 Action :0,q 值为:0.01

我的逻辑有问题吗?我真的无法弄清楚我的实现是否出了问题。

import gym
import random
import numpy as np

ENV = 'CartPole-v0'

env = gym.make(ENV)

class Qtable:
def __init__(self):
self.table = {}

def update_table(self, obs, action, value):
obs_hash = self.hash_obs(obs)

# Update table with new observation
if not obs_hash in self.table:
self.table[obs_hash] = {}
self.table[obs_hash][action] = value
else:
# Check if action has been recorded
# If such, check if this value was better
# If not, record new action for this obs
if action in self.table[obs_hash]:
if value > self.table[obs_hash][action]:
self.table[obs_hash][action] = value
else:
self.table[obs_hash][action] = value

def get_prev_value(self, obs, action):
obs_hash = self.hash_obs(obs)
if obs_hash in self.table:
if action in self.table[obs_hash]:
return self.table[obs_hash][action]
return 0

def get_max_value(self, obs):
obs_hash = self.hash_obs(obs)
if obs_hash in self.table:
key = max(self.table[obs_hash])
return self.table[obs_hash][key]
return 0

def has_action(self, obs):
obs_hash = self.hash_obs(obs)
if obs_hash in self.table:
if len(self.table[obs_hash]) > 0:
return True
return False

def get_best_action(self, obs):
obs_hash = self.hash_obs(obs)
if obs_hash in self.table:
return max(self.table[obs_hash])

# Makes a hashable entry of the observation
def hash_obs(self, obs):
return str(['{:.3f}'.format(i) for i in obs])

def play():

q_table = Qtable()

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
episodes = 1000

total = 0

for i in range(episodes):

done = False
prev_obs = env.reset()
episode_reward = 0

while not done:

if random.uniform(0, 1) > epsilon and q_table.has_action(prev_obs):
# Exploit learned values
action = q_table.get_best_action(prev_obs)
else:
# Explore action space
action = env.action_space.sample()

# Render the environment
#env.render()

# Take a step
obs, reward, done, info = env.step(action)

if done:
reward = -200

episode_reward += reward

old_value = q_table.get_prev_value(prev_obs, action)
next_max = q_table.get_max_value(obs)

# Get the current sate value
new_value = (1-alpha)*old_value + alpha*(reward + gamma*next_max)

q_table.update_table(obs, action, new_value)

prev_obs = obs

total += episode_reward

print("average", total/episodes)
env.close()


play()

最佳答案

我想我已经明白了。我误解了这部分 new_value = (1-alpha)*old_value + alpha*(reward + gamma*next_max)

这里next_max是下一个状态的最佳移动。而不是(应该是)该子树的最大值。

因此,将 Q 表实现为 HashMap 可能不是一个好主意。

关于python - Q-学习模型没有改进,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54708749/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com