- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我在自定义环境中使用 RLLib 的 PPOTrainer,我执行了两次trainer.train(),第一次成功完成,但是当我第二次执行它时,它崩溃了错误:
lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call (pid=15248) raise type(e)(node_def, op, message) (pid=15248)
tensorflow.python.framework.errors_impl.InvalidArgumentError:
Received a label value of 5 which is outside the valid range of [0, 5). >Label values: 5 5
(pid=15248) [[node default_policy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /tensorflow_core/python/framework/ops.py:1751) ]]
这是我的代码:
main.py
ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
ray.init()
trainer = PPOTrainer(env=MyEnv, config={
"train_batch_size": 4000,
"model": {
"custom_preprocessor": "tree_obs_prep"
}
})
for i in range(2):
print(trainer.train())
MyEnv.py
class MyEnv(rllib.env.MultiAgentEnv):
def __init__(self, env_config):
self.n_agents = 2
self.env = *CREATES ENV*
self.action_space = gym.spaces.Discrete(5)
self.observation_space = np.zeros((1, 12))
def reset(self):
self.agents_done = []
obs = self.env.reset()
return obs[0]
def step(self, action_dict):
obs, rewards, dones, infos = self.env.step(action_dict)
d = dict()
r = dict()
o = dict()
i = dict()
for i_agent in range(len(self.env.agents)):
if i_agent not in self.agents_done:
o[i_agent] = obs[i_agent]
r[i_agent] = rewards[i_agent]
d[i_agent] = dones[i_agent]
i[i_agent] = infos[i)agent]
d['__all__'] = dones['__all__']
for agent, done in dones.items():
if done and agent != '__all__':
self.agents_done.append(agent)
return o, r, d, i
我不知道问题出在哪里,有什么建议吗?这个错误是什么意思?
最佳答案
This评论对我很有帮助:
FWIW, I think such issues can happen if NaNs appear in the policy output. When that happens, you can get out of range errors.
Usually it's due to the observation or reward somehow becoming NaN, though it could be the policy diverging as well.
就我而言,我必须修改我的观察结果,因为代理无法学习策略,并且在训练中的某个时刻(在随机时间步长)返回的操作为 NaN
。
关于python - RLLib - Tensorflow - InvalidArgumentError : Received a label value of N which is outside the valid range of [0, N),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59272939/
我是一名优秀的程序员,十分优秀!