gpt4 book ai didi

matlab - 自由能强化学习实现

转载 作者:太空宇宙 更新时间:2023-11-03 19:25:37 24 4
gpt4 key购买 nike

我一直在尝试实现描述的算法 here ,然后在同一篇论文中描述的“大 Action 任务”上对其进行测试。

算法概览:

enter image description here

简而言之,该算法使用如下所示形式的 RBM 来解决强化学习问题,方法是改变其权重,使网络配置的自由能等于为该状态 Action 对提供的奖励信号。

为了选择一个 Action ,该算法在保持状态变量固定的情况下执行吉布斯采样。如果有足够的时间,这将产生具有最低自由能的 Action ,从而产生给定状态的最高奖励。

大 Action 任务概览:

enter image description here

作者的实现指南概述:

A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of the large action task with an 12-bit state space and a 40-bit action space. Thirteen key states were randomly selected. The network was run for 12 000 actions with a learning rate going from 0.1 to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. Each iteration was initialized with a random state. Each action selection consisted of 100 iterations of Gibbs sampling.

重要的省略细节:

  • 是否需要偏置单元?
  • 是否需要权重衰减?如果是,L1 还是 L2?
  • 权重和/或激活是否需要稀疏约束?
  • 是否对梯度下降进行了修改? (例如动量)
  • 这些附加机制需要哪些元参数?

我的实现:

我最初假设作者没有使用指南中描述的机制,所以我尝试在没有偏置单元的情况下训练网络。这导致了近乎偶然的表现,并且是我的第一个线索,即某些使用的机制一定被作者认为是“显而易见的”,因此被省略了。

我尝试了上面提到的各种省略的机制,并通过使用获得了最好的结果:

  • softmax 隐藏单元
  • 0.9 的动量(0.5 直到第 5 次迭代)
  • 隐藏层和可见层的偏置单元
  • 学习率是作者列出的学习率的 1/100。
  • l2 重量衰减 .0002

但即使进行了所有这些修改,在 12000 次迭代后,我在任务上的表现通常在平均奖励 28 左右。

每次迭代的代码:

    %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);

%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);

%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end


data(numdims+1:end) = negaction > rand(numcases,numactiondims);


if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end

posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;

%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


if batch>5,
momentum=.9;
else
momentum=.5;
end;

%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);

Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;

vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);

vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;

%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

我的要求:

所以,如果你们中的任何人能让这个算法正常工作(作者声称在 12000 次迭代后平均获得 ~40 奖励),我将非常感激。

如果我的代码似乎做错了什么,那么引起注意也是一个很好的答案。

我希望作者遗漏的内容对于那些在基于能量的学习方面比我更有经验的人来说确实是显而易见的,在这种情况下,只需指出工作实现中需要包含的内容。

最佳答案

  1. 论文中的算法看起来很奇怪。他们使用一种赫布式学习来增加连接强度,但没有衰减它们的机制。相比之下,常规 CD 将不正确幻想的能量推高,平衡了整体事件。我推测你需要强大的稀疏性调节和/或重量衰减。
  2. 偏见永远不会伤害 :)
  3. Momentum 和其他奇特的东西可能会加速,但通常不是必需的。
  4. 为什么在隐藏层上使用 softmax?它应该只是 sigmoid 吗?

关于matlab - 自由能强化学习实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10827089/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com