How to use actor-critique method in reinforcement learning to calculate slope when there are multiple behavioral probabilities, and more questions(当存在多个行为概率时，如何在强化学习中使用行动者批判方法来计算斜率，以及更多问题)-6ren

How to use actor-critique method in reinforcement learning to calculate slope when there are multiple behavioral probabilities, and more questions(当存在多个行为概率时，如何在强化学习中使用行动者批判方法来计算斜率，以及更多问题)

转载作者：bug小助手更新时间：2023-10-22 15:46:22

26

4

I am trying to run reinforcement learning with actor-critic methods

我正在尝试用演员-评论家的方法进行强化学习

Link: https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py

链接：https://github.com/keras-team/keras-io/blob/master/examples/rl/actor_critic_cartpole.py

The agent takes as input a map of size max_x * max_y, randomly filled with 1s or 0s.

代理将大小为max_x*max_y的映射作为输入，该映射随机填充有1或0。

The agent then updates the map by returning each step's x, y, and z probabilities.

然后，代理通过返回每个步骤的x、y和z概率来更新映射。

Each step is trained to change the map to equal target_map, which is initialized at the beginning of training and is rewarded +1 for changing the map with that action and getting closer to target_map, 0 for failing to change the map, and -1 for changing the map incorrectly.

The episode ends when target_map and cur_map become the same or when it receives a -1 reward.

当target_map和cur_map相同或获得-1奖励时，该集结束。

where x and y are positions on the map and z is 0 or 1.

其中x和y是地图上的位置，z是0或1。

So I have a few questions.

所以我有几个问题。

Did I train with the regular rewards?

To calculate the loss, I am using the probabilities of X, Y, and Z like the code below, is there a problem?

while True
   episode_reward = 0
   state = env.reset()

   for in step_per_episode
   
      x_probs, y_probs, z_probs, critic_value = model(state)
      critic_value_history.append(critic_value[0, 0])

      x = np.random.choice(np.arange(max_x), p=np.squeeze(x_probs[:max_x]))
      y = np.random.choice(np.arange(max_y), p=np.squeeze(y_probs[:max_y]))
      z = np.random.choice(np.arange(2), p=np.squeeze(tile_probs[:2]))
      
      #this
      action_probs_history.append(tf.math.log(x_probs[x] * y_probs[y] * z_probs[z]))

      state, reward, done = env.step(x, y, z)
      rewards_history.append(reward)
      episode_reward += reward

   running_reward = running_reward * (1 - episode_weight) + episode_reward * episode_weight
            
   # Calculate expected value from rewards
   returns = []
   discounted_sum = 0           
   for r in rewards_history[::-1]:
      discounted_sum = r + gamma * discounted_sum
      returns.insert(0, discounted_sum)
   
   # Normalize
   returns = np.array(returns)
   returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
   returns = returns.tolist()

   # Calculating loss values to update our network
   history = zip(action_probs_history, critic_value_history, returns)
   actor_losses = []
   critic_losses = []

   for log_prob, value, ret in history:
      diff = ret - value
      actor_losses.append(-log_prob * diff)  

      critic_losses.append(huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0)))
            
   # Backpropagation
   loss_value = sum(actor_losses) + sum(critic_losses)
   grads = tape.gradient(loss_value, model.trainable_variables)
   optimizer.apply_gradients(zip(grads, model.trainable_variables))
    
   action_probs_history.clear()
   critic_value_history.clear()
   rewards_history.clear()

.
.
.

I need help. welcome to comments. thanks.

我需要帮助。欢迎评论。谢谢

更多回答

优秀答案推荐

更多回答

26

4

0

文章推荐： Quartz.NET UI - SilkierQuartz(Quartz.NET UI-SilkierQuartz)

C++ 迭代器运算符优先级问题 *it.method() vs (*it).method() vs it->method()
这段代码无法编译: for(vector::iterator it = shapes.end(); it >= shapes.begin(); --it){ *it.update(1,1);
methods - CLOS中 'Standard method combination'和 'Simple method combination'的区别
我一直在研究 Common Lisp 对象协议(protocol) (CLOS)，我遇到了一个疑问。有人知道 CLOS 中的“标准方法组合”和“简单方法组合”是什么意思吗？在“简单方法组合”中，“
methods - 方法调用语法 `foo.method()`和UFCS `Foo::method(&foo)`有什么区别？
在Rust上对值调用方法之间是否有任何区别，如下所示: struct A { e: u32 } impl A { fn show(&self) { println!("{}",
java - JLS 如何指定术语 "abstract method"、 "concrete method"和 "default method"？
我在一些 StackOverflow 答案中看到了术语抽象方法、具体方法和默认方法的“不同”定义。 Java 语言规范给出的真正定义是什么？请在您的答案中包含相关的支持 JLS 引用资料。最佳答案
JavaScript 将 object[method] 扩展为 object.method 以调用 object.method()
如果method = "post"，如何使rest[method]扩展为rest.post(uri, body).then(. .？ function proxyUrl() { return
c# - Linq 到实体 "does not recognize the method ... method, and this method cannot be translated into a store expression."
这个问题在这里已经有了答案: Method cannot be translated into a store expression (1 个回答) 关闭 9 年前。我有一个问题。我在 Visua
c# - 比较 : interface methods vs virtual methods vs abstract methods
它们各自的优缺点是什么？接口(interface)方法虚方法抽象方法什么时候应该选择什么？做出这一决定时应牢记哪些要点？最佳答案虚拟和抽象几乎是一样的。虚方法在基类中有一个可以选择被覆盖的
methods - Meteor.methods() 回调错误
我在 Meteor.js 上的那段代码出错: 客户端 : Meteor.call("logUser", function(myvar){ console.log("le c
java编译错误: method inside method
运行代码时出现以下错误 Line: 18 illegal start of expression Line: 18 ';' expected 这意味着第 18 行中有代码写得不正确(public bo
Java method().method() 调用
如果可能的话，如何从另一个方法的返回中调用一个方法？例如…… class Example { public static void main(String[] args) {
c++ - (*it)->method() 与 (**it).method
当遍历指针的 vector (或其他容器)时，使用以下优势和/或优势之间是否有任何区别: for (it = v.begin(); it != v.end(); ++it) { (*it)->
java - 更好的做法 : Printing from void method OR returning value from method and printing from method caller
在从带有参数的 void 方法打印值或将值返回给方法调用者并在方法调用者中打印它之间，哪个被认为是更好的做法(如果有的话)？比如第一个代码摘录是前者，第二个代码摘录是后者: public static
javascript - 为什么 v-on :click ="method()" is treat as a method declaration like the v-on:click ="method"?
考虑这个例子https://codesandbox.io/s/1yvp4zz5x7?module=%2Fsrc%2FApp.vue Greet1 Greet2
.net - 带有接口(interface)的代码契约(Contract) : "Method Invocation skipped. Compiler will generate method invocation because the method is conditional... [...]"
晚上好，我刚开始使用 Microsoft.Contracts(最新版本)并将其插入示例界面之上，现在它看起来像这样: namespace iRMA2.Core.Interfaces { us
methods - "Method [show] does not exist"是什么意思？
我是 Laravel 4 的新手，并试图弄清楚为什么我收到一个错误，说 Method [show] 不存在。我没有名为“show”的方法，只能想象这是一个内部的 Laravel 方法，但我不知道如何
java - 递归 - 返回 ( method (...) || method (...) )
有人可以向我解释一下当我们进行下一次返回时“或”(||) 是什么意思吗？我的意思是这行: 返回封面(值，金额 - 值 [索引]，索引 + 1)||覆盖(值、金额、索引 + 1)； public st
Java method()++ VS method()+1
这个问题已经有答案了: Why doesn't the post increment operator work on a method that returns an int? (11 个回答) 已
jQuery $.method() 与 $(selector).method()
我很难理解 jQuery 的 $.method() 和 $(selector).method 之间的区别。 $.method() 实际适用于 DOM 中的哪些元素？如果有人能帮助解释这两种说法之间的区
java - 关于Java语法: Is it 'class.method.method' ?
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题吗？通过 editing this post 添加细节并澄清问题. 关闭 5 年前。 Improve t
javascript - 调用object.method.method，Javascript时出现问题
////////////////////////////////////////////////////////////////////////////// // 3 construct

首页

博学

6Ren·AI

商城

How to use actor-critique method in reinforcement learning to calculate slope when there are multiple behavioral probabilities, and more questions(当存在多个行为概率时，如何在强化学习中使用行动者批判方法来计算斜率，以及更多问题)