gpt4 book ai didi

python - 为什么我的值字典需要浅拷贝才能正确更新?

转载 作者:太空狗 更新时间:2023-10-30 01:33:04 24 4
gpt4 key购买 nike

我正在使用 Python 2.7.11 中的 Agent 类,它使用 Markov Decision Process (MDP) 在 GridWorld 中搜索最优策略 π。我正在使用以下 Bellman 方程为所有 GridWorld 状态的 100 次迭代实现基本值迭代:

enter image description here

  • T(s,a,s') 是从当前状态 成功转换到后继状态 s' 的概率函数>s 采取行动 a
  • R(s,a,s') 是从 s 过渡到 s' 的奖励.
  • γ (gamma) 是折扣因子,其中 0 ≤ γ ≤ 1
  • Vk(s') 是一个递归调用,用于在 s' 完成后重复计算达成。
  • Vk+1(s) 表示在发生足够的 k 次迭代后, Vk 迭代值将收敛并等于 Vk+1

这个等式是通过取Q 值函数的最大值得出的,这是我在我的程序中使用的函数:

enter image description here

在构造我的 Agent 时,它被传递给一个 MDP,它是一个包含以下方法的抽象类:

# Returns all states in the GridWorld
def getStates()

# Returns all legal actions the agent can take given the current state
def getPossibleActions(state)

# Returns all possible successor states to transition to from the current state
# given an action, and the probability of reaching each with that action
def getTransitionStatesAndProbs(state, action)

# Returns the reward of going from the current state to the successor state
def getReward(state, action, nextState)

我的 Agent 也传递了折扣因子和多次迭代。我还使用 dictionary 来记录我的值(value)观。这是我的代码:

class IterationAgent:

def __init__(self, mdp, discount = 0.9, iterations = 100):
self.mdp = mdp
self.discount = discount
self.iterations = iterations
self.values = util.Counter() # A Counter is a dictionary with default 0

for transition in range(0, self.iterations, 1):
states = self.mdp.getStates()
valuesCopy = self.values.copy()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value

valuesCopy.update({state: convergedValue})

self.values = valuesCopy

def computeQValueFromValues(self, state, action):
successors = self.mdp.getTransitionStatesAndProbs(state, action)
reward = self.mdp.getReward(state, action, successors)
qValue = 0
for successor, probability in successors:
# The Q value equation: Q*(a,s) = T(s,a,s')[R(s,a,s') + gamma(V*(s'))]
qValue += probability * (reward + (self.discount * self.values[successor]))
return qValue

这个实现是正确的,虽然我不确定为什么我需要 valuesCopy 来完成对我的 self.values 字典的成功更新。我尝试了以下方法来省略复制,但它不起作用,因为它返回的值稍微不正确:

for i in range(0, self.iterations, 1):
states = self.mdp.getStates()
for state in states:
legalMoves = self.mdp.getPossibleActions(state)
convergedValue = 0
for move in legalMoves:
value = self.computeQValueFromValues(state, move)
if convergedValue <= value or convergedValue == 0:
convergedValue = value

self.values.update({state: convergedValue})

我的问题是,为什么在 valuesCopy = self.values.copy() 制作副本时包含我的 self.values 字典的副本才能正确更新我的值字典无论如何每次迭代?不应该在同一更新中更新原始结果中的值吗?

最佳答案

拥有或不拥有副本在算法上存在差异:

# You update your copy here, so the original will be used unchanged, which is not the 
# case if you don't have the copy
valuesCopy.update({state: convergedValue})

# If you have the copy, you'll be using the old value stored in self.value here,
# not the updated one
qValue += probability * (reward + (self.discount * self.values[successor]))

关于python - 为什么我的值字典需要浅拷贝才能正确更新?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36369911/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com