辅导 data程序、讲解 python设计编程
How to Train your Bee
Assignment 2 Help Guide
CALCULATE AN OPTIMAL POLICY FOR
© Dreamworks, ”Bee Movie”Recap: Bellman Equation
• Bellman Equation is used to calculate the optimal value of a state
• Equation looks complicated, but it’s just the highest expected reward from the best action:
• is the probability of entering the next state !
given we perform action in state
• ! is the reward received for performing action in state and entering the next state !
• is the discount factor (makes immediate rewards worth more than future rewards)
• is the value of the next state !
• The optimal policy, , is the best available action that can be performed in each state
• The value of a state is given by the highest expected reward when following the optimal
policy
a
Assignment 2 Help GuideR
)Recap: Policy Iteration
1. Arbitrarily assign a policy to each state (e.g.
action to be performed in every state is LEFT)
2. Until convergence
• Policy Evaluation: Determine the value of every state based
on the current policy
• Policy Improvement: Determine the best action to be
performed in every state based on the values of the current
policy, then update the policy based on the new best
action
• Convergence occurs if the policy between two
iterations does not change
• Tutorial 7 involves implementing Policy Iteration in
the simple Gridworld environment
Assignment 2 Help GuideComputing State Space
• Both Value Iteration and Policy Iteration require us to loop through every state
• Value Iteration: to determine the current best policy and the value of a state
• Policy Iteration: to determine the new best policy based on the current value of a state
• We need a way to compute every state so we can loop through them all
• One way is to get every combination of states possible
• In BeeBot, for a given level, each state is given by the position and orientation of the bee, the position and
orientation of the widgets
• We can determine the list of all states by computing every combination of bee position and orientation, widget
position, and widget orientation
• However, this might include some invalid combinations (e.g., widget or bee are inside a wall)
• Is there a better way we can search for possible states?
Assignment 2 Help GuideTransition Outcomes
• The probabilistic aspect of this assignment means that by performing certain actions in
certain states, there might be multiple next states that can be reached each with a different
probability and reward
• The assignment involves creating a get_transition_outcomes(state, action)
function
• Takes a (state, action) pair and for every possible next state returns the probability of ending up in that state
and the reward
• Should return a list or other data structure with all the next states, probabilities, and rewards
• This will be useful when utilising the Bellman Equation to calculate the value of a state
• When creating the transition function, there are a few things to consider:
• What are the random aspects of the BeeBot environment?
• What are the possible next states to consider from a given state?
• What are edge cases that need to be considered (e.g. moving near a wall or thorn)?
Assignment 2 Help GuideTransition Outcomes
• The transition function will usually assume a given action is valid, so we need to only feed it
actions that are valid for given states to avoid any odd behaviour
• We can cache if actions are valid to improve runtime
• perform_action(state, action)
• Might help you understand how to determine the possible next states for certain states and actions
• However, note that it only returns one possible state for a given action
• We can cache the results of the transition function to improve runtime
• Tutorial 6 involves creating a transition function for the simple Gridworld environment
Assignment 2 Help GuideTerminal States
• We need to create a way to handle terminal states when calculating the values and optimal
policies of states, otherwise the agent might think it can leave the terminal states
• There are two ways we can model the terminal states to do this
• Terminal states
• Set the value of a terminal state to 0
• Skip over it without updating its value if it is encountered in a loop
• Absorbing states
• Create a new state outside of the regular state space to send the agent to once it reaches a terminal state
• If the player performs any action in the absorbing state, it remains in the absorbing state
• The reward is always 0 for the absorbing state, no matter the action performed
Assignment 2 Help GuideReward Function
• Rewards are not dependent on the resulting state, but on the state the agent is in and the
action it performs
• Reward functions considered up until this point have been R(s), solely based on the state
the agent is in
• For BeeBot, the expected reward function is R(s, a) – actions also give rewards that need to
be considered, as well as any possible penalties
• We can use get_transition_outcomes(state, action) to get the rewards:
• We can start by initialising a matrix of all zeroes size |S| x |A|
• Then, loop over each (state, action) pair and initialise the total expected reward to 0
• Loop over the outcomes from get_transition_outcomes(state, action) and add the (probability x
reward) to compute the expected reward over all outcomes
• i.e. , = = ∑#! ! , ) ⋅ ( , , !
)
Assignment 2 Help GuideValue Iteration: Updating State Values
• How we choose to update states can affect the performance of our value iteration
• Batch updates uses the value of the next state from the previous iteration to update the
value of the current state in the current iteration
• In-place updates uses the value of the next state from the current iteration, if it has already
been calculated, to update the value of the current state in the current iteration
• If the next state has not yet been calculated in the current iteration, it uses the value from the previous iteration
• In-place updates typically converges in fewer iterations
• The order in which states are calculated also has an effect for in-place updates (i.e. starting
near the goal and working backwards may enable faster convergence)
Assignment 2 Help GuidePolicy Iteration: Linear Algebra
• The numpy library is allowed for this assignment, as well as built-in python libraries
• We can use linear algebra to compute the value of states for the policy evaluation step of
Policy Iteration
• is the identity matrix of size |S| x |S|
• $ is a matrix containing the transition probabilities based on the current policy
• is a vector containing rewards for every state based on the current policy
• numpy can be used to perform linear algebra
• vπ = np.linalg.solve(I − γPπ, r)
Assignment 2 Help GuideImproving Runtime
• We need to compute the state space to calculate the value of every state
• Calculating the value of every state can be time consuming, especially for large levels
• We can remove unreachable states from the state space and only consider states the agent can reach
• This can improve runtime as we are reducing the number of states to calculate the value of
• Remember to use caching where possible
• If you are repeatedly computing something, caching can drastically improve runtime
• Remember to limit use of inefficient data structures
• If you're checking whether an element is in a list (e.g. if next state is in the states list), either modify your code
to remove the need for this, or use a data structure with more efficient lookup (e.g. a set or dictionary)
Assignment 2 Help Guide“Flying is exhausting. Why don't you
humans just run everywhere, isn't
that faster?”
- Barry B. Benson
You can do it!
13
© Dreamworks, ”Bee Movie”
Assignment 2 Help Guide