讲解 ISE 537 Homework 5辅导 Python语言

ISE 537

Homework 5

(100 pts in total)

Due April 30, 2024 by 10:00 pm.

We expect you to submit two Python Notebooks for this assignment.

1. (Value Iteration Algorithm for Student Dilemma.) Let us consider the so-called Student Dilemma. It is a Markov decision process with 7 states, among which states 5 , 6, 7 are terminal states (observe that there are no arrows going out from these states). All the transitions are Markov and are labeled on the picture. For example, in state 1, we can choose to either work or rest. If we rest, then with probability 0.5 we will transition to state 2, and with probability 0.5 we will stay in state 1. The rewards marked in red are the rewards we can get as long as we stay in the state. For example, we will collect reward 0 as long as we are in state 1, no matter if we choose to rest or work. Consider an infinite horizon decision-making problem (without discount, i.e. γ = 1) until reaching one of the terminal states. The goal is to find the policy that maximizes the expected sum of rewards before achieving a terminal state.

Recall that the Bellman equation

(a) The optimal value function of states 1, 2, ··· , 7 satisfy:

Question: Please explain how to fill in the blanks?

(b) We also covered in class the Value Iteration Algorithm to calculate the numerical solution of (1):

i. Initialization at iteration 0: Let V(0)(s) be any function V (0) : S → R

ii. Given V (i) at iteration i, we compute:

iii. Terminate when V (i) stops improving; e.g., when maxs|V(i+1)(s) — V(i)(s)| is small

iv. Return the greedy policy π(K)(s) = arg maxa [r(s, a) + γEs~P(s,a,s′ )V(K)(s′ )]

Question: Please implement the algorithm to solve (1) (ideally with a Python note- book). Stop when maxs|V(i+1)(s) — V(i)(s)| drops below 10−2 .

i. How many iterations does it take to reach this error level?

ii. Please report your estimated values of V (1), ··· , V (7).

iii. Describe the out-put greedy policy.

2. (Q Learning for Cab Driving.) We use Q learning to train a smart cab driver (Smartcab). The Smartcab’s job is to pick up passenger(s) at one location and drop them off in another. Here are a few things that we’d love our Smartcab to take care of:

• Drop off the passenger to the right location;

• Save passenger’s time by taking minimum time possible to drop off;

• Take care of passenger’s safety and traffic rules.

There are different aspects that need to be considered here while modeling an RL solution to this problem: rewards, states, and actions. Please read the Ipython notebook “Q Learning for Cab Driving.ipynb” for more detailed demonstration and explanations.

(a) Implement the Q-learning update at the indicated place in the Ipython notebook “Q Learning for Cab Driving.ipynb”. You should follow the following equation to implement your code: