Lecture 3: Planning by Dynamic Programming

# **INTRODUCTION** ### *What is Dynamic Programming?* Dynamic Programming is used whenever we can break down a problem into smaller components. We say that a problem has optimal substructure if the problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems. * Dynamics means something problem that changes step by step and we try to solve it. * Dynamic sequential or temporal component to the problem * Programming optimising a “program”, i.e. a policy * c.f. linear programming * A method for solving complex problems * By breaking them down into subproblems * Solve the subproblems * Combine solutions to subproblems ### *Requirements for Dynamic Programming* Dynamic Programming is a very general solution method for problems which have two properties: * Optimal substructure * Principle of optimality applies * Optimal solution can be decomposed into subproblems * Overlapping subproblems * Subproblems recur many times * Solutions can be cached and reused * Markov decision processes satisfy both properties * Bellman equation gives recursive decomposition * Value function stores and reuses solutions ### *Planning by Dynamic Programming* Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP 1. For prediction: We are given an MDP <S,A,P,R,γ> and policy π as input. We need to find the value function vπ associated with the policy. 2. For control: We are given an MDP <S,A,P,R,γ> as input. We need to find out the optimal value function v∗ from which the optimal policy π∗ follows. # **POLICY EVALUATION** ### *Iterative Policy Evaluation* In this problem, we want to estimate the value function for an MDP <S,A,P,R,γ> given some policy π. One way to solve the problem is to iteratively apply the Bellman expectation backup. The algorithm works in the following way: * randomly initialize the values for the state-value function v. Let the starting point be v0 (a vector of state-values for all s∈S). * for each iteration i, update vi(s) using the Bellman expectation equations and using the values from the previous iteration vi−1(s′), where s′ is the successor state to state s. In other words, we are using the state values of the successor states of the previous iteration to update the state value of the root of the current iteration. This algorithm is known as synchronous backup because we are updating the value function for all the states simultaneously. The actual update formula is as follows ![](https://hackmd.io/_uploads/H1Yl_1Joh.png) ### *Evaluating a Random Policy in the Small Gridworld* ![](https://hackmd.io/_uploads/rytj0Jksh.png) * Undiscounted episodic MDP (γ = 1) * Nonterminal states 1, ..., 14 * One terminal state (shown twice as shaded squares) * Actions leading out of the grid leave state unchanged * Reward is −1 until the terminal state is reached * Agent follows uniform random policy π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25 ![](https://hackmd.io/_uploads/r1UEOlJsn.png) ![](https://hackmd.io/_uploads/BkIEOl1sn.png) # **POLICY ITERATION** ### *How to Improve a Policy* * Given a policy π * Evaluate the policy π vπ(s) = E [Rt+1 + γRt+2 + ...|St = s] * Improve the policy by acting greedily with respect to vπ π' = greedy(vπ) * In Small Gridworld improved policy was optimal, π' = π∗ * In general, need more iterations of improvement / evaluation * But this process of policy iteration always converges to π∗ ### *Policy Iteration* In the policy iteration problem, given an MDP, we want to find the optimal policy π∗. The procedure is very similar to the one of the previous section but consists of two steps. For each iteration i we do: * update the state-value function vi under the current policy πi−1 using the policy evaluation algorithm * update the policy based on the updated state-value function vi, i.e. start acting greedily wrt to vi: πi=greedy(vi) ![](https://hackmd.io/_uploads/SyaNAxkoh.png) In the long-run, it is guaranteed that our policy will converge to the optimal policy π∗ regardless of what the initial estimates for the policy/state-value function are taken to be. ### *Jack’s Car Rental* * States: Two locations, maximum of 20 cars at each * Actions: Move up to 5 cars between locations overnight * Reward: $10 for each car rented (must be available) * Transitions: Cars returned and requested randomly * Poisson distribution, n returns/requests with prob (λ^n /n!) e^λ * 1st location: average requests = 3, average returns = 3 * 2nd location: average requests = 4, average returns = 2 ### *Policy Iteration in Jack’s Car Rental* ![](https://hackmd.io/_uploads/HkfdMWJin.png) ### *Policy Improvement* * Suppose that our current strategy is deterministic, i.e. π(s)=a. In other words, for any state s, our current policy is to choose some action a with 100% certainty. Taking a greedy action in state s means choosing such action a that maximizes our action-value function in this state, i.e. ![](https://hackmd.io/_uploads/SJ7Ydhko2.png) * This improves the value from any state s over one step, ![](https://hackmd.io/_uploads/SkUh_n1i2.png) The above equation basically states that taking the greedy action in state s and then returning to policy π is at least as good as taking an action a under policy π and then following policy π. We showed that acting greedily over a single step and getting back to the original policy is at least as good as always acting according to the original policy. * It therefore improves the value function, vπ'(s) ≥ vπ(s) ![](https://hackmd.io/_uploads/HkNeKnJi2.png) The above tells us that, by following the greedy policy over the whole trajectory, we can expect to get at least as much reward as if we followed the original policy over the whole trajectory. * If improvements stop, ![](https://hackmd.io/_uploads/ryjNtnyo3.png) * Then the Bellman optimality equation has been satisfied ![](https://hackmd.io/_uploads/SJj8Y2yoh.png) * Therefore vπ(s) = v∗(s) for all s ∈ S * so π is an optimal policy ### *Modified Policy Iteration* * Does policy evaluation need to converge to vπ? * Or should we introduce a stopping condition * e.g. ∈-convergence of value function * Or simply stop after k iterations of iterative policy evaluation? * For example, in the small gridworld k = 3 was sufficient to achieve optimal policy * Why not update policy every iteration? i.e. stop after k = 1 * This is equivalent to value iteration (next section) # **VALUE ITERATION** ### *Principle of Optimality* This means that if we know all the optimal state values of the successor states, we can just find an optimal action over a single step and be done. Theorem (Principle of Optimality): A policy π(a|s) achieves the optimal value from state s, vπ(s) = v∗(s), if and only if * For any state s' reachable from s * π achieves the optimal value from state s' , vπ(s') = v∗(s') ### *Deterministic Value Iteration* * If we know the solution to subproblems v∗(s') * Then solution v∗(s) can be found by one-step lookahead ![](https://hackmd.io/_uploads/rkuqeN-jn.png) * The idea of value iteration is to apply these updates iteratively * Intuition: start with final rewards and work backwards * Still works with loopy, stochastic MDPs ![](https://hackmd.io/_uploads/HkuAxNWin.png) ### *Value Iteration* The problem that value iteration method aims to solve is to find an optimal policy by iteratively applying the Bellman optimality equation. To do that, we perform synchronous backups, i.e. for each iteration i and for all states s∈S, update vk+1(s) from vk(s). Similar to value iteration, we are updating state-value function. However, we are not doing it for any particular policy. Similar to policy iteration, we are trying to find an optimal policy. However, we do not require an explicit policy. Instead, we are directly updating the state-value function. Note that the intermediate state-value function may not correspond to any meaningful or achievable policy. ![](https://hackmd.io/_uploads/rk-sb4bj3.png) ### *Synchronous Dynamic Programming Algorithms* ![](https://hackmd.io/_uploads/HyRSzNZs2.png) # **EXTENSIONS TO DYNAMIC PROGRAMMING** ### *Asynchronous Dynamic Programming* The algorithms covered up to now update the state-value function for all the states at the same time. However, this is costly if the number of states is too large. For that reason, there are extensions to the synchronous DP algorithms. In particular, we could be smart about which states and in what order to do the state-value update per iteration. The following three approaches are common in practice: 1. In-place dynamic programming The synchronous state-value update requires that we store both the old and the new state values, i.e. from different iterations. ![](https://hackmd.io/_uploads/H1Ah7NZj2.png) In-place dynamic programming, on the other hand, requires that we store a single copy. In other words, for all s∈S, ![](https://hackmd.io/_uploads/SJiC7EZoh.png) 2. Prioritised Sweeping The idea here is to update the states in the order of their significance. For example, we could update the states with the largest remaining Bellman error first, where the remaining Bellman error is ![](https://hackmd.io/_uploads/Hy8UNNWsh.png) 3. Real-Time Dynamic Programming The idea of real-time DP is to update only those states that are relevant to our agent. For example, if our agent is actively exploring a certain area on the map, it makes little sense to update the states which he has not visited yet. Thus, after each time step St,At,Rt+1, update the state St. ![](https://hackmd.io/_uploads/BJNdEEZs3.png)