--- title: planning by dynamic programming tags: rl --- # planning by dynamic programming #### what is dynamic programming? - a method for solving complex problems by reaking them down into subproblems - solve the subproblems - combine solutions to subproblems #### dynamic programming is a very general solution method for problems which have two properties: - optimal substructure - principle of optimality applies - optimal solution can be decomposed into subproblems - overlapping subproblems - subproblems recur many times - solutions can be cached and reused - Markov decision processes satisfy both properties - bellman equation gives recursive decomposition - value function stores and reuses solutions #### planning by dynamic programming - dynamic programming assumes full knowledge of the MDP - it is used for *planning* in an MDP - for prediction: - input: MDP ($S$, $A$, $P$, $R$, $\gamma$) and policy $\pi$ - or: MRP ($S$, $P^\pi$, $R^\pi$, $\gamma$) - output: value function $v_\pi$ - or for control: - input: MDP ($S$, $A$, $P$, $R$, $\gamma$) - output: optimal value function $v_\pi$ and optimal policy $\pi_*$ ### policy evaluation #### iterative policy evaluation - problem: evaluate a given policy $\pi$ - solution: iterative application of Bellman expectation backup - $v_1$ -> $v_2$ -> ... -> $v_\pi$ - using synchronous backups, - at each iteration k + 1 - for all states s $\in S$ - update $v_{k+1}$(s) from $v_k$(s') where s' is a successor state of s ![](https://i.imgur.com/0pJYkyT.png) #### how to improve a policy given a policy $\pi$ - evaluate the policy $\pi$ - improve the policy by acting greedily with respect to $v_\pi$ $\pi'$ = $greedy(v_\pi)$ ![](https://i.imgur.com/b98X43M.png) this process of policy iteration always converges to $\pi$* policy evaluation - estimate $v_\pi$ (iterative policy evaluation) policy improvement - generate $\pi$' ≥ $\pi$ (greedy policy improvement) - consider a deterministic policy, $a$ = $\pi(s)$ - we can *improve* the policy by acting greedily $\pi'(s)$ = $argmax_{a \in A}$ $q_\pi(s, a)$ - this improves the value from any state s over one step $q_\pi(s, \pi'(s))$ = $max_{a \in A}$ $q_\pi(s, a)$ ≥ $q_\pi(s, \pi(s))$ = $v_\pi(s)$ - it therefore improves the value function, $v_{\pi'}$(s) ≥ $v_{\pi}$(s) ![](https://i.imgur.com/BRQEOWm.png) - if improvements stop, $q_\pi(s, \pi'(s))$ = $max_{a \in A}$ $q_\pi(s, a)$ = $q_\pi(s, \pi(s))$ = $v_\pi(s)$ - then the bellman optimality equation has been satisfied $v_\pi(s)$ = $max_{a \in A}$ $q_\pi(s, a)$ - therefore $v_\pi(s)$ = $v_*(s)$ for all $s \in S$ so $\pi$ is an optimal policy #### extensions to policy iteration modified policy iteration - does pollicy evaluation need to converge to $v_\pi$? - or should we introduce a stopping condition - $\epsilon$-convergence of value function - or simply stop after k iterations of iterative policy evaluation? ### value iteration #### principle of optimality any optimal policy can be subdivided into two components: - an optimal first action $A_*$ - followed by an optimal policy from successor state S' theorem A policy $\pi(a | s)$ achieves the optimal value from state s, $v_\pi(s)$ = $v_*(s)$, if and only if - for any state s' reachable from s, $\pi$ achieves the optimal value from states s', $v_\pi(s')$ = $v_*(s')$ deterministic value iteration - if we know the solution to subproblems $v_*(s')$, then solution $v_*(s)$ can be found by one-step lookahead ![](https://i.imgur.com/cMmhuCL.png) - the idea of calue iteration is to apply these updates iteratively - intuition: start with final rewards and work backwards - still works with loopy, stochastic MDPs value iteration - problem: find optimal policy $\pi$ - solution: iterative application of Bellman optimality backup - $v_1$ -> $v_2$ -> ... -> $v_*$ - using synchronous backups - at each iteration k + 1 for all states $s \in S$, update $v_{k + 1}$(s) from $v_{k}$(s') - converges to $v_{*}$ - unlike policy iteration, there is no explicit policy - intermediate value functions may not corrrespond to any policy ### synchronous dynamic programming algorithm summary | problem | bellman equation | algorithm | | -------- | -------- | -------- | | prediction | bellman expectation equation | iterative policy evaluation | | control | bellman expectation equation + greedy policy improvement | policy iteration | | control | bellman optimality equation | value iteration | - algorithm are based on state-value function $v_\pi$(s) or $v_*$(s) - complexity O(mn^2^) per iteration, for m actions and n states - could also apply to action-value function $q_\pi(s, a)$ or $q_*(s, a)$ - complexity O(m^2^n^2^) per iteration ### ashynchronous dynamic programming - dynamic programming methods described so far used synchronous backups - all states are backed up in parallel - asynchronous dynamic programming backs up states individually, in any order - for each selected state, apply the appropriate backup - can sgnificantly reduce computation - guaranteed to converge if all states continue to be selected three simple iddeas for asynchronous dynamic programming - in-place dynamic programming - synchronous value iteration stores two copies of value function for all s in $S$ ![](https://i.imgur.com/ZjrEC8q.png) - in-place value iteraction only stores one copy of value function for all s in $S$ ![](https://i.imgur.com/0zbsKdP.png) - prioritized sweeping - use magnitude of bellman error to guide state selection ![](https://i.imgur.com/yr8CFGI.png) - backup the state with the largest remaining bellman error - update bellman error of affected states after each backup - requires knowledge of reverse dynamics (predecessor states) - can be implemented efficiently by maintaining a priority queue - real-time dynamic programming - idea: only states that are relevant to agent - use agent's experience to guide the selection of states - after each time-step $S_t$, $A_t$, $R_{t+1}$, backup the state $S_t$ ![](https://i.imgur.com/plhdy0t.png) ### full-width backups - dynamic programming uses full-width backups - for each backup (sync or async) - every successor state and action is considered using knowledge of the MDP transitions and reward function - dynamic programming is effective for medium-sized problems(millions of states) - for large problems dynamic programing sufferes Bellman's *curse of dimensionality* - number of states n = $|S|$ grows exponentially with number of state variables - even one backup can be too expensive ### sample backups - using sample rewards and sample transitions ($S$, $A$, $R$, $S'$) instead of reward function $R$ and transition dynamics $P$ - advantages - model-free: no advance knowledge of MDP required - breaks the curse of dimensionality through sampling - cost of backup is constant, independent of n = $|S|$ ### approximate dynamic programming - approximate the value function using a function approximator $\hat{v}$(s, **w**) - apply dynamic programming to $\hat{v}$(s, **w**) - e.g. fitten value iteration repeats at each iteration k, - sample state $\tilde{S} \subseteq S$ - for each state s $\in \tilde{S}$, estimate target value using Bellman optimality equation,![](https://i.imgur.com/qFexlnT.png) - train next value function $\hat{v}$($\cdot$, **$w_{K+1}$**) using targets {<s,$\tilde{v}_k$(s)>} ### contraction mapping some technical questions - how do we know that value iteration converges to $v_*$? - or that iterative policy evaluation converges to $v_\pi$ - and therefor that policy iteration converges to $v_*$? - is the solution unique? - how fast do these algorithms converge? - these questions are resolved by *contraction mapping theorem* value function space - consider the vector space $V$ over value functions - there are $|S|$ dimensions - each point in this space fully specifies a value function v(s) - what does a Bellman backup do to points in this space? - we will show that it brings value functions closer - and therefore the backups must converge on a unique solution value function $\infty$-Norm - we will measure distance between state-value functions $u$ and $v$ by the $\infty$-norm - i.e. the largest difference between state values,![](https://i.imgur.com/JKViMJu.png) Bellman Expectation Backup is a Contraction - define the Bellman expectation backup operator $T^\pi$,![](https://i.imgur.com/oTy2EQD.png) - this operator is a $\gamma$ - contraction, i.e. it makes value functions closer by at least $\gamma$, ![](https://i.imgur.com/88phMUY.png) contraction mapping theorem - for any metric space $V$ that is complete (i.e. closed) under an operator T(v), where T is a $\gamma$ - contraction - T converges to a unique fixed point at a linear convergence rate of $\gamma$ convergence of iterative policy evaluation and policy iteration - the Bellman expectation operator $T^\pi$ has a unique fixed point - $v_\pi$ is a fixed point of $T^\pi$ (by Bellman expectation equation) - by contraction mapping theorem, iterative policy evalution converges on $v_\pi$, policy iteration converges on $v_*$ Bellman Optimality Backup is a Contraction - define the Bellman expectation backup operator $T^\pi$,![](https://i.imgur.com/SaQy2oK.png) - this operator is a $\gamma$ - contraction, i.e. it makes value functions closer by at least $\gamma$![](https://i.imgur.com/cTwRkNG.png) convergence of value iteration - the Bellman optimality operator $T^*$ has a unique fixed point - $v_*$ is a fixed point of $T^*$ (by Bellman optimality equation) - by contraction mapping theorem, value iteration converges on $v_*$