---
title: planning by dynamic programming
tags: rl
---
# planning by dynamic programming
#### what is dynamic programming?
- a method for solving complex problems by reaking them down into subproblems
- solve the subproblems
- combine solutions to subproblems
#### dynamic programming is a very general solution method for problems which have two properties:
- optimal substructure
- principle of optimality applies
- optimal solution can be decomposed into subproblems
- overlapping subproblems
- subproblems recur many times
- solutions can be cached and reused
- Markov decision processes satisfy both properties
- bellman equation gives recursive decomposition
- value function stores and reuses solutions
#### planning by dynamic programming
- dynamic programming assumes full knowledge of the MDP
- it is used for *planning* in an MDP
- for prediction:
- input: MDP ($S$, $A$, $P$, $R$, $\gamma$) and policy $\pi$
- or: MRP ($S$, $P^\pi$, $R^\pi$, $\gamma$)
- output: value function $v_\pi$
- or for control:
- input: MDP ($S$, $A$, $P$, $R$, $\gamma$)
- output: optimal value function $v_\pi$ and optimal policy $\pi_*$
### policy evaluation
#### iterative policy evaluation
- problem: evaluate a given policy $\pi$
- solution: iterative application of Bellman expectation backup
- $v_1$ -> $v_2$ -> ... -> $v_\pi$
- using synchronous backups,
- at each iteration k + 1
- for all states s $\in S$
- update $v_{k+1}$(s) from $v_k$(s') where s' is a successor state of s

#### how to improve a policy
given a policy $\pi$
- evaluate the policy $\pi$
- improve the policy by acting greedily with respect to $v_\pi$
$\pi'$ = $greedy(v_\pi)$

this process of policy iteration always converges to $\pi$*
policy evaluation
- estimate $v_\pi$ (iterative policy evaluation)
policy improvement
- generate $\pi$' ≥ $\pi$ (greedy policy improvement)
- consider a deterministic policy, $a$ = $\pi(s)$
- we can *improve* the policy by acting greedily
$\pi'(s)$ = $argmax_{a \in A}$ $q_\pi(s, a)$
- this improves the value from any state s over one step
$q_\pi(s, \pi'(s))$ = $max_{a \in A}$ $q_\pi(s, a)$ ≥ $q_\pi(s, \pi(s))$ = $v_\pi(s)$
- it therefore improves the value function, $v_{\pi'}$(s) ≥ $v_{\pi}$(s)

- if improvements stop,
$q_\pi(s, \pi'(s))$ = $max_{a \in A}$ $q_\pi(s, a)$ = $q_\pi(s, \pi(s))$ = $v_\pi(s)$
- then the bellman optimality equation has been satisfied
$v_\pi(s)$ = $max_{a \in A}$ $q_\pi(s, a)$
- therefore $v_\pi(s)$ = $v_*(s)$ for all $s \in S$
so $\pi$ is an optimal policy
#### extensions to policy iteration
modified policy iteration
- does pollicy evaluation need to converge to $v_\pi$?
- or should we introduce a stopping condition
- $\epsilon$-convergence of value function
- or simply stop after k iterations of iterative policy evaluation?
### value iteration
#### principle of optimality
any optimal policy can be subdivided into two components:
- an optimal first action $A_*$
- followed by an optimal policy from successor state S'
theorem
A policy $\pi(a | s)$ achieves the optimal value from state s, $v_\pi(s)$ = $v_*(s)$, if and only if
- for any state s' reachable from s, $\pi$ achieves the optimal value from states s', $v_\pi(s')$ = $v_*(s')$
deterministic value iteration
- if we know the solution to subproblems $v_*(s')$, then solution $v_*(s)$ can be found by one-step lookahead

- the idea of calue iteration is to apply these updates iteratively
- intuition: start with final rewards and work backwards
- still works with loopy, stochastic MDPs
value iteration
- problem: find optimal policy $\pi$
- solution: iterative application of Bellman optimality backup
- $v_1$ -> $v_2$ -> ... -> $v_*$
- using synchronous backups
- at each iteration k + 1 for all states $s \in S$, update $v_{k + 1}$(s) from $v_{k}$(s')
- converges to $v_{*}$
- unlike policy iteration, there is no explicit policy
- intermediate value functions may not corrrespond to any policy
### synchronous dynamic programming algorithm summary
| problem | bellman equation | algorithm |
| -------- | -------- | -------- |
| prediction | bellman expectation equation | iterative policy evaluation |
| control | bellman expectation equation + greedy policy improvement | policy iteration |
| control | bellman optimality equation | value iteration |
- algorithm are based on state-value function $v_\pi$(s) or $v_*$(s)
- complexity O(mn^2^) per iteration, for m actions and n states
- could also apply to action-value function $q_\pi(s, a)$ or $q_*(s, a)$
- complexity O(m^2^n^2^) per iteration
### ashynchronous dynamic programming
- dynamic programming methods described so far used synchronous backups
- all states are backed up in parallel
- asynchronous dynamic programming backs up states individually, in any order
- for each selected state, apply the appropriate backup
- can sgnificantly reduce computation
- guaranteed to converge if all states continue to be selected
three simple iddeas for asynchronous dynamic programming
- in-place dynamic programming
- synchronous value iteration stores two copies of value function for all s in $S$

- in-place value iteraction only stores one copy of value function for all s in $S$

- prioritized sweeping
- use magnitude of bellman error to guide state selection

- backup the state with the largest remaining bellman error
- update bellman error of affected states after each backup
- requires knowledge of reverse dynamics (predecessor states)
- can be implemented efficiently by maintaining a priority queue
- real-time dynamic programming
- idea: only states that are relevant to agent
- use agent's experience to guide the selection of states
- after each time-step $S_t$, $A_t$, $R_{t+1}$, backup the state $S_t$

### full-width backups
- dynamic programming uses full-width backups
- for each backup (sync or async)
- every successor state and action is considered using knowledge of the MDP transitions and reward function
- dynamic programming is effective for medium-sized problems(millions of states)
- for large problems dynamic programing sufferes Bellman's *curse of dimensionality*
- number of states n = $|S|$ grows exponentially with number of state variables
- even one backup can be too expensive
### sample backups
- using sample rewards and sample transitions ($S$, $A$, $R$, $S'$) instead of reward function $R$ and transition dynamics $P$
- advantages
- model-free: no advance knowledge of MDP required
- breaks the curse of dimensionality through sampling
- cost of backup is constant, independent of n = $|S|$
### approximate dynamic programming
- approximate the value function using a function approximator $\hat{v}$(s, **w**)
- apply dynamic programming to $\hat{v}$(s, **w**)
- e.g. fitten value iteration repeats at each iteration k,
- sample state $\tilde{S} \subseteq S$
- for each state s $\in \tilde{S}$, estimate target value using Bellman optimality equation,
- train next value function $\hat{v}$($\cdot$, **$w_{K+1}$**) using targets {<s,$\tilde{v}_k$(s)>}
### contraction mapping
some technical questions
- how do we know that value iteration converges to $v_*$?
- or that iterative policy evaluation converges to $v_\pi$
- and therefor that policy iteration converges to $v_*$?
- is the solution unique?
- how fast do these algorithms converge?
- these questions are resolved by *contraction mapping theorem*
value function space
- consider the vector space $V$ over value functions
- there are $|S|$ dimensions
- each point in this space fully specifies a value function v(s)
- what does a Bellman backup do to points in this space?
- we will show that it brings value functions closer
- and therefore the backups must converge on a unique solution
value function $\infty$-Norm
- we will measure distance between state-value functions $u$ and $v$ by the $\infty$-norm
- i.e. the largest difference between state values,
Bellman Expectation Backup is a Contraction
- define the Bellman expectation backup operator $T^\pi$,
- this operator is a $\gamma$ - contraction, i.e. it makes value functions closer by at least $\gamma$,

contraction mapping theorem
- for any metric space $V$ that is complete (i.e. closed) under an operator T(v), where T is a $\gamma$ - contraction
- T converges to a unique fixed point at a linear convergence rate of $\gamma$
convergence of iterative policy evaluation and policy iteration
- the Bellman expectation operator $T^\pi$ has a unique fixed point
- $v_\pi$ is a fixed point of $T^\pi$ (by Bellman expectation equation)
- by contraction mapping theorem, iterative policy evalution converges on $v_\pi$, policy iteration converges on $v_*$
Bellman Optimality Backup is a Contraction
- define the Bellman expectation backup operator $T^\pi$,
- this operator is a $\gamma$ - contraction, i.e. it makes value functions closer by at least $\gamma$
convergence of value iteration
- the Bellman optimality operator $T^*$ has a unique fixed point
- $v_*$ is a fixed point of $T^*$ (by Bellman optimality equation)
- by contraction mapping theorem, value iteration converges on $v_*$