Policy iteration Algorithm
===
Author
[Raj Ghugare](https://github.com/Raj19022000)
###### tags: `notes` `Finite MDPs` `Policy iteration`
Once you've understood the proof of convergence of Value iteration it is fairly easy to understand why policy iteration shall converge. Let's dive right into the algorithm :
1. Set n as 0.
2. Select an arbitrary deterministic policy, $\pi_n$
3. (Policy Evaluation) Evaluate $v_{\pi_n}$ by using $L_{\pi_n}$ iteratively over any point in value function space. As the number of iterations become large we will converge towards $v_{\pi_n}$.
We can also find $v_{\pi_n}$ using the equation:
$$V_{\pi_n}=(I-\gamma P^{\pi_n})^{-1}r_{\pi_n} $$
4. (Policy Improvement) Chose $\pi_{n+1}$ by being greedy according to $v_{\pi_n}$:
$$\pi_{n+1}(s) = argmax_{a}[E[r|s,a]+\gamma \Sigma_{s' \in S}P(s'|s,a)v_{\pi_{n+1}}(s')]$$
5. If $\pi_{n} = \pi_{n+1}$ stop, else increment n and go back to step 3.
6. $\pi_{n}$ is the optimal policy $\pi^{*}$.
To prove that policy iteration works we'd need to prove that our policy is getting better at every iteration, i.e. the expected reward from every state according the policy at (n+1)th iteration should be greater than(or at the least equal to) the expected reward at respective states according to the policy at the nth iteration.
We know that by definition $\pi_{n+1}$ is the greedy policy according to $v_{\pi_n}$, hence for all states:
$r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq r_{\pi_{n}} + \gamma P^{\pi_{n}}v_{\pi_{n}}$
$r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq v_{\pi_{n}}$
Since $v_{\pi_{n}}$ is a fixed point with respect to the operator $L_{\pi_{n}}$.The rest is just algebraic manipulations:
$r_{\pi_{n+1}}\geq (I - \gamma P^{\pi_{n+1}})v_{\pi_{n}}$
$(I - \gamma P^{\pi_{n+1}})^{-1}r_{\pi_{n+1}} \geq v_{\pi_{n}}$
$v_{\pi_{n+1}} \geq v_{\pi_{n}}$
This is the most naive form of policy iteration. It is a well tested result that there is no need to completely evaluate a policy before acting greddily according to its evalution.Which gives result to other variants of policy iteration methods.You can use Putterman's book on MDPs to get a much deeper knowledge about different policy evaluation methods.