Policy iteration Algorithm === Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `notes` `Finite MDPs` `Policy iteration` Once you've understood the proof of convergence of Value iteration it is fairly easy to understand why policy iteration shall converge. Let's dive right into the algorithm : 1. Set n as 0. 2. Select an arbitrary deterministic policy, $\pi_n$ 3. (Policy Evaluation) Evaluate $v_{\pi_n}$ by using $L_{\pi_n}$ iteratively over any point in value function space. As the number of iterations become large we will converge towards $v_{\pi_n}$. We can also find $v_{\pi_n}$ using the equation: $$V_{\pi_n}=(I-\gamma P^{\pi_n})^{-1}r_{\pi_n} $$ 4. (Policy Improvement) Chose $\pi_{n+1}$ by being greedy according to $v_{\pi_n}$: $$\pi_{n+1}(s) = argmax_{a}[E[r|s,a]+\gamma \Sigma_{s' \in S}P(s'|s,a)v_{\pi_{n+1}}(s')]$$ 5. If $\pi_{n} = \pi_{n+1}$ stop, else increment n and go back to step 3. 6. $\pi_{n}$ is the optimal policy $\pi^{*}$. To prove that policy iteration works we'd need to prove that our policy is getting better at every iteration, i.e. the expected reward from every state according the policy at (n+1)th iteration should be greater than(or at the least equal to) the expected reward at respective states according to the policy at the nth iteration. We know that by definition $\pi_{n+1}$ is the greedy policy according to $v_{\pi_n}$, hence for all states: $r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq r_{\pi_{n}} + \gamma P^{\pi_{n}}v_{\pi_{n}}$ $r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq v_{\pi_{n}}$ Since $v_{\pi_{n}}$ is a fixed point with respect to the operator $L_{\pi_{n}}$.The rest is just algebraic manipulations: $r_{\pi_{n+1}}\geq (I - \gamma P^{\pi_{n+1}})v_{\pi_{n}}$ $(I - \gamma P^{\pi_{n+1}})^{-1}r_{\pi_{n+1}} \geq v_{\pi_{n}}$ $v_{\pi_{n+1}} \geq v_{\pi_{n}}$ This is the most naive form of policy iteration. It is a well tested result that there is no need to completely evaluate a policy before acting greddily according to its evalution.Which gives result to other variants of policy iteration methods.You can use Putterman's book on MDPs to get a much deeper knowledge about different policy evaluation methods.