Value iteration Algorithm === Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `notes` `Finite MDPs` `Value iteration` ## What have we learnt till now: 1) In the value function space we know that for the operator $L$ there exists a fixed and unique point, $V^*$ which we also call the true value of states. 2) Using Banach's fixed point theorem and continuity of Value function space we say that the sequence $v^{N+1} = L^{N+1}v_0$ converges to $V^*$ as N becomes very large. where $L$ is: $$Lv = max_\pi[r_\pi + P^\pi v]$$ assuming $\pi$ to be a deterministic policy. 3) Once these notions are rooted in your mind,the idea of value iteration becomes inevitable. ### Algorithm: 1) Select an arbitrary $v_0 \in V$, pick an $\epsilon > 0$ and set $n=0$ 2) Compute $v^{n+1}$ by for each state s by: $$v^{n+1}(s) = max_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$ 3) If $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$, then go to step 4,else increment the value of n and go back to step 2. 4) For each state S calculate the optimal policy by acting greedily: $$ \pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$ 5) Stop After looking at this theorem the question that should be arising in your mind is why stop iterating if $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$. Where did this magic number come from? To resolve this promising query I would like to rise another question first. i.e. when do you think we should stop iterating? Hopefully your answer is something like when our value estimates are epsilon close to $v^*$.The catch lies in what you mean by our value estimates. Do you mean $v^n$, because that would be wrong as $v^n$ is just a point in our value function space which is close to the fixed point $v^*$, it is not necessary that $v^{n}$ corresponds to a policy. So the write answer should be that we have to stop when $v_\pi$ is epsilon close to $v^*$. We can find $v_\pi$ by applying $L_\pi$ (where $\pi$ is the deterministic policy obtained in step 4) iteratively to obtain $v_\pi$(Which would be a fixed point corresponding to $L_\pi$). We actually dont need to find this quantity,we only need to ensure that whatever it may be it is atleast epsilon close to $v^*$. NOTE: The norm used in the following section is always $||.||_{\infty}$ ### Proof: Suppose that we have completely performed the value iteration algorithm, we have * $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$ * $\pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$ Using triangle inequality: $$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$ Considering the first term on RHS, Since $v^\pi$ is a fixed point, $$||v^{\pi} - v^{n+1}|| = ||L_\pi v^\pi - v^{n+1}||$$ We know that $\pi$ is a deterministic policy which chooses the action which maximizes the term $[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$ for any state. Hence you should understand how this is similar to applying $L$ on $v$. For our deterministic policy $\pi(s)$, $L_\pi$ and $L$ are equivalent. $$||v^{\pi} - v^{n+1}|| = ||L v^\pi - v^{n+1}||$$ Using triangle inequality, $$||v^{\pi} - v^{n+1}|| \leq ||L v^\pi - Lv^{n+1}|| + ||Lv^{n+1} - v^{n+1}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + ||Lv^{n+1} - Lv^{n}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + \gamma||v^{n+1} - v^{n}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^{n+1} - v^{n}||/(1- \gamma)$$ Considering the second term on RHS, By applying the triangle inequality countable infinite number of times we obtain, $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||v^{n+k+2}-v^{n+k+1}||$$ $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||L^{k+1}v^{n+1}-L^{k}v^{n+1}||$$ Using the fact that L is a contraction, $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty} \gamma^{k+1}||v^{n+1}-v^{n+1}||$$ $$||V^{n+1}-v^{*}|| \leq ||v^{n+1}-v^{n+1}||(\gamma/(1-\gamma))$$ By substituting the two results in the main inequality we get, $$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$ $$||v^{\pi}-v^{*}|| \leq 2\gamma||v^{n} - v^{n+1}||/(1-\gamma)$$ Since we know, from the third step of our algorithm that, $$||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$$ $$||v^{\pi}-v^{*}|| \leq \epsilon$$ Hence the value of our policy is epsilon close to the true values of all states.