Value iteration Algorithm
===
Author
[Raj Ghugare](https://github.com/Raj19022000)
###### tags: `notes` `Finite MDPs` `Value iteration`
## What have we learnt till now:
1) In the value function space we know that for the operator $L$ there exists a fixed and unique point, $V^*$ which we also call the true value of states.
2) Using Banach's fixed point theorem and continuity of Value function space we say that the sequence $v^{N+1} = L^{N+1}v_0$ converges to $V^*$ as N becomes very large.
where $L$ is:
$$Lv = max_\pi[r_\pi + P^\pi v]$$
assuming $\pi$ to be a deterministic policy.
3) Once these notions are rooted in your mind,the idea of value iteration becomes inevitable.
### Algorithm:
1) Select an arbitrary $v_0 \in V$, pick an $\epsilon > 0$ and set $n=0$
2) Compute $v^{n+1}$ by for each state s by:
$$v^{n+1}(s) = max_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$
3) If $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$, then go to step 4,else increment the value of n and go back to step 2.
4) For each state S calculate the optimal policy by acting greedily:
$$ \pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$
5) Stop
After looking at this theorem the question that should be arising in your mind is why stop iterating if $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$.
Where did this magic number come from?
To resolve this promising query I would like to rise another question first. i.e. when do you think we should stop iterating? Hopefully your answer is something like when our value estimates are epsilon close to $v^*$.The catch lies in what you mean by our value estimates. Do you mean $v^n$, because that would be wrong as $v^n$ is just a point in our value function space which is close to the fixed point $v^*$, it is not necessary that $v^{n}$ corresponds to a policy.
So the write answer should be that we have to stop when $v_\pi$ is epsilon close to $v^*$. We can find $v_\pi$ by applying $L_\pi$ (where $\pi$ is the deterministic policy obtained in step 4) iteratively to obtain $v_\pi$(Which would be a fixed point corresponding to $L_\pi$). We actually dont need to find this quantity,we only need to ensure that whatever it may be it is atleast epsilon close to $v^*$.
NOTE:
The norm used in the following section is always $||.||_{\infty}$
### Proof:
Suppose that we have completely performed the value iteration algorithm,
we have
* $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$
* $\pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$
Using triangle inequality:
$$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$
Considering the first term on RHS,
Since $v^\pi$ is a fixed point,
$$||v^{\pi} - v^{n+1}|| = ||L_\pi v^\pi - v^{n+1}||$$
We know that $\pi$ is a deterministic policy which chooses the action which maximizes the term $[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$ for any state.
Hence you should understand how this is similar to applying $L$ on $v$. For our deterministic policy $\pi(s)$, $L_\pi$ and $L$ are equivalent.
$$||v^{\pi} - v^{n+1}|| = ||L v^\pi - v^{n+1}||$$
Using triangle inequality,
$$||v^{\pi} - v^{n+1}|| \leq ||L v^\pi - Lv^{n+1}|| + ||Lv^{n+1} - v^{n+1}||$$
$$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + ||Lv^{n+1} - Lv^{n}||$$
$$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + \gamma||v^{n+1} - v^{n}||$$
$$||v^{\pi} - v^{n+1}|| \leq \gamma||v^{n+1} - v^{n}||/(1- \gamma)$$
Considering the second term on RHS,
By applying the triangle inequality countable infinite number of times we obtain,
$$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||v^{n+k+2}-v^{n+k+1}||$$
$$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||L^{k+1}v^{n+1}-L^{k}v^{n+1}||$$
Using the fact that L is a contraction,
$$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty} \gamma^{k+1}||v^{n+1}-v^{n+1}||$$
$$||V^{n+1}-v^{*}|| \leq ||v^{n+1}-v^{n+1}||(\gamma/(1-\gamma))$$
By substituting the two results in the main inequality we get,
$$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$
$$||v^{\pi}-v^{*}|| \leq 2\gamma||v^{n} - v^{n+1}||/(1-\gamma)$$
Since we know, from the third step of our algorithm that,
$$||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$$
$$||v^{\pi}-v^{*}|| \leq \epsilon$$
Hence the value of our policy is epsilon close to the true values of all states.