Bellman equation

Bellman equation === Author [Raj Ghugare](https://github.com/Raj19022000) ###### tags: `notes` `Bellman equation` `banach point theorem` ### Introduction: Bellman equations were one of the first ways to formalize MDPs and have been the backbone behind theoretical reinforcement learning since. Most of you guys who have seen the first few [David silver lectures](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ) or read blogs about introduction to RL might be familiar with the intuition behind bellman equations and why they seem to work or converge.I want to go through the mathematical details of the proofs of basic algorithms like I am going to assume that the reader is familiar with the basic concepts of 1. Value functions - $V(s)$ 2. Policy - $\pi(a|s)$ 3. Reward function - $R(s,a)$ 4. State transition probabilities - $p(s'|s,a)$ ### Bellman equations: Listing down the bellman equation for value functions $V_\pi(s) = \Sigma_{a}\pi(a|s).\Sigma_{s'}p[s'|s,a](E[r|s,a] + \gamma V^\pi(s'))$ Where $\gamma$ is the discounting factor. Working with the matrix equations is better,hence lets try to introduce some more notations for that. 1. $r_{\pi}(s) = \Sigma_{a}\pi(a|s).\Sigma_{s'}p(s'|s,a)E[r|s,a]$ This is an |S| dimensional reward vector which gives the value of the expected one step reward after following the policy $\pi(a|s)$ 2. $P_{\pi}(j,s) = \Sigma_{a}\pi(a|s)p(j|s,a)$ This is an |S| by |S| [stochastic matrix](https://en.wikipedia.org/wiki/Doubly_stochastic_matrix) which gives the total probability of going from any state s to any other state s',given that we are following a particular policy $\pi$ We assume that $\gamma \in [0,1)$ Hence the matrix form of the bellman equation under policy $\pi$ is: $V^\pi = r_\pi + \gamma P_\pi V^\pi$ Note that we can solve for $V^\pi$ directly by $V^\pi = (I - \gamma P_\pi)^{-1}r_\pi$, where the inverse term always exists because the matrix inside the inverse can never have a 0 eigen value. But here we'd concentrate on other iterative ways to solve for $V_\pi.$ ### Prerequisite Math: [Cauchy Sequence](https://en.wikipedia.org/wiki/Cauchy_sequence): A sequence $x_1,x_2...$ is known to be a Cauchy sequence if for every positive real number $\epsilon$ There exists a positive integer N,such that for all m,n(positive integers) greater than N $|x_m - x_n| < \epsilon$ For all infinite pairs of m,n greater than N, the absolute different should be less than epsilon. [Banach Space](https://en.wikipedia.org/wiki/Banach_space): A Banach space is a normed vector space over any scalar field, which is complete. A complete vector space is a Banach space if and only if all possible Cauchy series in the vector space converge to a point present in the vector space. Any finite dimensional vector space over R is a Banach space. [Contraction Mapping](https://en.wikipedia.org/wiki/Contraction_mapping): Given a normed space V, we can say that any function T:V->V is a contraction if, $||Tv - Tu|| \leq \lambda||v-u||$ $\forall u,v \in V$ where $\lambda \in [0,1)$ Note: The norm ||.|| we are considering is the infinity norm or max norm. [Banach's Fixed point theorem](https://en.wikipedia.org/wiki/Banach_fixed-point_theorem): The Theorem states that given V is a Banach space and there exists a contraction mapping $T:V \rightarrow V$, then there exists a unique $v^*$ in V such that $$Tv^* = v^*$$ and for any arbitrary $v^0$ in V,the sequence {$V^n$} defined by $$v^n = Tv^{n-1} = T^n{v^0}$$ converges to $v^*$. $v^*$ is known as the fixed point for T over the space V. Let us first try to prove that {$v_n$} is a Cauchy series: Without loss of generality we assume that m>n, $||v^{m} - v^{n}|| \leq ||v^{m} - v^{m-1}|| + ||v^{m-1} - v^{n}||$ -(using triangle inequality) similarly using traingle inequality $||v^{m} - v^{n}|| \leq ||v^{m} - v^{m-1}|| + ||v^{m-1} - v^{m-2}||+...+||v^{n+1} - v^{n}||$ $||v^{m} - v^{n}|| \leq \Sigma_{i=0}^{m-n-1}||v^{n+i+1}-v^{n+i}||$ Using the definition of the sequence {$v_n$} $||v^{m} - v^{n}|| \leq \Sigma_{i=0}^{m-n-1}||T^{n+i} v^{1}- T^{n+i}v^{0}||$ Since we know T is a contraction, $||v^{m} - v^{n}|| \leq \Sigma_{i=0}^{m-n-1}(\lambda^{n+i}.||v^{1}-v^{0}||)$ $||v^{m} - v^{n}|| \leq (\lambda^{n}.||v^{1}-v^{0}||)\Sigma_{i=0}^{m-n-1}\lambda^{i}$ for large enough values of n(forcing m to be even larger) we can see that the LHS value will become smaller than epsilon(whatever it may be). $||v^{m} - v^{n}|| \leq (\lambda^{n}||v^{1}-v^{0}||)/(1-\lambda)$ -(using the infinite GP formula) Hence {$v^n$} is a Cauchy sequence. Let's assume that it converges to $v^{*}$ $0 \leq ||Tv^* - v^*||$ -(definition of norm) Using the triangle inequality $0 \leq ||Tv^* - v^*|| \leq ||Tv^* - v^n|| + ||v^n - v^*||$ $0 \leq ||Tv^* - v^*|| \leq ||Tv^* - Tv^{n-1}|| + ||v^n - v^*||$ $0 \leq ||Tv^* - v^*|| \leq \lambda||v^* - v^{n-1}|| + ||v^n - v^*||$ -(T is a contraction) As n tends to infinity, $||v^* - v^{n}||$ tends towards 0 similarly, $||v^* - v^{n-1}||$ tends towards 0 Hence, $||Tv^* - v^*|| \leq 0$ This is only possible if $Tv^* = v^*$ Hence proved $v^*$ is indeed the fixed point of T over Banach space V. Note: The uniqueness can be easily proved by assuming two fixed points and then arriving at a contradiction. ### Bellman equations for $v_\pi$ : * $V^\pi = r_\pi + \gamma P_\pi V^\pi$ Doesn't this look familiar to a fixed point equation of the form $v = Tv$ ### Bellman Operator: Let's make this clearer by defining a operator $L_\pi(U) = r_\pi + \gamma P_\pi(U)$ If we prove that $L_\pi$ is a contraction then we can use Banach's fixed point theorem on the vector space of all states to prove the existence of a fixed point $V_\pi$. Proof: Let $u,v$ be two vectors in the state space V. Let $L_\pi v > L_\pi u$ $0 < L_\pi v - L_\pi u$ $0 < L_\pi v - L_\pi u = \gamma P_\pi (v-u)$ Then for any state s $L_\pi v(s) - L_\pi u(s) = \gamma\Sigma_{j \in |S|}P_\pi(j|s)(v(j)-u(j))$ Let's say that for some state $s'$ we have $|v(s') - u(s')| = ||v-u||_{\infty}$ i.e. the maximum absolute component wise difference. Then, $L_\pi v(s) - L_\pi u(s) \leq \gamma\Sigma_{j \in |S|}P_\pi(j|s)|v(s')-u(s')|$ since $\Sigma_{j \in |S|}P_\pi(j|s) = 1$ $L_\pi v(s) - L_\pi u(s) \leq \gamma ||v-u||$ , for all states $s \in S$ Since this inequality holds for all states s, it is obvious that $||L_\pi v - L_\pi u|| \leq \gamma ||v-u||$ , for all states $s \in S$ Hence it is proved that $L_\pi$ is a contraction mapping. Life becomes easy atleast if we know the dynamics of our environment completely, given a policy we can find the values of all the states using the fixed point theorem.