[Nuts and Bolts Of Reinforcement Learning](#)

# [Nuts and Bolts Of Reinforcement Learning](#) ###### tags: `Blog` `MDP` `Bellman Equations` `Value-iteration` `Policy-iteration` #### Author [Raj Ghugare](https://github.com/RajGhugare19) ## Table Of Contents: - [Introduction](#Introduction) - [Appendix](#Appendix) - [Proof of Bellman Equation](#Proof-of-Bellman-Equation) - [Exisitence and Uniqueness of Value Function](#Exisitence-and-Uniqueness-of-Value-Function) - [Mathematical Prerequisites](#Mathematical-Prerequisites) - [Contraction Mapping](#Contraction-Mapping) - [Fixed Point Theorem](#Fixed-Point-Theorem) - [Cauchy Sequence](#Cauchy-Sequence) - [Bellman Operator](#Bellman-Operator) - [Proof that Bellman Operator is a Contraction Mapping](#Proof-that-Bellman-Operator-is-a-Contraction-Mapping) - [Bellman Optimality Operator](#Bellman-Optimality-Operator) - [Proof that Bellman Optimality Operator is a Contraction Mapping](#Proof-that-Bellman-Optimality-Operator-is-a-Contraction-Mapping) - [Value Iteration](#Value-Iteration) - [Algorithm](#Value-Iteration-Algorithm) - [Proof](#Value-Iteration-Proof) - [Policy Iteration](#Policy-Iteration) - [Algorithm](#Policy-Iteration-Algorithm) - [Proof](#Policy-Iteration-Proof) - [References](#References) ## Introduction Bellman equations were one of the first ways to formalize MDPs and have been the backbone behind theoretical reinforcement learning since. This post is meant to provide a mathematical overview and fundamental proofs of the basic algorithms in RL. Hence, I am going to assume that the reader is familiar with the basic concepts of an MDP. ### Appendix 1. Policy - $\pi(s)$ 2. Value functions - $V^{\pi}(s)$ 3. Reward function - $R(s,a)$ 4. State transition probabilities - $p(s'|s,a)$ 5. Discount Factor - $\gamma$ 6. Return - $\Sigma_{t=0}^{t=\infty} \gamma^{t}r_{t+1}$ 7. Set of all states in the MDP - $S$ 8. Set of all actions in the MDP - $A$ ### Assumptions * For our current mathematical analysis we will assume that the state space and the action space of the MDP is finite. * The policies we would deal with, would be deterministic policies, i.e. given a state as input, the policy would return an action rather than a probability over the complete action space. ### Proof of Bellman Equation Sutton and Barto - > Almost all reinforcement learning algorithms involve estimating value functions—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) For any state s in an MDP by definition, $$V^{\pi}(s) = E[\Sigma_{t=0}^{\infty}\gamma^{t}r_{t+1}|s_{0}=s,\pi]$$ $$V^{\pi}(s) = E[ r_{1} + \Sigma_{t=1}^{\infty}\gamma^{t}r_{t+1}|s_{0}=s,\pi]$$ We can divide the expectation into two terms, $$V^{\pi}(s) = E[r_{1}|s_{0}=s,\pi] + E[\Sigma_{t=1}^{\infty}\gamma^{t}r_{t+1}|s_{0}=s,\pi]$$ $$V^{\pi}(s) = R(s,\pi(s)) + E[\Sigma_{t=1}^{\infty}\gamma^{t}r_{t+1}|s_{0}=s,\pi]$$ Since the rewards obtained after timestep 1 would not be affected by $s_{0}$ we can remove it from the expectation. $$V^{\pi}(s) = R(s,\pi(s)) + E[\Sigma_{t=1}^{\infty}\gamma^{t}r_{t+1}|\pi]$$ Now the expectation over all time-steps can be converted over an expectation over all states. You will encounter such conversions many a times throughout RL literature, so make sure how proceed only after getting the complete essence of it. $$V^{\pi}(s) = R(s,\pi(s)) + E_{s'}[\gamma V^{\pi}(s)]$$ $$V^\pi(s) = R(s,\pi(s)) + \Sigma_{s'}p(s'|s,\pi(s)) \gamma V^\pi(s')$$ ### Exisitence and Uniqueness of Value Function Let us assume there are $n$ unique states in the MDP, then there would be $n$ bellman equations and n unkown variables. For simplification we write these equations in there matrix form: $$V^{\pi} = R^{\pi} + \gamma P^{\pi}V^{\pi}$$ $V^{\pi} \rightarrow$ n-dimensional unkown vector, whose co-ordinates are the values of respective states. $R^{\pi} \rightarrow$ n-dimensional vector giving the reward at any state, since policy is deterministic $P^{\pi} \rightarrow$ an $(N,N)$ [row stochastic matrix](https://en.wikipedia.org/wiki/Doubly_stochastic_matrix) with the $(i,j)$th element giving the probability of jumping to state $j$, while being in state $i$ and following policy $\pi$. If the model is completely known then we can directly solve for $V^{\pi}$ by, $$V^\pi = (I - \gamma P_\pi)^{-1}r_\pi$$ To prove that $V^{\pi}$ is unique we need to show that the matrix $(I - \gamma P_\pi)$ is always invertible. To prove this, it would be enough to show that for non-zero $V^{\pi}$, $(I - \gamma P_\pi)V^{\pi}$ is always non-zero. Let us assume that $||V^{\pi}||_{\infty} \geq 0$ $$||(I-\gamma P^{\pi})V^{\pi}||_{\infty} = ||V^{\pi}-\gamma P^{\pi}V^{\pi}||_{\infty}$$ $$||(I-\gamma P^{\pi})V^{\pi}||_{\infty} \geq ||V^{\pi}||_{\infty} - \gamma||P^{\pi}V^{\pi}||_{\infty}$$ Every element of $P^{\pi}V^{\pi}$ is the average value of the next possible state over the transition dynamics of the environment. Since the average value will always be lesser than(or atmost equal to) the maximum value, we can say that $$||P^{\pi}V^{\pi}||_{\infty} \leq ||V^{\pi}||_{\infty}$$ Using this relation, $$||(I-\gamma P^{\pi})V^{\pi}||_{\infty} \geq ||V^{\pi}||_{\infty} - \gamma||V^{\pi}||_{\infty}$$ $$||(I-\gamma P^{\pi})V^{\pi}||_{\infty} \geq ||V^{\pi}||_{\infty}(1-\gamma)$$ $$||(I-\gamma P^{\pi})V^{\pi}||_{\infty} \geq 0$$ This is one fundamental way to prove the existence and uniquness of the value function for an MDP. ## Mathematical Prerequisites ### Contraction Mapping Given a normed space $V$, we can say that any function $T:V \rightarrow V$ is a contraction mapping if, for all $v,u \in V$, the following is true: $$||Tv - Tu|| \leq \lambda||v-u||$$ where $\lambda \in [0,1)$ ### Fixed Point Theorem Banach's Fixed Point theorem states that given: $V$ is a [Banach Space](https://mathworld.wolfram.com/BanachSpace.html) and, $T:V \rightarrow V$ is a contraction mapping , then there exists a unique $v^*$ in V such that $$Tv^* = v^*$$ and for any arbitrary $v^0$ in $V$, the sequence {$V^n$} defined by $$v^n = Tv^{n-1} = T^n{v^0}$$ converges to $v^*$. $v^*$ is known as the fixed point for T over the space V. ### Cauchy Sequence A sequence $x_1,x_2...$ is known to be a Cauchy sequence if for every positive real number $\epsilon$ there exists a positive integer $N$, such that for all integers $m,n$ greater than N $|x_m - x_n| < \epsilon$ i.e, For all infinite pairs of $m,n$ greater than $N$, the absolute different should be less than epsilon. ## Bellman Operator Bellman operator is an operator defined for mathematical convenience. It takes in a vector from space $|V|$ and maps it to another vector from the same space. It basically applies the bellman equation to every co-ordinate of the vector. $$L_\pi(U) = r_\pi + \gamma P^\pi U$$ ### Proof that Bellman Operator is a Contraction Mapping Let $u,v$ be two vectors in the state space $V$. Note that it is not neccesary that these vectors are the state values corresponding to a real policy, they can be any random vectors in the space $V$. By definition of bellman operator, $$L_\pi v - L_\pi u = r_\pi + \gamma P^\pi v - (r_\pi + \gamma P^\pi u)$$ $$L_\pi v - L_\pi u = \gamma P^{\pi}(v-u)$$ For any co-ordinate $s\in [0,|V|)$, the following is true $$L_\pi v(s) - L_\pi u(s) = \gamma P^{\pi}(v-u)(s) $$ $$L_\pi v(s) - L_\pi u(s) = \gamma \Sigma_{s' \in |S|}P(s'|s,\pi(s))(v(s')-u(s')) $$ $$L_\pi v(s) - L_\pi u(s) \leq \gamma \Sigma_{s' \in |S|}P(s'|s,\pi(s))(||v-u||_{\infty}) $$ Since, $\Sigma_{s' \in |S|}P(s'|s,\pi(s)) = 1$ $$L_\pi v[s] - L_\pi u[s] \leq \gamma||v-u||_{\infty}$$ Since this is true for all components of $v,u$, the following will also hold true, $$||L_\pi v - L_\pi u||_{\infty} \leq \gamma||v-u||_{\infty}$$ Hence, it is proved that $L_{\pi}$ is a contraction mapping. ### Bellman Optimality Operator Bellman optimality operator is an operator defined yet again for mathematical convenience. It takes in a vector from space $|V|$ and maps it to another vector from the same space. It applies the bellman optimality equation to every co-ordinate of the vector. $$L(U) = max_{\pi}(r_\pi + \gamma P^\pi U)$$ The above expression is abusing notation, what it actually means is, for all states $s$ in $S$, $$LU(s) = max_{a}(r(s,a) + \gamma\Sigma_{s'\in S}P(s'|s,a)U(s'))$$ ### Proof that Bellman Optimality Operator is a Contraction Mapping Let us pre-define some required notations: $$a_{v(s)}^* \in argmax_a[ R(s,a) + \gamma \Sigma_{s' \in S}P(s'|s,a)v(s')]$$ i.e. $a_{v(s)}^*$ is the(or is any one of the) action that gives the best expected value(according to the value function $v$) after the state $s$. Let $u,v$ be two vectors in the state space $V$ such that without loss of generality for some state $s$, $$Lv(s)-Lu(s) \geq 0$$ $$Lv(s)-Lu(s) = (R(s,a_{v(s)}^*) + \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)v(j)) - (R(s,a_{u(s)}^*) + \gamma \Sigma_{j \in S}p(j|s,a_{u(s)}^*)u(j))$$ $$Lv(s)-Lu(s) \leq (R(s,a_{v(s)}^*) + \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)v(j)) - (R(s,a_{v(s)}^*) + \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)u(j))$$ $$Lv(s)-Lu(s) \leq \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)v(j) - \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)u(j)$$ $$Lv(s)-Lu(s) \leq \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)(v(j)-u(j))$$ $$Lv(s)-Lu(s) \leq \gamma \Sigma_{j \in S}p(j|s,a_{v(s)}^*)||v-u||_{\infty}$$ Since the sum of all probabilities is $1$, $$Lv(s)-Lu(s) \leq \gamma||v-u||_{\infty}$$ Since the above result is true for any state, $$||Lv(s)-Lu(s)||_{\infty} \leq \gamma||v-u||_{\infty}$$ hence the bellman optimality operator is a contraction mapping. ## Value Iteration 1) In the value function space we know that for the operator $L$, there exists a fixed and unique point, $V^*$ which we also call the true value of states. 2) Using Banach's fixed point theorem and continuity of Value function space we say that the sequence $v^{N+1} = L^{N+1}v_0$ converges to $V^*$ as N becomes very large. where $L$ is: $$Lv = max_\pi[r_\pi + P^\pi v]$$ assuming $\pi$ to be a deterministic policy. 3) Once these notions are rooted in your mind,the idea of value iteration becomes inevitable. ### Value Iteration Algorithm 1) Select an arbitrary $v_0 \in V$, pick an $\epsilon > 0$ and set $n=0$ 2) Compute $v^{n+1}$ by applying L to $v^{n}$: $$v^{n+1}(s) = max_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$ 3) If $||v^{n+1}-v^{n}||_{\infty} \leq \epsilon(1-\gamma)/2\gamma$, then go to step 4,else increment the value of n and go back to step 2. 4) For each state S calculate the optimal policy by acting greedily: $$ \pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$$ 5) Stop After looking at this theorem the question that should be arising in your mind is why stop iterating if $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$. Where did this magic number come from? To resolve this promising query I would like to rise another question first. i.e. when do you think we should stop iterating? Hopefully your answer is something like when our value estimates are epsilon close to $v^*$.The catch lies in what you mean by our value estimates. Do you mean $v^n$, because that would be wrong as $v^n$ is just a point in our value function space which is close to the fixed point $v^*$, it is not necessary that $v^{n}$ corresponds to a policy. So the write answer should be that we have to stop when $v_\pi$ is epsilon close to $v^*$. We can find $v_\pi$ by applying $L_\pi$ (where $\pi$ is the deterministic policy obtained in step 4) iteratively to obtain $v_\pi$(Which would be a fixed point corresponding to $L_\pi$). We actually dont need to find this quantity,we only need to ensure that whatever it may be it is atleast epsilon close to $v^*$. NOTE The norm used in the following section is always $||.||_{\infty}$ ### Value Iteration Proof Suppose that we have completely performed the value iteration algorithm, we have * $||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$ * $\pi(s) = argmax_a[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$ Using triangle inequality: $$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$ Considering the first term on RHS, Since $v^\pi$ is a fixed point, $$||v^{\pi} - v^{n+1}|| = ||L_\pi v^\pi - v^{n+1}||$$ We know that $\pi$ is a deterministic policy which chooses the action which maximizes the term $[E[r|s,a] + \gamma\Sigma_{j \in S}(P(j|s,a) + v^{n}(s))]$ for any state. Hence you should understand how this is similar to applying $L$ on $v$. For our deterministic policy $\pi(s)$, $L_\pi$ and $L$ are equivalent. $$||v^{\pi} - v^{n+1}|| = ||L v^\pi - v^{n+1}||$$ Using triangle inequality, $$||v^{\pi} - v^{n+1}|| \leq ||L v^\pi - Lv^{n+1}|| + ||Lv^{n+1} - v^{n+1}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + ||Lv^{n+1} - Lv^{n}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^\pi - v^{n+1}|| + \gamma||v^{n+1} - v^{n}||$$ $$||v^{\pi} - v^{n+1}|| \leq \gamma||v^{n+1} - v^{n}||/(1- \gamma)$$ Considering the second term on RHS, By applying the triangle inequality countable infinite number of times we obtain, $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||v^{n+k+2}-v^{n+k+1}||$$ $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty}||L^{k+1}v^{n+1}-L^{k}v^{n+1}||$$ Using the fact that L is a contraction, $$||V^{n+1}-v^{*}|| \leq \Sigma_{k=0}^{\infty} \gamma^{k+1}||v^{n+1}-v^{n+1}||$$ $$||V^{n+1}-v^{*}|| \leq ||v^{n+1}-v^{n+1}||(\gamma/(1-\gamma))$$ By substituting the two results in the main inequality we get, $$||v^{\pi}-v^{*}|| \leq ||v^{\pi} - v^{n+1}|| + ||v^{n+1} - v^*||$$ $$||v^{\pi}-v^{*}|| \leq 2\gamma||v^{n} - v^{n+1}||/(1-\gamma)$$ Since we know, from the third step of our algorithm that, $$||v^{n+1}-v^{n}|| \leq \epsilon(1-\gamma)/2\gamma$$ $$||v^{n+1}-v^{*}|| \leq \epsilon$$ Hence the value of our policy is epsilon close to the true values of all states. ## Policy Iteration Once you've understood the proof of convergence of value iteration it is fairly easy to understand why policy iteration shall converge. Let's dive right into the algorithm : ### Policy Iteration Algorithm 1. Set n as 0. 2. Select an arbitrary deterministic policy, $\pi_n$ 3. (Policy Evaluation) Evaluate $v_{\pi_n}$ by using $L_{\pi_n}$ iteratively over any point in value function space. As the number of iterations become large we will converge towards $v_{\pi_n}$. We can also find $v_{\pi_n}$ using the equation: $$V_{\pi_n}=(I-\gamma P^{\pi_n})^{-1}r_{\pi_n} $$ 4. (Policy Improvement) Chose $\pi_{n+1}$ by being greedy according to $v_{\pi_n}$: $$\pi_{n+1}(s) = argmax_{a}[R(s,a)+\gamma \Sigma_{s' \in S}P(s'|s,a)v_{\pi_{n+1}}(s')]$$ 5. If $\pi_{n} = \pi_{n+1}$ stop, else increment n and go back to step 3. 6. $\pi_{n}$ is the optimal policy $\pi^{*}$. ### Policy Iteration Proof To prove that policy iteration works we'd need to prove that our policy is getting better at every iteration, i.e. the expected reward from every state according the policy at $(n+1)$th iteration should be greater than(or at the least equal to) the expected reward at respective states according to the policy at the $n$th iteration. We know that by definition $\pi_{n+1}$ is the greedy policy according to $v_{\pi_n}$, hence for all states: $$r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq r_{\pi_{n}} + \gamma P^{\pi_{n}}v_{\pi_{n}}$$ $$r_{\pi_{n+1}} + \gamma P^{\pi_{n+1}}v_{\pi_{n}} \geq v_{\pi_{n}}$$ Since $v_{\pi_{n}}$ is a fixed point with respect to the operator $L_{\pi_{n}}$. $$r_{\pi_{n+1}}\geq (I - \gamma P^{\pi_{n+1}})v_{\pi_{n}}$$ $$(I - \gamma P^{\pi_{n+1}})^{-1}r_{\pi_{n+1}} \geq v_{\pi_{n}}$$ $$v_{\pi_{n+1}} \geq v_{\pi_{n}}$$ This is the most naive form of policy iteration. It is a well tested result that there is no need to completely evaluate a policy before acting greddily according to its evalution.Which gives result to other variants of policy iteration methods. ## References 1) [Sutton and Barton](http://incompleteideas.net/book/the-book-2nd.html) 2) [Balaraman Ravindran's RL course](https://nptel.ac.in/courses/106/106/106106143/)