--- title: Markov Decision Processes tags: rl --- # Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment is fully observable i.e. the current state completely characterizes the process almost all RL problems can be formalized as MDPs, e.g. - optimal control primarly deals with continuous MDPs - partially observable problems can be converted into MDPs - bandits are MDPs with one state ### Markov Processes #### Markov Property " the future is independent of the past given the present" $\mathbb{P}$[$S_{t + 1}$ | $S_t$] = $\mathbb{P}$[$S_{t + 1}$ | $S_1$,...,$S_t$] - the state captures all relevant information from the history - once the state is known, the history may be thrown away - i.e. the state is a sufficient statistic of the future **state transition matrix** for a Markov state s and successor state s', the *state transition probability* is defined by $P_{ss'}$ = $\mathbb{P}$[$S_{t + 1}$ = s' | $S_t$ = s] state transition matrix P defines transition probabilities from all states s to all successor states s' ![](https://i.imgur.com/sUXtVg0.png) where each row of the matrix sums to 1 A Markov process is a memoryless random process, i.e. a sequence of random states $S_1$, $S_2$, ... with the Markov property A Markov Process (or Markov Chain) is a tuple (S, P) - S is a (finite) set of states - P is a state transition probability matrix, $P_{ss'}$ = $\mathbb{P}$[$S_{t + 1}$ = s' | $S_t$ = s] ### Markov Reward Process A Markov reward process is a Markov chain with values A Markov Reward Process is a tuple (S, P, R, $\lambda$) - S is a finite set of states - P is a state transition probability matrix, $P_{ss'}$ = $\mathbb{P}$[$S_{t + 1}$ = s' | $S_t$ = s] - R is a reward function, $R_s$ = $\mathbb{E}$[$R_{t + 1}$ | $S_t$ = s] - $\lambda$ is a discount factor, $\lambda \in$ [0, 1] #### return the return $G_t$ is the total discounted reward from time-step t $G_t$ = $R_{t+1}$ + $\gamma R_{t+2}$ + ... = $\sum^\infty_{k = 0}$ $\gamma^k R_{t+k+1}$ - the discount $\gamma \in$ [0, 1] is the present value of future rewards - the value of receiving reward R after k + 1 time-steps is $\gamma^k$R - this values immediate reward above delayed reward - $\gamma$ close to 0 leads to "myopic" evaluation - $\gamma$ close to 1 leads to "far-sighted" evaluation most Markov reward and decision processes are discounted. why? - mathematically convenient to discount rewards - avoids infinite returns in cyclic Markov processes - uncertainty about the future may not be fully represented - if the reward is financial, immediate rewards may earn more interest than delayed rewards - animal/human behavior shows preference for immediate reward - it is sometimes possible to use undiscounted Markov reward processes (i.e. $\gamma$ = 1), e.g. if all sequences terminate #### value function the value function v(s) gives the long-term value of state s **definition** the *state value function* v(s) of an MRP is the expected return starting from state s v(s) = $\mathbb{E}$[$G_{t}$ | $S_t$ = s] #### Bellman Equation for MRPs the value function can be decomposed into two parts: - imediate reward $R_{t+1}$ - discounted value of successor state $\gamma$v($S_{t+1}$) ![](https://i.imgur.com/rIa4iJm.png) ![](https://i.imgur.com/utnb2pE.png) the Bellman equation can be expressed concisely using matrices v = R + $\gamma$Pv where v is a column vector with one entry per state ![](https://i.imgur.com/3SX70Sn.png) solving a Bellman equation - the Bellman equation is a linear equation - it can be solved directly v = R + $\gamma$Pv (1 - $\gamma$)v = R v = (1 - $\gamma$ P)^-1^R - computational complexity is O(n^3^) for n states - direct solution only possible for small MRPs - there are many iterative methods for large MRPs, e.g - dynammic programming - Monte-Carlo evaluation - temporal-difference learning ### Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an *environment* in which all states are Markov **definition** A Markov Decision Process is a tuple ($S$, $A$, $P$, $R$, $\gamma$) - $S$ is a finite set of states - $A$ is a finite set of actions - $P$ is a state transition probability matrix - $P^a_{ss'}$ = $\mathbb{P}$[$S_{t + 1}$ = s' | $S_t$ = s, $A_t$ = a] - R is a reward function, $R_s$ = $\mathbb{E}$[$R^a_{t + 1}$ | $S_t$ = s, $A_t$ = a] - $\gamma$ is a discount factor, $\gamma \in$ [0, 1] #### policies **definition** a policy $\pi$ is a distribution over actions given states $\pi$(a | s) = $\mathbb{P}$[$A_{t}$ = a | $S_t$ = s] - a policy fully defines the behavior of an agent - MDP policies depend on the current state (not the history) - i.e. policies are *stationary* (time-independent) $A_t$ ~ $\pi$($\cdot$ | $S_t$), $\forall$t > 0 - given an MDP M = ($S$, $A$, $P$, $R$, $\gamma$) and a policy $\pi$, the state sequence $S_1$, $S_2$, ... is a Markov process (S, $P^\pi$) - the state and reward sequence $S_1$, $R_2$, $S_2$, ... is a Markov reward process $S$, $P^\pi$, $R^\pi$, $\gamma$) where $P^\pi_{s, s'}$ = $\sum_{a \in A}$ $\pi$(a | s) $P^a_{ss'}$ $R^\pi_{s}$ = $\sum_{a \in A}$ $\pi$(a | s) $R^a_{s}$ #### value function **definition** the state-value function $v_\pi$(s) of an MDP is the expected return starting from state s, and then following policy $\pi$ $v_\pi$(s) = $\mathbb{E}_\pi$[$G_{t}$ | $S_t$ = s] the action-value function $q_\pi$ (s, a)is the expected return starting from state s, taking action a, and then following policy $\pi$ $q_\pi$ (s, a) = $\mathbb{E}_\pi$[$G_{t}$ | $S_t$ = s, $A_t$ = a] #### Bellman Expectation Equation the state-value function can again be decomposed into immediate reward plus discounted value of successor state, $v_\pi$(s) = $\mathbb{E}_\pi$[$R_{t+1}$ + $\gamma v_{\pi}(S_{t+1})$ | $S_t$ = s] the action-value function can similarly be decomposed, $q_\pi$(s, a) = $\mathbb{E}_\pi$[$R_{t+1}$ + $\gamma v_{\pi}(S_{t+1}, A_{t+1}$) | $S_t$ = s, $A_t$ = a] Bellman Expectation Equation for $V^\pi$ ![](https://i.imgur.com/lOOdjwp.png) Bellman Expectation Equation for $Q^\pi$ ![](https://i.imgur.com/Z2Eg41w.png) Bellman Expectation Equation for $v^\pi$ ![](https://i.imgur.com/JZPBx0s.png) Bellman Expectation Equation for $q^\pi$ ![](https://i.imgur.com/cSgk7qV.png) matrix form the Bellman expectation equation can be expressed concisely usued the induced MRP $v_\pi$ = $R^\pi + \gamma P^\pi v_\pi$ with direct solution $v_\pi$ = (1 - $\gamma P^\pi)^{-1}R^\pi$ #### optimal value function the *optimal state-value function* $v_*$(s) is the maximum value function over all policies $v_*$(s) = $max_\pi v_\pi$(s) the *optimal action-value function* $q_*$(s, a) is the maximum action-value function over all policies $v_*$(s) = $max_\pi q_\pi$(s, a) - the optimal value function specifies the best possible performance in the MDP - an MDP is "solved" when we know the optimal value function #### optimal policy define a partial ordering over policies $\pi$ ≥ $\pi$' if $v_\pi$(s) ≥ $v_{\pi '}$(s), $\forall$s **theorem** for any Markov Decision Process - there exists an optimal policy $\pi_*$ that is better than or equal to all other policies, $\pi_*$ ≥ $\pi, \forall \pi$ - all optimal policies achieve the optimal value function, $v_{\pi_*}$(s) = $v_{*}$(s) - all optimal policies achieve the optimal action-value function, $q_{\pi_*}$(s, a) = $q_{*}$(s, a) #### finding an optimal policy an optimal policy can be found by maximizing over $q_{*}$(s, a), ![](https://i.imgur.com/ZvHIumP.png) - there is always a deterministic optimal policy for any MDP - if we know $q_{*}$(s, a), we immediately have the optimal policy #### Bellman optimality equations for $v_*$ and $q_*$ ADD GRAPHS/DIAGRAMS HERE #### solving the Bellman Optimality Equation - Bellman Optimality Equation is non-linear - no closed form solution (in general) - many iterative solution methods - value iteration - policy iteration - q-learning - sarsa ### extension to MDPs #### infinite and continuous MDPs the following extensions are all possible: - countably infinite state and/or action spaces - straightforward - continuous state and/or action spaces - closed form for linear wuadratic model (LQR) - continuous time - requires partial differential equations - Hamilton-Jacobi-Bellman (HJB) equation - limiting case of Bellman equation as time-step -> 0 #### partially observable MDPs A partially observable Markov Decision Process is an MDP with hidden states. It is a hidden Markov model with actions **definition** A POMDP is a tuple (S, A, O, P, R, Z, $\gamma$) - S is a finite set of states - A is a finite set of actions - O is a finite set of observations - P is a state transition probability matrix - $P^a_{ss'}$ = $\mathbb{P}$[$S_{t + 1}$ = s' | $S_t$ = s, $A_t$ = a] - R is a reward function, $R_s$ = $\mathbb{E}$[$R^a_{t + 1}$ | $S_t$ = s, $A_t$ = a] - Z is an observation function $Z^a_{s'o}$ = $\mathbb{P}$[$O_{t + 1}$ = o | $S_{t+1}$ = s', $A_t$ = a] - $\gamma$ is a discount factor, $\gamma \in$ [0, 1] belief states - a *history* $H_t$ is a sequence of actions, observations and rewards, $H_t = A_0, O_1, R_1, ..., A_{t+1}, O_t, R_t$ - a *belief state* b(h) is a probability distribution over states, conditioned on the history h b(h) = ($\mathbb{P}$[$S_{t}$ = s^1^ | $H_t$ = h], ... , $\mathbb{P}$[$S_{t}$ = s^n^ | $H_t$ = h]) reductions of POMDPs - the history $H_t$ satisfies the Markov property - the belief state b($H_t$) satisfies the Markov property ![](https://i.imgur.com/tijiS2Y.png) - a POMDP can be reduced to an (infinite) history tree - a POMDP can be reduced to an (infinite) belief state tree #### undiscounted, average reward MDPs an ergodic Markov Process is - recurrent: each state is visited an infinite number of times - aperiodic: each state is visited without any systematic period theorem - an ergodic Markov Process has a limiting stationary distribution $d^\pi$(s) with the property $d^\pi$(s) = $\displaystyle\sum_{s' \in S} d^\pi$(s')$P_{s's}$ an MDP is ergodic if the Markov chain induced by any policy is ergodic - for any policy $\pi$, an ergodic MDP has an *average reward per time-step* $p^\pi$ that is independent of start state $p^\pi$ = $\lim\limits_{T \to \infty}$ $\frac{1}{T}$ $\mathbb{E} [\displaystyle\sum_{t=1} ^{T} R_t]$ average reward value function - the value function of an undiscounted, ergodic MDP can be expressed in terms of average reward - $v_\pi$(s) is the extra reward due to starting from state s ![](https://i.imgur.com/y8MCCCf.png) - there is a corresponding average reward Bellman equation ![](https://i.imgur.com/wU0nYTf.png)