# MDP
###### tags: `RL`
## Markov Processes:
- environment is fully observable
- partially observable problems can be converted to MDPs
- Markov Property:
$$\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}|S_t, .....,S_0]$$
- State Transition Probability:
$$P_{ss'} = \mathbb{P}[S_{t+1}|S_t = s]$$
- S is a finite set of states
- P is a state transition probability
## Markov Reward Process:
- R is a reward function
$$R = \mathbb{E}[R_{t+1}|S_t = s]$$
- $\gamma$ is the discount factor.
- return G is the total discounted reward from time step t
$$G_t = R_{t+1} + \gamma R_{t+2} + .... = \sum^{\infty}_{k=0}{\gamma^{k} R_{t+k+1}}$$
- Reasons for using discount:
- no perfect model. Trust issues about the future.
- keep maths bounded (mathematically convinient).
- In financial settings, immediate rewards may earn more.
- It is possible to use undiscounted MRP eg: if all seq terminate.
- value function v(s) gives long-term value of state s :
$$v(s) = \mathbb{E}[G_t | S_t = s]$$
## Bellman Equation for MRPs:
$v(s) = \mathbb{E}[G_t|S_t = s]$
$= \mathbb{E}[R_{t_1} + \gamma R_{t+2} + .... |S_t = s]$
$= \mathbb{E}[R_{t+1} + \gamma( R_{t+2} + \gamma R_{t+3} + ....)| S_t = s]$
$= \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s]$
$= \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t = s]$ -By Law of Iterative Expectation
$$v(s) = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})| S_t = s]$$
$$ v(s) = R_s + \gamma \sum_{s' \epsilon S} {P_{ss'}v(s')}$$
matrix form :
$$\begin{bmatrix}
v(1)\\
v(2)\\
\vdots \\
v(n)
\end{bmatrix} = \begin{bmatrix}
R_1\\
R_2\\
\vdots \\
R_n
\end{bmatrix} + \gamma \begin{bmatrix}
P_{11} & ... & P_{1n}\\
\vdots && \vdots\\
P_{1n}&...& P_{nn}
\end{bmatrix}
\begin{bmatrix}
v(1)\\
v(2)\\
\vdots \\
v(n)
\end{bmatrix}
$$
Now:
$$v = R + \gamma P v $$
$$ v = (I-\gamma P)^{-1}R $$
- however, complexity is $O(n^3)$ so isnt practical
## Markov Decision Processes:
- A is a finite set of actions:
$$P^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s , A_t = a]$$
$$ R^a_s = \mathbb{E}[R_{t+1}|S_t = s, A_t = a]$$
- **Policy** :
$$ \pi (a|s) = \mathbb{P}[A_t = a | S_t = s]$$
- policies are stationary (time independent), since due to markov property , the MDP only depends upon the current state.
now: just average over the policy
$$P^{\pi}{ss'} = \sum_{a \epsilon A} \pi(a|s)P^a_{ss'}$$
$$R^{\pi}_s = \sum_{a \epsilon A} \pi(a|s)R^a_s $$
now we define
- state value function $v_{\pi}(s)$ : It is the expected return from state s following the policy $\pi$
$$v_{\pi} (s) = \mathbb{E}_{\pi}[G_t | S_t = s]$$
and
- action value function $q_{\pi}(s,a)$ : It is the expected return from state s, taking the action a and then following policy $\pi$
$$q_{\pi}(s,a) = \mathbb{E}[G_t | S_t = s, A_t = a]$$
### Bellman Expectation Equation:
$$v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})| S_t = s]$$
$$q_{\pi}(s,a)= \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})| S_t = s, A_t = a]$$
$$v_{\pi}(s) = \sum_{a \epsilon A} \pi (a|s) q_{\pi}(s,a)$$
$$q_{\pi}(s, a) = R^a_{s'} + \gamma \sum_{s \epsilon S}v_{\pi}.P^a_{ss'}$$
$$ v_{\pi}(s) = \sum_{a \epsilon A} \pi(a|s)(R^a_{s'} + \gamma \sum_{s \epsilon S}v_{\pi}.P^a_{ss'})$$
$$q_{\pi}(s, a) = R^a_{s'} + \gamma \sum_{s \epsilon S}P^a_{ss'}\sum_{a \epsilon A} \pi (a|s) q_{\pi}(s,a)$$
### Optimal value function:
$$v_*(s) = \max_{\pi} v_{\pi}(s)$$
$$q_*(s,a) = \max_{\pi} q_{\pi}(s, a)$$
mdp is solved if we know $q_*(s,a)$
- partial ordering of policies (just a jargon):
$\pi >= \pi '$ if $v_{\pi}(s) >= v_{pi '}(s)$
### Bellman Optimality Equation:
$$v_*(s) = \max_{a} q_*(a,s)$$
$$q_*(s,a) = R^a_s + \gamma \sum_{s' \epsilon S} P^a_{ss'}v_*(s')$$
$$v_*(s) = \max_{a}R^a_s + \gamma \sum_{s' \epsilon S} P^a_{ss'}v_*(s')$$
$$q_*(s,a) = R^a_s + \gamma \sum_{s' \epsilon S} P^a_{ss'} \max_{a'}q_*(s', a')$$