---
title: Markov Decision Processes
tags: cs 593 rl
---
$MDP = <S, A, T, R, \gamma>$
We define the following policy classes:
$\Pi^{HR} = \pi : A_t \sim \pi(h_t)$ (History-dependent randomized)
$\Pi^{HD} = \pi : A_t = \pi(h_t)$ (History-depednent deterministic)
$\Pi^{MR} = \pi : A_t \sim \pi(X_t, t)$ (Markov randomized)
$\Pi^{MD} = \pi : A_t = \pi(X_t, t)$ (Markov deterministic)
$\Pi^{SR} = \pi : A_t \sim \pi(X_t)$ (Stationary/Memoryless Markov randomized)
$\Pi^{SD} = \pi : A_t = \pi(X_t)$ (Stationary/Memoryless Markov determinstic)
### Tabular discounted stationary infinite horizon MDP
$P(x'|x,a), R(x,a)$
$V^\pi(x) = \mathbb{E}[\sum_{t=0}^\infty \gamma^tr_t|X_1 = x]$
$r_\pi \in \mathbb{R}^{|X|}, r_{\pi, i} = \sum_a R(x=i,a)$
$P_\pi \in \mathbb{R}^{|X| \times |X|}, P_{\pi, i, i'} = \sum_a P(x'=i'|x=i,a)$
$V^\pi = \sum_{t=0}^\infty \gamma^tP_\pi^tr_\pi$
$= r_\pi + \gamma P_\pi r_\pi + \gamma^2P_\pi^2r_\pi + ...$
$= r_\pi + \gamma P_\pi r_\pi$
$= \tau^\pi(V^\pi) (\tau^\pi = Bellman \; Operator)$
**Theorem**: The Bellman Equation $V = \tau^\pi(V)$ has a unique solution $V_\pi$
**Proof**:
$V = r_\pi + \lambda P_\pi V$
$(I - \gamma P_\pi)V = r_\pi$
If $(I - \gamma P_\pi)$ is full rank, then the inverse exists and thus $V = V_\pi$ is unique
$||P_\pi|| = 1, ||\gamma P_\pi|| = \gamma < 1$
Therefore $(I - \gamma P_\pi)$ is full rank
### Bellman Optimality Equation
$V = \max_\pi (r_\pi + \gamma P_\pi V) = \tau V$
**Theorem**: There exists a unique $V$ that satisfies $V = \tau V$, with $V = V^*$
**Lemma**: The $\tau$ operator is a contraction operator under the $||\cdot||_\infty$ norm
(Proof in Puterman 6.2.4 or Silver course)
**Banach Fixed-Point Theorem**: Suppose $\Omega$ is a Banach space and $\mathcal{L}: \Omega \to \Omega$ is a contraction mapping. Then:
(a) There exists a unique $v^* \in \Omega$ such that $\mathcal{L}v^\star = v^\star$
(b) $\forall \; v^0 \in \Omega$ the sequence $\{v_n\}$ given by $v^{n+1} = \mathcal{L}v^n = \mathcal{L}^{n + 1}v^0$ converges to $v^*$
(Proof in Puterman 6.2.3)
<!--
3/28/2022
-->
An MDP is **strongly connected** or **communicating** if for any $(\x_1, \x_2) \in \mathcal X \times \mathcal X$ there exists a policy $\pi \in \Pi^{MD}$ such that starting from $x_1$ and following $\pi$ gives a positive probability of reaching $x_2$ in finite time.
We define the **diameter** of an MDP $M$ as:
$D(M) = \max_{x' \neq x}\min_{\pi \in \Pi^{MD}} E_\pi[\min\{i \geq 1; x_i = x' | x_1 = x\}] - 1$
For a finite state MDP M, $D(M) < \infty$ if and only if M is strongly communicating
### Infinite Horizon MDP
*Missing Notes*
V(x) is defined using Cesaro sum, see photos and Divergent Series (Hardy 1945)
For a value function $V \mathcal X \to \mathbb R: \textbf{span(V)} = \max_{x \in \mathcal X}V(x) - \min_{x \in \mathcal X} V(x)$
**Lemma:** For any memoryless policy i.e. $\pi \in \Pi^S, \bar P_\pi V_\pi = 0$
**Proof:**
Note that $\bar P_\pi P_\pi = \bar P_\pi$
$\bar P_\pi V_\pi^T = \sum_{t=1}^T \bar P_\pi P_\pi^{t-1}(r_\pi - \rho^\pi)$
$=\sum_{t=1}^T \bar P_\pi(r_\pi - \rho^\pi) = T \bar P_\pi (r_\pi - \rho^\pi) = T(\rho^\pi - \rho^pi) = 0$
$\bar P_\pi V_\pi = \lim_{T \to \infty} \frac{1}{T}\sum_{t=1}^T \bar P_\pi V_\pi^t = 0$
**Lemma:** For any memoryless policy i.e. $\pi \in \Pi^S, \rho^\pi + V_\pi = r_\pi + P_\pi V_\pi$
This was called the Poisson equation, but now it is the Bellman equation
**Proof:**
$V_\pi = (I - P_\pi + \bar P_\pi)^{-1}r_\pi - \bar P_\pi r_\pi$
$(I - P_\pi + \bar P_\pi)(V_\pi + \rho^\pi) = r_\pi$
$V_\pi - P_\pi V_\pi + \bar P_\pi V_\pi + \rho^\pi - P_\pi\rho^\pi + \bar P_\pi \rho^\pi = r_\pi$
$V_\pi - P_\pi V_\pi + 0 + \rho^\pi - \rho^\pi + \rho^\pi = r_\pi$
$V_\pi + \rho^\pi = r_\pi + P_\pi V_\pi$
**Bellman Optimality Equation:**
$\rho + V(x) = \max_{a \in \mathcal A} \bar r(x, a) + \langle P(\cdot | x,a), V \rangle$
Bellman Operator $T(V) = \max_{a \in \mathcal A} \bar r(x, a) + \langle P(\cdot | x,a), V \rangle$
### UCRL2
Regret: $\bar R_T = T\rho^* - \sum_{t=1}^T r(x_t, a_t)$
We estimate $\hat P, \hat r$ at each iteration, then choose the most optimistic $\bar P \in ||\bar P - \hat P||, \bar r \in |\bar r - \hat r|$; we can prove that $\bar P \in ||P - \hat P||, \bar r \in ||r - \hat r|| \; w.p. \; 1 - \delta$
To simplify, we assume that $0 \leq r \leq 1$, $P$ is deterministic, and we know $\bar r$.
$T_t(x,a) = \sum_{i=1}^t I\{x_i = x, a_i = a\}$
$P_t(x'|x,a) = \sum_{i=1}^tI\{x_i = x, a_i = a, x_{i+1} = x'\}$
$C_t^\delta (x,a) = \{P' \in \Delta^{|\mathcal X|}|: |P' - P_{t-1}(\cdot|x,a)||_1 \leq \sqrt{\frac{|\mathcal X|L_{t-1}(x,a)}{\max\{1, T_{t-1}(x,a)\}}}\}$
$L_t(x,a) = 2 \log(\frac{4|\mathcal X||\mathcal A|T_t(x,a)(1 + T_t(x,a))}{\delta})\; L_t(x,a) = 1 \; if \; T_t(x,a) = 0$
**Upper Bound:**
$\hat R_T \leq CD|\mathcal X|\sqrt{|\mathcal A|T\log(T|\mathcal X||\mathcal A|/\delta)} \; w.p. \; 1 - \delta$
**Lower Bound:**
For $|\mathcal X| > 3, |\mathcal A| > 2, D \geq b + 2 \log_{|\mathcal A|}|\mathcal X|, T \geq D|\mathcal X||\mathcal A|,$
$$
UCRL2 is order optimal but off in $\sqrt{|\mathcal X|D}$
Multinomial distribution $P \in \mathbb R^d, P \in \Delta^d \implies P \in \mathbb R^d_+, \sum_{i=1}^dP = 1$
I.e. a d-sided die where the ith entry is the probability of i being face up
We can sample $\{e_{s_t}\}_{t=1}^n$ and estimate the probabilities $\hat P = \frac{1}{n} \sum_{t=1}^n e_{s_t}$
**Theorem:** Concentration on Simplex, Weissman et al., 2001
Let $S_1, ..., S_n$ be a sequence of random variables on $(\Omega, \mathcal F, P)$ such that $S_i \in \Omega = \{1, ..., d\}$
Let $P_i = \mathbb P(S_i = i)$ and $\hat P_i = \frac{1}{n} \sum_{t=1}^n I\{S_t = i\}$, then:
$\mathbb P\Big(||P_i - \hat P_i||_1 > \sqrt{\frac{2d\log(2/\delta)}{n}}\Big) \leq \delta$
$C_t^\delta = \{P' \in \Delta^|\mathcal X| : P'(\cdot | x, a) \in C_t^\delta(x,a) \; \forall \; x \in \mathcal X, a \in \mathcal A\}$
In the rest of this proof, we use an upper bound on $L_t(x,a) \leq L = 2 \log(\frac{4|\mathcal X||\mathcal A|T(T+1)}{\delta})$
We want $\rho_k = \max_{x \in \mathcal X}\max_\pi\max_{P' \in C_{\tau_k}} \rho_x^\pi(P')$
We also know that $P_k + V_k(x) = \max_{a \in A}r(x,a) + \langle P_k(\cdot | x,a), V_k\rangle$
Define $\tilde M_k$, the extended MDP with the same state space as M, but extended action space $\tilde A_k' = \{a,P', a \in \mathcal A, P' \in C_t^\delta(x,a)\}$, then the reward function is $r_k^{\tilde A_k}(x, (a, P')) = r(x,a)$ and transition $P(\cdot | x, (a, P')) = P'(\cdot | x,a)$
Then solving the extended MDP $\tilde M_k$ using Value Iteration gives us $\rho, V$, with $\pi_k(x) \in \arg\max_{a \in \mathcal A} r(x,a) + \max_{P'(x,a) \in C_t^\delta(x,a)}\langle P'(x,a), V\rangle$
$\tilde P_k$ in the above $\max$ is the optimistic P' for the original MDP; this is solvable because $C_t^\delta$ is a simplex so optimization just involves checking the corners
Also, $D(\tilde M) \leq D$
**Lemma:** For a $(\rho, V)$ satisfying the Bellman optimality equation,
$span(V) \leq D$
(UCRL2, Bandit Algorithms CH38)
Therefore, for $\tilde M_k$, we have $\rho_k, \tilde P_k, V_k$ and we know $span(V_k) \leq D$, and we know that $V_k$ is not unique; thus we choose $V_k$ such that $||V_k||_\infty < D/2$
$\bar R_T = \sum_{k=1}^K\sum_{t=\tau_k}^{\tau_{k+1} - 1}(\rho - r(x_t,a_t))$
$\leq \sum_{k=1}^K\sum_{t=\tau_k}^{\tau_{k+1} - 1}(\rho_k - r(x_t,\pi_k(x_t)))$
We know that $\rho_k = r(x_t,\pi_k(x_t)) - V_k(x_t) + \langle \tilde P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$
(Bellman optimality equation for optimistic model)
$R_k \leq \sum_{t = \tau_k}^{\tau_{k+1}-1} - V_k(x_t) + \langle \tilde P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$
$= \sum_{t = \tau_k}^{\tau_{k+1}-1} - V_k(x_t) + \langle P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$
$+ \sum_{t = \tau_k}^{\tau_{k+1}-1}\langle \tilde P_k(\cdot | x_t, \pi_k(x_t)) - P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$
$= A_1 + A_2$
$A_1 = \sum_{t = \tau_k}^{\tau_{k+1}-1} V_k(x_{t+1}) - V_k(x_t) + \langle P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle - V_k(x_{t+1})$
$\leq D + \sum_{t = \tau_k}^{\tau_{k+1}-1}()$
$A_2 \leq \sum_{t=\tau_k}^{\tau_{k+1}-1}||P_k(\cdot | x_t, a_t) - P(\cdot | x_t, a_t)||_1||V_k||_\infty$
$\leq \frac{D\sqrt{L|\mathcal X|}}{2}\sum_{x,a \in \mathcal{X,A}}\frac{T_K(x,a)}{\sqrt{\max(1, T_{\tau_k} - 1)}}$