--- title: Markov Decision Processes tags: cs 593 rl --- $MDP = <S, A, T, R, \gamma>$ We define the following policy classes: $\Pi^{HR} = \pi : A_t \sim \pi(h_t)$ (History-dependent randomized) $\Pi^{HD} = \pi : A_t = \pi(h_t)$ (History-depednent deterministic) $\Pi^{MR} = \pi : A_t \sim \pi(X_t, t)$ (Markov randomized) $\Pi^{MD} = \pi : A_t = \pi(X_t, t)$ (Markov deterministic) $\Pi^{SR} = \pi : A_t \sim \pi(X_t)$ (Stationary/Memoryless Markov randomized) $\Pi^{SD} = \pi : A_t = \pi(X_t)$ (Stationary/Memoryless Markov determinstic) ### Tabular discounted stationary infinite horizon MDP $P(x'|x,a), R(x,a)$ $V^\pi(x) = \mathbb{E}[\sum_{t=0}^\infty \gamma^tr_t|X_1 = x]$ $r_\pi \in \mathbb{R}^{|X|}, r_{\pi, i} = \sum_a R(x=i,a)$ $P_\pi \in \mathbb{R}^{|X| \times |X|}, P_{\pi, i, i'} = \sum_a P(x'=i'|x=i,a)$ $V^\pi = \sum_{t=0}^\infty \gamma^tP_\pi^tr_\pi$ $= r_\pi + \gamma P_\pi r_\pi + \gamma^2P_\pi^2r_\pi + ...$ $= r_\pi + \gamma P_\pi r_\pi$ $= \tau^\pi(V^\pi) (\tau^\pi = Bellman \; Operator)$ **Theorem**: The Bellman Equation $V = \tau^\pi(V)$ has a unique solution $V_\pi$ **Proof**: $V = r_\pi + \lambda P_\pi V$ $(I - \gamma P_\pi)V = r_\pi$ If $(I - \gamma P_\pi)$ is full rank, then the inverse exists and thus $V = V_\pi$ is unique $||P_\pi|| = 1, ||\gamma P_\pi|| = \gamma < 1$ Therefore $(I - \gamma P_\pi)$ is full rank ### Bellman Optimality Equation $V = \max_\pi (r_\pi + \gamma P_\pi V) = \tau V$ **Theorem**: There exists a unique $V$ that satisfies $V = \tau V$, with $V = V^*$ **Lemma**: The $\tau$ operator is a contraction operator under the $||\cdot||_\infty$ norm (Proof in Puterman 6.2.4 or Silver course) **Banach Fixed-Point Theorem**: Suppose $\Omega$ is a Banach space and $\mathcal{L}: \Omega \to \Omega$ is a contraction mapping. Then: (a) There exists a unique $v^* \in \Omega$ such that $\mathcal{L}v^\star = v^\star$ (b) $\forall \; v^0 \in \Omega$ the sequence $\{v_n\}$ given by $v^{n+1} = \mathcal{L}v^n = \mathcal{L}^{n + 1}v^0$ converges to $v^*$ (Proof in Puterman 6.2.3) <!-- 3/28/2022 --> An MDP is **strongly connected** or **communicating** if for any $(\x_1, \x_2) \in \mathcal X \times \mathcal X$ there exists a policy $\pi \in \Pi^{MD}$ such that starting from $x_1$ and following $\pi$ gives a positive probability of reaching $x_2$ in finite time. We define the **diameter** of an MDP $M$ as: $D(M) = \max_{x' \neq x}\min_{\pi \in \Pi^{MD}} E_\pi[\min\{i \geq 1; x_i = x' | x_1 = x\}] - 1$ For a finite state MDP M, $D(M) < \infty$ if and only if M is strongly communicating ### Infinite Horizon MDP *Missing Notes* V(x) is defined using Cesaro sum, see photos and Divergent Series (Hardy 1945) For a value function $V \mathcal X \to \mathbb R: \textbf{span(V)} = \max_{x \in \mathcal X}V(x) - \min_{x \in \mathcal X} V(x)$ **Lemma:** For any memoryless policy i.e. $\pi \in \Pi^S, \bar P_\pi V_\pi = 0$ **Proof:** Note that $\bar P_\pi P_\pi = \bar P_\pi$ $\bar P_\pi V_\pi^T = \sum_{t=1}^T \bar P_\pi P_\pi^{t-1}(r_\pi - \rho^\pi)$ $=\sum_{t=1}^T \bar P_\pi(r_\pi - \rho^\pi) = T \bar P_\pi (r_\pi - \rho^\pi) = T(\rho^\pi - \rho^pi) = 0$ $\bar P_\pi V_\pi = \lim_{T \to \infty} \frac{1}{T}\sum_{t=1}^T \bar P_\pi V_\pi^t = 0$ **Lemma:** For any memoryless policy i.e. $\pi \in \Pi^S, \rho^\pi + V_\pi = r_\pi + P_\pi V_\pi$ This was called the Poisson equation, but now it is the Bellman equation **Proof:** $V_\pi = (I - P_\pi + \bar P_\pi)^{-1}r_\pi - \bar P_\pi r_\pi$ $(I - P_\pi + \bar P_\pi)(V_\pi + \rho^\pi) = r_\pi$ $V_\pi - P_\pi V_\pi + \bar P_\pi V_\pi + \rho^\pi - P_\pi\rho^\pi + \bar P_\pi \rho^\pi = r_\pi$ $V_\pi - P_\pi V_\pi + 0 + \rho^\pi - \rho^\pi + \rho^\pi = r_\pi$ $V_\pi + \rho^\pi = r_\pi + P_\pi V_\pi$ **Bellman Optimality Equation:** $\rho + V(x) = \max_{a \in \mathcal A} \bar r(x, a) + \langle P(\cdot | x,a), V \rangle$ Bellman Operator $T(V) = \max_{a \in \mathcal A} \bar r(x, a) + \langle P(\cdot | x,a), V \rangle$ ### UCRL2 Regret: $\bar R_T = T\rho^* - \sum_{t=1}^T r(x_t, a_t)$ We estimate $\hat P, \hat r$ at each iteration, then choose the most optimistic $\bar P \in ||\bar P - \hat P||, \bar r \in |\bar r - \hat r|$; we can prove that $\bar P \in ||P - \hat P||, \bar r \in ||r - \hat r|| \; w.p. \; 1 - \delta$ To simplify, we assume that $0 \leq r \leq 1$, $P$ is deterministic, and we know $\bar r$. $T_t(x,a) = \sum_{i=1}^t I\{x_i = x, a_i = a\}$ $P_t(x'|x,a) = \sum_{i=1}^tI\{x_i = x, a_i = a, x_{i+1} = x'\}$ $C_t^\delta (x,a) = \{P' \in \Delta^{|\mathcal X|}|: |P' - P_{t-1}(\cdot|x,a)||_1 \leq \sqrt{\frac{|\mathcal X|L_{t-1}(x,a)}{\max\{1, T_{t-1}(x,a)\}}}\}$ $L_t(x,a) = 2 \log(\frac{4|\mathcal X||\mathcal A|T_t(x,a)(1 + T_t(x,a))}{\delta})\; L_t(x,a) = 1 \; if \; T_t(x,a) = 0$ **Upper Bound:** $\hat R_T \leq CD|\mathcal X|\sqrt{|\mathcal A|T\log(T|\mathcal X||\mathcal A|/\delta)} \; w.p. \; 1 - \delta$ **Lower Bound:** For $|\mathcal X| > 3, |\mathcal A| > 2, D \geq b + 2 \log_{|\mathcal A|}|\mathcal X|, T \geq D|\mathcal X||\mathcal A|,$ $$ UCRL2 is order optimal but off in $\sqrt{|\mathcal X|D}$ Multinomial distribution $P \in \mathbb R^d, P \in \Delta^d \implies P \in \mathbb R^d_+, \sum_{i=1}^dP = 1$ I.e. a d-sided die where the ith entry is the probability of i being face up We can sample $\{e_{s_t}\}_{t=1}^n$ and estimate the probabilities $\hat P = \frac{1}{n} \sum_{t=1}^n e_{s_t}$ **Theorem:** Concentration on Simplex, Weissman et al., 2001 Let $S_1, ..., S_n$ be a sequence of random variables on $(\Omega, \mathcal F, P)$ such that $S_i \in \Omega = \{1, ..., d\}$ Let $P_i = \mathbb P(S_i = i)$ and $\hat P_i = \frac{1}{n} \sum_{t=1}^n I\{S_t = i\}$, then: $\mathbb P\Big(||P_i - \hat P_i||_1 > \sqrt{\frac{2d\log(2/\delta)}{n}}\Big) \leq \delta$ $C_t^\delta = \{P' \in \Delta^|\mathcal X| : P'(\cdot | x, a) \in C_t^\delta(x,a) \; \forall \; x \in \mathcal X, a \in \mathcal A\}$ In the rest of this proof, we use an upper bound on $L_t(x,a) \leq L = 2 \log(\frac{4|\mathcal X||\mathcal A|T(T+1)}{\delta})$ We want $\rho_k = \max_{x \in \mathcal X}\max_\pi\max_{P' \in C_{\tau_k}} \rho_x^\pi(P')$ We also know that $P_k + V_k(x) = \max_{a \in A}r(x,a) + \langle P_k(\cdot | x,a), V_k\rangle$ Define $\tilde M_k$, the extended MDP with the same state space as M, but extended action space $\tilde A_k' = \{a,P', a \in \mathcal A, P' \in C_t^\delta(x,a)\}$, then the reward function is $r_k^{\tilde A_k}(x, (a, P')) = r(x,a)$ and transition $P(\cdot | x, (a, P')) = P'(\cdot | x,a)$ Then solving the extended MDP $\tilde M_k$ using Value Iteration gives us $\rho, V$, with $\pi_k(x) \in \arg\max_{a \in \mathcal A} r(x,a) + \max_{P'(x,a) \in C_t^\delta(x,a)}\langle P'(x,a), V\rangle$ $\tilde P_k$ in the above $\max$ is the optimistic P' for the original MDP; this is solvable because $C_t^\delta$ is a simplex so optimization just involves checking the corners Also, $D(\tilde M) \leq D$ **Lemma:** For a $(\rho, V)$ satisfying the Bellman optimality equation, $span(V) \leq D$ (UCRL2, Bandit Algorithms CH38) Therefore, for $\tilde M_k$, we have $\rho_k, \tilde P_k, V_k$ and we know $span(V_k) \leq D$, and we know that $V_k$ is not unique; thus we choose $V_k$ such that $||V_k||_\infty < D/2$ $\bar R_T = \sum_{k=1}^K\sum_{t=\tau_k}^{\tau_{k+1} - 1}(\rho - r(x_t,a_t))$ $\leq \sum_{k=1}^K\sum_{t=\tau_k}^{\tau_{k+1} - 1}(\rho_k - r(x_t,\pi_k(x_t)))$ We know that $\rho_k = r(x_t,\pi_k(x_t)) - V_k(x_t) + \langle \tilde P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$ (Bellman optimality equation for optimistic model) $R_k \leq \sum_{t = \tau_k}^{\tau_{k+1}-1} - V_k(x_t) + \langle \tilde P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$ $= \sum_{t = \tau_k}^{\tau_{k+1}-1} - V_k(x_t) + \langle P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$ $+ \sum_{t = \tau_k}^{\tau_{k+1}-1}\langle \tilde P_k(\cdot | x_t, \pi_k(x_t)) - P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle$ $= A_1 + A_2$ $A_1 = \sum_{t = \tau_k}^{\tau_{k+1}-1} V_k(x_{t+1}) - V_k(x_t) + \langle P_k(\cdot | x_t, \pi_k(x_t)), V_k \rangle - V_k(x_{t+1})$ $\leq D + \sum_{t = \tau_k}^{\tau_{k+1}-1}()$ $A_2 \leq \sum_{t=\tau_k}^{\tau_{k+1}-1}||P_k(\cdot | x_t, a_t) - P(\cdot | x_t, a_t)||_1||V_k||_\infty$ $\leq \frac{D\sqrt{L|\mathcal X|}}{2}\sum_{x,a \in \mathcal{X,A}}\frac{T_K(x,a)}{\sqrt{\max(1, T_{\tau_k} - 1)}}$