POMDPs and Offline RL

--- title: POMDPs and Offline RL tags: cs 593 rl --- ### Richly Observable Markov Decision Process (ROMDP) $R = (\mathcal{X, Y, A, P, P_1, R, O})$ $X_1 \sim P_1$ $X_{t+1} \sim P(\cdot | X_t, A_t)$ $Y_t \sim O(\cdot | X_t)$ $R_t \sim R(X_t, A_t)$ Knowing $O$, we can find $f: \mathcal{Y \to X}$ This can be viewed as an MDP on $\mathcal Y$-space with diameter $D_Y$ $P(Y_{t+1}|Y_t, A_t) = \sum_{X_{t+1}} O(Y_{t+1} |X_{t+1})P(X_{t+1}|f(Y_t),A_t)$ ### Partially Observable Markov Decision Process (POMDP) $M = \{\mathcal{X, Y, A, P, P_1, O, R}\}$ $X_1 \sim P_1$ $Y_1 \sim O(X_1)$ $X_2 \sim P(X_2, A_2)$ $h_t = \{Y_1, A_1, Y_2, A_2, ..., Y_t\}$ Belief: $b_t(x) = P(X_t = x | h_t)$ $b_1(x) = P(X_1 = x | Y_1 = y)$ $= \frac{P(X_1 = x, Y_1 = y)}{P(Y_1 = y)}$ $= \frac{O(y|x)P_1(x)}{\sum_{x'} O(y|x')P_1(x')}$ Optimal policy for infinite horizon POMDP is undecidable (Madani et al., 1999) Fixed horizon POMDP is PSPACE-Complete (Papadimitrou and Tsitsiklis, 1987) For memoryless policies in fixed horizon POMDP, the optimal policy is stochastic (Singh et al., 1994), but finding it is NP-Hard (Vlassis et al., 2011) If we have access to an oracle that can perform optimization, we can use the corresponding decision problem with UCRL2 optimism to find a regret bound $\tilde O(D|\mathcal X|^{3/2}\sqrt{\mathcal{A|Y|}T}), \mathcal{|X| \leq |Y|}$ ### Off-Policy Policy Evaluation (OPPE) ### Off-Policy Policy Optimization (OPPO) ### Off-Policy Risk Assessment (OPRA) Can we assess the performance of a policy with respect to all risk functionals or a class of risk functions e.g. mean, stddev, upper/lower quantile $D = \{X_t, A_t, R_t\}$ from $\pi_b$ $F_{\pi_b}$ is the CDF of $R$ under $\pi_b$ Most important risk functionals can be written as a function $\rho(F_{\pi_b})$ e.g. $E_{\pi_b}[R] = \int(1 - F_{\pi_b}(t))dt$ ### OPPE in MDP $\mathcal{M = (X, A, P, R}, P_1, H, \gamma)$ $R(x,a)$ is deterministic $\pi_b, \tau_n = (X_1, A_1, ..., X_n, A_n)$ We want to find $\eta(\pi) = E_{P_\pi}[\sum_{h=1}^H \gamma^{h-1}r_h]$ $= E_{P_{\pi_b}}[\frac{dP_{\pi}}{dP_{\pi_b}}\sum_{h=1}^H \gamma^{h-1}r_h]$ if $P_\pi << p_{\pi_b}$ In the special case of Euclidean spaces, $dP_{\pi} = P_1(X_1)\pi(A_1, X_1)P(X_2, X_1, A_1)...dX_1dA_1dX_2dA_2...$ $dP_\pi = \Pi_{h=1}^H \frac{\pi(A_h, X_h)}{\pi_b(A_h, X_h)} = \Pi_{h=1}^H \rho_h$ Given trajectories $\{\tau^t\}_{t=1}^T$ under $\pi_b$ $\eta(\pi) = \frac{1}{T}\sum_T (\Pi_{h=1}^H \rho_h^t)(\sum_{h=1^H} \gamma^{h-1}r_h^t)$ (unbiased Monte Carlo estimate) $\eta(\pi) = E_{p_{\pi_b}}[(\sum_{h=1^H} \gamma^{h-1}(\Pi_{h'=1}^H \rho_h')r_h^t)]$ (Per-step I.S.) Let $\hat V_h(x) = \frac{1}{T_h(x)}\sum_{t=1}^T \rho_h^t(r_h^t + \gamma\hat V_{h+1}^t)I(X_h^t = x)$ Then $\hat V_H(x) = \frac{1}{T_h(x)}\sum_{t=1}^T \rho_H^t r_H^t I(X_H^t = x)$ Calculate $\hat V_{H-1}(x) = \frac{1}{T_{H-1}(x)} \sum_{t=1}^T \rho_{H-1}^t (r_{H-1}^t + \gamma \hat V_H^t)I(X_{H-1}^t = x)$, then following dynamic programming downwards on $h$ gives us an unbiased estimate of $\eta(\pi)$ with variance Doubly Robust Estimator when given $Q'$: See photos Nan-Jiang-Lihong Li: Doubly Robust Estimation in MDP