---
title: POMDPs and Offline RL
tags: cs 593 rl
---
### Richly Observable Markov Decision Process (ROMDP)
$R = (\mathcal{X, Y, A, P, P_1, R, O})$
$X_1 \sim P_1$
$X_{t+1} \sim P(\cdot | X_t, A_t)$
$Y_t \sim O(\cdot | X_t)$
$R_t \sim R(X_t, A_t)$
Knowing $O$, we can find $f: \mathcal{Y \to X}$
This can be viewed as an MDP on $\mathcal Y$-space with diameter $D_Y$
$P(Y_{t+1}|Y_t, A_t) = \sum_{X_{t+1}} O(Y_{t+1} |X_{t+1})P(X_{t+1}|f(Y_t),A_t)$
### Partially Observable Markov Decision Process (POMDP)
$M = \{\mathcal{X, Y, A, P, P_1, O, R}\}$
$X_1 \sim P_1$
$Y_1 \sim O(X_1)$
$X_2 \sim P(X_2, A_2)$
$h_t = \{Y_1, A_1, Y_2, A_2, ..., Y_t\}$
Belief:
$b_t(x) = P(X_t = x | h_t)$
$b_1(x) = P(X_1 = x | Y_1 = y)$
$= \frac{P(X_1 = x, Y_1 = y)}{P(Y_1 = y)}$
$= \frac{O(y|x)P_1(x)}{\sum_{x'} O(y|x')P_1(x')}$
Optimal policy for infinite horizon POMDP is undecidable (Madani et al., 1999)
Fixed horizon POMDP is PSPACE-Complete (Papadimitrou and Tsitsiklis, 1987)
For memoryless policies in fixed horizon POMDP, the optimal policy is stochastic (Singh et al., 1994), but finding it is NP-Hard (Vlassis et al., 2011)
If we have access to an oracle that can perform optimization, we can use the corresponding decision problem with UCRL2 optimism to find a regret bound $\tilde O(D|\mathcal X|^{3/2}\sqrt{\mathcal{A|Y|}T}), \mathcal{|X| \leq |Y|}$
### Off-Policy Policy Evaluation (OPPE)
### Off-Policy Policy Optimization (OPPO)
### Off-Policy Risk Assessment (OPRA)
Can we assess the performance of a policy with respect to all risk functionals or a class of risk functions e.g. mean, stddev, upper/lower quantile
$D = \{X_t, A_t, R_t\}$ from $\pi_b$
$F_{\pi_b}$ is the CDF of $R$ under $\pi_b$
Most important risk functionals can be written as a function $\rho(F_{\pi_b})$
e.g. $E_{\pi_b}[R] = \int(1 - F_{\pi_b}(t))dt$
### OPPE in MDP
$\mathcal{M = (X, A, P, R}, P_1, H, \gamma)$
$R(x,a)$ is deterministic
$\pi_b, \tau_n = (X_1, A_1, ..., X_n, A_n)$
We want to find $\eta(\pi) = E_{P_\pi}[\sum_{h=1}^H \gamma^{h-1}r_h]$
$= E_{P_{\pi_b}}[\frac{dP_{\pi}}{dP_{\pi_b}}\sum_{h=1}^H \gamma^{h-1}r_h]$ if $P_\pi << p_{\pi_b}$
In the special case of Euclidean spaces,
$dP_{\pi} = P_1(X_1)\pi(A_1, X_1)P(X_2, X_1, A_1)...dX_1dA_1dX_2dA_2...$
$dP_\pi = \Pi_{h=1}^H \frac{\pi(A_h, X_h)}{\pi_b(A_h, X_h)} = \Pi_{h=1}^H \rho_h$
Given trajectories $\{\tau^t\}_{t=1}^T$ under $\pi_b$
$\eta(\pi) = \frac{1}{T}\sum_T (\Pi_{h=1}^H \rho_h^t)(\sum_{h=1^H} \gamma^{h-1}r_h^t)$ (unbiased Monte Carlo estimate)
$\eta(\pi) = E_{p_{\pi_b}}[(\sum_{h=1^H} \gamma^{h-1}(\Pi_{h'=1}^H \rho_h')r_h^t)]$ (Per-step I.S.)
Let $\hat V_h(x) = \frac{1}{T_h(x)}\sum_{t=1}^T \rho_h^t(r_h^t + \gamma\hat V_{h+1}^t)I(X_h^t = x)$
Then $\hat V_H(x) = \frac{1}{T_h(x)}\sum_{t=1}^T \rho_H^t r_H^t I(X_H^t = x)$
Calculate $\hat V_{H-1}(x) = \frac{1}{T_{H-1}(x)} \sum_{t=1}^T \rho_{H-1}^t (r_{H-1}^t + \gamma \hat V_H^t)I(X_{H-1}^t = x)$, then following dynamic programming downwards on $h$ gives us an unbiased estimate of $\eta(\pi)$ with variance
Doubly Robust Estimator when given $Q'$: See photos
Nan-Jiang-Lihong Li: Doubly Robust Estimation in MDP