Monte Carlo Convergence for Episodic MDPs

# Finite-sample Analysis for Episodic MDPs - We are interested in Monte Carlo estimates for Q functions and then take greedy action at each iteration in episodic MDPs - The assumption is that at each iteration, samples are generated for each state action pair. If the sample size at each pair goes to infinity then eventually the policy evaluation is exact and hence the algorithm converges almost surely. - We want to find a finite sample bound at each iteration for each state-action pair. Clearly even for finite MDPs we cannot have almost sure convergence if we only sample finitely many episodes at each iteration. The best we can hope for is convergence with high probability. - We are interested in episodic MDPs with undiscounted return. We assume that every policy induces an absorbing MRP. There are two steps to establish this kind of result: 1. At each iteration, for the policy evaluation step compute the sample complexity in terms of the MDP parameters and the absoption time associated with $\pi$. 2. Bound the number of policy improvement steps needed to reach the optimal policy. ## Policy evaluation step Here we assume that the policy $\pi$ is fixed and we analyze the induced Markov reward process. Denote the transition matrix of the absorbing chain in block form by $$P^\pi:=P = \begin{bmatrix} Q & R \\ O & I\end{bmatrix}.$$ Let $(s,a)$ be an arbitrary state-action pair. We want to find the minimum number $N$ of trajectories starting from $(s,a)$ following $\pi$ such that the sample mean $\hat{q}(s,a)$ of the $N$ sampled returns approximates the action-value function $q_\pi(s,a)$ accurately. We say that the length or absoprtion time of a trajectory is $L$ if it is of the form $$s_0=s, a_0=a, r_1, s_1, a_1=\pi(s_1), s_2, \cdots, r_{L-1}, s_{L-1}, a_{L-1}=\pi(s_{L-1}), r_L, s_L$$ where $s_L$ is an absorbing state. It could happen for some $(s,a)$ that $L=1$. Let $T_{s,a}$ denote the absoprtion time of a trajectory under $\pi$ starting from $(s,a)$, and let $p_{s,a}$ be the probability that $s_1$ is not terminal. For any distribution $s_1\sim \tau$ on transient states let $T_\tau= T_{s,a}-1$. It is clear that $$\begin{align} \mathbb{P}(T_\tau > k \mid T_{s,a}>1) & = \tau Q^k \mathbf{1} , \hspace{10pt} k\geq 0. \end{align}$$ Thus $\mathbb{P}(T_{s,a}=1)=1-p_{s,a}$ and for $k\geq 0$, $$\begin{align} \mathbb{P}(T_{s,a}>k+1) & = \mathbb{P}(T_{s,a}>1) \mathbb{P}(T_{s,a}>k+1 \mid T_{s,a}>1) \\ & = \mathbb{P}(T_{s,a}>1) \mathbb{P}(T_\tau>k \mid T_{s,a}>1) \\ & = p_{s,a} \tau_{s,a} Q^{k}\mathbf{1} \leq p_{s,a} \|Q^k\mathbf{1}\|_\infty = p_{s,a} \|Q^k\|_{\infty\to\infty} , \end{align}$$ where $s_1\sim\tau_{s,a}$ is the distribution of $s_1$ on transient states. We note that $p_{s,a}\tau_{s,a}$ is precisely the next-state distribution $\mathcal{P}(\cdot\mid s,a)$ when restricted to the transient states. We would like to control this tail bound so that we say something about the concentration of $\hat{q}$. ### Bounding the tail probabilities of $T$ For a vector $v$ let $v_i$ be its $i$-th compnent. For $m,n\geq 0$ we have $$(Q^{m+n}\mathbf{1})_i = (Q^{m}Q^n\mathbf{1})_i = \sum_{j} Q^m_{ij}(Q^{n}\mathbf{1})_j \leq \begin{cases} \|Q^{n}\mathbf{1}\|_\infty (Q^{m}\mathbf{1})_i\\ \max_j Q^m_{ij} \|Q^{n}\mathbf{1}\|_1 \end{cases}$$ and therefore $$\|Q^{m+n}\mathbf{1}\|_\infty \leq \begin{cases} \|Q^{n}\mathbf{1}\|_\infty \|Q^{m}\mathbf{1}\|_\infty\\ \|Q^m\|_\infty \|Q^{n}\|_1 \hspace{5pt}\text{(viewed as vectors)} \end{cases}$$ Note that powers of $Q$ may have rows that sum to $1$. We want an exponential decay. The second bound seems hard to control, so we turn to the first bound. The absorption assumption is equivalent to the statement $Q^k\to O_r$ as $k\to\infty$. Hence for any $\epsilon>0$ there exists some $K\in\mathbb{N}$ such that $k\geq K$ implies $\|Q^k\mathbf{1}\|_\infty \leq 1-\epsilon$. Thus for all $t\geq 0$ we have the sub-exponential bound $$\begin{align} & \mathbb{P}(T_{s,a} > t+1)\leq \mathbb{P}(T_{s,a} > \lfloor t/K\rfloor K + 1) \leq p_{s,a}(1-\epsilon)^{\lfloor t/K\rfloor} \leq p_{s,a} C_1 e^{-C_2t} \end{align}$$ where $C_1=\frac{1}{1-\epsilon}$ and $C_2=\frac{1}{K}\log(\frac{1}{1-\epsilon})$. Therefore to ensure that $\mathbb{P}(T_{s,a}>t+1)\leq \delta$ we can take $$t\geq \frac{K}{\log\frac{1}{1-\epsilon}}\log \frac{p_{s,a}}{\delta (1-\epsilon)}.$$ In particular, if we take $\epsilon=1-1/e$ we have $$ t\geq K\left(1+\log \frac{p_{s,a}}{\delta} \right).$$ One could obviously replace $p_{s,a}$ by $1$ in these bounds. The issue with this bound is that we do not know $K$ explicitly. But this should not cause too much trouble as $\|Q^k\mathbf{1}\|_\infty$ is decreasing in $k$ and we can compute $Q^{2^n}$ iteratively. ### Sample complexity Now we estimate the number of samples $N$ necessary to estimate $q_\pi(s,a)$ for each pair $(s,a)$. If we assume access to quantities $$\Delta(\pi; s,a):= q_\pi(s,a^\star)-q_\pi(s,a)$$ where $a^\star$ is a maximizing action of $q^\pi(s,\cdot)$. Then we can compute $$\Delta(\pi):=\min_{s}\min_{a: \text{ suboptimal for $\pi$ at $s$}}\Delta(\pi; s,a), \hspace{5pt} \Delta:= \min_{\pi\in \Pi_{\text{det}}} \Delta(\pi).$$ Recall that $\hat{q}(s,a)$ is the sample mean of $G_1,...,G_N$, where $G_i=r_1^{(i)}+\cdots +r_{T_i}^{(i)}$ is the return of some episode of random length $T_i$. Let $E_{s,a,t}$ be the event in which at iteration $t$ we get $$|\hat{q}(s,a)-q_\pi(s,a)|\geq \Delta/2.$$ Then, for any $T_0\in\mathbb{N}$ we have $$\begin{align} \mathbb{P}(E_{s,a,t}) & = \mathbb{P}(E_{s,a,t} \cap \{T\leq T_0\}) + \mathbb{P}(E_{s,a,t} \cap \{T> T_0\}) \\ & \leq \mathbb{P}(E_{s,a,t} \cap \{T\leq T_0\}) + \mathbb{P}(T> T_0) \\ & \leq \mathbb{P}(E_{s,a,t} \cap \{T\leq T_0\}) + p_{s,a} C_1 e^{-C_2 (T_0-1)}. \end{align}$$ Since we have assumed that $r_t\in[0,1]$, applying Hoeffding's inequality on $G_i$ gives \begin{align} \mathbb{P}(E_{s,a,t} \cap \{T\leq T_0\}) \leq 2\exp\left(\frac{-2\Delta^2/4}{\sum_{i=1}^N T_0^2/N^2}\right) = 2\exp\left(\frac{-\Delta^2 N}{2 T_0^2}\right). \end{align} Thus \begin{align*} \mathbb{P}(E_{s,a,t}) &\leq 2\exp\left(\frac{-\Delta^2 N}{2 T_0^2}\right) + \frac{p_{s,a}}{1-\epsilon} \exp\left(-\frac{T_0-1}{K}\log\frac{1}{1-\epsilon}\right)\\ & = 2\exp\left(\frac{-\Delta^2 N}{2 T_0^2}\right) + p_{s,a}(1-\epsilon)^{\frac{T_0-1}{K}-1} \end{align*} One then needs to optimize for $N$ and $T_0$ to make both terms small. We can first fix $\epsilon$ which gives us the corresponding $K$, and make the second term $\leq \eta_{s,a,t}/2$ to obtain $T_0$, and then compute $N$ to make the first term $\leq\eta_{s,a,t}/2$. Another way is to balance the exponents of the two terms. Setting $\frac{\Delta^2 N}{T_0^2} \asymp T_0$ we get $N \asymp T_0^3\Delta^{-2}$ and hence $\mathbb{P}(E_{s,a,t})\lesssim e^{-CT_0}.$ :::info A better bound may be possible by directly estimating the sum across all $r_t$'s instead of first summing them to $G_i$'s. ::: In summary, by union bound over $(s,a)$, at each generation we can find the number of samples needed to ensure that the greedy action selected by our algorithm is correct with high probability.  ## Policy improvement step Now that we know how many samples are needed for each policy evaluation step, it remains to bound the total number of iterations needed for convergence. The central task is to quantify the minimum improvement of one iteration. In other words, we would like to make sense of something like the following: :::success Suppose $d(\pi_1,\pi_2)$ is some distance neasure between policies and $\pi^\star$ is some optimal policy. This metric should reflect their value functions $v_{\pi_i}$ in some way and also has some monotonicity/contraction properties. Define $$\Lambda:= \min_{\pi: \text{ suboptimal}} d(\pi,\pi^\star). $$ The policy iteration starts at some policy $\pi_0$ and produces $v_t$ (or $q_t$) and $\pi_{t+1}$. Let $d_t:=d(\pi_t, \pi^\star)$ and $N$ be the maximum number of steps required for convergence. - Additive version: Find $\lambda>0$ such that $d_{t+1}\leq d_t -\lambda$. Then $$N \leq \frac{d_0-\Lambda}{\lambda}.$$ - Multiplicative version: Find $\rho>0$ such that $d_{t+1}\leq \rho d_{t}$. Then $$ N \leq \dfrac{\log \frac{\Lambda}{d_0}}{\log\rho}. $$ ::: ### Case where all policies are proper Let $\{1,...,n\}$ be the set of transient states and $t$ the terminal state. Since the value function is always $0$ at $t$ the value functions $v$ are viewed as vectors in $\mathbb{R}^n$. Assume that all (stationary) policies are proper, namely the Markov chain induced by each policy terminates with probability $1$. Then the Bellman operators $T$ and $T_\pi$ for any policy $\pi$ are contractions with respected to some weighted sup-norm defined by $$ \| J \|_w := \max_{i} \frac{|J(i)|}{w(i)}, $$ where $w(i):=\max_{\pi} w_\pi(i)$ and $w_\pi(i)$ is the expected path length from state $i$ to the terminal state when following $\pi$ (see Sec. 3.3 in [1]). Write $\|w\|_\infty:=\max_i w(i)\geq 1$. Then the contraction constant is given by $\rho:=1- \|w\|_\infty^{-1}$. This contraction property gives (c.f. Prop 2.5.9 or 3.5.2 [1]) $$ \|v_{\pi_{k+1}} - v^\star\|_w \leq \rho \|v_{\pi_{k}} - v^\star\|_w \implies \|v_{\pi_k} - v^\star\|_w \leq \rho^k \|v_{\pi_{0}} - v^\star\|_w $$ Let $\mathcal{F}$ be the collection of suboptimal policies. Fix any optimal policy $\pi^\star$ and let $$ \Delta_{\pi^\star}:=\min_s \min_{a: \hspace{2pt} v_\star(s)>q_{\pi^\star}(s,a)} [v_\star(s) - q_{\pi^\star}(s,a)] $$ be the value function gap. Then for any $\pi\in\mathcal{F}$ and state $i=1,...,n$ we have \begin{align*} v_\star(i)-v_\pi(i) & = q_{\pi^\star}(i,\pi^\star(i)) - q_\pi(i, \pi(i)) \\ & = [q_{\pi^\star}(i,\pi^\star(i))- q_{\pi^\star}(i,\pi(i)) ] + [q_{\pi^\star}(i,\pi(i)) - q_\pi(i, \pi(i))] \\ & \geq \Delta_{\pi^\star}\mathbb{1}\{\pi(i) \text{ is suboptimal}\} + \sum_{j=1}^n p_{ij}(\pi(i)) [v_\star(j)-v_\pi(j)] \\ & = \Delta_{\pi^\star}\chi_\pi(i) + [Q^\pi (v_\star-v_\mu)](i), \end{align*} where $Q^\pi$ is the chain on the transient states induced by $\pi$ and the $i$-th component of $\chi_\pi$ indicates whether $\pi(i)$ does not maximize $q_\star(i, \cdot)$. Thus \begin{align*} (I_n-Q^\pi) (v_\star-v_\pi) \geq \Delta_{\pi^\star} \chi_\pi \end{align*} and multiplying both sides by $(Q^\pi)^m$ and summing across $m=0,1,...$ gives $$ v_\star-v_\pi \geq \Delta_{\pi^\star} (I_n-Q^\pi)^{-1} \chi_\pi. $$ The inverse is well-defined because we have assumed that all stationary policies are proper. Therefore we have \begin{align*} \|v_\star-v_\pi\|_w & \geq \frac{\Delta_{\pi^\star}}{\|w\|_\infty} \max_{i=1,...,n} (I_n-Q^\pi)^{-1} \chi_\pi \\ & \geq \frac{\Delta_{\pi^\star}}{\|w\|_\infty} \min_{i=1,...,n} w_\pi(i) \geq \frac{\Delta_{\pi^\star}}{\|w\|_\infty} \\ \implies \min_{\pi\in\mathcal{F}} \|v_\star-v_\pi\|_w & \geq \frac{\Delta_{\pi^\star}}{\|w\|_\infty}. \end{align*} To guarantee convergences it suffices to take $k$ so large that $$\rho^k \|v_{\pi_0}-v_\star\|_w \leq \frac{\Delta_{\pi^\star}}{\|w\|_\infty},$$ namely take $k$ to be the smallest integer larger than $$\dfrac{\log \frac{\Delta_{\pi^\star}}{\|w\|_\infty \|v_{\pi_0}-v_\star\|_w}}{\log \rho} = \dfrac{\log \frac{\|w\|_\infty \|v_{\pi_0}-v_\star\|_w}{\Delta_{\pi^\star}}}{-\log \left(1-\|w\|_\infty^{-1}\right)} \leq \|w\|_\infty \log \frac{\|w\|_\infty \|v_\star-v_{\pi_0}\|_w}{\Delta_{\pi^\star}},$$ where, assuming the immediate reward is bounded in $[0,1]$, $$ \|v_\star-v_{\pi_0}\|_w \leq \|v_\star-v_{\pi_0}\|_\infty \leq \|v_\star\|_\infty \leq \|w\|_\infty $$ and therefore exact policy iteration converges after at most $$ K := \left\lceil 2 \|w\|_\infty \log \frac{\|w\|_\infty}{\Delta_{\pi^\star}}\right\rceil $$ iterations. Note that computing $\|w\|_\infty$ amounts to solving an MDP of the same size as the original one; it is only easier in the sense that all the immediate rewards are deterministically $1$. The main difficulty to estimate the convergence rate of PI is that the algorithm does not improve uniformly, as reflected by the presence of $\chi_\mu$. A reasonable next step is to analyze the improvement w.r.t. a non-sup norm. In terms of operators, if $Q$ is the transition matrix for a policy that achieves the maximum expected length $w$, then $\|w\|_\infty = \|(I_n-Q)^{-1}\|_{\infty\to\infty}$ and we want to obtain an upper bound on this quantity that is easier to compute. Assume that all policies are proper. Recall that for any policy $\pi$ the induced transition matrix satisfies $\|(Q^\pi)^k\|_{\infty\to\infty} \searrow 0$. Since there are finitely many policies, for any $\epsilon\in (0,1)$ there exists $N_\epsilon\in\mathbb{N}$ such that for any policy $\pi$ we have $\|(Q^\pi)^{N_\epsilon} \|_{\infty\to\infty} \leq 1-\epsilon$. In words, it means that for any trajectory, after $N_\epsilon$ steps the probability of absorption is at least $\epsilon$ regardless of which policy is used. Let $Q$ be the matrix induced by an optimal policy for the longest path problem. Then \begin{align} \|w\|_\infty = \left\|\sum_{k=0}^\infty Q^k\right\|_{\infty\to\infty} \leq \sum_{k=0}^\infty N_\epsilon \left\| (Q^{N_\epsilon})^k\right\|_{\infty\to\infty} \leq \sum_{k=0}^\infty N_\epsilon \left\| Q^{N_\epsilon}\right\|^k_{\infty\to\infty} = \frac{N_\epsilon}{\epsilon}. \end{align} Alternatively, note that $\|A\|_{\infty\to\infty}\leq \sqrt{n} \|A\|_{2\to 2}$ for any matrix $n\times n$ matrix $A$ (See [4] Theorem 5.6.18). Thus $\|w\|_\infty$ is bounded by $\sqrt{n}$ divided by the smallest positive singular value of $I_n-Q$, where $n$ is the number of transient states and $Q$ is the transition matrix of any policy. ## Difficulties 1. Bound $\|w\|_\infty$ with something computable. Neither bound mentioned above precludes the need to compute exponentially many policies. 2. The assumption that we know $\Delta$ and $\Delta(\pi^\star)$ does not seem realistic, but it may not be possible to do without them. However, these two difficulties seem intrinsic. Also see [2]. ## Next steps - Compare with existing sample complexity - Construct "bad MDPs": Ex 2.5 [1] - Relax the assumption that all policies are proper: [2] - Bound the error in L^p instead of sup-norm: [3] - Relax the assumption that all policies are proper # Theorem Statement Setting - Episodic MDP with undiscounted return - Immediate rewards are bounded in $[0,1]$ - All policies are proper Let $w$ be the vector consisting of the logest expected absorption time. For policy $\pi$ define $\Delta(\pi)$ to be the value function gap, and $\Delta:=\min_{\pi} \Delta(\pi)$ be the minimum gap across all deterministic policies. :::success **Theorem** Asume $K\in\mathbb{N}$ is such that the transition matrix $Q^\pi$ on the transient states induced by any policy $\pi$ satisfies $$ \|{(Q^\pi)}^K\|_{\infty\to\infty}\leq \frac{1}{2} $$ and define $$ L:= \left\lceil 2 \|w\|_\infty \log \frac{\|w\|_\infty}{\Delta_{\pi^\star}}\right\rceil. $$ Then for any $\epsilon>0$, setting $\delta=1-(1-\epsilon)^L$, we have that with probability at least $1-\epsilon$, the policy iteration algorithm converges in $L$ iterations (in terms of optimal policy) where each policy evaluation step requires $$ N= \frac{18K^2}{\Delta^2} \left( \log \frac{4SA}{\delta}\right)^3 $$ samples for each state-action pair. In other words, this version of MC PI converges with probability at least $1-\epsilon$ given $LSAN$ episodes in total. ::: # References [1] Bertsekas, Dimitri P. “Dynamic programming and optimal control 4th edition, volume ii.” Athena Scientific (2015). [2] Hansen, Eric A. “Suboptimality bounds for stochastic shortest path problems.” arXiv preprint arXiv:1202.3729 (2012). [3] [Farahmand, Amir-massoud, Csaba Szepesvári, and Rémi Munos. "Error propagation for approximate policy and value iteration." Advances in Neural Information Processing Systems 23 (2010).](https://proceedings.neurips.cc/paper_files/paper/2010/hash/65cc2c8205a05d7379fa3a6386f710e1-Abstract.html) [4] Horn, R., & Johnson, C. (2012). Norms for Vectors and Matrices. In Matrix Analysis (pp. 313-386). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139020411.008 [5] Van Handel, Ramon. "Probability in high dimension." Lecture Notes (Princeton University) (2016).