A Graph Perspective on Generalized Policy Iteration

# A Graph Perspective on Generalized Policy Iteration Algorithms Given a finite Markov decision process (MDP), the most common way to solve the control problem is to adopt some form of generalized policy iteration (GPI), where one alternates between estimating the action value function $Q(s,a)$ for the current policy and setting the new policy to be greedy with respect to the the estimate of $Q$. By the policy improvement theorem, if the evaluation step is exact, then the new greedy policy $\pi'$ is gauranteed to improve upon the current policy $\pi$ in the sense that the state value function satisfies $v_\pi(s)\leq v_{\pi'}(s)$ for all state $s$. Moreover, the inequality must be strict for some state $s$ unless the policy $\pi$ is already optimal. (See [1] for details.) Now construct a complete graph whose vertex set is the collection of deterministic stationary policies $\Pi$ for this MDP. The state value function induces a preoder on $\Pi$ defined by $\pi\leq \pi'$ if and only if $v_\pi\leq v_{\pi'}$ (for all components), as well as an equivalence relation $\equiv$ defined by $\pi\equiv\pi'$ if and only if $v_\pi=v_{\pi'}$. Also write $\pi< \pi'$ for $\pi\leq \pi'$ and $v_\pi(s)<v_{\pi'}(s)$ for some state $s$. Define $\tilde{\Pi}:=\Pi/\equiv$, on which the preorder on $\Pi$ induces a partial order and has maximum element $[\pi^\star]$, the class of any optimal policy. For ease of language we also refer to $[\pi]$ as policies, even though they are really classes of policies. In this setting, GPI corresponds to a random walk on the quotient graph $\tilde{\Pi}$. If the policy evaluation step is exact, the policy improvement theorem says that for any starting policy $[\pi_0]\in\tilde{\Pi}$, the random walk (which is in fact deterministic) is a strictly increasing sequence which terminates at the absorbing state $[\pi^\star]$. An arbitrary GPI need not converge without further assumption. For example, the Monte Carlo with Exploring Starts assumptsion [1] can fail to converge under GPI without further assuming the uniformity of update rates. An example of non-convergence can be found in [2], where at each iteration the $Q$ estimate is only updated for the state-action pair $(s,a)$ at the start of the episode. A careful choice of starting states $(s,a)$ at each episode based on the current $Q$ estimates leads to cycling behavior of the $Q$ estimates and hence failure of convergence. In this case, the random walk on $\tilde{\Pi}$ is only allowed to transition to a neighbor whose action differs at $\leq 1$ state (modulo $\equiv$). All the possible transitions induce a sparse graph, and combined with inacurate $Q$ estimates and a curated rule for exploring starts, the walk is made into a cycle that never hits the optimal policy (class). From this perspective, tools from random walk and graph theory might be helpful in proving convergence; however, we caution that much complexity is hidden behind this drastically simplified graph structure, as it does not capture the MDP dynamics and hence the interplay between the state-value and action-value functions explicity. As a crude example, let $D$ be the maximum possible length of strictly increasing paths in the graph $\tilde{\Pi}$ induced by the algorithm, so that any strictly increasing sequence of policies must hit $[\pi^\star]$ in at most $D$ steps. Let $E_t$ be the event that at iteration $t$ in the algorithm, either 1) the new policy $[\pi_{t+1}]$ does not strictly improve upon $[\pi_{t}]$ if $[\pi_{t}]\neq [\pi^\star]$ (including the case where the two are not comparable), or 2) the new policy $[\pi_{t+1}]$ becomes suboptimal if $[\pi_{t}]=[\pi^\star]$. If it is the case that $\mathbb{P}(E_t)<\delta/D$, then it follows that with probability at least $1-\delta$, the algorithm converges to an optimal policy after $D$ steps. If it is the case that $\sum_{t=1}^\infty \mathbb{P}(E_t) < \infty$, then by the Borel-Cantelli Lemma, $\mathbb{P}(\limsup E_t)=0$. Taking the complement, we know that with probability 1, there exists some $t$ such as after the $t$-th iteration, the algorithm improves strictly at each iteration until it hits $[\pi^\star]$ and stays there (since $D<\infty$). In other words, the algorithm converges to $[\pi^\star]$ almost surely. We note that the assumption here is quite strong, as it requires the probability of "bad event" at each iteration to not only go to $0$ but drop fast enough to be summable. This means that if we are to perform policy evaluation from scatch at each iteration, we would need to collect increasingly more samples to achieve increasingly higher confidence. Therefore any "smart" algorithm would need to utilize past reward samples (coming from different policies) to estimate the current policy. In other words, we should design the algorithm so that the probability of $E_t$ given $E_{t-1}^c$ (or $E_{1}^c\cap\cdots\cap E_{t-1}^c$) is exceedingly low. # GPI using Monte Carlo Evaluation Consider the setting where at each iteration, we evaluate the current policy using Monte Carlo method (i.e. taking the i.i.d. return from each state-action pair following the policy) and take the greedy policy with respect to that estimate. This differs from the MC control in [1] as samples from previous iterations do not carry over. Our goal is to take the number of samples at each iteration so large that each greedy policy is correct (and hence the policy improves) with high probability. For simplicity, assume that the MDP is episodic reward bounded in $[0,1]$ and that the maximum episode length is $T$ (we can relax this assumption by conditioning on $T$). At each iteration $t$, for each state $s$ and action $a$ we sample $n_{s,a,t}$ episodes. Let $E_{s,a,t}$ be the event in which at this iteration we get $$|Q_t(s,a)-q_{\pi_t}(s,a)|\geq \delta_{s,a,t},$$ where $Q_t(s,a)$ is the sample mean of the $n_{s,a,t}$ returns of trajectories starting from $(s,a)$. We hope that at each iteration $t$ we have $$\mathbb{P}\left(E_{s,a,t}\right) \leq \eta_{s,a,t}, \text{ where } \sum_{t=1}^\infty\sum_{s,a} \eta_{s,a,t} <\infty,$$ so that by union bound and Borel-Cantelli the algorithm converges with probability one. By Hoeffding's inequality we have $$\mathbb{P}(E_{s,a,t})\leq 2\exp\left(\frac{-2\delta^2_{s,a,t}n_{s,a,t}}{T^2} \right)$$ so setting the RHS to be $\eta_{s,a,t}$ we get $$\delta_{s,a,t}=T\sqrt{\frac{\log(2/\eta_{s,a,t})}{2n_{s,a,t}}}.$$ For example, we can take $\eta_{s,a,t}:=1/(SAt^2)$ where $S$ and $A$ are the number of states and actions respectively, and therefore at iteration $t$, for each $s$ we would need to take $n_{s,a,t}$ large enough (to make $\delta_{s,a,t}$ small enough) that the confidence interval of $Q_t(s, a^\star)$ for the optimal action at $s$ is disjoint with any other confidence interval $Q_t(s,a)$ for suboptimal $a$. This way we obtain a convergent Monte Carlo GPI algorithm.     # References [1] Richard S. Sutton and Andrew G. Barto. "Reinforcement Learning: An Introduction," 2018 [2] Wang, Che, et al. "On the convergence of the monte carlo exploring starts algorithm for reinforcement learning," 2020 [3] [Santos, Manuel S., and John Rust. "Convergence properties of policy iteration." SIAM Journal on Control and Optimization 42.6 (2004): 2094-2115.](https://editorialexpress.com/jrust/research/siam_dp_paper.pdf)