# A Graph Perspective on Generalized Policy Iteration Algorithms
Given a finite Markov decision process (MDP), the most common way to solve the control problem is to adopt some form of generalized policy iteration (GPI), where one alternates between estimating the action value function $Q(s,a)$ for the current policy and setting the new policy to be greedy with respect to the the estimate of $Q$. By the policy improvement theorem, if the evaluation step is exact, then the new greedy policy $\pi'$ is gauranteed to improve upon the current policy $\pi$ in the sense that the state value function satisfies $v_\pi(s)\leq v_{\pi'}(s)$ for all state $s$. Moreover, the inequality must be strict for some state $s$ unless the policy $\pi$ is already optimal. (See [1] for details.)
Now construct a complete graph whose vertex set is the collection of deterministic stationary policies $\Pi$ for this MDP. The state value function induces a preoder on $\Pi$ defined by $\pi\leq \pi'$ if and only if $v_\pi\leq v_{\pi'}$ (for all components), as well as an equivalence relation $\equiv$ defined by $\pi\equiv\pi'$ if and only if $v_\pi=v_{\pi'}$. Also write $\pi< \pi'$ for $\pi\leq \pi'$ and $v_\pi(s)<v_{\pi'}(s)$ for some state $s$. Define $\tilde{\Pi}:=\Pi/\equiv$, on which the preorder on $\Pi$ induces a partial order and has maximum element $[\pi^\star]$, the class of any optimal policy. For ease of language we also refer to $[\pi]$ as policies, even though they are really classes of policies.
In this setting, GPI corresponds to a random walk on the quotient graph $\tilde{\Pi}$. If the policy evaluation step is exact, the policy improvement theorem says that for any starting policy $[\pi_0]\in\tilde{\Pi}$, the random walk (which is in fact deterministic) is a strictly increasing sequence which terminates at the absorbing state $[\pi^\star]$.
An arbitrary GPI need not converge without further assumption. For example, the Monte Carlo with Exploring Starts assumptsion [1] can fail to converge under GPI without further assuming the uniformity of update rates. An example of non-convergence can be found in [2], where at each iteration the $Q$ estimate is only updated for the state-action pair $(s,a)$ at the start of the episode. A careful choice of starting states $(s,a)$ at each episode based on the current $Q$ estimates leads to cycling behavior of the $Q$ estimates and hence failure of convergence. In this case, the random walk on $\tilde{\Pi}$ is only allowed to transition to a neighbor whose action differs at $\leq 1$ state (modulo $\equiv$). All the possible transitions induce a sparse graph, and combined with inacurate $Q$ estimates and a curated rule for exploring starts, the walk is made into a cycle that never hits the optimal policy (class).
From this perspective, tools from random walk and graph theory might be helpful in proving convergence; however, we caution that much complexity is hidden behind this drastically simplified graph structure, as it does not capture the MDP dynamics and hence the interplay between the state-value and action-value functions explicity.
As a crude example, let $D$ be the maximum possible length of strictly increasing paths in the graph $\tilde{\Pi}$ induced by the algorithm, so that any strictly increasing sequence of policies must hit $[\pi^\star]$ in at most $D$ steps. Let $E_t$ be the event that at iteration $t$ in the algorithm, either
1) the new policy $[\pi_{t+1}]$ does not strictly improve upon $[\pi_{t}]$ if $[\pi_{t}]\neq [\pi^\star]$ (including the case where the two are not comparable), or
2) the new policy $[\pi_{t+1}]$ becomes suboptimal if $[\pi_{t}]=[\pi^\star]$.
If it is the case that $\mathbb{P}(E_t)<\delta/D$, then it follows that with probability at least $1-\delta$, the algorithm converges to an optimal policy after $D$ steps.
If it is the case that $\sum_{t=1}^\infty \mathbb{P}(E_t) < \infty$, then by the Borel-Cantelli Lemma, $\mathbb{P}(\limsup E_t)=0$. Taking the complement, we know that with probability 1, there exists some $t$ such as after the $t$-th iteration, the algorithm improves strictly at each iteration until it hits $[\pi^\star]$ and stays there (since $D<\infty$). In other words, the algorithm converges to $[\pi^\star]$ almost surely. We note that the assumption here is quite strong, as it requires the probability of "bad event" at each iteration to not only go to $0$ but drop fast enough to be summable. This means that if we are to perform policy evaluation from scatch at each iteration, we would need to collect increasingly more samples to achieve increasingly higher confidence. Therefore any "smart" algorithm would need to utilize past reward samples (coming from different policies) to estimate the current policy. In other words, we should design the algorithm so that the probability of $E_t$ given $E_{t-1}^c$ (or $E_{1}^c\cap\cdots\cap E_{t-1}^c$) is exceedingly low.
# GPI using Monte Carlo Evaluation
Consider the setting where at each iteration, we evaluate the current policy using Monte Carlo method (i.e. taking the i.i.d. return from each state-action pair following the policy) and take the greedy policy with respect to that estimate. This differs from the MC control in [1] as samples from previous iterations do not carry over. Our goal is to take the number of samples at each iteration so large that each greedy policy is correct (and hence the policy improves) with high probability.
For simplicity, assume that the MDP is episodic reward bounded in $[0,1]$ and that the maximum episode length is $T$ (we can relax this assumption by conditioning on $T$). At each iteration $t$, for each state $s$ and action $a$ we sample $n_{s,a,t}$ episodes. Let $E_{s,a,t}$ be the event in which at this iteration we get
$$|Q_t(s,a)-q_{\pi_t}(s,a)|\geq \delta_{s,a,t},$$ where $Q_t(s,a)$ is the sample mean of the $n_{s,a,t}$ returns of trajectories starting from $(s,a)$. We hope that at each iteration $t$ we have $$\mathbb{P}\left(E_{s,a,t}\right) \leq \eta_{s,a,t}, \text{ where } \sum_{t=1}^\infty\sum_{s,a} \eta_{s,a,t} <\infty,$$ so that by union bound and Borel-Cantelli the algorithm converges with probability one. By Hoeffding's inequality we have $$\mathbb{P}(E_{s,a,t})\leq 2\exp\left(\frac{-2\delta^2_{s,a,t}n_{s,a,t}}{T^2} \right)$$ so setting the RHS to be $\eta_{s,a,t}$ we get
$$\delta_{s,a,t}=T\sqrt{\frac{\log(2/\eta_{s,a,t})}{2n_{s,a,t}}}.$$ For example, we can take $\eta_{s,a,t}:=1/(SAt^2)$ where $S$ and $A$ are the number of states and actions respectively, and therefore at iteration $t$, for each $s$ we would need to take $n_{s,a,t}$ large enough (to make $\delta_{s,a,t}$ small enough) that the confidence interval of $Q_t(s, a^\star)$ for the optimal action at $s$ is disjoint with any other confidence interval $Q_t(s,a)$ for suboptimal $a$. This way we obtain a convergent Monte Carlo GPI algorithm.
<!-- # Total orders on $\tilde{\Pi}$
A metric on $\Pi$ can help us analyze the effect of policy improvement in a more detailed manner. Any functional from the state space to $\mathbb{R}$ that is increasing in each component naturally yields a total preorder and a pseudometric on $\Pi$. We do not explore this direction for now.
Observe that not all states are equally important when solving for an MDP. Can we get anything out of the importance of states? Let $\pi^\star$ be an optimal policy with value function $v_\star$. Intuitively speaking, improving the value function at a state $s$ with small $v_{\star}(s)$ is not as helpful as improving the value function at a state $s'$ with large $v_{\star}(s')$. We can thus rank the importance of all the states $s$ based on the value of $v_\star(s)$. In other words, we have the order $\leq_\star$ on the state space given by $$s\leq_\star s' \iff v_\star(s)\leq v_\star(s').$$ More generally, for each $\pi\in\Pi$ we can consdier the order $\leq_\pi$ on states defined by $$s\leq_\pi s' \iff v_\pi (s)\leq v_\pi(s').$$ The partial orders $\leq_\pi$ and $\leq_\star$ can be made total using any predefined tie-breaking rule in case two states have the same value. For example, if the MDP satisfies OPFF then ties can be broken using the induced topological ordering.
:::spoiler We may also consider some form of weighted importance.
Notice that some states with large optimal values may be hard to get to. Define the importance of a state $s$ under policy $\pi$ and initial distribution $\mu_0$ as $$I_\pi(s; \mu_0):= v_\pi(s) \cdot \mathbb{E}_\pi[\text{number of visits to $s$ before termination} \mid \mu_0],$$ where $\mu_0$ is some distribution of the starting states or the starting state-action pairs when generating trajectories. The expected number of visits may be changed to other quantities such as the hitting probability. For notational simplicity we drop the starting distribution $\mu_0$ when is fixed (we have the freedom to choose it in our algorithm, e.g. uniform). This (along with a tie-break rule) gives rise to another total order $$s\leq_{I_\pi}s' \iff I_\pi(s) \leq I_\pi(s').$$ A global perspective on $\Pi$ suggests that we investigate $$I_\star:=\sup_{\pi^\star: \text{ optimal}}I_{\pi^\star}$$ and the associated order $\leq_{I_\star}$, which can be thought of as a weighted version of $\leq_\star$.
We caution that for a fixed $\mu_0$, $[\pi]=[\pi']$ does not imply $I_{\pi}=I_{\pi'}$. For example, consider the two-state MDP with transient state $s$ , terminal state $t$, and two actions $a_1, a_2$ such that at state $s$,
- When taking action $a_1$, the chain transitions back to $s$ with reward $r_1$ with probability $p$, and the chain goes to $t$ with no reward with probability $1-p$
- When taking action $a_2$, the chain deterministically goes to $t$ with reward $r_2$.
There are two policies $\pi_i(s)=a_i$ $(i=1,2)$ with value functions $v_{\pi_1}(s)=\frac{p r_1}{1-p}$ and $v_{\pi_2}(s)=r_2$. Setting the initial action $\mu_0=\delta_{a_1}$ to always be $a_1$, the expected number of visits to $s$ under $\pi_1$ and $\pi_2$ are $\frac{1}{1-p}$ and $p$ respectively. If we force $v_{\pi_1}(s)=v_{\pi_2}(s)$ we see that $I_{\pi_1}(s)=r_2^2/r_1$ while $I_{\pi_2}(s)=r_2-r_1$.
---
:::
Let $\leq_\Box$ denote any of the aformentioned total order (defined by the value functions along with its tie-breaking rule). Then the state space can be arranged as a strictly increasing sequence
$$s_S \lneq_\Box s_{S-1} \lneq_\Box \cdots \lneq_\Box s_2 \lneq_\Box s_1$$ where $S$ is the number of states and the ties in order are broken in an arbitrary (but fixed) manner. Reorder $\mathbb{R}^S$ so that dimension $i$ corresponds to $s_i$ and let $\trianglelefteq_\Box$ denote the lexicographic order induced by $\leq_\Box$. Then $\trianglelefteq_\Box$ defines a total order on $\mathbb{R}^S$ for state value functions, which in turn defines a total preorder order on $\Pi$ ; namely $\pi \trianglelefteq_\Box \pi'$ if and only if $v_\pi \trianglelefteq_\Box v_{\pi'}$. Therefore $\trianglelefteq_{\pi}$, $\trianglelefteq_{\star}$ and $\trianglelefteq_{I_\star}$ are well-defined total order on $\tilde{\Pi}$.
Now we connect these new orders to the original preoder $\leq$ on $\Pi$. For a set of policies $U\subseteq \Pi$ consider the partial order $\leq_U$ on $\Pi$ defined by $\pi_1\leq_U \pi_2$ if and only if $\pi_1\leq_\pi \pi_2$ for all $\pi\in U$. Then, assuming two policies $\pi_1$ and $\pi_2$ are comparable under $\leq$, we have $$\left[ \hspace{2pt}\exists \hspace{2pt} \pi\in\Pi \hspace{5pt} \pi_1\leq_\pi \pi_2 \hspace{2pt} \right] \implies \pi_1\leq \pi_2 \implies \pi_1\leq_\Pi \pi_2 .$$
Observe that $\pi_1\leq_\Pi \pi_2$ need not imply $\pi_1\leq \pi_2$. One can construct a counterexample with two transient states $s_1,s_2$ and two actions $a_1, a_2$ that both terminate the chain. Rewards when taking $a_j$ at $s_i$ is $r_{ij}$. In case $r_{11}>r_{12}>r_{21}>r_{22}$, the policies $\pi_1: s_i\mapsto a_i$ $(i=1,2)$ and $\pi_2:s_1\mapsto a_{2}, s_2\mapsto a_{1}$ are not comparable under $\leq$ but satisfy $\pi_1\leq_{\Pi}\pi_2$.
With a more careful analysis, this observation might be able to explan the curious phenomenon that policy iteration tends to converge in a remarkably small number of iterations [3]. Very informally, if at each step the policy $\pi_t$ moves to the greedy policy $\pi_{t+1}$ of the current value function, then $\pi_t\leq_\Pi\pi_{t+1}$, so $[\pi_{t+1}]$ is a successor of $[\pi_t]$ in each of the $N$ directed increasing paths. If $N$ is not too small and the paths are sufficiently different when starting from $[\pi_t]$, then it is plausible that the next intersection of all paths may take quite a few steps to get to, so that the total number of iterations required is small.
-->
<!-- What if we only modify the policy one state at a time? estimate new $Q$ using the old one. it's like not going to the absolute best next step but just a strictly better one
we should have $\pi'>\pi$ still. and how many steps of this is needed to get to the actual greedy policy. also how many steps if we take greedy to converge-->
<!-- Maintain an V estimate on the side
-->
<!-- Now we modify the choice of the exploring-start state-action pair in the MCES algorithm in [2] as follows:
- Initialize $(s_0, a_0)$ and $\pi_0\in\Pi$ at random
- For each iteration $t=0,1,2,...$:
- Generate an episode starting from $(s_t, a_t)$ following $\pi_t$ where the next state is $s'_t$
- Update $Q(s_t, a_t)$ using the sample mean
- Update $\pi_{t+1}$ to be greedy with respect to $Q$
- Set the next exploring start pair to be $(s_{t+1}, a_{t+1})=(s'_t, \pi_t(s'_t))$
To make the algorithm converge it suffices to generate enough trajectories and use Hoeffding's inequality. -->
# References
[1] Richard S. Sutton and Andrew G. Barto. "Reinforcement Learning: An Introduction," 2018
[2] Wang, Che, et al. "On the convergence of the monte carlo exploring starts algorithm for reinforcement learning," 2020
[3] [Santos, Manuel S., and John Rust. "Convergence properties of policy iteration." SIAM Journal on Control and Optimization 42.6 (2004): 2094-2115.](https://editorialexpress.com/jrust/research/siam_dp_paper.pdf)