In this short note (and a corresponding colab notebook) about the example of a Markov-chain whose state distribution does not converge to a stationary distribution.
These are probably irrelevant technicalities for reinforcement learning, but an interesting topic to understand nevertheless.
Consider an integer and the homogeneous Markov-chain such that the transition probabilities are:
In other words, the state is a counter, and we either increase or decrease a counter by , with probability . I added the bit so that the counter always stays finite, bounded dbetween and .
The problem with this Markov-chain, from the perspective of converging to a stationary distribution, is that the states are periodic. An even number can only be reached from another even number after an even number of steps. Therefore, the periodicity of all states is . Because the states are periodic, the Markov-chain is not ergodic. See Wikipedia for a definition of periodicity of Markov Chains.
A Markov-chain is ergodic, meaning that
An ergodic Markov chain has a stationary distribution, and the state distribution converges to this stationary distribution irrespective of the starting distibution. Let's denote the stationary distrribution by as in the paper. Then we have that
And does not depend on . Moreover, it is also true that if is an initial state distribution then
However, if the Markov-chain is non-ergodic, this convergence won't happen. For example, in the Markov chain we defined above, the one with periodicity 2, the limit
simply does not exist. This sequence of probability distributions doesn't converge, it keeps oscillating between a uniform distribution over odd numbers and a uniform distribution over even numbers.
However, consider the initial distribution . Now, starting from this initial distribution, the state distribution does converge.
Let's take the definition of the discounted stationary distibution, which isn't really a thing in Markov chains, but we use it in RL apparently:
This limit exists even for non-ergodic Markov chains (someone with fresher memories of calculus might be able to say why). However, this discounted stationary distibution-ish thing could depend on the initial state .
I'm also not sure that this discounted distribution is ever independent of the initial conditon , even for ergodic Markov chains. Intuitively, it won't be.
Herer's a colab notebook illustrating the limiting behaviour of this Markov-chain.
Calculate the discounted stationary distribution as a function of and .