owned this note
owned this note
Published
Linked with GitHub
$\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\text{var}}
\newcommand{\summation}[2]{\sum_{ #1 }^{ #2 } }
\newcommand{\sumtoinf}[1]{\sum_{ #1 }^\infty }
\newcommand{\texp}{\tau_{\exp}}
\newcommand{\tint}{\tau_\text{int}}$
# Thinning vs Sampling Efficiency
Consider a stationary process $\Psi \equiv \{\psi_t\} \equiv \{\psi(x_t)\}_{x_t \in X}$ with stationary distribution $\pi$ and let $\psi :X \to \mathbb{R}$ and $\psi \in L^2(\pi)$, i.e. the process has finite variance. The normalized autocorrelation at lag $t \geq 0$ is defined as
$$\rho(t) \equiv \frac{C(t)}{C(0)} \propto \exp(-|t|/\texp) > 0, $$
where $C(t) \equiv \E_\pi[(\psi_s-\E_\pi[\psi_t])(\psi_{s+t} -\E_\pi[\psi_{s+t}])$ and $\texp$ is the exponential autocorrelation time. Then, the integrated autocorrelation time is defined as
$$\tint \equiv \frac{1}{2} + \sumtoinf{t=1} \rho(t).$$
In MCMC, $\texp$ measures the time taken to reach stationarity. If we know it, Sokal advise to burn-in for about $20 \, \texp$. On the other hand, $\tint$ controls the Monte Carlo error of the estimates once stationarity has been achieved.
Thinning a process by $k \geq 1$ (taking only every $k$-th value) is equivalent to $\rho(t)$ decaying with rate $\texp/k$, i.e. the process is sped up by a factor of $k$. The resulting integrated autocorrelation time then reads
$$\begin{align}
\tint(k, \texp) &= \frac{1}{2}+ \sumtoinf{t=1} \rho(k\,t) \\
&= \frac{1}{2} + \sumtoinf{t=1} K \exp(-k|t|/\texp) \\
&= \frac{1}{2} + \frac{K \exp(-k/\texp) }{1-\exp(-k/\texp)}\\
&= \frac{1}{2} + \frac{K}{\exp(k/\texp) - 1}.\\
%&\to \frac{1}{2} \quad \text{as} \quad k \to \infty. \label{tint}\\
\end{align}$$
$K$ is a constant of propotionality. The infinite sum is a (convergent) geometric series with ratio $\exp(-k/\texp) < 1$. In the $k \to \infty$ limit, $\exp(k/\texp) \to \infty$ so the infinite sum vanishes which implies $\rho(k\,t) \to 0$ for all $t \geq 0$, i.e. the thinned process becomes uncorrelated in the limit.
## MCMC
Consider a chain of $N$ samples of variable $x \in X$ and a functional $\psi : X \to \mathbb{R}$. Throughout this text, we assume $k$ divides $N$ and define the effective sample size (ESS) of the chain $\Psi = \{ \psi(x_i), \cdots, \psi(x_N)\}$ as
$$\text{ESS}(k, \texp) = \frac{\lfloor N/k \rfloor}{2 \, \tint(k, \texp)} \times k = \frac{N}{2 \, \tint(k, \texp)}.$$
The multiplication by $k$ compensates for the loss of samples due to thinning. In other words, we are interested of the effects of thinning on ESS due to $\tint$. For an independent sampler, i.e. $\texp =0$, we have $\tint(k, 0) = 1/2$ for all $k$, therefore it is unaffected by thinning and delivers a constant ESS of $N$ using our definition.
#### One chain, thinned two ways
Considering the same chain $\Psi$ as above and let us thin it in two ways, with $k_1$ and $k_2$ where $k_1 > k_2$. It follows from the definitions of $\tint$ and ESS that the more thinning, the higher the chain efficiency
$$\frac{\texp}{k_1} < \frac{\texp}{k_2} \implies \tint(k_1, \texp) < \tint(k_2, \texp) \implies \text{ESS}(k_1, \texp) > \text{ESS}(k_2, \texp).$$
In practice, a correlated chain that is thinned with a sufficiently large $k$ is indistinguishable to a chain of independent samples,, where both will give an ESS of $N$.
#### Two chains with distinct $\texp$ values, thinned the same
Let us consider two chains $\Psi$ and $\Psi'$ with exponential autocorrelation times ${\texp}_{\Psi'} > {\texp}_\Psi$ and constants $K' > K$. If we thin them by $k$, and we have that
$$\begin{align}
{\texp}_{\Psi'} &> {\texp}_\Psi\\
\implies {\tint}(k, {\texp}_{\Psi'}) &> {\tint}(k,{\texp}_{\Psi}) \\
\implies \text{ESS}(k, {\texp}_{\Psi'}) &< \text{ESS}(k, {\texp}_{\Psi})\end{align}$$
for all $1 \leq k \leq N.$ In other words, a less efficient chain cannot surpass the ESS of a more efficient chain by thinning.
![](https://i.imgur.com/bOunOWu.png =350x220)
## What about cost?
![](https://i.imgur.com/fs0bK51.png =400x300)
An ESS estimate is not meaningful without accounting for the associated cost. Let the cost to compute one sample be $c$ and the cost to evaluate $\psi$ be $c_\psi$. Then, in general a chain of $N$ samples thinned by $k$ costs
$$\text{cost}(k, c) = Nc + \left \lfloor \frac{N}{k}\right \rfloor c_\psi.$$
We consider $c_\psi$ and $N$ as constants for simplicity. Naturally, when thinning by $k$, we evaluate $\psi$ fewer times but compute $N$ samples all the same. The ESS normalized by cost then reads
$$
\begin{align}
\text{E}(k, \texp, c) &\equiv \text{ESS}(k, \texp)/\text{cost}(k, c) \\
&= \frac{N/ 2 \, \tint(k, \texp)} {Nc + \left \lfloor \frac{N}{k}\right \rfloor c_\psi}\\
&= \frac{N/ 2 \, \tint(k, \texp)} {Nc + \frac{N}{k} c_\psi} \quad \text{(assume $k$ divides $N$)}\\
&= \frac{1 } {2 \, \tint(k, \texp) \times (c + c_\psi/k)}.\\
\end{align}
$$
Let two chains $\Psi$ and $\Psi'$ have exponential autocorrelation times $\texp=\tau$ and $\texp'= z \, \tau$ and $z > 1$, constants $K' > K$, and cost per sample $C > 1$ and $1$ respectively. We assume $c_\psi = 0$ here, describing situations where cost of $c_\psi$ is negligible compared to cost of computing a sample ($c_\psi \ll C$ ) and/or $\psi$ is already evaluated during sampling. We are keen to find out if we can find some cost $C$ and thinning $k$ such that $\Psi$ is more efficient than $\Psi'$ before thinning but less efficient than $\Psi'$ after thinning.
To describe the former case, we form the inequality
$$\begin{align}
\label{eq1}
\text{E}(k=1, c=C, \texp=\tau) &> \text{E}(k=1, c=1, \texp=z\, \tau)\\
\left( 1 + \frac{2K}{\exp(1/\tau) - 1} \right) \times C &< \left( 1 + \frac{2K'}{\exp(1/z\, \tau) - 1}\right).\\
\end{align}$$
To describe the latter case, we form the inequality
$$\begin{align}
\text{E}(k=k, c=C, \texp=\tau) &< \text{E}(k=k, c=1, \texp=z\, \tau)\\
\left( 1 + \frac{2K}{\exp(k/\tau) - 1} \right) \times C &> \left( 1 + \frac{2K'}{\exp(k/z\, \tau) - 1}\right).\\
\end{align}$$
The first inequality says that $C$ must not be so high that $\Psi$'s cost dominates its faster autocorrelation decay causing it to be less efficient than $\Psi'$. The second inequality says that $C$ must not be so low that $\Psi'$ cannot catch up to $\Psi$'s efficiency by means of thinning. In all, as long as $C$ fulfils the first inequality, we can find out what $k^*$ is such that $\Psi'$ is more efficient than $\Psi$ after thinning by any $k \geq k^*$.
Example:
Let $\tau=10, z = 15$, $K = K' = 1$. Then the first inequality (former case) yields $1 < C < 14.9875679888.$ Indeed when $\tau$ is large, $\exp(1/\tau) \approx 1 + 1/\tau$ is a reasonable approximation and we need to fulfil $1 < C < z$.
If $C < 1$, which means $\Psi$ is cheaper to run than $\Psi'$ to begin with, then $\Psi$ will forever be more efficient no matter the thinning (left figure). If $C > 15$, then $\Psi'$ will have been more efficient even without thinning and remain so for all $k$ (right figure).
![](https://i.imgur.com/eyp4cJ4.png =230x160) ![](https://i.imgur.com/Esh3Pt5.png =230x160)
This means we need to keep the ratio of the costs to produce one sample must not exceed the ratio of the exponential autocorrelation times.
If we take $C=10$, the second inequality gives $k^* = 25.86$. This means in order for $\Psi'$ to be more efficient than $\Psi$, we need to thin by $k \geq 26$. Specifically, at low values of $k$, $\text{E}_\Psi > \text{E}_{\Psi'}$ (left). When $k \geq 26$, $\Psi'$ overtakes $\Psi$ in terms of ESS per cost $\text{E}_\Psi < \text{E}_{\Psi'}$ (center). Both values saturate eventually (right).
![](https://i.imgur.com/Gvbk5lV.png =230x160)![](https://i.imgur.com/cK7Sgwz.png =230x160)![](https://i.imgur.com/eGh4uOd.png =230x160)
## Conclusion
For estimation purposes, thinning does not bring any gain other than saving storage space. For the purpose of evaluating sampler performance, thinning should be avoided as it may give the false impression that a cheap but 'weak' sampler (e.g. Random walk Metropolis) performs better than an expensive but 'strong' sampler (e.g. Hamiltonian Monte Carlo) when thinned by a sufficiently large $k$. The intuition is that if we only look at samples that are far apart, it gives the weaker sampler time to decorrelate while many of the stronger sampler's already-decorrelated samples are ignored. However, the total costs are kept the same, therefore the strong sampler may end up performing worse when normalizing ESS by cost.